Académique Documents
Professionnel Documents
Culture Documents
Part A
(Compulsory)
1. a. Define Data mining and list out its functionalities.
Data mining refers to the process or method that extracts or \mines" interesting knowledge or
patterns from large amounts of data.
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of
data to be mined, there are two categories of functions involved in Data Mining
Descriptive Tasks and Predictive Tasks.
Descriptive Tasks
The descriptive task deals with the general properties of data in the database. Here is the list
of descriptive functions
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Predictive Tasks
Classification
Prediction
Outlier Analysis
Evolution Analysis
Drill-down
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year." On drilling down,
the time dimension is descended from the level of quarter to the level of month. When
drill-down is performed, one or more dimensions from the data cube are added. It
navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
B
Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
B
The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order
to provide an alternative presentation of data. Consider the following diagram that
shows the pivot operation.
B
In this the item and location axes in 2-D slice are rotated.
Part-B
Unit I,II
2. a. Explain Data mining as a process of knowledge extraction.
B
(or)
3. a. Describe Multi dimensional data model in detail.
Multi Dimensional Data Model is a logical design technique used in Data Warehouses
(DW). It is quite directly related to OLAP systems
MDDM is a design technique for databases intended to support end-user queries in a
DataWarehouse
It is oriented around understandability, as opposed to database administration
MDDM Terminology
Grain
Fact
Dimension
Cube
Star
Snowflake
Grain
Identifying the grain also means deciding the level of detail that will be made
available in the dimensional model
Granularity is defined as the detailed level of information stored in a table
B
The more the detail, the lower is the level of granularity
The lesser the detail, higher is the level of granularity
Dimensions/Dimension Tables
The dimensional tables contain attributes (descriptive) which are typically static
values containing textual data or discrete numbers which behave as text values.
Main functionalities :
Query filtering\constraining
Query result set labeling
Star Schema
The basic star schema contains four components.
These are:
Fact table, Dimension tables, Attributes and Dimension hierarchies
B
The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and
services. Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer retention and
satisfaction.
Telecommunication Industry
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics.
The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been collected
from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is
being generated because of the fast numerical simulations in various fields such as climate
and ecosystem modeling, chemical engineering, fluid dynamics, etc.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
B
intruding and attacking network prompted intrusion detection to become a critical
component of network administration.
Unit II, III
4. a. What are the characteristics of rule based classification.
Mutually exclusive rules
Classifier contains mutually exclusive rules if the rules are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values
Each record is covered by at least one rule.
Information gain
ID3 uses information gain as its attribute selection measure. Let node N represent or hold
the tuples of partition D. The attribute with the highest information gain is chosen as the
splitting attribute for node N. This attribute minimizes the information needed to classify the
tuples in the resulting partitions and reflects the least randomness or impurity in these
partitions. Such an approach minimizes the expected number of tests needed to classify a
given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
The expected information needed to classify a tuple in D is given by
where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated
by
B
Gain ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there are values), each one containing
just one tuple. Because each partition is pure, the information required to classify data set
D based on this partitioning would be Infoproduct ID(D) = 0. Therefore, the information
gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless
for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information
gain using a split information value defined analogously with Info(D) as
Gini index
The Gini index is used in CART. Using the notation described above, the Gini index
measures the impurity of D, a data partition or set of training tuples, as
When considering a binary split, we compute a weighted sum of the impurity of each
resulting partition. For example, if a binary split on A partitions D into D1 and D2, the
gini index of D given that partitioning is
(or)
5. a. Describe Bayesian classification in detail.
B
Naive Bayesian classification is called naive because it assumes class conditional
independence. That is, the effect of an attribute value on a given class is independent
of the values of the other attributes. This assumption is made to reduce computational
costs, and hence is considered naive". The major idea behind naive Bayesian
classification is to try and classify data by maximizing P(XjCi)P(Ci) (where i is an
index of the class) using the Bayes' theorem of posterior probability. In general:
We are given a set of unknown data tuples, where each tuple is represented by an n-
dimensional vector, X = (x1, x2,. xn) depicting n measurements made on the tuple
from n attributes, respectively A1;A2; ::;An. We are also given a set of m classes, C1,C,
..Cm. Using Bayes theorem, the naive Bayesian classifier calculates the posterior
probability of each class conditioned on X. X is assigned the class label of the class
with the maximum posterior probability conditioned on X. Therefore, we try to
maximize P(CijX) = P(XjCi)P(Ci)=P(X). However, since P(X) is constant for all
classes, only P(XjCi)P(Ci) need be maximized. If the class prior probabilities are not
known, then it is commonly assumed that the classes are equally likely, i.e. P(C1) =
P(C2) = = P(Cm), and we would therefore maximize P(XjCi). Otherwise, we
maximize P(XjCi)P(Ci). The class prior probabilities may be estimated by P(Ci) =
si/s , where si is the number of training tuples of class Ci, and s is the total number of
training tuples. In order to reduce computation in evaluating P(XjCi), the nave
assumption of class conditional independence is made. This presumes that the values
of the attributes are conditionally independent of one another, given the class label of
the tuple, i.e., that there are no dependence relationships among the attributes.
{ If Ak is a categorical attribute then P(xkjCi) is equal to the number of training tuples
in Ci that have xk as the value for that attribute, divided by the total number of training
tuples in Ci. { If Ak is a continuous attribute then P(xkjCi) can be calculated using a
Gaussian density function.
b. Describe Decision Tree Induction algorithm in detail.
The major steps are as follows:
1. The tree starts as a single root node containing all of the training tuples.
2. If the tuples are all from the same class, then the node becomes a leaf, labeled with
that class.
3. Else, an attribute selection method is called to determine the splitting criterion. Such a
method may using a heuristic or statistical measure (e.g., information gain or gini
index) to select the best" way to separate the tuples into individual classes. The
splitting criterion consists of a splitting attribute and may also indicate either a split-
point or a splitting subset, as described below.
4. Next, the node is labeled with the splitting criterion, which serves as a test at the node.
A branch is grown from the node to each of the outcomes of the splitting criterion and
the tuples are partitioned accordingly. There are three possible scenarios for such
partitioning. (1) If the splitting attribute is discrete-valued, then a branch is grown for
each possible value of the attribute. (2) If the splitting attribute, A, is continuous-
valued, then two branches are grown, corresponding to the conditions A <= split point
and A > split point. (3) If the splitting attribute is discrete-valued and a binary tree
must be produced (e.g., if the gini index was used as a selection measure), then the test
B
at the node is A SA?" where SA is the splitting subset for A. It is a subset of the
known values of A. If a given tuple has value aj of A and if aj SA, then the test at the
node is satisfied. The algorithm recurses to create a decision tree for the tuples at each
partition.
The stopping conditions are:
If all tuples at a given node belong to the same class, then transform that node into a
leaf, labeled with that class.
If there are no more attributes left to create more partitions, then majority voting can
be used to convert the given node into a leaf, labeled with the most common class
among the tuples.
If there are no tuples for a given branch, a leaf is created with the majority class from
the parent node.