Vous êtes sur la page 1sur 19

A

ADITYA COLLEGE OF ENGINEERING


PUNGANUR ROAD, MADANAPALLE-517325
III-B.Tech(R13) II-Sem I-Internal Examinations March-2017 (Descriptive)
13A05603 Datamining (Computer Science & Engineering)
Time: 90 min Max Marks: 30

Part A
(Compulsory)
1. a. Define Data mining and list out its functionalities.
Data mining (sometimes called data or knowledge discovery) is the process of analyzing data
from different perspectives and summarizing it into useful information which helps in
decision making. Technically, data mining is the process of finding correlations or patterns
among dozens of fields in large relational databases.
There are two types of data mining tasks: descriptive data mining tasks that describe the
general properties of the existing data, and predictive data mining tasks that attempt to do
predictions based on inference on available data.

The data mining functionalities and the variety of knowledge they discover are briefly
presented in the following list:
Characterization: Data characterization is a summarization of general features of objects in
a target class, and produces what is called characteristic rules.
Discrimination: Data discrimination produces what are called discriminant rules and is
basically the comparison of the general features of objects between two classes referred to as
the target class and the contrasting class.
Association analysis: Association analysis is the discovery of what are commonly called
association rules.
Classification: Classification analysis is the organization of data in given classes. Also
known as supervised classification, the classification uses given class labels to order the
objects in the data collection.
Prediction: Prediction is more often referred to the forecast of missing numerical values, or
increase/ decrease trends in time related data.
Clustering: Clustering is the organization of data in classes. However, unlike classification,
in clustering, class labels are unknown and it is up to the clustering algorithm to discover
acceptable classes. Clustering is also called unsupervised classification.
Outlier analysis: Outliers are data elements that cannot be grouped in a given class or
cluster..
Evolution and deviation analysis: Evolution and deviation analysis pertain to the study of
time related data that changes in time. Evolution analysis models evolutionary trends in data,
which consent to characterizing, comparing, classifying or clustering of time related data.
Deviation analysis, on the other hand, considers differences between measured values and
expected values, and attempts to find the cause of the deviations from the anticipated values.
A
b. Explain Principal Component analysis in detail.
Principal component analysis (PCA) is a technique used to emphasize variation and bring out
strong patterns in a dataset. It's often used to make data easy to explore and visualize.
Objectives of principal component analysis
To discover or to reduce the dimensionality of the data set.
To identify new meaningful underlying variables.
PCA takes a data matrix of n objects by p variables, which may be correlated, and
summarizes it by uncorrelated axes (principal components or principal axes) that are linear
combinations of the original p variables and the first k components display as much as
possible of the variation among objects.
Objective of PCA is to rigidly rotate the axes of this p-dimensional space to new positions
(principal axes) that have the following properties:
ordered such that principal axis 1 has the highest variance, axis 2 has the next highest
variance, .... , and axis p has the lowest variance
covariance among each pair of the principal axes is zero (the principal axes are
uncorrelated).

c. Write a short note on concept hierarchies.


Concept hierarchies define a sequence of mappings from a set of lower-level concepts
to higher-level, more general concepts and can be represented as a set of nodes
organized in a tree, in the form of a lattice, or as a partial order.
They are useful in data mining because they allow the discovery of knowledge at
multiple levels of abstraction and provide the structure on which data can be
generalized (rolled-up) or specialized (drilled-down). Together, these operations allow
users to view the data from different perspectives, gaining further insight into
relationships hidden in the data.
Generalizing has the advantage of compressing the data set, and mining on a
compressed data set will require fewer I/O operations. This will be more efficient than
mining on a large, uncompressed data set.
Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is
usually associated with each dimension in a data warehouse
Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
Concept hierarchy can be automatically formed for both numeric and nominal data.

d. Define classification and list out few applications of it.


Classification is a data mining function that assigns items in a collection to target categories
or classes. The goal of classification is to accurately predict the target class for each case in
the data. For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
A
Customer Target Marketing: Since the classification problem relates feature variables to
target classes, this method is extremely popular for the problem of customer target
marketing. In such cases, feature variables describing the customer may be used to predict
their buying interests on the basis of previous training examples. The target variable may
encode the buying interest of the customer.
Medical Disease Diagnosis: In recent years, the use of data mining methods in medical
technology has gained increasing traction. The features may be extracted from the medical
records, and the class labels correspond to whether or not a patient may pick up a disease
in the future. In these cases, it is desirable to make disease predictions with the use of such
information.
Supervised Event Detection: In many temporal scenarios, class labels may be associated
with time stamps corresponding to unusual events. For example, an intrusion activity may
be represented as a class label. In such cases, time-series classification methods can be very
useful.
Multimedia Data Analysis: It is often desirable to perform classification of large volumes
of multimedia data such as photos, videos, audio or other more complex multimedia data.
Multimedia data analysis can often be challenging, because of the complexity of the
underlying feature space and the semantic gap between the feature values and corresponding
inferences.
Biological Data Analysis: Biological data is often represented as discrete sequences, in
which it is desirable to predict the properties of particular sequences. In some cases, the
biological data is also expressed in the form of networks. Therefore, classification methods
can be applied in a variety of different ways in this scenario.
e. Explain multi dimensional data model.
Multi Dimensional Data Model is a logical design technique used in Data Warehouses (DW).
It is quite directly related to OLAP systems
MDDM is a design technique for databases intended to support end-user queries in a DW
It is oriented around understandability, as opposed to database administration
MDDM Terminology
Grain
Fact
Dimension
Cube
Star
Snowflake

Grain
Identifying the grain also means deciding the level of detail that will be made available in
the dimensional model
Granularity is defined as the detailed level of information stored in a table
The more the detail, the lower is the level of granularity
A
The lesser the detail, higher is the level of granularity

Facts and Fact Tables


Consists of at least two or more foreign keys
Generally has huge numbers of records
Useful facts tend to be numeric and additive

Dimensions/Dimension Tables
The dimensional tables contain attributes (descriptive) which are typically static values
containing textual data or discrete numbers which behave as text values.
Main functionalities :
Query filtering\constraining
Query result set labeling

Star Schema
The basic star schema contains four components.
These are:
Fact table, Dimension tables, Attributes and Dimension hierarchies
A

Snow Flake Schema


Normalization and expansion of the dimension tables in a star schema result in the
implementation of a snowflake design.
A dimension table is said to be snow flaked when the low-cardinality attributes in the
dimension have been removed to separate normalized tables and these normalized tables are
then joined back into the original dimension table.
A
Part-B
Unit I,II
2. a. Explain Data mining as a process of knowledge extraction.

Knowledge discovery as a process consists of an iterative sequence of the following


steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)1
3. Data selection (where data relevant to the analysis task are retrieved fromthe
database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate
for mining by performing summary or aggregation operations, for instance)2
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures; Section 1.5)
A
7. Knowledge presentation (where visualization and knowledge representation
techniques
are used to present the mined knowledge to the user)

b. Distinguish between OLAP and OLTP.

(or)
3. a. Write a short note on discretization and binarization.
Discretization is the process of converting a continuous attribute to a discrete attribute. A
common example is rounding off real numbers to integers. Some data mining algorithms
require that the data be in the form of categorical or binary attributes. Thus, it is often
necessary to convert continuous attributes in to categorical attributes and / or binary
attributes. Its pretty straightforward to convert categorical attributes in to discrete or binary
attributes
Discretization of Continuous Attributes - Transformation of continuous attributes to a
categorical attributes involves
Deciding how many categories to have.
How to map the values of the continuous attribute to categorical attribute
A basic distinction between discretization methods for classification is whether class
information is used (supervised) or not (unsupervised)
A
b. Describe various OLAP operations in detail.
OLAP operations:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
The following diagram illustrates how roll-up works.

Roll-up is performed by climbing up a concept hierarchy for the dimension location.


Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level of
city to the level of country.
The data is grouped into cities rather than countries.
A
When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down

Drill-down is the reverse operation of roll-up. It is performed by either of the following


ways:
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.

The following diagram illustrates how drill-down works:

Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year." On drilling down, the
time dimension is descended from the level of quarter to the level of month. When drill-
down is performed, one or more dimensions from the data cube are added. It navigates the
data from less detailed data to highly detailed data.

Slice
A
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.

Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.

Dice

Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
A

The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")

Pivot

The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows the
pivot operation.
A

In this the item and location axes in 2-D slice are rotated.

c. Describe the general approach for a classification process.


A classification technique (or classifier) is a systematic approach to building classification
models from an input data set. Examples include decision tree classifiers, rule-based
classifiers, neural networks, support vector machines, and naive Bayes classifiers. Each
technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data. The model generated by a learning
algorithm should both fit the input data well and correctly predict the class labels of records
it has never seen before. Therefore, a key objective of the learning algorithm is to build
models with good generalization capability; i.e. , models that accurately predict the class
labels of previously unknown records.
A

First, a training set consisting of records whose class labels are known must be provided. The
training set is used to build a classification model, which is subsequently applied to the test
set, which consists of records with unknown class labels.

Unit II, III


4. a. What are the characteristics of rule based classification. Describe Bayesian Belief
Networks in detail.
Rule based classifier characteristics:
Mutually exclusive rules
Classifier contains mutually exclusive rules if the rules are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every possible combination of attribute
values
Each record is covered by at least one rule.
Bayesian Belief Network
Belief networks represent the full joint distribution over the variables more compactly with a
smaller number of parameters. Take advantage of conditional and marginal independences
among random variables
A and B are independent
A and B are conditionally independent given C
P( A, B) = P( A)P(B)
P( A, B | C) = P( A | C)P(B | C)
P( A | C, B) = P( A | C)
A

b. What are the measures of selecting the splitting attribute. Explain.


There are many measures that can be used to determine the best way to split the records.
These measures are defined in terms of the class distribution of the records before and after
splitting. An attribute selection measure is a heuristic for selecting the splitting criterion that
best separates a given data partition, D, of class-labeled training tuples into individual
classes. Attribute selection measures are also known as splitting rules because they determine
how the tuples at a given node are to be split. The attribute selection measure provides a
ranking for each attribute describing the given training tuples. The attribute having the best
score for the measure6 is chosen as the splitting attribute for the given tuples. If the splitting
attribute is continuous-valued or if we are restricted to binary trees then, respectively, either a
split point or a splitting subset must also be determined as part of the splitting criterion. The
tree node created for partition D is labeled with the splitting criterion, branches are grown for
each outcome of the criterion, and the tuples are partitioned accordingly. The three popular
attribute selection measuresinformation gain, gain ratio, and gini index.

Information gain
ID3 uses information gain as its attribute selection measure. Let node N represent or hold
the tuples of partition D. The attribute with the highest information gain is chosen as the
splitting attribute for node N. This attribute minimizes the information needed to classify the
tuples in the resulting partitions and reflects the least randomness or impurity in these
partitions. Such an approach minimizes the expected number of tests needed to classify a
given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
The expected information needed to classify a tuple in D is given by

where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated
by
A

Gain ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there are values), each one containing
just one tuple. Because each partition is pure, the information required to classify data set
D based on this partitioning would be Infoproduct ID(D) = 0. Therefore, the information
gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless
for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information
gain using a split information value defined analogously with Info(D) as

The gain ratio is defined as

Gini index
The Gini index is used in CART. Using the notation described above, the Gini index
measures the impurity of D, a data partition or set of training tuples, as

When considering a binary split, we compute a weighted sum of the impurity of each
resulting partition. For example, if a binary split on A partitions D into D1 and D2, the
gini index of D given that partitioning is

The reduction in impurity that would be incurred by a binary split on a discrete- or


continuous-valued attribute A is

c. Write a short note on Overfitting.


Overfitting refers to a model that models the training data too well.
A
Overfitting happens when a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new data. This means that
the noise or random fluctuations in the training data is picked up and learned as concepts by
the model. The problem is that these concepts do not apply to new data and negatively
impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more
flexibility when learning a target function. As such, many nonparametric machine learning
algorithms also include parameters or techniques to limit and constrain how much detail the
model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very
flexible and is subject to overfitting training data. This problem can be addressed by pruning
a tree after it has learned in order to remove some of the detail it has picked up.
(or)
5. a. Explain about tree pruning.
The decision tree built may overfit the training data. There could be too many branches,
some of which may reflect anomalies in the training data due to noise or outliers. Tree
pruning addresses this issue of overfitting the data by removing the least reliable branches
(using statistical measures). This generally results in a more compact and reliable decision
tree that is faster and more accurate in its classification of data. The drawback of using a
separate set of tuples to evaluate pruning is that it may not be representative of the training
tuples used to create the original decision tree. If the separate set of tuples are skewed, then
using them to evaluate the pruned tree would not be a good indicator of the pruned tree's
classification accuracy. Furthermore, using a separate set of tuples to evaluate pruning means
there are less tuples to use for creation and testing of the tree. While this is considered a
drawback in machine learning, it may not be so in data mining due to the availability of
larger data sets.
b. Describe Decision Tree Induction algorithm in detail.
The major steps are as follows:
1. The tree starts as a single root node containing all of the training tuples.
2. If the tuples are all from the same class, then the node becomes a leaf, labeled with
that class.
3. Else, an attribute selection method is called to determine the splitting criterion. Such a
method may using a heuristic or statistical measure (e.g., information gain or gini
index) to select the best" way to separate the tuples into individual classes. The
splitting criterion consists of a splitting attribute and may also indicate either a split-
point or a splitting subset, as described below.
4. Next, the node is labeled with the splitting criterion, which serves as a test at the node.
A branch is grown from the node to each of the outcomes of the splitting criterion and
the tuples are partitioned accordingly. There are three possible scenarios for such
partitioning. (1) If the splitting attribute is discrete-valued, then a branch is grown for
each possible value of the attribute. (2) If the splitting attribute, A, is continuous-
A
valued, then two branches are grown, corresponding to the conditions A <= split point
and A > split point. (3) If the splitting attribute is discrete-valued and a binary tree
must be produced (e.g., if the gini index was used as a selection measure), then the test
at the node is A SA?" where SA is the splitting subset for A. It is a subset of the
known values of A. If a given tuple has value aj of A and if aj SA, then the test at the
node is satisfied. The algorithm recurses to create a decision tree for the tuples at each
partition.
The stopping conditions are:
If all tuples at a given node belong to the same class, then transform that node into a
leaf, labeled with that class.
If there are no more attributes left to create more partitions, then majority voting can
be used to convert the given node into a leaf, labeled with the most common class
among the tuples.
If there are no tuples for a given branch, a leaf is created with the majority class from
the parent node.

ADITYA COLLEGE OF ENGINEERING


PUNGANUR ROAD, MADANAPALLE-517325
III-B.Tech(R13) II-Sem I-Internal Examinations March2017 (Objective)
13A05603 Datamining (Computer Science & Engineering)

Name : Roll No. :


Time: 20 min Max Marks: 10

Answer all the questions 51=5M


1. List any two applications of datamining.
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
A
2. Define outliers and state their role in clustering.
Outliers are often discarded as noise. After clustering, the different clusters represent the
different kinds of data (transactions). The outliers are those data points that do not fall
into any cluster. Among the various kinds of clustering methods, density-based clustering
may be the most effective.

3. What is a data warehouse and data mart?


The data mart is a subset of the data warehouse and is usually oriented to a specific business
line or team. Whereas data warehouses have an enterprise-wide depth, the information in
data marts pertains to a single department. ... Each data mart is dedicated to a specific
business function or region.

4. What are the characteristics of Rule based Classifier.


Mutually exclusive rules
Classifier contains mutually exclusive rules if the rules are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values
Each record is covered by at least one rule.
5. Name some techniques for handling missing data.
Ignore the tuple: usually done when class label is missing.
Use the attribute mean (or majority nominal value) to fill in the missing value.
Use the attribute mean (or majority nominal value) for all samples belonging to the
same class.
Predict the missing value by using a learning algorithm: consider the attribute with
the missing value as a dependent (class) variable and run a learning algorithm
(usually Bayes or decision tree) to predict the missing value.

Choose the correct answer from following questions. 10 1/2 =5 M


1. OLAP system adopts which type of Data Model? [ A,B ]
a) Star b) Fact constellation c) Drill Down d) none
2. Which of the following is not OLAP operation? [ NONE ]
a) Roll-up b) Roll-down c) Slice d) Pivot
3. Binning Method is used for? [ A ]
a) Noisy data b) missing data c) a&b d) none
4. Outliers may be detected by which method? [ D ]
a) Reduction b) Aggregation c) transformation d) cluster analysis
5. Concept Hierarchies represents__? [ A ]
a) Background Knowledge b) Classification c) Prediction d) Association
6. Extracting knowledge from large amount of data is called _____. [ B ]
a) Warehousing b) data mining c) database d) cluster
7. A HOLAP server combines _____________ . [ C ]
A
a) ROLAP b) MOLAP c) a & b d) none
8. The out put of KDD is ............. [ D ]
a) Data b) Information c) Query d) Useful information
9. Smoothing techniques are ________________ [ A ]
a) binning b) aggregation c) Normalization d) None
10,...... is a summarization of the general characteristics or features of a target class of data. [ A ]
a) Data Characterization b) Data Classification c) Data discrimination d) Data selection