Vous êtes sur la page 1sur 18

B

ADITYA COLLEGE OF ENGINEERING


PUNGANUR ROAD, MADANAPALLE-517325
III-B.Tech(R13) II-Sem I-Internal Examinations March-2017 (Descriptive)
13A05603 Datamining (Computer Science & Engineering)
Time: 90 min Max Marks: 30

Part A
(Compulsory)
1. a. Define Data mining and list out its functionalities.
Data mining refers to the process or method that extracts or \mines" interesting knowledge or
patterns from large amounts of data.
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of
data to be mined, there are two categories of functions involved in Data Mining
Descriptive Tasks and Predictive Tasks.
Descriptive Tasks
The descriptive task deals with the general properties of data in the database. Here is the list
of descriptive functions
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Predictive Tasks
Classification
Prediction
Outlier Analysis
Evolution Analysis

b. Write a short note on dimensionality reduction techniques.


Dimensionality reduction, where encoding mechanisms are used to reduce the data
set size. In dimensionality reduction, data encoding or transformations are applied so as to
obtain a reduced or compressed representation of the original data. If the original data can
be reconstructed from the compressed data without any loss of information, the data
reduction is called lossless. If, instead, we can reconstruct only an approximation of
the original data, then the data reduction is called lossy.Two popular and effective methods
of lossy dimensionality reduction: wavelet transforms and principal components analysis.
c. Write a short note on concept hierarchies.
Concept hierarchies define a sequence of mappings from a set of lower-level concepts
to higher-level, more general concepts and can be represented as a set of nodes
organized in a tree, in the form of a lattice, or as a partial order.
B
They are useful in data mining because they allow the discovery of knowledge at
multiple levels of abstraction and provide the structure on which data can be
generalized (rolled-up) or specialized (drilled-down). Together, these operations allow
users to view the data from different perspectives, gaining further insight into
relationships hidden in the data.
Generalizing has the advantage of compressing the data set, and mining on a
compressed data set will require fewer I/O operations. This will be more efficient than
mining on a large, uncompressed data set.
Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is
usually associated with each dimension in a data warehouse
Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
Concept hierarchy can be automatically formed for both numeric and nominal data.

d. Write Hunts algorithm.


Hunt's algorithm will work if every combination of attribute values is present in the training
data and each combination has a unique class label.
These assumptions are too stringent for use in most practical situations. Additional
conditions are needed to handle the following cases:
1. It is possible for some of the child nodes created in Step 2 to be empty: i.e., there are no
records associated with these nodes. This can happen if none of the training records have the
combination of attribute values associated with such nodes. In this case the node is declared
a leaf node with the same class label as the majority class of training records associated with
its parent node.
2. In Step 2, if all the records associated with D, have identical attribute values (except for
the class label), then it is not possible to split these reoords any further. In this case, the node
is declared a leaf node with the same class label as the majority class of training records
associated with this node.
e. Explain OLAP operations.
OLAP operations:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
The following diagram illustrates how roll-up works.
B

Roll-up is performed by climbing up a concept hierarchy for the dimension location.


Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down

Drill-down is the reverse operation of roll-up. It is performed by either of the


following ways:
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.

The following diagram illustrates how drill-down works:


B

Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year." On drilling down,
the time dimension is descended from the level of quarter to the level of month. When
drill-down is performed, one or more dimensions from the data cube are added. It
navigates the data from less detailed data to highly detailed data.

Slice

The slice operation selects one particular dimension from a given cube and provides a
new sub-cube. Consider the following diagram that shows how slice works.
B

Here Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.

Dice

Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.
B

The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")

Pivot

The pivot operation is also known as rotation. It rotates the data axes in view in order
to provide an alternative presentation of data. Consider the following diagram that
shows the pivot operation.
B

In this the item and location axes in 2-D slice are rotated.
Part-B
Unit I,II
2. a. Explain Data mining as a process of knowledge extraction.
B

Knowledge discovery as a process consists of an iterative sequence of the following


steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)1
3. Data selection (where data relevant to the analysis task are retrieved fromthe
database)
4. Data transformation (where data are transformed or consolidated into forms
appropriate
for mining by performing summary or aggregation operations, for instance)2
5. Data mining (an essential process where intelligent methods are applied in order to
extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on some interestingness measures; Section 1.5)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present the mined knowledge to the user)

b. Distinguish between OLAP and OLTP.


B

(or)
3. a. Describe Multi dimensional data model in detail.
Multi Dimensional Data Model is a logical design technique used in Data Warehouses
(DW). It is quite directly related to OLAP systems
MDDM is a design technique for databases intended to support end-user queries in a
DataWarehouse
It is oriented around understandability, as opposed to database administration
MDDM Terminology
Grain
Fact
Dimension
Cube
Star
Snowflake

Grain
Identifying the grain also means deciding the level of detail that will be made
available in the dimensional model
Granularity is defined as the detailed level of information stored in a table
B
The more the detail, the lower is the level of granularity
The lesser the detail, higher is the level of granularity

Facts and Fact Tables


Consists of at least two or more foreign keys
Generally has huge numbers of records
Useful facts tend to be numeric and additive

Dimensions/Dimension Tables
The dimensional tables contain attributes (descriptive) which are typically static
values containing textual data or discrete numbers which behave as text values.
Main functionalities :
Query filtering\constraining
Query result set labeling

Star Schema
The basic star schema contains four components.
These are:
Fact table, Dimension tables, Attributes and Dimension hierarchies
B

Snow Flake Schema


Normalization and expansion of the dimension tables in a star schema result in the
implementation of a snowflake design.
A dimension table is said to be snow flaked when the low-cardinality attributes in the
dimension have been removed to separate normalized tables and these normalized
tables are then joined back into the original dimension table.

b. Describe various applications of data mining.


B
Data mining is widely used in diverse areas. There are a number of commercial data mining
system available today and yet there are many challenges in this field.

Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of high quality
which facilitates systematic data analysis and data mining.

Retail Industry

Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and
services. Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer retention and
satisfaction.

Telecommunication Industry

Data mining in telecommunication industry helps in identifying the telecommunication


patterns, catch fraudulent activities, make better use of resource, and improve quality of
service.
Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics.

Other Scientific Applications

The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been collected
from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is
being generated because of the fast numerical simulations in various fields such as climate
and ecosystem modeling, chemical engineering, fluid dynamics, etc.

Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the
major issue. With increased usage of internet and availability of the tools and tricks for
B
intruding and attacking network prompted intrusion detection to become a critical
component of network administration.
Unit II, III
4. a. What are the characteristics of rule based classification.
Mutually exclusive rules
Classifier contains mutually exclusive rules if the rules are independent of each other
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every possible combination of
attribute values
Each record is covered by at least one rule.

b. What are the measures of selecting the splitting attribute. Explain.


There are many measures that can be used to determine the best way to split the records.
These measures are defined in terms of the class distribution of the records before and after
splitting. An attribute selection measure is a heuristic for selecting the splitting criterion that
best separates a given data partition, D, of class-labeled training tuples into individual
classes. Attribute selection measures are also known as splitting rules because they determine
how the tuples at a given node are to be split. The attribute selection measure provides a
ranking for each attribute describing the given training tuples. The attribute having the best
score for the measure6 is chosen as the splitting attribute for the given tuples. If the splitting
attribute is continuous-valued or if we are restricted to binary trees then, respectively, either a
split point or a splitting subset must also be determined as part of the splitting criterion. The
tree node created for partition D is labeled with the splitting criterion, branches are grown for
each outcome of the criterion, and the tuples are partitioned accordingly. The three popular
attribute selection measuresinformation gain, gain ratio, and gini index.

Information gain
ID3 uses information gain as its attribute selection measure. Let node N represent or hold
the tuples of partition D. The attribute with the highest information gain is chosen as the
splitting attribute for node N. This attribute minimizes the information needed to classify the
tuples in the resulting partitions and reflects the least randomness or impurity in these
partitions. Such an approach minimizes the expected number of tests needed to classify a
given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.
The expected information needed to classify a tuple in D is given by

where pi is the probability that an arbitrary tuple in D belongs to class Ci and is estimated
by
B

Gain ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier, such as product ID. A split on product ID would
result in a large number of partitions (as many as there are values), each one containing
just one tuple. Because each partition is pure, the information required to classify data set
D based on this partitioning would be Infoproduct ID(D) = 0. Therefore, the information
gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless
for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio,
which attempts to overcome this bias. It applies a kind of normalization to information
gain using a split information value defined analogously with Info(D) as

The gain ratio is defined as

Gini index
The Gini index is used in CART. Using the notation described above, the Gini index
measures the impurity of D, a data partition or set of training tuples, as

When considering a binary split, we compute a weighted sum of the impurity of each
resulting partition. For example, if a binary split on A partitions D into D1 and D2, the
gini index of D given that partitioning is

The reduction in impurity that would be incurred by a binary split on a discrete- or


continuous-valued attribute A is

(or)
5. a. Describe Bayesian classification in detail.
B
Naive Bayesian classification is called naive because it assumes class conditional
independence. That is, the effect of an attribute value on a given class is independent
of the values of the other attributes. This assumption is made to reduce computational
costs, and hence is considered naive". The major idea behind naive Bayesian
classification is to try and classify data by maximizing P(XjCi)P(Ci) (where i is an
index of the class) using the Bayes' theorem of posterior probability. In general:
We are given a set of unknown data tuples, where each tuple is represented by an n-
dimensional vector, X = (x1, x2,. xn) depicting n measurements made on the tuple
from n attributes, respectively A1;A2; ::;An. We are also given a set of m classes, C1,C,
..Cm. Using Bayes theorem, the naive Bayesian classifier calculates the posterior
probability of each class conditioned on X. X is assigned the class label of the class
with the maximum posterior probability conditioned on X. Therefore, we try to
maximize P(CijX) = P(XjCi)P(Ci)=P(X). However, since P(X) is constant for all
classes, only P(XjCi)P(Ci) need be maximized. If the class prior probabilities are not
known, then it is commonly assumed that the classes are equally likely, i.e. P(C1) =
P(C2) = = P(Cm), and we would therefore maximize P(XjCi). Otherwise, we
maximize P(XjCi)P(Ci). The class prior probabilities may be estimated by P(Ci) =
si/s , where si is the number of training tuples of class Ci, and s is the total number of
training tuples. In order to reduce computation in evaluating P(XjCi), the nave
assumption of class conditional independence is made. This presumes that the values
of the attributes are conditionally independent of one another, given the class label of
the tuple, i.e., that there are no dependence relationships among the attributes.
{ If Ak is a categorical attribute then P(xkjCi) is equal to the number of training tuples
in Ci that have xk as the value for that attribute, divided by the total number of training
tuples in Ci. { If Ak is a continuous attribute then P(xkjCi) can be calculated using a
Gaussian density function.
b. Describe Decision Tree Induction algorithm in detail.
The major steps are as follows:
1. The tree starts as a single root node containing all of the training tuples.
2. If the tuples are all from the same class, then the node becomes a leaf, labeled with
that class.
3. Else, an attribute selection method is called to determine the splitting criterion. Such a
method may using a heuristic or statistical measure (e.g., information gain or gini
index) to select the best" way to separate the tuples into individual classes. The
splitting criterion consists of a splitting attribute and may also indicate either a split-
point or a splitting subset, as described below.
4. Next, the node is labeled with the splitting criterion, which serves as a test at the node.
A branch is grown from the node to each of the outcomes of the splitting criterion and
the tuples are partitioned accordingly. There are three possible scenarios for such
partitioning. (1) If the splitting attribute is discrete-valued, then a branch is grown for
each possible value of the attribute. (2) If the splitting attribute, A, is continuous-
valued, then two branches are grown, corresponding to the conditions A <= split point
and A > split point. (3) If the splitting attribute is discrete-valued and a binary tree
must be produced (e.g., if the gini index was used as a selection measure), then the test
B
at the node is A SA?" where SA is the splitting subset for A. It is a subset of the
known values of A. If a given tuple has value aj of A and if aj SA, then the test at the
node is satisfied. The algorithm recurses to create a decision tree for the tuples at each
partition.
The stopping conditions are:
If all tuples at a given node belong to the same class, then transform that node into a
leaf, labeled with that class.
If there are no more attributes left to create more partitions, then majority voting can
be used to convert the given node into a leaf, labeled with the most common class
among the tuples.
If there are no tuples for a given branch, a leaf is created with the majority class from
the parent node.

ADITYA COLLEGE OF ENGINEERING


PUNGANUR ROAD, MADANAPALLE-517325
III-B.Tech(R13) II-Sem I-Internal Examinations March2017 (Objective)
13A05603 Datamining (Computer Science & Engineering)
Name : Roll No. :
Time: 20 min Max Marks: 10

Answer all the questions 51=5M


1. List any two applications of classification.
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
B
Intrusion Detection
2. Define a data warehouse and data mart.
The data mart is a subset of the data warehouse and is usually oriented to a specific
business line or team. Whereas data warehouses have an enterprise-wide depth, the
information in data marts pertains to a single department. ... Each data mart is
dedicated to a specific business function or region.

3. Explain data preprocessing.


Data preprocessing is a data mining technique that involves transforming raw data into
an understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviors or trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such issues.
Real world data are generally
Incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
Noisy: containing errors or outliers
Inconsistent: containing discrepancies in codes or names
Tasks in data preprocessing
Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
Data integration: using multiple databases, data cubes, or files.
Data transformation: normalization and aggregation.
Data reduction: reducing the volume but producing the same or similar analytical results.
Data discretization: part of data reduction, replacing numerical attributes with nominal
ones.

4. Explain data transformation.


Data transformation is the process of converting data or information from one format to
another, usually from the format of a source system into the required format of a new
destination system. Few transformation activities are:
1.Normalization:
Scaling attribute values to fall within a specified range.
Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-Min)
Scaling by using mean and standard deviation (useful when min and max are unknown or
when there are outliers): V'=(V-Mean)/StDev
2. Aggregation: moving up in the concept hierarchy on numeric attributes.
3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by existing attributes.

5. List out the methods of handling noisy data.


To smooth out noisy data:
Binning
Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);
Then smooth by bin means, bin median, or bin boundaries.
B
Clustering: group values in clusters and then detect and remove outliers
(automatic or manual)
Regression: smooth by fitting the data into regression functions.

Choose the correct answer from following questions. 10 1/2 =5 M


1. A HOLAP server combines _____________ . [ C ]
a) ROLAP b) MOLAP c) a & b d) none
2. Data warehouses and OLAP tools are based on a _____ dimensional data model. [ B ]
a)single b) multi c) three d) cube
3. The querying of multidimensional databases can be based on a _________ model. [ A ]
a) Star net b) data c) query d) none
4. Star schema consists of ____________& _______________ tables. [ A ]
a) fact, dimension b) dimension, star c) a& b d) none
5. ___________Operation performs a selection on one dimension of the given cube. [ A ]
a) slice b) dice c) pivot d) none
6. Users of data mining systems can be classified into____________ categories. [ D ]
a) 1 b) 2 c) 3 d) 4
7. Smoothing techniques are ________________ [ A ]
a) binning b) aggregation c) Normalization d) None
8. The full form of KDD is .................. [ B ]
a) Knowledge Database b) Knowledge Discovery in Database
c) Knowledge Data House d) Knowledge Data Definition
9. Which of the following is not a data mining functionality? [ C ]
a) Characterization and Discrimination b) Classification and regression
c) Selection and interpretation d) Clustering and Analysis
10...... is a summarization of the general characteristics or features of a target class of data.[ A ]
a) Data Characterization b) Data Classification c) Data discrimination d) Data selection

Vous aimerez peut-être aussi