ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc

Matoshri College of Engineering & R. C.
Nashik
Department of Computer Engineering. [P.G.]
Subject :- ELE-1 DATA MINING [510105B ]
Semester:- 1
Subject I/c :- Mr. Ranjit Gawande.
Model Ans. Paper for Data Mining 2017 Pat.
Compulsory Module 1 , Introduction to D.M. , Credit – 1
Syllabus : Data, Information and Knowledge, Attribute Types: Nominal, Binary, Ordinal
and Numeric attributes, Discrete versus Continuous Attributes, Introduction to Data
Preprocessing, Data Cleaning, Data integration, data reduction, transformation and Data
Descritization. Concept of class: Characterization and Discrimination, basics /Introduction
to: Classification and Regression for Predictive Analysis, Mining Frequent Patterns,
Associations, and Correlations, Cluster Analysis
Reference Book : Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts
and Techniques” Elsevier Publishers Third Edition/Second Edition, ISBN: 9780123814791,
9780123814807
Model Question Based on Module -1
Sub Topic Covered :- Data , Attribute Types: Nominal, Binary, Ordinal and
Numeric attributes, Discrete versus Continuous Attributes,
Q1.A what are different Attribute Types in data mining? 5
Ans. Data sets are made up of data objects. A data object represents an entity—in a sales
database, the objects may be customers, store items, and sales; in a medical database, the
objects may be patients; in a university database, the objects may be students, professors,
and courses. Data objects are typically described by attributes. Data objects can also be
referred to as samples, examples, instances, data points, or objects. If the data objects are
stored in a database, they are data tuples. That is, the rows of a database correspond to the
data objects, and the columns correspond to the attributes. Thus data is a Collection of data
objects and their attributes. An attribute is a property or characteristic of an object Examples:
eye color of a person, temperature, etc. Attribute is also known as variable, field,
characteristic, or feature i.e. it is a collection of attributes describe an object Object is also
known as record, point, case, sample, entity, or instance. On the other hand attribute values
defined as numbers or symbols assigned to an attribute , It is a Distinction between
attributes and attribute values
a) Same attribute can be mapped to different attribute values • Example: height can be
measured in feet or meters
b) Different attributes can be mapped to the same set of values • Example: Attribute
values for ID and age are integers • But properties of attribute values can be different
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Department of Computer Engineering [P.G.] M.C.E.R.C. Nashik. M. E. Comp. Engg. Subject:- Data Mining Page no 1
c) ID has no limit but age has a maximum and minimum value
Types of Attribute
There are different types of attributes
1) Nominal • Examples: ID numbers, eye color, zip codes
2) Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
3) Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
4) Ratio • Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
The type of an attribute depends on which of the following properties it possesses:
1) Distinctness =≠ ( 2 ) Order < > (3) Addition + -
4) Multiplication * / (4) Nominal attribute :- distinctness
5) Ordinal attribute :- distinctness & order (6) Interval attribute :- distinctness, order &
addition (7) Ratio attribute: - all 4 properties
Discrete and Continuous Attributes

Discrete Attribute :– Has only a finite or countably infinite set of values , for Examples: zip
codes, counts, or the set of words in a collection of documents . It often represented as
integer variables. But note that binary attributes are a special case of discrete attributes
• Continuous Attribute :– Has real numbers as attribute values for Examples: temperature,
height, or weight , but practically, real values can only be measured and represented
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
using a finite number of digits . The Continuous attributes are typically represented as
floating point variables
Q1.B Differentiate between ordered data set and graph data set? 5
Ans. Data sets are broadly categories in to three types. Shown below
1) Record :- (a) Data Matrix (b) Document Data (c) Transaction Data
2) Graph :- –(a) World Wide Web – (b) Molecular Structures
3) Ordered :- –(a) Spatial Data – (b) Temporal Data – (c) Sequential Data – (d)Genetic
Sequence Data
Each data set must maintain Characteristics of Structured Data i.e. Dimensionality :- Curse
of Dimensionality , then Sparsity :-– Only presence counts , and Resolution :- Patterns
depend on the scale
Record Data is Data that consists of a collection of records, each of which consists of a
fixed set of attributes for example table shown below.
(1.a) Data Matrix If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multi-dimensional space, where each dimension
represents a distinct attribute
Such data set can be represented by an m by n matrix, where there are m rows,
one for each object, and n columns, one for each attribute, for example table shown below.
(1.b) Document Data

Each document becomes a “term” vector, each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the
document.
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
(1.c) Transaction Data :- A special type of record data, where each record (transaction)
involves a set of items. – For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a transaction, while the
individual products that were purchased are the items.
(2) Graph Data :- The best example of graph data is Generic graph and HTML Links
(2. b)Chemical Data:- Such data can be represented by using chemical compounds ex .
Benzene Molecule: C6H6
(3) Ordered Data:- The Sequences of transactions can be represented as shown below. Few
other ordered data are Genomic sequence data , Spatio-Temporal Data
Sub Topic Covered :- Introduction to Data Preprocessing, Data Cleaning, Data

integration, data reduction, transformation and Data Discretization
Q2A What kinds of data quality problems? How can we detect problems with the data? What can 5
we do about these problems?
Ans. The data quality problems are (a) noise and outliers (b) missing values (c) duplicate data
(a) Noise refers to modification of original values – Examples: distortion of a person’s voice
when talking on a poor phone and “snow” on television . To over come with this problem the
solution is Outliers are data objects with characteristics that are considerably different than
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
most of the other data objects in the data set
(b) Missing values :- Reasons for missing values are – Information is not collected (e.g.,
people decline to give their age and weight) another reason – Attributes may not be
applicable to all cases (e.g., annual income is not applicable to children) To over come with
this problem the solution is Handling missing values by – Eliminate Data Objects , –
Estimate Missing Values , – Ignore the Missing Value During Analysis , – Replace with all
possible values (weighted by their probabilities)
(c) duplicate data :- Data set may include data objects that are duplicates, or almost
duplicates of one another – Major issue when merging data from heterogeneous sources ,
The Examples is – Same person with multiple email addresses To over come with this
problem the solution is Data cleaning in which – Process of dealing with duplicate data
issues
Q2B Explain the need of data preprocessing? Explain any two preprocessing techniques. 5
Ans. Data preprocessing, where the data are prepared for mining. For data preprocessing
following steps need to be carried out
1. Aggregation
2. Sampling
3. Dimensionality Reduction
4. Feature subset selection
5. Feature creation
6. Discretization and Binarization
7. Attribute Transformation
1. Aggregation :- Combining two or more attributes (or objects) into a single attribute (or
object) For the purpose (a) Data reduction (b) Change of scale (c) More “stable” data
(a) Data reduction :- Reduce the number of attributes or objects
(b) Change of scale :- Cities aggregated into regions, states, countries, etc
(c) More “stable” data :- Aggregated data tends to have less variability
2 . Sampling :- Sampling is the main technique employed for data selection. It is often used
for both the preliminary investigation of the data and the final data analysis.
Statisticians sample because obtaining the entire set of data of interest is too
expensive or time consuming. Sampling is used in data mining because processing the entire
set of data of interest is too expensive or time consuming The key principle for effective
sampling is the following:
* using a sample will work almost as well as using the entire data sets, if the sample is
representative
* A sample is representative if it has approximately the same property (of interest) as the
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
original set of data
Types of Sampling
1) Simple Random Sampling – There is an equal probability of selecting any particular item
2) Sampling without replacement – As each item is selected, it is removed from the
population
3) Sampling with replacement – Objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be picked up
more than once
4) Stratified sampling – Split the data into several partitions; then draw random
samples from each partition
Q2C What are the issues involved in data integration ? Also mentioned the suitable technique to 5
handle issue?
Ans. Data analysis task will involve data integration, which combines data from multiple sources
into a coherent data store, as in data warehousing. These sources may include multiple
databases, data cubes, or flat files
There are a number of issues to consider during data integration.
Schema integration and object matching can be tricky. i.e. entity identification problem.
How to handling blank, zero, or null values to avoid errors in schema integration
Redundancy is another important issue. An attribute (such as annual revenue, for
instance) may be redundant if it can be “derived” from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting
data set. Some redundancies can be detected by correlation analysis. Given two attributes,
such analysis can measure how strongly one attribute implies the other, based on the
available data. For numerical attributes, we can evaluate the correlation between two
attributes, A and B, by computing the correlation coefficient . Note that correlation does not
imply causality. That is, if A and B are correlated, this does not necessarily imply that A
causes B or that B causes A. For example, in analyzing a demographic database, we may
find that attributes representing the number of hospitals and the number of car thefts in a
region are correlated. This does not mean that one causes the other. Both are actually
causally linked to a third attribute, namely, population.
For categorical (discrete) data, a correlation relationship between two
attributes, A and B, can be discovered by a χ2 (chi-square) test . Thus contingency table
can be created. The use of denormalized tables (often done to improve performance by
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
avoiding joins) is another source of data redundancy. Inconsistencies often arise between
various duplicates, due to inaccurate data entry or updating some but not all of the
occurrences of the data.
A third important issue in data integration is the detection and resolution
of data value conflicts. For example, for the same real-world entity, attribute values from
different sources may differ. This may be due to differences in representation, scaling, or
encoding. For instance, a weight attribute may be stored in metric units in one system and
British imperial units in another.
When matching attributes from one database to another during
integration, special attention must be paid to the structure of the data. This is to ensure that
any attribute functional dependencies and referential constraints in the source system match
those in the target system.
The semantic heterogeneity and structure of data pose great challenges in
data integration. Careful integration of the data from multiple sources can help reduce and
avoid redundancies and inconsistencies in the resulting data set. This can help improve the
accuracy and speed of the subsequent mining process.
Q3.A Explain technique used for the data are transformed or consolidated into forms suitable 5
for mining. Also explain how attribute is normalized by scaling.
Ans. Following technique are used in Data Transformed
1) Smoothing, which works to remove noise from the data. Such techniques include
binning, regression, and clustering.
2) Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.
3) Generalization of the data, where low-level or “primitive” (raw) data are replaced by
higher-level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher-level concepts, like city or country.
Similarly, values for numerical attributes, like age, may be mapped to higher-level
concepts, like youth, middle-aged, and senior.
4) Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as −1:0 to 1:0, or 0:0 to 1:0.
5) Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.
Smoothing is a form of data cleaning on the data cleaning process also use ETL tools,
where users specify transformations to correct data inconsistencies. Aggregation and
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
generalization serve as forms of data reduction.
An attribute is normalized by scaling its values so that they fall within a small
specified range, such as 0.0 to 1.0. Normalization is particularly useful for classification
algorithms involving neural networks, or distance measurements such as nearest-neighbor
classification and clustering. If using the neural network backpropagation algorithm for
classification mining , normalizing the input values for each attribute measured in the
training tuples will help speed up the learning phase. For distance-based methods,
normalization helps prevent attributes with initially large ranges (e.g., income) from
outweighing attributes with initially smaller ranges (e.g., binary attributes). There are many
methods for data normalization. The following are three methods : min-max normalization,
z-score normalization, and normalization by decimal scaling.
Q3.B Explain how Data discretization techniques used to reduce the number of values for a given 5
continuous attribute ?
Ans. Data discretization and automatic generation of concept hierarchies for numerical data can
involve techniques such as (1) binning, (2) histogram analysis, (3) entropy-based
discretization, (4) χ2 analysis, (5) cluster analysis, and (6) discretization by intuitive
partitioning. For categorical data, concept hierarchies may be generated based on the number
of distinct values of the attributes defining the hierarchy. Each method assumes that the
values to be discretized are sorted in ascending order.
(1) binning :- Binning is a top-down splitting technique based on a specified number of bins.
Binning methods use for data smoothing. These methods are also used as discretization
methods for numerosity reduction and concept hierarchy generation. For example, attribute
values can be discretized by applying equal-width or equal-frequency binning, and then
replacing each bin value by the bin mean or median, as in smoothing by bin means or
smoothing by bin medians, respectively. These techniques can be applied recursively to the
resulting partitions in order to generate concept hierarchies. Binning does not use class
information and is therefore an unsupervised discretization technique. It is sensitive to the
user-specified number of bins, as well as the presence of outliers.
(2) histogram analysis :- Like binning, histogram analysis is an unsupervised discretization
technique because it does not use class information. Histograms partition the values for an
attribute, A, into disjoint ranges called buckets. Partitioning rules for defining histograms In
an equal-width histogram, for example, the values are partitioned into equal-sized partitions
or ranges. The histogram analysis algorithm can be applied recursively to each partition in
order to automatically generate a multilevel concept hierarchy, with the procedure
terminating once a prespecified number of concept levels has been reached. A minimum
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
interval size can also be used per level to control the recursive procedure. This specifies the
minimum width of a partition, or the minimum number of values for each partition at each
level. Histograms can also be partitioned based on cluster analysis of the data distribution
(3) entropy-based discretization :- Entropy is one of the most commonly used
discretization measures. Entropy-based discretization is a supervised, top-down splitting
technique. It explores class distribution information in its calculation and determination
of split-points (data values for partitioning an attribute range). To discretize a numerical
attribute, A, the method selects the value of A that has the minimum entropy as a split-point,
and recursively partitions the resulting intervals to arrive at a hierarchical discretization.
Such discretization forms a concept hierarchy for A. Entropy-based discretization can reduce
data size. Unlike the other methods , entropy-based discretization uses class information.
This makes it more likely that the interval boundaries (split-points) are defined to occur in
places that may help improve classification accuracy. The entropy and information gain
measures are also used for decision tree induction.
(4) χ2 analysis The stopping criterion is typically determined by three conditions. First,
merging stops when χ2 values of all pairs of adjacent intervals exceed some threshold, which
is determined by a specified significance level. A too (or very) high value of significance
level for the χ2 test may cause overdiscretization, whereas a too (or very) low value may
lead to underdiscretization. ChiMerge is that the relative class frequencies
should be fairly consistent within an interval. In practice, some inconsistency is allowed,
although this should be no more than a prespecified threshold, such as 3%, which may
be estimated from the training data. This last condition can be used to remove irrelevant
attributes from the data set.
(5) cluster analysis :- Cluster analysis is a popular data discretization method. A clustering
algorithm can be applied to discretize a numerical attribute, A, by partitioning the values of
A into clusters or groups. Clustering takes the distribution of A into consideration, as well as
the closeness of data points, and therefore is able to produce high-quality discretization
results. Clustering can be used to generate a concept hierarchy for A by following either a
topdown splitting strategy or a bottom-up merging strategy, where each cluster forms a node
of the concept hierarchy. In the former, each initial cluster or partition may be further
decomposed into several subclusters, forming a lower level of the hierarchy. In the latter,
clusters are formed by repeatedly grouping neighboring clusters in order to form higher-level
concepts
(6) discretization by intuitive:- The above discretization methods are useful in the
generation of numerical hierarchies, but we may like to see numerical ranges partitioned into
relatively uniform, easy-to-read intervals that appear intuitive or “natural.” The 3-4-5 rule
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
can be used to segment numerical data into relatively uniform, natural seeming intervals.The
rule partitions a given range of data into 3, 4, or 5 relatively equal-width intervals,
recursively and level by level, based on the value range at the most significant digit.
The rule can be recursively applied to each interval, creating a
concept hierarchy for the given numerical attribute. Real-world data often contain extremely
large positive and/or negative outlier values, which could distort any top-down discretization
method based on minimum and maximum data values.
Sub Topic Covered :- Concept of class: Characterization and Discrimination,

basics /Introduction to: Classification and Regression for Predictive Analysis,
Mining Frequent Patterns, Associations, and Correlations, Cluster Analysis
Q3.C What are the various forms used to present the output of data characterization? Explain in 4
detail Data Characterization?
Ans. The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables,
including crosstabs. The resulting descriptions can also be presented as generalized relations
or in rule form (called characteristic rules).
Data can be associated with classes or concepts. It can be useful to describe
individual classes and concepts in summarized, concise, and yet precise terms. Such
descriptions of a class or a concept are called class/concept descriptions. These descriptions
can be derived via (1) data characterization, by summarizing the data of the class under study
(often called the target class) in general terms, or (2) data discrimination, by comparison of
the target class with one or a set of comparative classes (often called the contrasting classes),
or (3) both data characterization and discrimination.
Data characterization is a summarization of the general characteristics or
features of a target class of data. The data corresponding to the user-specified class are
typically collected by a database query. There are several methods for effective data
summarization and characterization. Simple data summaries based on statistical measures
and plots . The data cube–based OLAP roll-up operation can be used to perform user-
controlled data summarization along a specified dimension. An attribute-oriented induction
technique can be used to perform data generalization and characterization without step-by-
step user interaction.
Q4.A What are different kinds of frequent patterns ? Explain how frequent patterns leads to the 5
discovery of interesting associations and correlations within data?
Ans. frequent patterns, including itemsets, subsequences, and substructures. A frequent itemset
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
typically refers to a set of items that frequently appear together in a transactional data set,
such as milk and bread. A frequently occurring subsequence, such as the pattern that
customers tend to purchase first a PC, followed by a digital camera, and then a memory card,
is a (frequent) sequential pattern. A substructure can refer to different structural forms, such
as graphs, trees, or lattices, which may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured pattern. Mining frequent
patterns leads to the discovery of interesting associations and correlations within data
Associations Analysis:- Example A marketing manager of AllElectronics, user may like to
determine which items are frequently purchased together within the same transactions.
An example of such a rule, mined from the AllElectronics transactional database, is
buys(X;“computer”)  buys(X;“software”) [support = 1%;confidence = 50%]
where X is a variable representing a customer. A confidence, or certainty, of 50% means
that if a customer buys a computer, there is a 50% chance that she will buy software
as well. A 1% support means that 1% of all of the transactions under analysis showed
that computer and software were purchased together. This association rule involves a
single attribute or predicate (i.e., buys) that repeats. Association rules that contain a single
predicate are referred to as single-dimensional association rules. Dropping the predicate
notation, the above rule can be written simply as “computer  software [1%, 50%]”.
Adopting the terminology used in multidimensional databases,
where each attribute is referred to as a dimension, the above rule can be referred to as a
multidimensional association rule. Frequent itemset mining is the simplest form of frequent
pattern mining. The mining of frequent patterns, associations, and correlations particular
emphasis is placed on efficient algorithms for frequent itemset mining. Sequential pattern
mining and structured pattern mining
Q4.B How is the derived model presented using Classification and Prediction ? Explain in detail. 5
Ans. Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict
the class of objects whose class label is unknown. The derived model is based on the analysis
of a set of training data (i.e., data objects whose class label is known).
“How is the derived model presented?” The derived model may be represented in various
forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae,
or neural networks (Figure below). A decision tree is a flow-chart-like tree structure, where
each node denotes a test on an attribute value, each branch represents an outcome of the
test, and tree leaves represent classes or class distributions. Decision trees can easily be
converted to classification rules. A neural network, when used for classification, is typically
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
a collection of neuron-like processing units with weighted connections between the
units. There are many other methods for constructing classification models, such as naïve
Bayesian classification, support vector machines, and k-nearest neighbor classification.
Whereas classification predicts categorical (discrete, unordered) labels, prediction
models continuous-valued functions. That is, it is used to predict missing or unavailable
numerical data values rather than class labels. Although the term prediction may
refer to both numeric prediction and class label prediction, in this book we use it to refer
primarily to numeric prediction. Regression analysis is a statistical methodology that is
most often used for numeric prediction, although other methods exist as well. Prediction
also encompasses the identification of distribution trends based on the available data.
Classification and prediction may need to be preceded by relevance analysis, which
attempts to identify attributes that do not contribute to the classification or prediction
process. These attributes can then be excluded.
Suppose that the resulting classification is expressed in the form of a decision

tree. The decision tree, for instance, may identify price as being the single factor that best
distinguishes the three classes. The tree may reveal that, after price, other features that
help further distinguish objects of each class from another include brand and place made.
Such a decision tree may help you understand the impact of the given sales campaign and
design a more effective campaign for the future
Q5.A What is cluster analysis , outlier analysis , Evolution Analysis ? Explain each term in 5
Ans. clustering analyzes data objects without consulting a known class label. In general, the class
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
labels are not present in the training data simply because they are
not known to begin with. Clustering can be used to generate such labels. The objects are
clustered or grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity. That is, clusters of objects are formed so that objects
within a cluster have high similarity in comparison to one another, but are very dissimilar
to objects in other clusters. Each cluster that is formed can be viewed as a class of objects,
from which rules can be derived. Clustering can also facilitate taxonomy formation, that
is, the organization of observations into a hierarchy of classes that group similar events
together.
Outlier Analysis :- A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are outliers. Most data mining methods
discard outliers as noise or exceptions. However, in some applications such as fraud
detection, the
rare events can be more interesting than the more regularly occurring ones. The analysis
of outlier data is referred to as outlier mining.
Outliers may be detected using statistical tests that assume a distribution or probability
model for the data, or using distance measures where objects that are a substantial
distance from any other cluster are considered outliers. Rather than using statistical or
distance measures, deviation-based methods identify outliers by examining differences
in the main characteristics of objects in a group.
Evolution Analysis:- Data evolution analysis describes and models regularities or trends for
objects whose
behavior changes over time. Although this may include characterization, discrimination,
association and correlation analysis, classification, prediction, or clustering of timerelated
data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis.
Note:- This literature is only for private circulation.
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Matoshri College of Engineering & R. C. Nashik
Department of Computer Engineering. [P.G.]
Subject :- ELE-1 DATA MINING [510105B ]
Semester:- 1
Subject I/c :- Mr. Ranjit Gawande.
Model Ans. Paper for Data Mining 2017 Pat.
Compulsory Module 2 of 1, Introduction to D.M. , Credit – 1
Syllabus :- Measuring the Central Tendency: Basics of Mean, Median, and Mode, Measuring the
Dispersion of Data, Variance and Standard Deviation. Measuring Data Similarity and Dissimilarity, Data
Matrix versus Dissimilarity Matrix, Proximity Measures for Nominal Attributes and Binary Attributes
Model Question Based on Module -2 of 1
Sub Topic Covered : Measuring the Central Tendency: Basics of Mean, Median, and Mode,
Measuring the Dispersion of Data, Variance and Standard Deviation.
Q1.A What are various ways to measure the central tendency of data ? Explain in detail 5
Ans. 2.2 Descriptive Data Summarization 53
Q1.B Draw the Boxplot for the unit price data for items sold at four branches of a electronics shop 5
during a specific time period. Assume suitable time period.
Ans. 2.2 Descriptive Data Summarization 55
Q2A What are other popular types of graphs for the display of data summaries and distributions 5
apart from bar charts, pie charts, and line graph
Ans. Page 56 subpoint 2.2.3
Q2B With suitable algebraic equations show that how distributive measures cab be computed 5
from the variance and standard deviation?
Ans. Page 55,56
Q2C What are most common measures tools of data dispersion ? Explain Boxplots in detail? 5
Ans. Range, the five-number summary (based on quartiles), the interquartile range, and the
standard deviation. Boxplots can be plotted based on the five-number summary and are a
useful tool for identifying
outliers
Q3A What are the score or grades that are consider for Five-Number summary? Explain how a 5
boxplot incorporates the five-number summary
Ans. The five-number summary of a distribution consists of the median, the quartiles Q1 and Q3,
and the smallest and largest individual observations, written in the order
Minimum; Q1; Median; Q3; Maximum:
Typically, the ends of the box are at the quartiles, so that the box length is the interquartile
range, IQR.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
largest (Maximum) observations.
When dealing with a moderate number of observations, it is worthwhile to plot
potential outliers individually. To do this in a boxplot, the whiskers are extended to the
extreme low and high observations only if these values are less than 1:5 × IQR
beyond the quartiles
Sub Topic Covered :Measuring Data Similarity and Dissimilarity, Data Matrix versus Dissimilarity
Matrix,Attributes and Binary Attributes, Minkowski Distance, Euclidean distance and Manhattan
distance
Q3B What is Measures of Data Similarity and Dissimilarity? Explain Common Properties of 5
Dissimilarity Measures
Ans. Distance or similarity measures are essential to solve many pattern recognition problems
such as classification and clustering. Various distance/similarity measures are available to
compare two data distributions. As the names suggest, a similarity measures how close two
distributions are. For multivariate data complex summary methods are developed
Similarity Measure
• Numerical measure of how alike two data objects are.
• Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
• Numerical measure of how different two data objects are.
• Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.
Similarity/Dissimilarity for Simple Attributes

Here, p and q are the attribute values for two data objects.
Common Properties of Dissimilarity Measures
Distance, such as the Euclidean distance, is a dissimilarity measure and has some well
known properties:
d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,

d(p, q) = d(q,p) for all p and q,
d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
A distance that satisfies these properties is called a metric.
Q3C Which data sturcture is more suitable for main memory-based clustering algorithms ? 5
Ans. Page no 386 & 387 subpoint 7.2
Q4A What are interval-scaled variables? How can the data for a variable be standardized 5
Ans. Interval-scaled variables are continuous measurements of a roughly linear scale. Typical
examples include weight and height, latitude and longitude coordinates (e.g., when lustering
houses), and weather temperature
To standardize measurements, one choice is to convert the original measurements to unitless

variables
Ref Page no 387 & 388 subpoint 7.2.1

Q4B Define following distance measured with suitable algebraic equations 5
1) Euclidean distance
2) Manhattan (or city block) distance
3) Minkowski distance
4) Mean absolute deviation
5) standardized measurement, or z-score
Ans. Ref Page no 389 subpoint 7.2.1
Optional Module No 4 , Classification , Credit 2
Syllabus: Basic Concepts, General Approach to Classification, Decision Tree Induction,
Attribute Selection Measures, Tree Pruning, Scalability and Decision Tree Induction, Visual
Mining for Decision Tree Induction, Bayes Classification Methods, Baye’s Theorem, Naive
Bayesian Classification, RuleBased Classification, Using IFTHEN Rules for Classification,
Rule Extraction from a Decision Tree, Rule Induction Using a Sequential Covering
Algorithm, Model Evaluation and Selection: Metrics for Evaluating Classifier Performance,
Holdout Method and Random Sub sampling, CrossValidation, Bootstrap, Model Selection
Using Statistical Tests of Significance, Comparing Classifiers Based on Cost–Benefit and
ROC Curves, Techniques to Improve Classification Accuracy: Introducing Ensemble
Methods, Bagging, Boosting and Ada Boost, Random Forests, Improving Classification
Accuracy of ClassImbalanced Data.
Reference Book: Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts
and Techniques” Elsevier Publishers Third Edition, ISBN: 9780123814791, 9780123814807.
Model Question Based on Module -4
Sub Topic Covered : Basic Concepts, General Approach to Classification, Decision Tree
Induction, Attribute Selection Measures, Tree Pruning
Q7B What Is Classification? What Is Prediction? 5
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Ans. Ref. Page no 286
Q7C What is unsupervised learning ? Explain with suitable example 5
Q8A Explain how classification accuracy developed? How is numeric prediction different from 5
classification
Q8B What are the Criterias to be consider for Comparing Classification and Prediction Methods? 5
Ans. Ref. Page no 290 subpoint 6.2.2
Q8C Explain with suitable tree structure how Decision Tree Induction used for classification? 5
Ans. Ref. Page no 291 Figure 6.2
Q8D Write Basic algorithm for inducing a decision tree from training tuples. Assume suitable data
Ans. Ref. Figure 6.3 page nu 293
Sub Topic Covered : Scalability and Decision Tree Induction, Visual Mining for Decision
Tree Induction, Bayes Classification Methods, Baye’s Theorem, Naive Bayesian
Classification
5
Q9A Explain how scalable is decision tree induction?
Ans. 6.3.4 page 306
Q9B What is SLIQ and SPRINT ? How they overcome the limitations of decision tree? 5
Ans. SLIQ = supervised learning in ques , SPRINT = scalable parallelizable induction of

decision tree algorithm
Both used RID (a record identifier) Ref page no 307
Q9C What are Bayesian classifiers? How they can predict class membership probabilities? 5
Ans. Ref page no 310
Q10A Write the Bayes’ Theorem for posteriori probability 5
Ans. Ref page no 310 subpoint 6.4.1
Q10B Explain steps involved in naïve Bayesian classifier? 5
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Q10C What are the two components used to defined simple Bayesian belief network Explain in 5
detail with suitable diagram.
Ans. 1) a directed acyclic graph and a 2) set of conditional probability tables
Ref Figure 6.11 page no 316
Sub Topic Covered : RuleBased Classification, Using IFTHEN Rules for Classification,
Rule Extraction from a Decision Tree, Rule Induction Using a Sequential Covering
Algorithm,
Q1.A IF age = youth AND student = yes THEN buys computer = yes. 5
The above IF-THEN rule is for classification, Write which part is rule antecedent or
precondition and which part is rule consequent, Rewrite the above mentioned rule for “
suppose predicting whether a customer will buy a computer “
Ans. (age = youth) ^ (student = yes) ) (buys computer = yes).
Ref. Page no 319
Q1.B Explain how size ordering and rule ordering are used in conflict resolution strategy ? 5
Q2A Which is a newer alternative approach for classification rules that can search attribute pair in 5
frequent data ?
Ans. Newer alternative approach, classification rules can be generated using associative
classification algorithms, which search for attribute-value pairs that occur frequently in the
data. These pairs may form association rules, which can be analyzed and used in
classification. Since this latter approach is based on association rule mining
Page no 322 subtopic 6.5.3

Q2B Write Basic sequential covering algorithm. Clearly mention Input , Output and Steps involed 5
Ans Figure 6.12 page 323
Q2C In ID3 and CART which algorithmic approch used ? Explain in detail. 5
Ans. 6.3.1 page nu 292
ID3 and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide-
and-conquer manner. Most algorithms for decision tree induction also follow such a top-down approach,
Q3A Draw and explain three possibilities for partitioning tuples based on the splitting criterion? 5
Ans. Figure 6.4 page 295
Sub Topic Covered : Model Evaluation and Selection: Metrics for Evaluating Classifier
Performance, Holdout Method and Random Sub sampling, CrossValidation, Bootstrap,
Model Selection Using Statistical Tests of Significance,
Q3B Explain measures for assessing how good or how “accurate” your classifier is at 5
predicting the class label of tuples
Ans. 8.5.1 page 365 D.M. 3rd edition
Q3C What are the four “building blocks” used in computing many evaluation measures? Give 5
algebraic formulae for the measure ?
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Ans. Page 365 D.M. 3rd edition
Q4A Explain confusion matrix with suitable example of class lable ? 5
Q4B What is class imbalance problem? Explain with suitable example? 5
Q4C Explain the use of sensitivity and specificity measures Mention algebraic formulation of 5
same?
Q5A What are the additional aspects from that classifiers can also be compared to measure 5
accuracy ? Explain in detail.
Ans. Speed:
Robustness:
Scalability:
Interpretability:
Page 369 D.M. 3rd edition
Q5B Explain with labled diagram for estimating accuracy with the holdout method? 5
Ans. Figure 8.17 Page 370 D.M. 3rd edition
Q5C Explain why k-fold cross-validation is recommended for estimating accuracy 5
Ans. Page 370-371 D.M. 3rd edition
Sub Topic Covered : Comparing Classifiers Based on Cost–Benefit and ROC Curves,
Techniques to Improve Classification Accuracy: Introducing Ensemble Methods, Bagging,
Boosting and Ada Boost,
Q6A What is bootstrap method? Explain most commonly used bootstrap method? 5
Q6B Explain why we required Statistical Tests of Significance for Model Selection ? 5
Ans. Page 372 D.M. 3rd edition sub point 8.5.5
Q6C Explain how costs and benefits are useful for Comparing Classifiers ? 5
Q7A What are major Techniques to Improve Classification Accuracy? Explain any one in detail?
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Q7B 5
In above table Tuples are sorted by decreasing score, where the score is the value returned
by a probabilistic classifier. Draw the ROC for the data shown in table.
Ans. Figure 8.19 page 376 D.M. 3rd edition
Q7C Write an algorithm for bagging Clearly mention input output and method?
Ans. Figure 8.23 page 380 D.M. 3rd edition
Q8A Explain how accuracy incres in Adaptive Boosting algorithm?
Ans. Subpoint 8.6.3 page 381 D.M. 3rd edition
Sub Topic Covered : Scalability and Decision Tree Induction, Visual Mining for Decision
Tree Induction, Bayes Classification Methods, Baye’s Theorem, Naive Bayesian
Classification
Q8B
Ans.
Q8C
Ans.
Q9A
Ans.
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
Q9B
Ans.
Q9C
Ans.
Sub Topic Covered : Random Forests, Improving Classification Accuracy of Class

Imbalanced Data.
Q10A
Ans.
Q10B
Ans.
Q10C
Ans.
Q11 A
Ans.
Q11 B
Ans.
Q11 C
Ans.
PRACTICE QUESTION (HOME WORK)
1.
2.
3.
4.
5.
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
6.
7.
8.
9.
10.
______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ML 1,2 Unit Peter Flach Machine Learning. The Art and Scienc

Transféré par

Droits d'auteur :

Formats disponibles

Matoshri College of Engineering & R. C.

Model Ans. Paper for Data Mining 2017 Pat.

Discrete and Continuous Attributes

(1.b) Document Data

Sub Topic Covered :- Introduction to Data Preprocessing, Data Cleaning, Data

Sub Topic Covered :- Concept of class: Characterization and Discrimination,

Suppose that the resulting classification is expressed in the form of a decision

Note:- This literature is only for private circulation.

Model Ans. Paper for Data Mining 2017 Pat.

Proximity refers to a similarity or dissimilarity.

Similarity/Dissimilarity for Simple Attributes

Common Properties of Dissimilarity Measures

d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,

Ans. Page no 386 & 387 subpoint 7.2

To standardize measurements, one choice is to convert the original measurements to unitless

Ref Page no 387 & 388 subpoint 7.2.1

Q7C What is unsupervised learning ? Explain with suitable example 5

Ans. Ref. Page no 287

Ans. Ref. Page no 290 subpoint 6.2.2

Ans. Ref. Page no 291 Figure 6.2

Ans. Ref. Figure 6.3 page nu 293

Ans. SLIQ = supervised learning in ques , SPRINT = scalable parallelizable induction of

Ans. Ref page no 310

Q10A Write the Bayes’ Theorem for posteriori probability 5

Ans. Ref page no 310 subpoint 6.4.1

Q10B Explain steps involved in naïve Bayesian classifier? 5

Ans. Ref page no 311

Page no 322 subtopic 6.5.3

Ans. Figure 6.4 page 295

Q4A Explain confusion matrix with suitable example of class lable ? 5

Ans. Page 366 D.M. 3rd edition

Q4B What is class imbalance problem? Explain with suitable example? 5

Ans. Page 367 D.M. 3rd edition

Ans. Figure 8.17 Page 370 D.M. 3rd edition

Q5C Explain why k-fold cross-validation is recommended for estimating accuracy 5

Ans. Page 370-371 D.M. 3rd edition

Ans. Page 371 D.M. 3rd edition

Ans. Page 372 D.M. 3rd edition sub point 8.5.5

Ans. Page 373 D.M. 3rd edition sub point 8.5.6

Ans. Figure 8.19 page 376 D.M. 3rd edition

Ans. Figure 8.23 page 380 D.M. 3rd edition

Q8A Explain how accuracy incres in Adaptive Boosting algorithm?

Ans. Subpoint 8.6.3 page 381 D.M. 3rd edition

Sub Topic Covered :­ Random Forests, Improving Classification Accuracy of Class­

PRACTICE QUESTION (HOME WORK)

Vous aimerez peut-être aussi

Sub Topic Covered : Random Forests, Improving Classification Accuracy of Class