Académique Documents
Professionnel Documents
Culture Documents
Distinguis
hing
Features
Users and
system
orientation
Data
contents
Database
design
View
Access
patterns
OLTP
OLAP
V=(68000-38400)/(12300)= 2.40650406504065=2.41.
5.
Classification
Prediction
6.
C0
C1
Gini index
Node-1
3
3
1-(3/6)2-(3/6)2
Node-2
4
2
1-(4/6)2-(2/6)2
Node-3
6
0
1-(6/6)2-(0/6)2
Node-4
1
5
1-(1/6)2-(5/6)2
7.
Agglomerative
bottom-up approach
Starts with each object forming a
separate group. It successively
merges the objects that are close to
one another, until all of the groups
are merged into one, or until a
termination condition holds.
Divisive
Top-down approach
Starts with all of the objects in the
same cluster. In each successive
iteration, a cluster is split up into
smaller clusters, until eventually
each object forms its own cluster or
until a termination condition holds
8.
True positives TP: These refer to the positive tuples that were correctly
labeled by
the classifier.Let TP be the number of true positives.
True negatives TN: These are the negative tuples that were correctly
labeled by the
classifier. Let TN be the number of true negatives.
False positives FP: These are the negative tuples that were incorrectly
labeled as
positive . Let FP be the number of false positives.
False negatives .FN: These are the positive tuples that were mislabeled as
negative
Let FN be the number of false negatives.
ACTUAL CLASS
CONFUSION MATRIX
PREDICTED CLASS
YES
YES
YES
TP
NO
FP
TOTAL
P
NO
FN
TN
N
TOTAL
P
N
P+N
Precision: TP/P
Recall or Sensitivity or true positive (recognition) rate
positive tuples that are correctly identified.
Recall: TP/P.
the proportion of
11
(a) Incomplete, inconsistent and noisy data are commonplace properties of
large real-world databases. Attributes of interest may not always be available
and other data was included just because it was considered to be important
at the time of entity. Relevant data may not sometimes be recorded.Further
more, the recording of the modifications to the data may not have been
done. There are many possible reasons for noisy data (incorrect attribute
values). They could have been human as well as computer errors that
occurred during data entry. There could be inconsistent in the naming
conventions
adopted. Sometimes duplicate tuples may occur.
Data cleaning routines work to clean the data by filling in the missing
values, moothing noisy data, identifying and remo0ving outliers, and
resolving inconsistencies in the data. Although mining routines have some
form of handling noisy data, they are always not robust. If you would like to
include files from many sources in your analysis then requires data
integration. Naming inconsistencies may occur in this context. A large
amount of redundant data may confuse or slow down the knowledge
discovery process. In addition to data cleaning steps must taken to remove
redundancies in the data.
Sometimes data would have to be normalized so that it scaled to a specific
range e.g. [0.0, 1.0] in order to work data mining algorithms such as neuralnetworks, or clustering. Furthermore, you would require aggregating data
4
e.g. as sales per region-something thet is not part of the data transformation
methods need to be applied to the data.
Data reduction obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same or almost the same analytical
results. There are a number of strategies for data reduction-data
compression, numerosity reduction, generalization.
Data reduction:
Normally data used for data mining is huge. Complex analysis on data
mining on huge amounts of data can take a very long time, making such
analysis impartibly or infeasible.
Data reduction techniques can be applied to obtain a reduced representation
of the data set that is smaller in volume yet closely maintains the integrity of
the original data. That is, mining on the educed set should be efficient and
yet produce the same or almost the same analytical results.
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity
reduction, and
data compression.
Dimensionality reduction is the process of reducing the number of
randomvariables or attributes under consideration. Dimensionality reduction
methods include wavelet transforms and principal components
analysis which transform or project the original data onto a smaller space.
Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or dimensions are
detected and removed .
Numerosity reduction techniques replace the original data volume by
alternative,smaller forms of data representation. These techniques may be
parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.) Regression and log-linear models
are examples.Nonparametric methods for storing reduced representations
of the data include histograms, clustering ,sampling and data cube
aggregation
In data compression, transformations are applied so as to obtain a reduced
or compressed representation of the original data. If the original data can
be reconstructed from the compressed data without any information loss, the
data reduction is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data reduction is called lossy.
There are several lossless algorithms for string compression; however, they
5
2. Stepwise backward elimination: The procedure starts with the full set
of attributes and at each step, it removes the worst attribute remaining in
the set.
3. Combination of forward selection and backward elimination: The
stepwise forward selection and backward elimination methods can be
combined so that at each step, the procedure selects the best attributes and
removes the worst from among the remaining attributes.
If the mining task is classification, and the mining algorithm itself is used to
determine the attribute subset, then it is called a wrapper approach,
otherwise a filter approaches. In general the wrapper approach is leads to
greater accuracy since it optimizes the evolution measure of the algorithm
while removing attributes. However, it requires much more computation than
a filter approach.
Data Compression:
In data compression, data encoding or transformations are applied so as to
obtain reduced or compressed representation of the original data. If the
original data can be reconstructed from the compressed without loss of
information the data compression technique is called lossless. If, instead we
can reconstruct a partial approximation to the original data it is called lossy.
11
(b)
Characterization is a summarization of the general characteristics or
features of a target class of data. For example, the characteristics of
students can be produced, generating a profile of all the University first year
computing science students, which may include such information as a high
GPA and large number of courses taken.
Discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting
classes. For example, the general features of students with high GPA's may
be compared with the general features of students with low GPA's. The
resulting description could be a general comparative profile of the students
such as 75% of the students with high GPA's are fourth-year computing
science students while 65% of the students with low GPA's are
not.
Association is the discovery of association rules showing attribute-value
conditions that occur fre-quently together in a given set of data. For
example, a data mining system may find association rules like
birth_date,
13.(a)
Given two data objects are X(22,1,24,10,46),Y(12,2,24,23,46)
Distance
Measure
Manhattan
or
City block
Solution
More
Euclidean
Sqrt((22-12)2+(1-2)2+(24-24)2+(10-23)2+(4646)2)=sqrt(102+12+02+132+02)
=sqrt(100+1+0+169+0)
=sqrt(270)
=16.43167672515498
(abs(22-12)+abs(1-2)+abs(24-24)+abs(1023)+abs(46-46))
=(10+1+0+13+0)
=24
11
Supremum
Or
Chebyshev
=16.43
abs(10-23)=13
To compute it, we
find the attribute f that
gives
the
maximum
difference
in
values
between the two objects.
13(b)
OLAP operations. Lets look at some typical OLAP operations for
multidimensional data. Each of the following operations described is
illustrated in Figure 4.12. At the center of the figure is a data cube for
AllElectronics sales. The cube contains the dimensions location, time, and
item, where location is aggregated with respect to city values, time is
aggregated with respect to quarters, and item is aggregated with respect to
item types.
To aid in our explanation, we refer to this cube as the central cube. The
measure displayed is dollars sold (in thousands). (For improved readability,
only some of the cubes cell values are shown.) The data examined are for
the cities Chicago, New York, Toronto,and Vancouver.
Roll-up: The roll-up operation (also called the drill-up operation by some
vendors)performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by dimension reduction. Figure 4.12
shows the result of a roll-up
operation performed on the central cube by climbing up the concept
hierarchy for location. This hierarchy was defined as the total order street
<city < province or state < country. The roll-up operation shown aggregates
the data by ascending the location hierarchy from the level of city to the
level of country. In other words, rather than grouping the data by city, the
resulting cube groups the data by country.
When roll-up is performed by dimension reduction, one or more dimensions
are removed from the given cube. For example, consider a sales data cube
containing only the location and time dimensions. Roll-up may be performed
by removing, say,the time dimension, resulting in an aggregation of the total
sales by location, rather than by location and by time.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less
detailed datato more detailed data. Drill-down can be realized by either
stepping down a concept hierarchy for a dimension or introducing additional
dimensions. Figure 4.12 shows the result of a drill-down operation performed
on the central cube by stepping down a concept hierarchy for time defined
as day < month < quarter < year. Drill-down occurs by descending the
time hierarchy fromthe level of quarter to the more detailed level of month.
12
The resulting data cube details the total sales per month rather than
summarizing them by quarter.
Because a drill-down adds more detail to the given data, it can also be
performed by adding new dimensions to a cube. For example, a drill-down on
the central cube of Figure 4.12 can occur by introducing an additional
dimension, such as customer group.
Slice and dice: The slice operation performs a selection on one dimension
of the given
cube, resulting in a subcube. Figure 4.12 shows a slice operation where the
sales data are selected from the central cube for the dimension time using
the criterion time D Q1. The dice operation defines a subcube by
performing a selection on two or more dimensions. Figure 4.12 shows a dice
operation on the central cube based on the following selection criteria that
involve three dimensions: (location D Torontoor Vancouver) and (time D
Q1 or Q2) and (item D home entertainment or computer).
Pivot (rotate): Pivot (also called rotate) is a visualization operation that
rotates the data
axes in view to provide an alternative data presentation. Figure 4.12 shows a
pivot operation where the item and location axes in a 2-D slice are rotated.
Other examples
include rotating the axes in a 3-D cube, or transforming a 3-D cube into a
series of 2-D planes.
13
14
(a)
The Apriori Algorithm The first pass of the algorithm simply counts item occurrences
to determine the large 1-itemsets. A subsequent pass, say pass k, consists of two
phases. First, the large itemsets Lk-1 found in the (k-1)th pass are used to generate
the candidate itemsets Ck, using the Apriori candidate generation function (apriorigen) described below. Next, the database is scanned and the support of candidates
in Ck is counted.
The Apriori algorithm is:
14
L1 = {large 1-itemsets};
for ( k = 2; Lk-1 ; k++ ) do begin
Ck = apriori-gen(Lk-1); //New candidates
forall transactions t D do begin
Ct = subset(Ck, t); //Candidates contained in t
forall candidates c Ct do
c.count++;
end
Lk = { c Ck | c.count minsup }
end
Answer = k Lk;
The apriori-gen function takes as argument Lk-1, the set of all large (k-1)-itemsets.
tid
T100
T200
T300
T400
item
set
K,A,D,
B
D,A,C,
E,B
C,A,B,
E
B,A,D
C1
L1
15
Scan D for
each
candidate
Generate C2
candiate
from L1
Generate C3
candiate
from L2
itemset
A
Sup.cou
nt
4
itemset
Sup.cou
nt
K
C2
E
L2
itemset
Sup.cou
nt
A,B
A,C
itemset
Sup.cou
nt
A,B
A,C
A,D
A,D
A,E
A,E
B,C
B,C
B,D
B,D
B,E
B,E
C,D
C,E
C,E
D,E
C3
itemset
L3
Sup.cou
nt
A,B,D
A,B,C
A,B,E
A,B,D
A,C,D
A,B,E
A,C,E
A,C,E
A,D,E
B,C,E
B,C,D
B,C,E
B,D,E
C,D,E
C4
Generate C4
candiate
from L3
itemset
Sup.cou
nt
A,B,C
L4
itemset
sup.cou
nt
A,B,C,D
A,B,C,E
A,B,D,E
itemset
sup.cou
nt
A,B,C,E
C5 is a null set and the algorithm terminates. Frequent item set is {A,B,C,E}
16
Rule generation
Frequent item set is {A,B,C,E}
Subset of {A,B,C,E} Excluding null set and itself are {A},{B},{C},{E},{A,B},{A,C},{A,E},{B,C},{B,E},
{C,E},{A,B,C},{A,B,E},{A,C,E},{B,C,E}.
Rule
no
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
Association
rule
{A} {B,C,E}
{B}{A,C,E}
{C}{A,B,E}
{E}{A,B,C}
{A,B}{C,E}
{A,C}{B,E}
{A,E}{B,C}
{B,C}{A,E}
{B,E}{A,C}
{C,E}{A,B}
{A,B,C}{E}
{A,B,E}{C}
{A,C,E}{B}
{B,C,E}{A}
Confidence in
percentage
(2/4)*100=50
(2/4)*100=50
(2/2)*100=100
(2/2)*100=100
(2/4)*100=50
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
the minimum confidence threshold is 80% then strong association rules set
is {R3,R4,R6,R7,R8,R9,R10,R11,R12,R13,R14},
14.(b)
18
Figure 1: The attribute age has the highest information gain and
therefore becomes the splitting attribute at the root node of the
decision tree. Branches are grown for each outcome of age.
The tuples are shown partitioned accordingly.
19
C2(6,5.6)
8.4
4.6
3.6
3.4
1.6
1.6
8.6
5.4
C3(2.5,5.5)
5
1
7
5
5
5
5
5
17 (a)
20
scalable
Standard deviation measures spread about the mean and should be used
only when the mean is chosen as the measure of the center.Standard
deviation =0 only when there is no spread, that is, when all observations
have the same value. Otherwise Standard deviation >0. Variance and
standard deviation are algebraic measures. Thus,their computation is
scalable in large databases.
Graphic Displays
Boxplot: graphic display of five-number summary.
Histogram: x-axis are values, y-axis repres. Frequencies.
Quantile plot: each value xi is paired with fi indicating that approximately
100 fi % of data are xi.
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant
distribution against the corresponding quantiles of another.
Scatter plot: each pair of values is a pair of coordinates and plotted as
points in the plane
Measure
Mean
Formula
Weighted
Mean
Median
23
Variance
Standar
d
deviatio
n
Empirical
formula:
17.(c )
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are
the statistical classifiers. Bayesian classifiers can predict class membership
probabilities such as the probability that a given tuple belongs to a
particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities
24
26