Vous êtes sur la page 1sur 26

1.

Distinguis
hing
Features
Users and
system
orientation

Data
contents

Database
design

View

Access
patterns

OLTP

OLAP

An OLTP system is customeroriented


and
is
used
for
transaction and query processing
by
clerks,
clients,
and
information
technology
professionals.
An OLTP system manages current
data that, typically, are too
detailed to be easily used for
decision making.

An OLAP system is market-oriented and is


used for data analysis by knowledge
workers, including managers, executives,
and analysts.

An OLTP system usually adopts


an entity-relationship (ER) data
model
and
an
applicationoriented database design.
An OLTP system focuses mainly
on the current data within an
enterprise or
department,without referring to
historical data or data in different
organizations.

The access patterns of an OLTP


systemconsist mainly of short,
atomic
transactions. Such a system
requires concurrency control and
recovery mechanisms.

An OLAP system manages large amounts


of
historical data, provides facilities for
summarization and aggregation, and
stores and
manages information at different levels of
granularity. These features make the data
easier to use in informed decision making.
An OLAP system typically adopts either a
star or snowflake model and a subject
oriented database design.
An OLAP system often spans multiple
versions of a database schema,due to the
evolutionary process of an organization.
OLAP systems also deal with
information that originates from different
organizations, integrating information from
many data stores. Because of their huge
volume, OLAP data are stored on multiple
storage media.
accesses to OLAP systems are mostly
read-only operations (because most data
warehouses store historical rather than upto-date information), although many could
be complex queries.

Data mining refers to extracting or mining knowledge from large


amounts of data. Data mining is an essential step in the knowledge
discovery process .The data mining step may interact with the user or a
knowledge base. The interesting patterns are presented to the user and may
be stored as new knowledge in the knowledge base. data mining is only one
step an essential one because it uncovers hidden patterns for evaluation.
2.

3. Normalization, where the attribute data are scaled so as to fall within a


small specified range, z-score normalization (or zero-mean normalization),
the values for an attribute,A, are normalized based on the mean and
standard deviation of A. A value, v, of A is normalized to v by computing
where A and sA are the mean and standard deviation, respectively, of
attribute A. This method of normalization is useful when the actual minimum
and maximum of attribute A are unknown, or when there are outliers that
dominate the min-max normalization.

V=(68000-38400)/(12300)= 2.40650406504065=2.41.

Objective measures of pattern interestingness exist based on the structure


of discovered patterns and the statistics underlying them. An objective
measure for association rules of the form XY is rule support, representing
the percentage of transactions from a transaction database that the given
rule satisfies. This is taken to be the probability P(X UY),where X UY indicates
that a transaction contains both X and Y, that is, the union of itemsets X and
Y. Another objective measure for association rules is confidence, which
assesses the degree of certainty of the detected association. This is taken to
be the conditional probability P(Y/X), that is, the probability that a transaction
containing X also contains Y. More formally, support and confidence are
defined as
support(X Y) = P(X UY)
confidence(X Y) = P(Y/X)
4.

5.
Classification

Prediction

Constructs a set of models (or functions)


that describe and distinguish data
classes or concepts
Used for predicting the class label of
data objects

Builds a model to predict some missing


or unavailable, and often numerical, data
values
Used for predicting missing numerical
data values

6.
C0
C1
Gini index

Node-1
3
3
1-(3/6)2-(3/6)2

Node-2
4
2
1-(4/6)2-(2/6)2

Node-3
6
0
1-(6/6)2-(0/6)2

Node-4
1
5
1-(1/6)2-(5/6)2

7.
Agglomerative
bottom-up approach
Starts with each object forming a
separate group. It successively
merges the objects that are close to
one another, until all of the groups
are merged into one, or until a
termination condition holds.

Divisive
Top-down approach
Starts with all of the objects in the
same cluster. In each successive
iteration, a cluster is split up into
smaller clusters, until eventually
each object forms its own cluster or
until a termination condition holds

8.
True positives TP: These refer to the positive tuples that were correctly
labeled by
the classifier.Let TP be the number of true positives.
True negatives TN: These are the negative tuples that were correctly
labeled by the
classifier. Let TN be the number of true negatives.
False positives FP: These are the negative tuples that were incorrectly
labeled as
positive . Let FP be the number of false positives.
False negatives .FN: These are the positive tuples that were mislabeled as
negative
Let FN be the number of false negatives.

ACTUAL CLASS

CONFUSION MATRIX
PREDICTED CLASS
YES
YES
YES
TP
NO
FP
TOTAL
P

NO
FN
TN
N

TOTAL
P
N
P+N

Precision: TP/P
Recall or Sensitivity or true positive (recognition) rate
positive tuples that are correctly identified.
Recall: TP/P.

the proportion of

9. MinPts, which specifies the density threshold of dense regions.


3

Minimum number of points in an Eps-neighbourhood of that point.


10.

Data analysis and decision support


Market analysis and management
Target marketing, customer relationship management
(CRM), market basket analysis, cross selling, market
segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
Bioinformatics and bio-data analysis

11
(a) Incomplete, inconsistent and noisy data are commonplace properties of
large real-world databases. Attributes of interest may not always be available
and other data was included just because it was considered to be important
at the time of entity. Relevant data may not sometimes be recorded.Further
more, the recording of the modifications to the data may not have been
done. There are many possible reasons for noisy data (incorrect attribute
values). They could have been human as well as computer errors that
occurred during data entry. There could be inconsistent in the naming
conventions
adopted. Sometimes duplicate tuples may occur.
Data cleaning routines work to clean the data by filling in the missing
values, moothing noisy data, identifying and remo0ving outliers, and
resolving inconsistencies in the data. Although mining routines have some
form of handling noisy data, they are always not robust. If you would like to
include files from many sources in your analysis then requires data
integration. Naming inconsistencies may occur in this context. A large
amount of redundant data may confuse or slow down the knowledge
discovery process. In addition to data cleaning steps must taken to remove
redundancies in the data.
Sometimes data would have to be normalized so that it scaled to a specific
range e.g. [0.0, 1.0] in order to work data mining algorithms such as neuralnetworks, or clustering. Furthermore, you would require aggregating data
4

e.g. as sales per region-something thet is not part of the data transformation
methods need to be applied to the data.
Data reduction obtains a reduced representation of the data set that is much
smaller in volume, yet produces the same or almost the same analytical
results. There are a number of strategies for data reduction-data
compression, numerosity reduction, generalization.
Data reduction:
Normally data used for data mining is huge. Complex analysis on data
mining on huge amounts of data can take a very long time, making such
analysis impartibly or infeasible.
Data reduction techniques can be applied to obtain a reduced representation
of the data set that is smaller in volume yet closely maintains the integrity of
the original data. That is, mining on the educed set should be efficient and
yet produce the same or almost the same analytical results.
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity
reduction, and
data compression.
Dimensionality reduction is the process of reducing the number of
randomvariables or attributes under consideration. Dimensionality reduction
methods include wavelet transforms and principal components
analysis which transform or project the original data onto a smaller space.
Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or dimensions are
detected and removed .
Numerosity reduction techniques replace the original data volume by
alternative,smaller forms of data representation. These techniques may be
parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual
data. (Outliers may also be stored.) Regression and log-linear models
are examples.Nonparametric methods for storing reduced representations
of the data include histograms, clustering ,sampling and data cube
aggregation
In data compression, transformations are applied so as to obtain a reduced
or compressed representation of the original data. If the original data can
be reconstructed from the compressed data without any information loss, the
data reduction is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data reduction is called lossy.
There are several lossless algorithms for string compression; however, they
5

typically allow only limited data manipulation. Dimensionality reduction and


numerosity reduction techniques can also be considered forms of data
compression.
Data Cube Aggregation:
Data cube store multidimensional aggregated information. Each cell holds an
aggregate data value, corresponding to the data in the multidimensional
space. Concept hierarchies may exist for each attribute, allowing analysis of
data at multiple levels of abstraction. For example, a hierarchy for branch
could allow branches to be grouped into regions, based on their address.
Data cubes provide fast access to pre computed summarized data thereby
benefiting online analytical processing as well as data mining.
The cube created at the lowest level of abstraction is referred to as the base
cuboids. Data cubes created at various levels of abstraction are called as
cuboids, so that a data cube may refer to a lattice of cuboids. Each higher
level of abstraction further reduces the resulting data size. The lowest level
cuboids should be useable or used for data analysis.
Dimensionally Reduction:
Data sets may contain hundreds of attributes for analysis most of which may
be irrelevant to the data mining task, or redundant. Although it may be
possible for a domain expert to pick out certain attributes, this cans a
difficult and a time consuming task. Leaving out relevant attributes or
keeping irrelevant attributes may cause confusion for the mining algorithm
employed. The redundant data may slow down the mining process.
Dimensionally reduction reduces the data set size by removing such
attributes (dimensions) from it. Typically methods for attribute selection are
applied. The goal of attribute subset selection is to find the minimum number
of attributes so that the resulting probability pattern of the data class is as
close as the original distribution obtained using all the attributes. Mining on a
reduced set has an additional benefit. It reduces the number of attributes
appearing in the discovered patterns, helping to make the
patterns easier to understand.
Basic heuristic methods of attribute selection include the following
techniques:
1. Stepwise forward selection: The procedure starts with an empty set of
attributes. The best of the original attributes is determined and added tom
the set. At each subsequent iteration or step, it removes the worst attribute
remaining original attributes is added to the set.
6

2. Stepwise backward elimination: The procedure starts with the full set
of attributes and at each step, it removes the worst attribute remaining in
the set.
3. Combination of forward selection and backward elimination: The
stepwise forward selection and backward elimination methods can be
combined so that at each step, the procedure selects the best attributes and
removes the worst from among the remaining attributes.
If the mining task is classification, and the mining algorithm itself is used to
determine the attribute subset, then it is called a wrapper approach,
otherwise a filter approaches. In general the wrapper approach is leads to
greater accuracy since it optimizes the evolution measure of the algorithm
while removing attributes. However, it requires much more computation than
a filter approach.
Data Compression:
In data compression, data encoding or transformations are applied so as to
obtain reduced or compressed representation of the original data. If the
original data can be reconstructed from the compressed without loss of
information the data compression technique is called lossless. If, instead we
can reconstruct a partial approximation to the original data it is called lossy.
11
(b)
Characterization is a summarization of the general characteristics or
features of a target class of data. For example, the characteristics of
students can be produced, generating a profile of all the University first year
computing science students, which may include such information as a high
GPA and large number of courses taken.
Discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting
classes. For example, the general features of students with high GPA's may
be compared with the general features of students with low GPA's. The
resulting description could be a general comparative profile of the students
such as 75% of the students with high GPA's are fourth-year computing
science students while 65% of the students with low GPA's are
not.
Association is the discovery of association rules showing attribute-value
conditions that occur fre-quently together in a given set of data. For
example, a data mining system may find association rules like

major(X ,"computing science") owns(X ,"personal computer") [support = 12%;


confidence = 98%]

where X is a variable representing a student. The rule indicates that of the


students under study,12% (support) major in computing science and own a
7

personal computer. There is a 98% probability(confidence, or certainty) that


a student in this group owns a personal computer.
Classification differs from prediction in that the former constructs a set of
models (or functions)that describe and distinguish data classes or concepts,
whereas the latter builds a model to predict some missing or unavailable,
and often numerical, data values. Their similarity is that they are both tools
for prediction: Classification is used for predicting the class label of data
objects and prediction is typically used for predicting missing numerical data
values.
Clustering analyzes data objects without consulting a known class label.
The objects are clustered or grouped based on the principle of maximizing
the intraclass similarity and minimizing the interclass similarity. Each cluster
that is formed can be viewed as a class of objects. Clustering can also
facilitate taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.
Data evolution analysis describes and models regularities or trends for
objects whose behavior changes over time. Although this may include
characterization, discrimination, association, classification, or clustering of
time-related data, distinct features of such an analysis include time-series
data analysis, sequence or periodicity pattern matching, and similarity-based
data analysis.
12.
(a) Attribute-Oriented Induction
(a) Proposed in 1989 (KDD 89 workshop),Not confined to categorical data
nor particular measures Collect the task-relevant data (initial relation) using
a relational database query Perform generalization by attribute removal or
attribute generalization.Apply aggregation by merging identical, generalized
tuples and accumulating their respective counts Interactive presentation with
users

Basic Principles of Attribute-Oriented Induction


Data focusing: task-relevant data, including dimensions, and the result is
the initial relation
Attribute-removal: remove attribute A if there is a large set of distinct
values for A but
(1) there is no generalization operator on A, or
8

(2) As higher level concepts are expressed in terms of other attributes


Attribute-generalization: If there is a large set of distinct values for A,
and there exists a set of generalization operators on A, then select an
operator and generalize A
Attribute-threshold control: typical 2-8, specified/default
Generalized relation threshold control: control the final relation/rule size
Attribute-Oriented Induction: Basic Algorithm
InitialRel: Query processing of task-relevant data, deriving the initial
relation.
PreGen: Based on the analysis of the number of distinct values in each
attribute, determine generalization plan for each attribute: removal? or how
high to generalize?
PrimeGen: Based on the PreGen plan, perform generalization to the right
level to derive a prime generalized relation, accumulating the counts.
Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting,
(3) mapping into rules, cross tabs, visualization presentations.
Example
DMQL: Describe general characteristics of graduate students in the
Big-University database
use Big_University_DB
mine characteristics as Science_Students
in relevance to name, gender, major, birth_place,
residence, phone#, gpa
from student
where status in graduate

birth_date,

Corresponding SQL statement:


Select name, gender, major, birth_place, birth_date, residence,
phone#, gpa
from student
where status in {Msc, MBA, PhD }

12. (b)Three tier warehouse architecture


1. The bottom tier is a warehouse database server that is almost always a
relational database system. Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or other external sources
(e.g., customer profile information provided by external consultants). These
tools and utilities perform data extraction, cleaning, and transformation (e.g.,
to merge similar data from different sources into a unified format), as well as
load and refresh functions to update the data warehouse. The data are
extracted using application program interfaces known as gateways. A
gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server. Examples of gateways include
ODBC (Open Database Connection) and OLEDB (Object Linking and
Embedding
Database)
by
Microsoft
and
JDBC
(Java
Database
Connection).This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using
either (1) a
Relational OLAP(ROLAP) model (i.e., an extended relational DBMS that
maps operations
on multidimensional data to standard relational operations); or (2) a
multidimensional
10

OLAP (MOLAP) model (i.e., a special-purpose server that directly


implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and
reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and
so on).

13.(a)
Given two data objects are X(22,1,24,10,46),Y(12,2,24,23,46)
Distance
Measure
Manhattan
or
City block

Solution

More

Euclidean

Sqrt((22-12)2+(1-2)2+(24-24)2+(10-23)2+(4646)2)=sqrt(102+12+02+132+02)
=sqrt(100+1+0+169+0)
=sqrt(270)
=16.43167672515498

(abs(22-12)+abs(1-2)+abs(24-24)+abs(1023)+abs(46-46))
=(10+1+0+13+0)
=24

11

Supremum
Or
Chebyshev

=16.43
abs(10-23)=13

To compute it, we
find the attribute f that
gives
the
maximum
difference
in
values
between the two objects.

13(b)
OLAP operations. Lets look at some typical OLAP operations for
multidimensional data. Each of the following operations described is
illustrated in Figure 4.12. At the center of the figure is a data cube for
AllElectronics sales. The cube contains the dimensions location, time, and
item, where location is aggregated with respect to city values, time is
aggregated with respect to quarters, and item is aggregated with respect to
item types.
To aid in our explanation, we refer to this cube as the central cube. The
measure displayed is dollars sold (in thousands). (For improved readability,
only some of the cubes cell values are shown.) The data examined are for
the cities Chicago, New York, Toronto,and Vancouver.
Roll-up: The roll-up operation (also called the drill-up operation by some
vendors)performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by dimension reduction. Figure 4.12
shows the result of a roll-up
operation performed on the central cube by climbing up the concept
hierarchy for location. This hierarchy was defined as the total order street
<city < province or state < country. The roll-up operation shown aggregates
the data by ascending the location hierarchy from the level of city to the
level of country. In other words, rather than grouping the data by city, the
resulting cube groups the data by country.
When roll-up is performed by dimension reduction, one or more dimensions
are removed from the given cube. For example, consider a sales data cube
containing only the location and time dimensions. Roll-up may be performed
by removing, say,the time dimension, resulting in an aggregation of the total
sales by location, rather than by location and by time.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less
detailed datato more detailed data. Drill-down can be realized by either
stepping down a concept hierarchy for a dimension or introducing additional
dimensions. Figure 4.12 shows the result of a drill-down operation performed
on the central cube by stepping down a concept hierarchy for time defined
as day < month < quarter < year. Drill-down occurs by descending the
time hierarchy fromthe level of quarter to the more detailed level of month.

12

The resulting data cube details the total sales per month rather than
summarizing them by quarter.
Because a drill-down adds more detail to the given data, it can also be
performed by adding new dimensions to a cube. For example, a drill-down on
the central cube of Figure 4.12 can occur by introducing an additional
dimension, such as customer group.
Slice and dice: The slice operation performs a selection on one dimension
of the given
cube, resulting in a subcube. Figure 4.12 shows a slice operation where the
sales data are selected from the central cube for the dimension time using
the criterion time D Q1. The dice operation defines a subcube by
performing a selection on two or more dimensions. Figure 4.12 shows a dice
operation on the central cube based on the following selection criteria that
involve three dimensions: (location D Torontoor Vancouver) and (time D
Q1 or Q2) and (item D home entertainment or computer).
Pivot (rotate): Pivot (also called rotate) is a visualization operation that
rotates the data
axes in view to provide an alternative data presentation. Figure 4.12 shows a
pivot operation where the item and location axes in a 2-D slice are rotated.
Other examples
include rotating the axes in a 3-D cube, or transforming a 3-D cube into a
series of 2-D planes.

13

Figure 4.12 Examples of typical OLAP operations on multidimensional data.

14
(a)

The Apriori Algorithm The first pass of the algorithm simply counts item occurrences
to determine the large 1-itemsets. A subsequent pass, say pass k, consists of two
phases. First, the large itemsets Lk-1 found in the (k-1)th pass are used to generate
the candidate itemsets Ck, using the Apriori candidate generation function (apriorigen) described below. Next, the database is scanned and the support of candidates
in Ck is counted.
The Apriori algorithm is:
14

L1 = {large 1-itemsets};
for ( k = 2; Lk-1 ; k++ ) do begin
Ck = apriori-gen(Lk-1); //New candidates
forall transactions t D do begin
Ct = subset(Ck, t); //Candidates contained in t
forall candidates c Ct do
c.count++;
end
Lk = { c Ck | c.count minsup }
end
Answer = k Lk;
The apriori-gen function takes as argument Lk-1, the set of all large (k-1)-itemsets.
tid
T100
T200
T300
T400

item
set
K,A,D,
B
D,A,C,
E,B
C,A,B,
E
B,A,D

Given minimum support is 60% so (60/100)*4=2.4.Round(2.4)=2

C1

L1

15

Scan D for
each
candidate

Generate C2
candiate
from L1

Generate C3
candiate
from L2

itemset
A

Sup.cou
nt
4

Compare candidate support


count with minimum
support count

itemset

Sup.cou
nt

K
C2

E
L2

itemset

Sup.cou
nt

A,B

A,C

Compare candidate support


count with minimum
support count

itemset

Sup.cou
nt

A,B

A,C

A,D

A,D

A,E

A,E

B,C

B,C

B,D

B,D

B,E

B,E

C,D

C,E

C,E

D,E
C3

itemset

L3
Sup.cou
nt

Compare candidate support


count with minimum
support count

A,B,D

A,B,C

A,B,E

A,B,D

A,C,D

A,B,E

A,C,E

A,C,E

A,D,E

B,C,E

B,C,D

B,C,E

B,D,E

C,D,E
C4

Generate C4
candiate
from L3

itemset

Sup.cou
nt

A,B,C

L4

itemset

sup.cou
nt

A,B,C,D

A,B,C,E

A,B,D,E

Compare candidate support


count with minimum
support count

itemset

sup.cou
nt

A,B,C,E

C5 is a null set and the algorithm terminates. Frequent item set is {A,B,C,E}

16

Rule generation
Frequent item set is {A,B,C,E}
Subset of {A,B,C,E} Excluding null set and itself are {A},{B},{C},{E},{A,B},{A,C},{A,E},{B,C},{B,E},
{C,E},{A,B,C},{A,B,E},{A,C,E},{B,C,E}.

Rule
no
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14

Association
rule
{A} {B,C,E}
{B}{A,C,E}
{C}{A,B,E}
{E}{A,B,C}
{A,B}{C,E}
{A,C}{B,E}
{A,E}{B,C}
{B,C}{A,E}
{B,E}{A,C}
{C,E}{A,B}
{A,B,C}{E}
{A,B,E}{C}
{A,C,E}{B}
{B,C,E}{A}

Confidence in
percentage
(2/4)*100=50
(2/4)*100=50
(2/2)*100=100
(2/2)*100=100
(2/4)*100=50
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100
(2/2)*100=100

the minimum confidence threshold is 80% then strong association rules set
is {R3,R4,R6,R7,R8,R9,R10,R11,R12,R13,R14},

14.(b)

If a rule describes associations between quantitative items or attributes, then


it is a quantitative association rule. In these rules, quantitative values for
items or attributes are partitioned into intervals. The following Rule can also
be considered a quantitative association rule where the quantitative
attributes age and income have been discretized.
age.(X, 20..29) income(X, 52K..58K) buys(X, iPad ). where X is
a variable representing a customer.
15.
The class label attribute, buys computer, has two distinct values (namely,
yes, no); therefore, there are two distinct classes (i.e., m = 2). Let class C1
correspond to yes and class C2 correspond to no.There are nine tuples of
class yes and five tuples of class no. A (root) node N is created for the tuples
in D. To find the splitting criterion for these tuples, we must compute the
17

information gain of each attribute. We first


information needed to classify a tuple in D:

compute the expected

Next, we need to compute the expected information requirement for each


attribute. Lets start with the attribute age. We need to look at the
distribution of yes and no tuples for each category of age. For the age
category youth, there are two yes tuples and three no tuples. For the
category middle aged, there are four yes tuples and zero no tuples.For the
category senior, there are three yes tuples and two no tuples. The
expected information needed to classify a tuple in D if the tuples are
partitioned according to age is

Hence, the gain in information from such a partitioning would be

Similarly, we can compute Gain(income)= 0.029 bits, Gain(student)= 0.151


bits,and Gain(credit rating)=0.048 bits. Because age has the highest
information gain among the attributes, it is selected as the splitting attribute.
Node N is labeled with age, and branches are grown for each of the
attributes values. The tuples are then partitioned accordingly, as shown in
the Figure1 . Notice that the tuples falling into the partition for age = middleaged all belong to the same class. Because they all belong to class yes,a
leaf should therefore be created at the end of this branch and labeled yes.
The final decision tree returned by the algorithm is shown in Figure 2.

18

Figure 1: The attribute age has the highest information gain and
therefore becomes the splitting attribute at the root node of the
decision tree. Branches are grown for each outcome of age.
The tuples are shown partitioned accordingly.

Figure 2 : A final decision tree for the concept buys computer

19

16. Centroid of th given cluster are


Name of the Centroid
Centroid
Centroid of Cluster-1
C1 ( (2+2+8)/3,(10+5+4)/3 ))=(4,6.3)
Centroid of Cluster-2
C2 ((5+7+6)/3,(8+5+4)/3 )=(6,5.6)
Centroid of Cluster-3
C3 ((1+4)/2,(2+9)/2)=(2.5,5.5)
Using Manhattan distance
First Round
C1(4,6.3)
A1(2,10) 5.7
A2(2,5)
3.3
A3(8,4)
6.3
A4(5,8)
2.7
A5(7,5)
4.3
A6(6,4)
4.3
A7(1,2)
7.3
A8(4,9)
2.7
Re-compute Centroid

C2(6,5.6)
8.4
4.6
3.6
3.4
1.6
1.6
8.6
5.4

C3(2.5,5.5)
5
1
7
5
5
5
5
5

Name of the Centroid


Centroid
Centroid of Cluster-1
C1((5+4)/2,(8+9)/2)=(4.5,8.5)
Centroid of Cluster-2
C2 ((8+7+6)/3,(4+5+4)/3)=(7,4)
Centroid of Cluster-3
C3((2+2+1)/3,(10+5+2)/3)= (1.7,5.7)
C1(4.5,8.5)
C2(7,4)
C3(1.7,5.7)
A1(2,10) 4
11
4.6
A2(2,5)
6
6
1
A3(8,4)
8
1
6.6
A4(5,8)
1
6
5.6
A5(7,5)
6
1
6
A6(6,4)
6
1
6
A7(1,2)
10
8
6.4
A8(4,9)
1
8
5.6
Second Round
Recompute the cluster centroid of the all the clusters ,repeate the above
process until previous and present cluster centroids should same.

17 (a)
20

Methods to evaluate clustering


Supervised:To compare the results of clustering to externally known results
(e.g.externally provided class labels).
Cluster Cohesion: Measures how closely related are objects in a Cluster.
Cluster Separation: Measures how distinct or well-separated the clusters are
from one another.
Silhouette Coefficient combine ideas of both cohesion and separation, but for
individual points.
Unsupervised:Measure the degree of correspondence between the cluster
labels and the class labels
Entropy of cluster i
Precision: the fraction of a cluster i that consists of objects of a class j
Recall: the extent to which a cluster i contains all objects of a class j
17.(b)
For data preprocessing, it is essential to have an overall picture of your
data.Data summarization techniques can be used to define the typical
properties of the data, highlight which data should be treated as noise or
outliers.Data dispersion characteristics such as Median, max, min, quantiles,
outliers, variance, etc. will help us finding noise. From the data mining point
of view it is important to examine how these measures are computed
efficiently,introduce the notions of distributive measure, algebraic measure
and holistic measure.
Measuring the Central Tendency
Mean (algebraic measure)
Note: n is sample size
A distributive measure can be computed by partitioning the data into smaller
subsets (e.g., sum, and count).An algebraic measure can be computed by
applying an algebraic function to one or more distributive measures
(e.g.,mean=sum/count). Sometimes each value xi is weighted , Weighted
arithmetic mean. Problem can be the mean measure is sensitive to extreme
(e.g., outlier) values
Median (holistic measure)
Middle value if odd number of values, or average of the middle two values
otherwise. A holistic measure must be computed on the entire data
set.Holistic measures are much ore expensive to compute than distributive
measures and Can be estimated by interpolation (for grouped data):
21

Median interval contains the median frequency.L1:the lower boundary of the


median interval.N:the number of values in the entire dataset.( freq)l: sum of
all freq of intervals below the median interval.Freqmedian and width : frequency
& width of the median interval
Mode
Value that occurs most frequently in the data.It is possible that several
different values have the greatest.frequency: Unimodal, bimodal, trimodal,
multimodal.If each data value occurs only once then there is no mode .
Midrange
Can also be used to assess the central tendency.It is the average of the
smallest and the largest value of the set. It is an algebric measure that is
easy to compute.
The degree in which data tend to spread is called the dispersion,or
variance of the data
The most common measures for data dispersion are range, the fivenumber summary (based on quartiles), the inter-quartile range, and
standard deviation.
Range : The distance between the largest and the smallest values
Kth percentile: Value xi having the property that k% of the data lies at or
below xi The median is 50th percentile.The most popular percentiles other
than the median are Quartiles Q1 (25th percentile), Q3 (75th percentile).
Quartiles + median give some indication of the center, spread, and the
shape of a distribution
Inter-quartile range:Distance between the first and the third quartiles
IQR=Q3-Q1.A simple measure of spread that gives the range covered by the
middle half of the data.
Outlier: usually, a value falling at least 1.5 x IQR above the third quartile or
below the first quartile.Five number summary Provide in addition information
about the endpoints (e.g., tails) min, Q1, median, Q3, max
Eg., min= Q1-1.5 x IQR, max= Q3 + 1.5 x IQR Represented by a Boxplot
Variance and standard deviation: Variance: (algebraic,
computation),Standard deviation is the square root of variance.
Basic properties of the standard deviation
22

scalable

Standard deviation measures spread about the mean and should be used
only when the mean is chosen as the measure of the center.Standard
deviation =0 only when there is no spread, that is, when all observations
have the same value. Otherwise Standard deviation >0. Variance and
standard deviation are algebraic measures. Thus,their computation is
scalable in large databases.
Graphic Displays
Boxplot: graphic display of five-number summary.
Histogram: x-axis are values, y-axis repres. Frequencies.
Quantile plot: each value xi is paired with fi indicating that approximately
100 fi % of data are xi.
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant
distribution against the corresponding quantiles of another.
Scatter plot: each pair of values is a pair of coordinates and plotted as
points in the plane

Measure
Mean

Formula

Weighted
Mean

Median

23

Variance

Standar
d
deviatio
n
Empirical
formula:

is the square root of variance

17.(c )
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are
the statistical classifiers. Bayesian classifiers can predict class membership
probabilities such as the probability that a given tuple belongs to a
particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities

Posterior Probability [P(H/X)]

Prior Probability [P(H)]

where X is data tuple and H is some hypothesis.


According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Bayesian Belief Networks specify joint conditional probability distributions.
They are also known as Belief Networks, Bayesian Networks, or Probabilistic
Networks.

A Belief Network allows class conditional independencies to be defined


between subsets of variables.

24

It provides a graphical model of causal relationship on which learning


can be performed.

We can use a trained Bayesian Network for classification.

There are two components that define a Bayesian Belief Network

Directed acyclic graph

A set of conditional probability tables

Directed Acyclic Graph

Each node in a directed acyclic graph represents a random variable.

These variable may be discrete or continuous valued.

These variables may correspond to the actual attribute given in the


data.

Directed Acyclic Graph Representation


The following diagram shows a directed acyclic graph for six Boolean
variables.

The arc in the diagram allows representation of causal knowledge. For


example, lung cancer is influenced by a person's family history of lung
cancer, as well as whether or not the person is a smoker. It is worth noting
that the variable PositiveXray is independent of whether the patient has a
family history of lung cancer or that the patient is a smoker, given that we
know the patient has lung cancer.
25

Conditional Probability Table


The arc in the diagram allows representation of causal knowledge. For
example, lung cancer is influenced by a person's family history of lung
cancer, as well as whether or not the person is a smoker. It is worth noting
that the variable PositiveXray is independent of whether the patient has a
family history of lung cancer or that the patient is a smoker, given that we
know the patient has lung cancer.

26

Vous aimerez peut-être aussi