Académique Documents
Professionnel Documents
Culture Documents
1
The Course Book
Data Mining: A Tutorial Based Primer
by Richard J.Roiger, Michael Geatz.
• Amazon.com
•Paperback: 408 pages ; Dimensions (in inches): 0.67
x 9.14 x 7.44
•Publisher: Addison-Wesley Publishing; ; Book and
CD-ROM edition (September 26, 2002)
•ISBN: 0201741288
•List Price: $40.00
•Availability: Usually ships within 2 to 3
days
2
1.1 Data Mining: A Definition
The process of employing one or more
computer learning techniques to
automatically analyze and extract
knowledge from data.
Induction-based learning is the process
of forming generally applicable models
(or concept definitions) by observing
specific examples.
3
“Concepts”
•Definition: A “concept” is a set of objects, symbols or events grouped
together because they share certain characteristics.
Concept set, class, group, cluster, roughly
4
An Investment Dataset
Table 1.3 • Acme Investors Incorporated
5
Possible Business Questions
Table 1.3 • Acme Investors Incorporated
•Supervised Learning
–builds a learner model, or concept
definitions, using data instances of known
origin.
– and uses the model to determine the
outcome new instances of unknown origin.
•Unsupervised Learning
– A data mining method that builds models
from data without predefined classes.
–Usually for classification/clustering. 10
Supervised Learning:
A Decision Tree Example
A Decision Tree is a tree structure where non-terminal
nodes represent tests/decisions on one or more attributes
and terminal nodes reflect decision outcomes.
11
Table 1.1 • Hypothetical Training Data for Disease Diagnosis
• Consider each of the attributes in turn, to see which would be a “good” one to
start our Decision Tree with.
• Is there a perfect 1-1 relationship between any of the input variables and the
ourput variable:
• Sore Throat, Fever don’t seem “very good”.
• However,
{Swollen Glands = Yes} corresponds 1-1 with {Diagnosis = Strep throat}
i.e. If {Swollen Glands = Yes} then {Diagnosis = Strep throat}
• Hence we use “Swollen Glands” for our first Dicision Node.
• Etc… we get… 12
First
Test/Decision Swollen
Node Glands
No Yes
Fever
Terminal
Decision Node
No Yes
13
Notes on this Decision Tree:
14
Use of the Decision Tree for Prediction
We may now use the Decision Tree for future
diagnoses, (or prediction of diagnosis). Consider
the following symptomatic data:
Table 1.2 • Data Instances with an Unknown Classification
15
Production Rules
We may summarize the Decision Tree by listing
the decisions along each path from the starting
node to each terminal node.
16
Unsupervised Clustering
A data mining method that builds models from data without
predefined output classes.
Table 1.3 • Acme Investors Incorporated
• Use data query if you already almost know what you are
looking for, and you wish to work with large databases.
• Use OLAP if you wish to discover simple associations in
large databases.
• Use data mining to find patterns and relationships in data
that are not obvious.
Because of the relative slowness of datamining algorithms
this often means that the database has to be small, or
sampled. Devising Data Mining algorithms which scale to
large databases is a current research topic in Data Mining.
19
Data Mining Applications
• Data mining is a young discipline with wide and
diverse applications
– There is still a nontrivial gap between general
principles of data mining and domain-specific, effective
data mining tools for particular applications
• Some application domains
– Biomedical and DNA data analysis
– Financial data analysis
– Retail industry
– Telecommunication industry
20
Biomedical Data Mining and
DNA Analysis
• DNA sequences: 4 basic building blocks (nucleotides): adenine
(A), cytosine (C), guanine (G), and thymine (T).
• Gene: a sequence of hundreds of individual nucleotides arranged
in a particular order
• Humans have around 100,000 genes
• Tremendous number of ways that the nucleotides can be ordered
and sequenced to form distinct genes
• Semantic integration of heterogeneous, distributed genome
databases
– Current: highly distributed, uncontrolled generation and use
of a wide variety of DNA data
– Data cleaning and data integration methods developed in data
mining will help 21
DNA Analysis: Examples
• Similarity search and comparison among DNA sequences
– Compare the frequently occurring patterns of each class (e.g., diseased and
healthy)
– Identify gene sequence patterns that play roles in various diseases
• Association analysis: identification of co-occurring gene
sequences
– Most diseases are not triggered by a single gene but by a combination of
genes acting together
– Association analysis may help determine the kinds of genes that are likely
to co-occur together in target samples
• Path analysis: linking genes to different disease development
stages
– Different genes may become active at different stages of the disease
– Develop pharmaceutical interventions that target the different stages
separately
• Visualization tools and genetic data analysis 22
Data Mining for Financial Data Analysis
28