Académique Documents
Professionnel Documents
Culture Documents
Alternative names
CS490D 1
Integration of Multiple
Technologies
Machine Artificial
Learning Intelligence
Database
Management Statistics
Algorithms Visualization
Data
Mining
CS490D 2
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
CS490D 4
Why Data Mining?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
CS490D 5
Multi-Dimensional View of
Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
association, classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.
CS490D 6
Why Data Mining?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
CS490D 7
Market Analysis and
Management
Where does the data come from?
Credit card transactions, loyalty cards,
discount coupons, customer complaint calls,
plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share
the same characteristics: interest, income
level, spending habits, etc.
Determine customer purchasing patterns over
time
CS490D 8
Customer profiling
What types of customers buy what products
(clustering or classification)
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest employees
CS490D 10
Example: Use in retailing
Goal: Improved business efficiency
Improve marketing (advertise to the most likely buyers)
Size ranges from 50k records (research studies) to terabytes (years of data
from chains)
CS490D 14
Necessity for Data Mining
Large amounts of current and historical data
being stored
Only small portion (~5-10%) of collected data is
analyzed.
Data Quality
How do we interpret results in light of low quality data?
CS490D 17
Major Issues in Data Mining
Mining methodology
CS490D 18
Are All the Discovered Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm .
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
CS490D 21
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Data Exploration
Statistical Analysis, Querying and Reporting
Pattern evaluation
Data
Databases Warehouse
CS490D 24
Why Mine Data? Commercial
Viewpoint
CS490D 27
Data Mining vs. DBMS
Induction vs Deduction
DBMS:
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no
Example of DBMS, OLAP and Data
Mining: Weather Data
outlook = sunny
humidity = high: no
humidity = normal: yes
outlook = overcast: yes
outlook = rainy
windy = true: no
windy = false: yes
Steps of a KDD Process
Layer2
MDDB
MDDB
Meta Data
Network Intrusion
Detection
Mining market
Around 20 to 30 mining tool vendors
Major players:
-WEKA
IBMs Intelligent Miner,
SGIs MineSet,
SASs Enterprise Miner.
All pretty much the same set of tools
Many embedded products: fraud detection,
electronic commerce applications
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Mining Association Rules:
Decoupling
Example of Rules: