Data Mining Third Lecture Updated 16 April New

What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting ( previously unknown and potentially

useful) patterns or knowledge from huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
CS490D 1
Integration of Multiple
Technologies
Machine Artificial
Learning Intelligence
Database
Management Statistics
Algorithms Visualization
Data
Mining
CS490D 2
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation
Data Mining Knowledge

Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
CS490D 4
Why Data Mining?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
CS490D 5
Multi-Dimensional View of
Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
association, classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.
CS490D 6
Why Data Mining?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis
CS490D 7
Market Analysis and
Management
Where does the data come from?
Credit card transactions, loyalty cards,
discount coupons, customer complaint calls,
plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share
the same characteristics: interest, income
level, spending habits, etc.
Determine customer purchasing patterns over
time
CS490D 8
Customer profiling
What types of customers buy what products
(clustering or classification)
Customer requirement analysis

identifying the best products for different customers
predict what factors will attract new customers
Provision of summary information

multidimensional summary reports
statistical summary information (data central tendency
and variation)
CS490D 9
Fraud Detection & Mining
Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier
analysis
Applications: Health care, retail, credit card service, telecomm.
Money : suspicious monetary transactions
Telecommunications: Phone-call fraud

Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest employees
CS490D 10
Example: Use in retailing
Goal: Improved business efficiency
Improve marketing (advertise to the most likely buyers)
Inventory reduction (stock only needed quantities)
Information source: Historical business data
Example: Supermarket sales records
Size ranges from 50k records (research studies) to terabytes (years of data
from chains)
Data is already being warehoused

Sample question what products are generally purchased
together?
The answers are in the data, if only we could see them
11
What Can Data Mining Do?
Cluster
Classify
Categorical, Regression
Summarize
Summary statistics, Summary rules
Link Analysis / Model Dependencies
Association rules
Sequence analysis
Time-series analysis, Sequential associations
Detect Deviations
CS490D 12
Clustering
Find groups of similar data Group people with
items
similar travel profiles
Statistical techniques require George, Patricia
some definition of distance
(e.g. between travel profiles) Jeff, Evelyn, Chris
while conceptual techniques Rob
use background concepts and
logical descriptions
Clusters
CS490D 14
Necessity for Data Mining
Large amounts of current and historical data
being stored
Only small portion (~5-10%) of collected data is
analyzed.
Data that may never be analyzed is collected in

the fear that something that may prove important
will be missed.
As databases grow larger, decision-making from

the data is not possible; need knowledge derived
from the stored data
CS490D 15
Data Mining Complications
Volume of Data
Clever algorithms needed for reasonable performance
Knowledge Discovery Process skill required

How to select tool, prepare data?
Data Quality
How do we interpret results in light of low quality data?
Data Source Heterogeneity

How do we combine data from multiple sources?
CS490D 17
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data

types, e.g., bio, stream, Web.
Performance: efficiency, effectiveness, and scalability.
Pattern evaluation: the interestingness problem.
Incorporation of background knowledge.
Handling noise and incomplete data.
CS490D 18
Are All the Discovered Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm .
Objective vs. subjective interestingness measures

Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on users belief in the data, e.g., unexpectedness, novelty,
actionability, etc. CS490D 19
Can We Find All and Only
Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Heuristic vs. exhaustive search
Association vs. classification vs. clustering
Search for only interesting patterns: An optimization problem

Can a data mining system find only the interesting patterns?
Approaches
First general all the patterns and then filter out the uninteresting ones.
Generate only the interesting patternsmining query optimization
CS490D 20
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation
Data Mining Knowledge

Knowledge
Preprocessing
Patterns
Selection
Preprocessed
Data
Data
Target
Data
adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
CS490D 21
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation

Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining

summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation

CS490D 22
visualization, transformation, removing redundant patterns, etc.
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts

OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Architecture: Typical Data Mining
System
Graphical user interface
Pattern evaluation
Data mining engine

Knowledge-
Database or base
data warehouse
server
Data cleaning & data integration Filtering
Data
Databases Warehouse
CS490D 24
Why Mine Data? Commercial
Viewpoint
Lots of data is being collected

and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more powerful

Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific
Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw
data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
Integration of Data Mining
and Data Warehousing
Data mining systems, DBMS, Data warehouse systems
coupling.
On-line analytical mining data

integration of mining and OLAP technologies.
Interactive mining multi-level knowledge

Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Integration of multiple mining functions

Characterized classification, first clustering and then association
CS490D 27
Data Mining vs. DBMS
Example DBMS Reports

Last months sales for each service type
Sales per service grouped by customer sex or age
bracket
List of customers who lapsed their policy
Questions answered using Data Mining

What characteristics do customers that lapse their
policy have in common and how do they differ from
customers who renew their policy?
Which motor insurance policy holders would be
potential customers for my House Content Insurance
policy?
Data Mining and Induction Principle
Induction vs Deduction
Deductive reasoning is truth-preserving:

1. All horses are mammals
2. All mammals have lungs
3. Therefore, all horses have lungs
Induction reasoning adds information:

1. All horses observed so far have lungs.
2. Therefore, all horses have lungs.
DBMS, OLAP, and Data Mining
DBMS OLAP Data Mining
Extraction of detailed Summaries, trends and Knowledge discovery

Task
and summary data forecasts of hidden patterns
Type of result Information Analysis Prediction
Multidimensional data Induction (Build the

Deduction (Ask the
modeling, model, apply it to
Method question, verify
Aggregation, new data, get the
with data)
Statistics result)
What is the average

Who purchased Who will buy a mutual
income of mutual
Example question mutual funds in fund in the next 6
fund buyers by
the last 3 years? months and why?
region by year?
Example of DBMS, OLAP and Data
Mining: Weather Data
DBMS:
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no
By querying a DBMS containing the above table we may

answer questions like:
What was the temperature in the sunny days? {85, 80,
72, 69, 75}
Which days the humidity was less than 75? {6, 7, 9, 11}
Which days the temperature was greater than 70? {1, 2,
3, 8, 10, 11, 12, 13, 14}
Which days the temperature was greater than 70 and the
humidity was less than 75? The intersection of the above
two: {11}
OLAP:
Using OLAP we can create a Multidimensional Model of our data
(Data Cube).
For example using the dimensions: time, outlook and play we can
create the following model.
9/5 sunny rainy overcast
Week 1 0/2 2/1 2/0
Week 2 2/1 1/1 2/0

Data Mining:
Using the ID3 algorithm we can produce the following

decision tree:
outlook = sunny
humidity = high: no
humidity = normal: yes
outlook = overcast: yes
outlook = rainy
windy = true: no
windy = false: yes
Steps of a KDD Process
Learning the application domain:

relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
OLAM OLAP Layer3
Engine Engine OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration Database API Filtering

Layer1
Data cleaning Data
Databases Data integration Warehouse Data
Repository
Data Mining and Visualization
Approaches
Visualization to display results of data mining
Help analyst to better understand the results of the data
mining tool
Visualization to aid the data mining process
Interactive control over the data exploration process
Interactive steering of analytic approaches (grand tour)
Interactive data mining issues
Relationships between the analyst, the data mining
tool and the visualization tool
Visualized
result
Data Mining
Analyst
Tool
CS490D 37
Some basic operations
Predictive:
Regression
Classification
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection
Data Warehousing & Mining 38

Classification
Given old data about customers and payments,

predict new applicants loan eligibility.
Previous customers Classifier Decision rules

Salary > 5
L
Good/
Prof. = bad
Exec
New applicants data

Classification methods
Nearest neighbor
Regression: (linear or any polynomial)
a*salary + b*age + c = eligibility score.
Decision tree classifier
Probabilistic/generative models
Neural networks

Clustering
Unsupervised learning when old data with class

labels not available e.g. when introducing a new
product.
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Key requirement: Need a good measure of similarity
between instances.
Identify micro-markets and develop policies for each

Regression
Predict a value of a given continuous valued variable

based on the values of other variables, assuming a
linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advetising
expenditure.
Predicting wind velocities as a function of temperature, humidity,
air pressure, etc.
Time series prediction of stock market indices.
Deviation/Anomaly Detection
Detect significant deviations

from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection
Mining market
Around 20 to 30 mining tool vendors
Major players:
-WEKA
IBMs Intelligent Miner,
SGIs MineSet,
SASs Enterprise Miner.
All pretty much the same set of tools
Many embedded products: fraud detection,
electronic commerce applications
Dr. Sunita Sarawagi Data Warehousing & Mining 44

Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}
Rule Evaluation Metrics

Support (s) Example:
Fraction of transactions that contain {Milk, Diaper} Beer
both X and Y
Confidence (c) (Milk, Diaper, Beer) 2
Measures how often items in Y s 0.4
appear in transactions that |T| 5
contain X
(Milk, Diaper, Beer) 2
c 0.67
(Milk, Diaper) 3
Association Rule Mining Task
Given a set of transactions T, the goal of

association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Mining Association Rules:
Decoupling
Example of Rules:
{Milk,Diaper} {Beer} (s=0.4, c=0.67)

{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements

Data Mining Third Lecture Updated 16 April New

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Data Mining Third Lecture Updated 16 April New

Transféré par

Droits d'auteur :

Formats disponibles

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting ( previously unknown and potentially

Knowledge discovery (mining) in databases (KDD), knowledge

Data Mining Knowledge

Customer requirement analysis

Provision of summary information

Applications: Health care, retail, credit card service, telecomm.

Money : suspicious monetary transactions

Telecommunications: Phone-call fraud

Inventory reduction (stock only needed quantities)

Information source: Historical business data

Example: Supermarket sales records

Data is already being warehoused

Data that may never be analyzed is collected in

As databases grow larger, decision-making from

Knowledge Discovery Process skill required

Data Source Heterogeneity

Mining different kinds of knowledge from diverse data

Performance: efficiency, effectiveness, and scalability.

Pattern evaluation: the interestingness problem.

Incorporation of background knowledge.

Handling noise and incomplete data.

Objective vs. subjective interestingness measures

Search for only interesting patterns: An optimization problem

Data Mining Knowledge

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation

Choosing functions of data mining

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation

Data Presentation Business

Data Warehouses / Data Marts

Graphical user interface

Data mining engine

Lots of data is being collected

Computers have become cheaper and more powerful

On-line analytical mining data

Interactive mining multi-level knowledge

Integration of multiple mining functions

Example DBMS Reports

Questions answered using Data Mining

Deductive reasoning is truth-preserving:

Induction reasoning adds information:

DBMS OLAP Data Mining

Extraction of detailed Summaries, trends and Knowledge discovery

Type of result Information Analysis Prediction

Multidimensional data Induction (Build the

What is the average

By querying a DBMS containing the above table we may

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0

Using the ID3 algorithm we can produce the following

Learning the application domain:

Filtering&Integration Database API Filtering

Data Warehousing & Mining 38

Given old data about customers and payments,

Previous customers Classifier Decision rules

New applicants data

Data Warehousing & Mining 40

Unsupervised learning when old data with class