Vous êtes sur la page 1sur 45

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting ( previously unknown and potentially


useful) patterns or knowledge from huge amount of data

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge


extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

CS490D 1
Integration of Multiple
Technologies
Machine Artificial
Learning Intelligence

Database
Management Statistics

Algorithms Visualization
Data
Mining

CS490D 2
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation

Data Mining Knowledge


Knowledge

Preprocessing
Patterns

Selection
Preprocessed
Data
Data
Target
Data

adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

CS490D 4
Why Data Mining?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis

CS490D 5
Multi-Dimensional View of
Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
Knowledge to be mined
association, classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, Web mining, etc.

CS490D 6
Why Data Mining?
Potential Applications
Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis

CS490D 7
Market Analysis and
Management
Where does the data come from?
Credit card transactions, loyalty cards,
discount coupons, customer complaint calls,
plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share
the same characteristics: interest, income
level, spending habits, etc.
Determine customer purchasing patterns over
time

CS490D 8
Customer profiling
What types of customers buy what products
(clustering or classification)

Customer requirement analysis


identifying the best products for different customers
predict what factors will attract new customers

Provision of summary information


multidimensional summary reports
statistical summary information (data central tendency
and variation)
CS490D 9
Fraud Detection & Mining
Unusual Patterns
Approaches: Clustering & model construction for frauds, outlier
analysis

Applications: Health care, retail, credit card service, telecomm.

Money : suspicious monetary transactions

Telecommunications: Phone-call fraud


Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm

Retail industry
Analysts estimate that 38% of retail shrink is due to dishonest employees

CS490D 10
Example: Use in retailing
Goal: Improved business efficiency
Improve marketing (advertise to the most likely buyers)

Inventory reduction (stock only needed quantities)

Information source: Historical business data

Example: Supermarket sales records

Size ranges from 50k records (research studies) to terabytes (years of data
from chains)

Data is already being warehoused


Sample question what products are generally purchased
together?
The answers are in the data, if only we could see them
11
What Can Data Mining Do?
Cluster
Classify
Categorical, Regression
Summarize
Summary statistics, Summary rules
Link Analysis / Model Dependencies
Association rules
Sequence analysis
Time-series analysis, Sequential associations
Detect Deviations
CS490D 12
Clustering
Find groups of similar data Group people with
items
similar travel profiles
Statistical techniques require George, Patricia
some definition of distance
(e.g. between travel profiles) Jeff, Evelyn, Chris
while conceptual techniques Rob
use background concepts and
logical descriptions
Clusters

CS490D 14
Necessity for Data Mining
Large amounts of current and historical data
being stored
Only small portion (~5-10%) of collected data is
analyzed.

Data that may never be analyzed is collected in


the fear that something that may prove important
will be missed.

As databases grow larger, decision-making from


the data is not possible; need knowledge derived
from the stored data
CS490D 15
Data Mining Complications
Volume of Data
Clever algorithms needed for reasonable performance

Knowledge Discovery Process skill required


How to select tool, prepare data?

Data Quality
How do we interpret results in light of low quality data?

Data Source Heterogeneity


How do we combine data from multiple sources?

CS490D 17
Major Issues in Data Mining
Mining methodology

Mining different kinds of knowledge from diverse data


types, e.g., bio, stream, Web.

Performance: efficiency, effectiveness, and scalability.

Pattern evaluation: the interestingness problem.

Incorporation of background knowledge.

Handling noise and incomplete data.

CS490D 18
Are All the Discovered Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining

Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm .

Objective vs. subjective interestingness measures


Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on users belief in the data, e.g., unexpectedness, novelty,
actionability, etc. CS490D 19
Can We Find All and Only
Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Heuristic vs. exhaustive search
Association vs. classification vs. clustering

Search for only interesting patterns: An optimization problem


Can a data mining system find only the interesting patterns?
Approaches
First general all the patterns and then filter out the uninteresting ones.
Generate only the interesting patternsmining query optimization
CS490D 20
Knowledge Discovery in
Databases: Process
Interpretation/
Evaluation

Data Mining Knowledge


Knowledge

Preprocessing
Patterns

Selection
Preprocessed
Data
Data
Target
Data

adapted from:
U. Fayyad, et al. (1995), From Knowledge Discovery to Data
Mining: An Overview, Advances in Knowledge Discovery and
Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press

CS490D 21
Steps of a KDD Process
Learning the application domain
relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation


Find useful features, dimensionality/variable reduction, invariant
representation.

Choosing functions of data mining


summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)

Data mining: search for patterns of interest

Pattern evaluation and knowledge presentation


CS490D 22
visualization, transformation, removing redundant patterns, etc.
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Architecture: Typical Data Mining
System

Graphical user interface

Pattern evaluation

Data mining engine


Knowledge-
Database or base
data warehouse
server
Data cleaning & data integration Filtering

Data
Databases Warehouse
CS490D 24
Why Mine Data? Commercial
Viewpoint

Lots of data is being collected


and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions

Computers have become cheaper and more powerful


Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific
Viewpoint
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw
data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
Integration of Data Mining
and Data Warehousing
Data mining systems, DBMS, Data warehouse systems
coupling.

On-line analytical mining data


integration of mining and OLAP technologies.

Interactive mining multi-level knowledge


Necessity of mining knowledge and patterns at different levels of
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

Integration of multiple mining functions


Characterized classification, first clustering and then association

CS490D 27
Data Mining vs. DBMS

Example DBMS Reports


Last months sales for each service type
Sales per service grouped by customer sex or age
bracket
List of customers who lapsed their policy

Questions answered using Data Mining


What characteristics do customers that lapse their
policy have in common and how do they differ from
customers who renew their policy?
Which motor insurance policy holders would be
potential customers for my House Content Insurance
policy?
Data Mining and Induction Principle

Induction vs Deduction

Deductive reasoning is truth-preserving:


1. All horses are mammals
2. All mammals have lungs
3. Therefore, all horses have lungs

Induction reasoning adds information:


1. All horses observed so far have lungs.
2. Therefore, all horses have lungs.
DBMS, OLAP, and Data Mining

DBMS OLAP Data Mining

Extraction of detailed Summaries, trends and Knowledge discovery


Task
and summary data forecasts of hidden patterns

Type of result Information Analysis Prediction

Multidimensional data Induction (Build the


Deduction (Ask the
modeling, model, apply it to
Method question, verify
Aggregation, new data, get the
with data)
Statistics result)

What is the average


Who purchased Who will buy a mutual
income of mutual
Example question mutual funds in fund in the next 6
fund buyers by
the last 3 years? months and why?
region by year?
Example of DBMS, OLAP and Data
Mining: Weather Data

DBMS:
Day outlook temperature humidity windy play

1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no
Example of DBMS, OLAP and Data
Mining: Weather Data

By querying a DBMS containing the above table we may


answer questions like:
What was the temperature in the sunny days? {85, 80,
72, 69, 75}
Which days the humidity was less than 75? {6, 7, 9, 11}
Which days the temperature was greater than 70? {1, 2,
3, 8, 10, 11, 12, 13, 14}
Which days the temperature was greater than 70 and the
humidity was less than 75? The intersection of the above
two: {11}
Example of DBMS, OLAP and Data
Mining: Weather Data
OLAP:
Using OLAP we can create a Multidimensional Model of our data
(Data Cube).
For example using the dimensions: time, outlook and play we can
create the following model.

9/5 sunny rainy overcast

Week 1 0/2 2/1 2/0

Week 2 2/1 1/1 2/0


Example of DBMS, OLAP and Data
Mining: Weather Data
Data Mining:

Using the ID3 algorithm we can produce the following


decision tree:

outlook = sunny
humidity = high: no
humidity = normal: yes
outlook = overcast: yes
outlook = rainy
windy = true: no
windy = false: yes
Steps of a KDD Process

Learning the application domain:


relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
OLAM OLAP Layer3
Engine Engine OLAP/OLAM
Data Cube API

Layer2
MDDB
MDDB
Meta Data

Filtering&Integration Database API Filtering


Layer1
Data cleaning Data
Databases Data integration Warehouse Data
Repository
Data Mining and Visualization
Approaches
Visualization to display results of data mining
Help analyst to better understand the results of the data
mining tool
Visualization to aid the data mining process
Interactive control over the data exploration process
Interactive steering of analytic approaches (grand tour)
Interactive data mining issues
Relationships between the analyst, the data mining
tool and the visualization tool
Visualized
result
Data Mining
Analyst
Tool
CS490D 37
Some basic operations
Predictive:
Regression
Classification
Descriptive:
Clustering / similarity matching
Association rules and variants
Deviation detection

Data Warehousing & Mining 38


Classification

Given old data about customers and payments,


predict new applicants loan eligibility.

Previous customers Classifier Decision rules


Salary > 5
L
Good/
Prof. = bad
Exec

New applicants data


Data Warehousing & Mining 39
Classification methods
Nearest neighbor
Regression: (linear or any polynomial)
a*salary + b*age + c = eligibility score.
Decision tree classifier
Probabilistic/generative models
Neural networks

Data Warehousing & Mining 40


Clustering

Unsupervised learning when old data with class


labels not available e.g. when introducing a new
product.
Group/cluster existing customers based on time
series of payment history such that similar
customers in same cluster.
Key requirement: Need a good measure of similarity
between instances.
Identify micro-markets and develop policies for each

Data Warehousing & Mining 41


Regression

Predict a value of a given continuous valued variable


based on the values of other variables, assuming a
linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advetising
expenditure.
Predicting wind velocities as a function of temperature, humidity,
air pressure, etc.
Time series prediction of stock market indices.
Deviation/Anomaly Detection

Detect significant deviations


from normal behavior
Applications:
Credit Card Fraud Detection

Network Intrusion
Detection
Mining market
Around 20 to 30 mining tool vendors
Major players:
-WEKA
IBMs Intelligent Miner,
SGIs MineSet,
SASs Enterprise Miner.
All pretty much the same set of tools
Many embedded products: fraud detection,
electronic commerce applications

Dr. Sunita Sarawagi Data Warehousing & Mining 44


Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset
An itemset that contains k items
Support count ()
Frequency of occurrence of an itemset
E.g. ({Milk, Bread,Diaper}) = 2
Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
Association Rule
An implication expression of the form
X Y, where X and Y are itemsets
Example:
{Milk, Diaper} {Beer}

Rule Evaluation Metrics


Support (s) Example:
Fraction of transactions that contain {Milk, Diaper} Beer
both X and Y
Confidence (c) (Milk, Diaper, Beer) 2
Measures how often items in Y s 0.4
appear in transactions that |T| 5
contain X
(Milk, Diaper, Beer) 2
c 0.67
(Milk, Diaper) 3
Association Rule Mining Task

Given a set of transactions T, the goal of


association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold

Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
Mining Association Rules:
Decoupling
Example of Rules:

{Milk,Diaper} {Beer} (s=0.4, c=0.67)


{Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
{Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements

Vous aimerez peut-être aussi