Vous êtes sur la page 1sur 97

DATA WAREHOUSING

AND
DATA MINING

Introduction
Data Warehousing,
OLAP(On Line Analytical
Processing) and data
mining: what and why
(now)?

Relation to OLTP
(On Line Transaction
Processing)
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

A producer wants to know.


Which are our
lowest/highest margin
customers ?
Who are my customers
and what products
are they buying?

What is the most


effective distribution
channel?

What product prom-otions have the biggest


impact on revenue?

Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
Dr. Pramod S.Nair, Medi-Caps
andof margins?
Group
Institutions, Indore

Data, Data everywhere


yet ...
 I cant find the data I need
data is scattered over the
network
many versions, subtle
differences

 I cant get the data I need


need an expert to get the data

 I cant understand the data I


found
available data poorly documented

 I cant use the data I found


results are unexpected
Dr. Pramod
S.Nair, Medi-Caps
data
needs to be transformed
Group of Institutions, Indore
from one form to other

What is a Data Warehouse?


A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they
can understand and use
in a business context.
[Barry Devlin]
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

What are the users saying...


Data should be integrated
across the enterprise
Summary data has a real
value to the organization
Historical data holds the
key to understanding data
over time
What-if capabilities are
required
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

What is Data Warehousing?


Information

Data

A process of
transforming data into
information and
making it available to
users in a timely
enough manner to
make a difference
[Forrester Research, April
1996]

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

Evolution
 60s: Batch reports
hard to find and analyze information
inflexible and expensive, reprogram every new
request

 70s: Terminal-based DSS(Decision Support


System) and EIS (executive information
systems)
still inflexible, not integrated with desktop tools

 80s: Desktop data access and analysis tools


query tools, spreadsheets, GUIs
easier to use, but only access operational databases

 90s: Data warehousing with integrated OLAP


engines and tools
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

Very Large Data Bases


 Terabytes -- 10^12 bytes:Walmart -- 24 Terabytes
 Petabytes -- 10^15 bytes:Geographic Information
Systems
 Exabytes -- 10^18 bytes: National Medical Records
 Zettabytes -- 10^21
bytes:
 Zottabytes -- 10^24
bytes:

Weather images
Intelligence Agency
Videos

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

Data Warehousing -It is a process


Technique for assembling and
managing data from various
sources for the purpose of
answering business
questions. Thus making
decisions that were not
previous possible

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

A decision support database


maintained separately from
the organizations operational
10
database

Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile

collection of data that is used primarily in


organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

11

Explorers, Farmers and Tourists


Tourists: Browse information
harvested by farmers

Farmers: Harvest information


from known access paths
Explorers: Seek out the unknown and
previously unsuspected rewards hiding in
the detailed data
12
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

Data Warehouse Architecture


Relational
Databases
Optimized Loader
ERP
Systems

Extraction
Cleansing
Data Warehouse
Engine

Purchased
Data

Legacy
Data

Analyze
Query

Metadata Repository
13

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

Data Warehouse for Decision


Support & OLAP
Putting Information technology to help the
knowledge worker make faster and better
decisions
Which of my customers are most likely to go
to the competition?
What product promotions have the biggest
impact on revenue?
How did the share price of software
companies correlate with profits over last 10
years?
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

14

Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and
can be ad-hoc
Used by managers and end-users to
understand the business and make
judgments
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

15

Data Mining works with Warehouse


Data
Data Warehousing
provides the Enterprise
with a memory

Data Mining provides


the Enterprise with
intelligence

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

16

We want to know ...


 Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
 Which types of transactions are likely to be fraudulent
given the demographics and transactional history of a
particular customer?
 If I raise the price of my product by Rs. 2, what is the
effect on my ROI(Return On Investment)?
 If I offer only 2,500 airline miles as an incentive to
purchase rather than 5,000, how many lost responses will
result?
 If I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my
revenues?
 Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

17

Application Areas
Industry
Finance
Insurance
Telecommunication
Transport
Consumer goods
Data Service providers
Utilities

Application
Credit Card Analysis
Claims, Fraud Analysis
Call record analysis
Logistics management
promotion analysis
Value added data
Power usage analysis

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

18

Data Mining in Use


The US Government uses Data Mining to
track fraud
A Supermarket becomes an information
broker
Basketball
Basketball teams use it to track game
strategy
Warranty Claims Routing
Holding on to Good Customers
Weeding out Bad Customers
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

19

What makes data mining possible?


Advances in the following areas are
making data mining deployable:
data warehousing
better and more data (i.e., operational,
behavioral etc)
the emergence of easily deployed data
mining tools and
the advent of new data mining
techniques.
20
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

Why Separate Data Warehouse?


 Performance
Op dbs designed & tuned for known txs & workloads.
Complex OLAP queries would degrade perf. for op txs.
Special data organization, access & implementation
methods needed for multidimensional views & queries.

 Function
Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
reconciled.
21
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

What are Operational Systems?


They are OLTP systems
Run mission critical
applications
Need to work with
stringent performance
requirements for
routine tasks
Used to run a
business!
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

22

RDBMS used for OLTP


Database Systems have been used
traditionally for OLTP
clerical data processing tasks
detailed, up to date data
structured repetitive tasks
read/update a few records
isolation, recovery and integrity are
critical
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

23

Operational Systems
 Run the business in real time
 Based on up-to-the-second data
 Optimized to handle large
numbers of simple read/write
transactions
 Optimized for fast response to
predefined transactions
 Used by people who deal with
customers, products -- clerks,
salespeople etc.
 They are increasingly used by
customers
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

24

Examples of Operational Data


Data

Industry Usage

Volumes

Customer All
File

Legacy application, flat Small-medium


files, main frames

Account
Balance

Legacy applications,
Large
hierarchical databases,
mainframe
ERP, Client/Server,
Very Large
relational databases

Point-ofSale data
Call
Record

Track
Customer
Details
Finance
Control
account
activities
Retail
Generate
bills, manage
stock
Telecomm- Billing
unications

Technology

Production ManufactRecord
uring

Control
Production

Legacy application,
Very Large
hierarchical database,
mainframe
ERP,
Medium
relational databases,
AS/400
25

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

So, whats different?

Application-Orientation vs.
Subject-Orientation
Application-Orientation

Subject-Orientation

Operational
Database

Loans

Credit
Card

Data
Warehouse
Customer
Vendor

Trust

Product

Savings

Activity
27
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

OLTP vs. Data Warehouse


OLTP systems are tuned for known
transactions and workloads while
workload is not known a priori in a data
warehouse
Special data organization, access methods
and implementation methods are needed
to support data warehouse queries
(typically multidimensional queries)
e.g., average amount spent on phone calls
between 9AM-5PM in Pune during the month
of December
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

28

OLTP vs Data Warehouse


OLTP
Application
Oriented
Used to run
business
Detailed data
Current up to date
Isolated Data
Repetitive access
Clerical User

Warehouse (DSS)
Subject Oriented
Used to analyze
business
Summarized and
refined
Snapshot data
Integrated Data
Ad-hoc access
Knowledge User
(Manager)

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

29

OLTP vs Data Warehouse


 OLTP
Performance Sensitive
Few Records accessed at
a time (tens)
Read/Update Access
No data redundancy
Database Size
100MB
-100 GB

 Data Warehouse
Performance relaxed
Large volumes accessed
at a time(millions)
Mostly Read (Batch
Update)
Redundancy present
Database Size
100 GB - few terabytes

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

30

OLTP vs Data Warehouse


OLTP
Transaction
throughput is the
performance metric
Thousands of users
Managed in
entirety

Data Warehouse
Query throughput
is the performance
metric
Hundreds of users
Managed by
subsets

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

31

To summarize ...
OLTP Systems are
used to run a
business

The Data
Warehouse helps
to optimize the
business
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

32

Data Warehouse Architecture


Relational
Databases
Optimized Loader

ERP
Systems

Extraction
Cleansing
Data Warehouse
Engine

Purchased
Data

Legacy
Data

Analyze
Query

Metadata Repository
33

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

Components of the Warehouse


Data Extraction and Loading
The Warehouse
Analyze and Query -- OLAP Tools
Metadata
Data Mining tools

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

34

Loading the Warehouse

Cleaning the data


before it is loaded

Source Data

Operational/
Source Data

Sequential

Legacy

Relational

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

External

36

Data Transformation
Cleaning

Correction of misspelling
Resolution of state codes and zip codes
Default values of missing data
Elimination of duplicate data

Standardization
Standardize the data types and field lengths for
same data
Semantic standardization ( Synonym : Two or
more terms from different systems mean same
thing
Homonym : Single term means many different
things in different source systems)

Summarization
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

37

Data Quality - The Reality


Tempting to think creating a data
warehouse is simply extracting
operational data and entering into a
data warehouse
Nothing could be farther from the
truth
Warehouse data comes from
Dr. Pramod S.Nair, Medi-Caps sources
disparate questionable
Group of Institutions, Indore

38

Data Quality - The Reality


Legacy systems no longer documented
Outside sources with questionable quality
procedures
Production systems with no built in
integrity checks and no integration
Operational systems are usually designed to
solve a specific business problem and are
rarely developed to a corporate plan
And get it done quickly, we do not have time to
worry about corporate standards...
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

39

Data Integration Across Sources


Savings

Same data
different name

Loans

Different data
Same name

Trust

Data found here


nowhere else

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

Credit card

Different keys
same data

40

Data Transformation Example


Data Warehouse
appl
appl
appl
appl

A - m,f
B - 1,0
C - x,y
D - male, female

appl
appl
appl
appl

A - pipeline - cm
B - pipeline - in
C - pipeline - feet
D - pipeline - yds

appl
appl
appl
appl

A - balance
B - bal
C - currbal
D - balcurr

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

41

Data Integrity Problems in


Transformation
 Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
 Multiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt.
LTD.
 Use of different names
mumbai, bombay
 Different account numbers generated by
different applications for the same customer
 Required fields left blank
 Invalid product codes collected at point of sale
manual entryDr. leads
to mistakes
Pramod S.Nair, Medi-Caps
of Institutions,
Indore
in case of a Group
problem
use
9999999

42

Data Transformation Terms


Extracting
Conditioning
Scrubbing
Merging
Householding

Enrichment
Scoring
Loading
Validating
Delta Updating

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

43

Data Transformation Terms


Extracting
Capture of data from operational source in
as is status

Conditioning
The conversion of data types from the source
to the target data store (warehouse) -always a relational database

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

44

Data Transformation Terms


Householding
Identifying all members of a household
(living at the same address)
Ensures only one mail is sent to a
household
Can result in substantial savings: 1
lakh catalogues at Rs. 50 each costs Rs.
50 lakhs. A 2% savings would save Rs.
Dr. Pramod S.Nair, Medi-Caps
1 lakh.
Group of Institutions, Indore
45

Data Transformation Terms


Enrichment
Bring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, A.
C. Nielsen, CMIE, IMRA etc...

Scoring
computation of a probability of an
event. e.g..., chance that a customer
will defect to AT&T from MCI, chance
that a customer is likely to buy a new
product.
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

46

Scrubbing Data
Sophisticated
transformation tools.
Used for cleaning the
quality of data
Clean data is vital for the
success of the
warehouse
Example
Seshadri, Sheshadri,
Sesadri, Seshadri S.,
Srinivasan Seshadri, etc.
are the same Dr.
person
Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

47

Loads
After extracting, scrubbing, cleaning,
validating etc. need to load the data
into the warehouse
Issues
huge volumes of data to be loaded
small time window available when warehouse can be
taken off line (usually nights)
when to build index and summary tables
allow system administrators to monitor, cancel, resume,
change load rates
Recover gracefully -- restart after failure from where
you were and without loss of data integrity
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

48

Load Taxonomy
Incremental versus Full loads
Online versus Offline loads

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

49

Refresh
Propagate updates on source data to
the warehouse
Issues:
when to refresh
how to refresh -- refresh techniques

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

50

When to Refresh?
periodically (e.g., every night, every
week) or after significant events
on every update: not warranted unless
warehouse data require current data (up
to the minute stock quotes)
refresh policy set by administrator based
on user needs and traffic
possibly different policies for different
sources
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

51

Refresh Techniques
Full Extract from base tables
read entire source table: too expensive
maybe the only choice for legacy
systems

Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

52

Data Extraction and Cleansing


Extract data from existing operational
and legacy data
Issues:
Sources of data for the warehouse
Data quality at the sources
Merging different data sources
Data Transformation
How to propagate updates (on the sources) to the
warehouse
Terabytes of data to be loaded
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

53

Data Granularity in Warehouse


Summarized data stored
reduce storage costs
reduce cpu usage
increases performance since smaller
number of records to be processed
design around traditional high level
reporting needs
tradeoff with volume of data to be
stored and detailed usage of data
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore

54

Granularity in Warehouse
Can not answer some questions with
summarized data
Did Anand call Seshadri last month?
Not possible to answer if total duration
of calls by Anand over a month is only
maintained and individual call details
are not.

Detailed data too voluminous


Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

55

From the Data Warehouse to Data


Marts
Information
Individually
Structured

Less
History
Normalized
Detailed

Departmentally
Structured

Organizationally
Structured

Data

Data Warehouse
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

More

56

Data Warehouse and Data Marts


OLAP
Data Mart
Lightly summarized
Departmentally structured

Organizationally structured
Atomic
Detailed Data Warehouse Data
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

57

Characteristics of the
Departmental Data Mart
OLAP
Small
Flexible
Customized by
Department
Source is
departmentally
structured data
warehouse
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

58

Data Mart Centric


Data Sources

Data Marts

Data Warehouse

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

59

Problems with Data Mart Centric


Solution

If you end up creating multiple warehouses,


integrating them is a problem
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

60

True Warehouse
Data Sources

Data Warehouse

Data Marts
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

61

Typical OLAP Queries


 Write a multi-table join to compare sales for each
product line YTD this year vs. last year.
 Repeat the above process to find the top 5
product contributors to margin.
 Repeat the above process to find the sales of a
product line to new vs. existing customers.
 Repeat the above process to find the customers
that have had negative sales growth.
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

62

Strengths of OLAP
It is a powerful visualization paradigm
It provides fast, interactive response
times
It is good for analyzing time series
It can be useful to find some clusters and
outliers
Many vendors offer OLAP tools
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

63

OLAP Is FASMI
Fast
Analysis
Shared
Multidimensional
Information

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

64

Multi-dimensional Data
HeyI sold $100M worth of goods
Dimensions: Product, Region, Time
Hierarchical summarization paths

Product

W
S
N
Juice
Cola
Milk
Cream
Toothpaste
Soap
1 2 34 5 6 7

Month

Product
Industry

Region
Country

Time
Year

Category

Region

Quarter

Product

City

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

Office

Month

Day

Week
65

A Visual Operation: Pivot (Rotate)

Juice

10

Cola

47

Milk

30

Cream 12

Product

3/1 3/2 3/3 3/4

Date

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

66

Slicing and Dicing


The Telecomm Slice

Product

Household
Telecomm
Video
Audio

Europe
Far East
India
Retail Direct

Special

Sales Channel

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

67

Roll-up and Drill Down


Higher Level of
Aggregation

Sales Channel
Region
Country
Country
State
Location Address
Sales
Representative
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

Low-level
Details
68

What is data mining?

The iterative and interactive process of


discovering valid, novel, useful, and
understandable knowledge ( patterns,
models, rules etc.) in Massive
databases
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions,
Indore
69

What is data mining?


 Valid: generalize to the future
 Novel: what we don't know
 Useful: be able to take some action
 Understandable: leading to insight
 Iterative: takes multiple passes
 Interactive: human in the loop

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions,
Indore
70

Why data mining?


 Data volume too large for classical analysis
Number of records too large (millions or
billions)
High dimensional (attributes/features/ fields)
data (thousands)
 Increased opportunity for access
Web navigation, on-line collections

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions,
Indore
71

Data mining goals


Prediction
What? Opaque

Description
Why? Transparent

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions,
Indore
72

Data mining operations


Verification driven
Validating hypothesis
Querying and reporting (spreadsheets,
pivot tables)
Multidimensional analysis (dimensional
summaries); On Line Analytical
Processing
Statistical analysis
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions,
Indore
73

Data mining operations


Discovery driven
Exploratory data analysis
Predictive modelling
Database segmentation
Link analysis
Deviation detection

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions,
Indore
74

Data mining process


Interpretation
Data Mining
Transformation
Preprocessing

Knowledge

Selection
Patterns

Original
Data

Target
Data

Transformed
Data
Preprocessed
Data

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions,
Indore
75

 Knowledge discovery in databases (KDD) is


the non-trivial process of identifying valid,
potentially useful and ultimately
understandable patterns in data

Clean,
Collect,
Summarize

Operational
Databases

Data
Warehouse

Data
Preparation

Training
Data

Verification,
Evaluation

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

Data
Mining

Model
Patterns

76

Knowledge Discovery

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

77

The Steps Involved: step 1


Developing an Understanding of
The application domain
The relevant prior knowledge
The goals of the end user

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

78

The Steps Involved: step 2


Creating a target data set
Selecting a data set, or focusing on a
subset of variables, or data samples, on
which discovery is to be performed

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

79

The Steps Involved: step3


Data cleaning and preprocessing
Removal of noise or outliers
Collecting necessary information to model
or account for noise
Strategies for handling missing data fields
Accounting for time sequence information
and known changes

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

80

The Steps Involved: step 4


Data reduction and projection
Finding useful features to represent the
data depending on the goal of the task
Using
dimensionality
reduction
or
transformation methods to reduce the
effective number of variables under
consideration
or
to
find
invariant
representations for the data.

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

81

The Steps Involved: step 5


Choosing data mining task
Deciding whether the goal of the KDD
process is classification, regression,
clustering etc.

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

82

The Steps Involved: step 6


Choosing the data mining algorithms
Selecting method(s) to be used for
searching for patterns in the data
Deciding which models and patterns may
be appropriate
Matching a particular data mining method
with the overall criteria of the KDD process

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

83

The Steps Involved: step 7


Data Mining
Searching of patterns of interest in a
particular representational form

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

84

The Steps Involved: steps 8 & 9


Interpreting mined patterns
Consolidating discovered knowledge

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

85

Data mining process


Understand application domain
Prior knowledge, user goals

Create target dataset


Select data, focus on subsets

Data cleaning and transformation


Remove noise, outliers, missing values
Select features, reduce dimensions
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions,
Indore
86

Data mining process


Apply data mining algorithm
Associations, sequences, classification,
clustering, etc.

Interpret, evaluate and visualize patterns


What's new and interesting?
Iterate if needed

Manage discovered knowledge


Close the loop
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions,
Indore
87

Data mining process


Original DB
Normalize
values
Transform
values
Extract
knowledge
Test
knowledge

Create/select
target database

Select sample

Eliminate
noisy data

Supply missing
values

Create derived
attributes

Relevant
attributes

Select DM
method (s)

Select DM
task (s)

Refine
knowledge

Presentation,
visualization

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions,
Indore
88

How Data Mining is used


1. Identify the problem
2. Use data mining techniques to
transform the data into
information
3. Act on the information
4. Measure the results
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

89

The Data Mining Process


1. Understand the domain
2. Create a dataset:
Select the interesting attributes
Data cleaning and preprocessing

3. Choose the data mining task and


the specific algorithm
4. Interpret the results, and possibly
return to 2
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

90

Origins of Data Mining

Draws ideas from machine learning/AI, pattern


recognition, statistics, and database systems
Must address:
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data

AI /

Statistics

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

Machine Learning

Data Mining

Database
systems
91

Data Mining Tasks


1. Classification: learning a function
that maps an item into one of a set
of predefined classes
2. Regression: learning a function that
maps an item to a real value
3. Clustering: identify a set of groups
of similar items
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

92

Data Mining Tasks


4. Dependencies and associations:
identify significant dependencies
between data attributes
5. Summarization: find a compact
description of the dataset or a
subset of the dataset
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

93

Data Mining Methods


1.

Decision Tree Classifiers:

Used for modeling, classification


2. Association Rules:
Used to find associations between sets of
attributes
3. Sequential patterns:
Used to find temporal associations in time series
4. Hierarchical clustering:
used to group customers, web users, etc
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

94

Why Data Preprocessing?


 Data in the real world is dirty
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or
names

 No quality data, no quality mining results!


Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality
data
Required for both OLAP and Data Mining!
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore

95

Why can Data be Incomplete?


 Attributes of interest are not available (e.g.,
customer information for sales transaction data)
 Data were not considered important at the time
of transactions, so they were not recorded!
 Data not recorded because of misunderstanding
or malfunctions
 Data may have been recorded and later deleted!
 Missing/unknown values for some data

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

96

Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy
data
Correct inconsistent data

Dr. Pramod S.Nair, Medi-Caps


Group of Institutions, Indore

97

Vous aimerez peut-être aussi