Académique Documents
Professionnel Documents
Culture Documents
AND
DATA MINING
Introduction
Data Warehousing,
OLAP(On Line Analytical
Processing) and data
mining: what and why
(now)?
Relation to OLTP
(On Line Transaction
Processing)
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
Which customers
are most likely to go
to the competition ?
What impact will
new products/services
have on revenue
Dr. Pramod S.Nair, Medi-Caps
andof margins?
Group
Institutions, Indore
Data
A process of
transforming data into
information and
making it available to
users in a timely
enough manner to
make a difference
[Forrester Research, April
1996]
Evolution
60s: Batch reports
hard to find and analyze information
inflexible and expensive, reprogram every new
request
Weather images
Intelligence Agency
Videos
Data Warehouse
A data warehouse is a
subject-oriented
integrated
time-varying
non-volatile
11
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Analyze
Query
Metadata Repository
13
14
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than update
Use of the system is loosely defined and
can be ad-hoc
Used by managers and end-users to
understand the business and make
judgments
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
15
16
17
Application Areas
Industry
Finance
Insurance
Telecommunication
Transport
Consumer goods
Data Service providers
Utilities
Application
Credit Card Analysis
Claims, Fraud Analysis
Call record analysis
Logistics management
promotion analysis
Value added data
Power usage analysis
18
19
Function
Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be
reconciled.
21
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore
22
23
Operational Systems
Run the business in real time
Based on up-to-the-second data
Optimized to handle large
numbers of simple read/write
transactions
Optimized for fast response to
predefined transactions
Used by people who deal with
customers, products -- clerks,
salespeople etc.
They are increasingly used by
customers
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore
24
Industry Usage
Volumes
Customer All
File
Account
Balance
Legacy applications,
Large
hierarchical databases,
mainframe
ERP, Client/Server,
Very Large
relational databases
Point-ofSale data
Call
Record
Track
Customer
Details
Finance
Control
account
activities
Retail
Generate
bills, manage
stock
Telecomm- Billing
unications
Technology
Production ManufactRecord
uring
Control
Production
Legacy application,
Very Large
hierarchical database,
mainframe
ERP,
Medium
relational databases,
AS/400
25
Application-Orientation vs.
Subject-Orientation
Application-Orientation
Subject-Orientation
Operational
Database
Loans
Credit
Card
Data
Warehouse
Customer
Vendor
Trust
Product
Savings
Activity
27
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore
28
Warehouse (DSS)
Subject Oriented
Used to analyze
business
Summarized and
refined
Snapshot data
Integrated Data
Ad-hoc access
Knowledge User
(Manager)
29
Data Warehouse
Performance relaxed
Large volumes accessed
at a time(millions)
Mostly Read (Batch
Update)
Redundancy present
Database Size
100 GB - few terabytes
30
Data Warehouse
Query throughput
is the performance
metric
Hundreds of users
Managed by
subsets
31
To summarize ...
OLTP Systems are
used to run a
business
The Data
Warehouse helps
to optimize the
business
Dr. Pramod S.Nair, Medi-Caps Group of Institutions, Indore
32
ERP
Systems
Extraction
Cleansing
Data Warehouse
Engine
Purchased
Data
Legacy
Data
Analyze
Query
Metadata Repository
33
34
Source Data
Operational/
Source Data
Sequential
Legacy
Relational
External
36
Data Transformation
Cleaning
Correction of misspelling
Resolution of state codes and zip codes
Default values of missing data
Elimination of duplicate data
Standardization
Standardize the data types and field lengths for
same data
Semantic standardization ( Synonym : Two or
more terms from different systems mean same
thing
Homonym : Single term means many different
things in different source systems)
Summarization
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
37
38
39
Same data
different name
Loans
Different data
Same name
Trust
Credit card
Different keys
same data
40
A - m,f
B - 1,0
C - x,y
D - male, female
appl
appl
appl
appl
A - pipeline - cm
B - pipeline - in
C - pipeline - feet
D - pipeline - yds
appl
appl
appl
appl
A - balance
B - bal
C - currbal
D - balcurr
41
42
Enrichment
Scoring
Loading
Validating
Delta Updating
43
Conditioning
The conversion of data types from the source
to the target data store (warehouse) -always a relational database
44
Scoring
computation of a probability of an
event. e.g..., chance that a customer
will defect to AT&T from MCI, chance
that a customer is likely to buy a new
product.
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
46
Scrubbing Data
Sophisticated
transformation tools.
Used for cleaning the
quality of data
Clean data is vital for the
success of the
warehouse
Example
Seshadri, Sheshadri,
Sesadri, Seshadri S.,
Srinivasan Seshadri, etc.
are the same Dr.
person
Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
47
Loads
After extracting, scrubbing, cleaning,
validating etc. need to load the data
into the warehouse
Issues
huge volumes of data to be loaded
small time window available when warehouse can be
taken off line (usually nights)
when to build index and summary tables
allow system administrators to monitor, cancel, resume,
change load rates
Recover gracefully -- restart after failure from where
you were and without loss of data integrity
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
48
Load Taxonomy
Incremental versus Full loads
Online versus Offline loads
49
Refresh
Propagate updates on source data to
the warehouse
Issues:
when to refresh
how to refresh -- refresh techniques
50
When to Refresh?
periodically (e.g., every night, every
week) or after significant events
on every update: not warranted unless
warehouse data require current data (up
to the minute stock quotes)
refresh policy set by administrator based
on user needs and traffic
possibly different policies for different
sources
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
51
Refresh Techniques
Full Extract from base tables
read entire source table: too expensive
maybe the only choice for legacy
systems
52
53
54
Granularity in Warehouse
Can not answer some questions with
summarized data
Did Anand call Seshadri last month?
Not possible to answer if total duration
of calls by Anand over a month is only
maintained and individual call details
are not.
55
Less
History
Normalized
Detailed
Departmentally
Structured
Organizationally
Structured
Data
Data Warehouse
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
More
56
Organizationally structured
Atomic
Detailed Data Warehouse Data
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
57
Characteristics of the
Departmental Data Mart
OLAP
Small
Flexible
Customized by
Department
Source is
departmentally
structured data
warehouse
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
58
Data Marts
Data Warehouse
59
60
True Warehouse
Data Sources
Data Warehouse
Data Marts
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
61
62
Strengths of OLAP
It is a powerful visualization paradigm
It provides fast, interactive response
times
It is good for analyzing time series
It can be useful to find some clusters and
outliers
Many vendors offer OLAP tools
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
63
OLAP Is FASMI
Fast
Analysis
Shared
Multidimensional
Information
64
Multi-dimensional Data
HeyI sold $100M worth of goods
Dimensions: Product, Region, Time
Hierarchical summarization paths
Product
W
S
N
Juice
Cola
Milk
Cream
Toothpaste
Soap
1 2 34 5 6 7
Month
Product
Industry
Region
Country
Time
Year
Category
Region
Quarter
Product
City
Office
Month
Day
Week
65
Juice
10
Cola
47
Milk
30
Cream 12
Product
Date
66
Product
Household
Telecomm
Video
Audio
Europe
Far East
India
Retail Direct
Special
Sales Channel
67
Sales Channel
Region
Country
Country
State
Location Address
Sales
Representative
Dr. Pramod S.Nair, Medi-Caps
Group of Institutions, Indore
Low-level
Details
68
Description
Why? Transparent
Knowledge
Selection
Patterns
Original
Data
Target
Data
Transformed
Data
Preprocessed
Data
Clean,
Collect,
Summarize
Operational
Databases
Data
Warehouse
Data
Preparation
Training
Data
Verification,
Evaluation
Data
Mining
Model
Patterns
76
Knowledge Discovery
77
78
79
80
81
82
83
84
85
Create/select
target database
Select sample
Eliminate
noisy data
Supply missing
values
Create derived
attributes
Relevant
attributes
Select DM
method (s)
Select DM
task (s)
Refine
knowledge
Presentation,
visualization
89
90
AI /
Statistics
Machine Learning
Data Mining
Database
systems
91
92
93
94
95
96
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy
data
Correct inconsistent data
97