Vous êtes sur la page 1sur 158

Data Ware Housing

& Data Mining

D Venkata Subramanin
August 2nd 2011
For TJ Institute of Technology,
DWH/DataMining for CSE IV A & B sections

Topics To Be Covered
Recap and Overview of DWH UNIT I
Business Analysis & tools UNIT II
Details about OLAP UNIT II
Data Mining & Algorithms UNIT III
TO V
Quick introduction of the topic and
algorithms

Decision Support Systems


Created to facilitate the decision
making process
So much information that it is
difficult to extract it all from a
traditional database
Need for a more comprehensive data
storage facility
Data Warehouse

Decision Support Systems


Extract Information from data to use as
the basis for decision making
Used at all levels of the Organization
Tailored to specific business areas
Interactive
Ad Hoc queries to retrieve and display
information
Combines historical operation data with
business activities

4 Components of DSS
Data Store The DSS Database
Business Data
Business Model Data
Internal and External Data

Data Extraction and Filtering


Extract and validate data from the
operational database and the external
data sources

4 Components of DSS
End-User Query Tool
Create Queries that access either the
Operational or the DSS database

End User Presentation Tools


Organize and Present the Data

What is a Data Warehouse ?


Can I see credit
Can I see credit
report from
report from
Accounts, Sales
Accounts, Sales
from marketing and
from marketing and
open order report
open order report
from order entry for
from order entry for
this customer
this customer

Data from
Data from
multiple
multiple
sources is
sources is
integrated for a
integrated for a
subject
subject

A data warehouse is a subject-oriented,


integrated, nonvolatile,
time-variant collection of data in support
of management's decisions.
- WH Inmon

Identical queries
Identical queries
will give same
will give same
results at different
results at different
times. Supports
times. Supports
analysis requiring
analysis requiring
historical data
historical data

WH Inmon - Regarded As Father Of Data Warehousing

Data stored for


Data stored for
historical period. Data
historical period. Data
is populated in the
is populated in the
data warehouse on
data warehouse on
daily/weekly basis
daily/weekly basis
depending upon the
depending upon the
requirement.
requirement.

Data Growth

In 2 years (2003 to 2005),


the size of the largest database TRIPLED!

Data Growth Rate


Twice as much information was created
in 2002 as in 1999 (~30% growth rate)
Other growth rate estimates even
higher
Very little data will ever be looked at by
a human

Knowledge Discovery is NEEDED to make


sense and use of data.
9

Data WarehouseSubjectOriented
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
April 26, 2016

Data Mining: Concepts and Techniques

10

Data Warehouse
Integrated
Constructed by integrating multiple,
heterogeneous data sources
relational databases, flat files, on-line
transaction records
Data cleaning and data integration techniques
are applied.
Ensure consistency in naming conventions,
encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast covered,
etc.

When data is moved to the warehouse, it is


converted.
April 26, 2016

Data Mining: Concepts and Techniques

11

Data WarehouseTime
Variant
The time horizon for the data warehouse is
significantly longer than that of operational systems.
Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain time element.
April 26, 2016

Data Mining: Concepts and Techniques

12

Data WarehouseNonVolatile
A physically separate store of data transformed from
the operational environment.
Operational update of data does not occur in the
data warehouse environment.
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.

April 26, 2016

Data Mining: Concepts and Techniques

13

Data Warehouse vs. Heterogeneous


DBMS

Traditional heterogeneous DB integration:

Build wrappers/mediators on top of heterogeneous databases


Query driven approach
When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
Complex information filtering, compete for resources

Data warehouse: update-driven, high performance


Information from heterogeneous sources is integrated in
advance and stored in warehouses for direct query and analysis

April 26, 2016

Data Mining: Concepts and Techniques

14

Data Warehouse vs. Operational


DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)


Major task of data warehouse system
Data analysis and decision making

Distinct features (OLTP vs. OLAP):


User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries

April 26, 2016

Data Mining: Concepts and Techniques

15

OLTP vs. OLAP


OLTP

OLAP

users

clerk, IT professional

knowledge worker

function

day to day operations

decision support

DB design

application-oriented

subject-oriented

data

current, up-to-date
detailed, flat relational
isolated
repetitive

historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans

unit of work

read/write
index/hash on prim. key
short, simple transaction

# records accessed

tens

millions

#users

thousands

hundreds

DB size

100MB-GB

100GB-TB

metric

transaction throughput

query throughput, response

usage
access

April 26, 2016

complex query

Data Mining: Concepts and Techniques

16

Why Separate Data


Warehouse?

High performance for both systems


DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
missing data: Decision support requires historical
data which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use
inconsistent data representations, codes and
formats which have to be reconciled
April 26, 2016

Data Mining: Concepts and Techniques

17

Subject-Oriented
Data is arranged and optimized to
provide answer to questions from
diverse functional areas
Data is organized and summarized by
topic
Sales / Marketing / Finance / Distribution /
Etc.

Integrated
The data warehouse is a centralized,
consolidated database that
integrated data derived from the
entire organization
Multiple Sources
Diverse Sources
Diverse Formats

Time-Variant
The Data Warehouse represents the
flow of data through time
Can contain projected data from
statistical models
Data is periodically uploaded then
time-dependent data is recomputed

Nonvolatile
Once data is entered it is NEVER removed
Represents the companys entire history
Near term history is continually added to it
Always growing
Must support terabyte databases and
multiprocessors

Read-Only database for data analysis and


query processing

Subject-OrientedCharacteristics of a Data
Warehouse
Operation
al

Leads

Quotes

Prospects

Orders

Focus is on Subject Areas rather than Applications

Data
Warehouse

Customers

Products

Regions

Time

Integrated - Characteristics of
a Data Warehouse

Appl A - m,f
Appl B - 1,0
Appl C - male,female

Appl A - balance dec fixed (13,2)


Appl B - balance pic 9(9)V99
Appl C - balance pic S9(7)V99 comp-3

m,f

balance dec
fixed (13,2)

Appl A - bal-on-hand
Appl B - current-balance
Appl C - cash-on-hand

Current balance

Appl A - date (julian)


Appl B - date (yymmdd)
Appl C - date (absolute)

date (julian)

Integrated View Is The Essence Of A Data Warehouse

Non-volatile - Characteristics of
a Data Warehouse
insert

change

Data
Warehouse

Operational
delet
e

replace

insert
load

change

Integrated View Is The Essence Of A Data Warehouse

read only
access

Time Variant - Characteristics


of a Data Warehouse
Operational

Current Value data


time horizon : 60-90 days
key may not have element of
time

Data Warehouse Typically Spans Across Time

Data
Warehouse
Snapshot data
time horizon : 5-10 years
key has an element of time
data warehouse stores
historical data

Alternate Definitions
A collection of integrated, subject
oriented databases designed to
support the DSS function, where
each unit of data is relevant to some
moment of time
- Imhoff

Alternate Definitions
Data Warehouse is a repository of data
summarized or aggregated in
simplified form from operational
systems. End user orientated data
access and reporting tools let user
get at the data for decision support Babcock

12 Rules of a Data
Warehouse
Data Warehouse and Operational
Environments are Separated
Data is integrated
Contains historical data over a long
period of time
Data is a snapshot data captured at a
given point in time
Data is subject-oriented

12 Rules of Data Warehouse


Mainly read-only with periodic batch
updates
Development Life Cycle has a data
driven approach versus the
traditional process-driven approach
Data contains several levels of detail
Current, Old, Lightly Summarized, Highly
Summarized

12 Rules of Data Warehouse


Environment is characterized by Read-only
transactions to very large data sets
System that traces data sources,
transformations, and storage
Metadata is a critical component
Source, transformation, integration,
storage, relationships, history, etc
Contains a chargeback mechanism for
resource usage that enforces optimal use
of data by end users

Need for Data Warehousing


Better business intelligence for end-users
Reduction in time to locate, access, and
analyze information
Consolidation of disparate information sources
Strategic advantage over competitors
Faster time-to-market for products and
services
Replacement of older, less-responsive decision
support systems
Reduction in demand on IS to generate reports

Data Warehouse
Architecture

ICS 541 - 062

Data Mining Concepts

32

Multi-Tiered Architecture
other

Metadata

source
s
Operational

DBs

Extract
Transform
Load
Refresh

Monitor
&
Integrator

Data
Warehouse

OLAP Server

Serve

Analysis
Query
Reports
Data mining

Data Marts

Data
Sources
April 26, 2016

Data Storage
OLAP
Front-End Tools
Data Mining: Concepts and
TechniquesEngine
33

Typical Data Warehouse


Architecture
Data
Marts

Select

EIS /DSS

Metadata

Query Tools

Extract
Transform
Integrate
Maintain

Data
Warehouse

OLAP/ROLAP

Web Browsers
Operational
Systems/Data
Data
Preparation
Multi-tiered Data Warehouse without ODS

Middleware/
API

Data Mining

Typical Data Warehouse


Architecture
Data
Marts
Metadata

Metadata
Select

Select

Extract

Extract

Transform
Integrate

ODS

Transform
Load

Maintain

Operational
Systems/Data
Data
Preparation
Multi-tiered Data Warehouse with ODS

Data
Preparation

Data
Warehouse

ETL- Extract, Transform and


As the name suggests, ETL process
covers the following phases :
Load

Extraction of data from data sources.


Transforming the extracted data to meet business
requirements.
Loading the data in to the target warehouse/database.

ETL Extract, Transform and


Data Extract
Load

Get Data from source


Data Transformation
Data Cleansing - Data Quality Assurance
Data Scrubbing - Removing errors and inconsistencies
Processing Calculations
Applying Business Rules
Changing Data Types
Making the Data More Readable
Replacing Codes with Actual Values
Summarizing the Data
Data Load
Load data into Warehouse

Extraction
The first part of the ETL process.
Data under consideration is being extracted from the
different data sources.
The source data may use a different data
organization/format.
Some of the common data sources are :
Databases
Flat files

Transform
It involves applying a series of rules to the data
extracted from the source to derive the data to load
the target.
Depending on the requirement of the target, the
transformation rules may be simple or complex.
Transformation may involve :
Selecting only certain columns to load
Filtering
Sorting
Combining data from multiple sources
Generating Surrogate keys, etc.

Load
The last step of the ETL process
The load phase loads the transformed data to the
end target.
Depending on the requirement, the load phase may
be:
Full Load
Incremental Load

Importance of ETL
Data of an organization spread across multiple
geographies and domains.
Data organized in different format in different
sources.
Consolidation of the data to make it more
meaningful.
Applying Business rules enriches the value the data
provides.
Identifying the inconsistencies and providing a
unified view.
Improving the data quality.

Sample ETL Tools

Teradata Warehouse Builder from Teradata


DataStage from IBM
SAS System from SAS Institute
Power Mart/Power Center from Informatica
Sagent Solution from Sagent Software
Hummingbird Genio Suite from
Hummingbird Communications
Abinitio
Oracle Data Warehouse Builder and ODI
from Oracle Corporation

Data Access and Analysis


It is the process of timely access and
analysis of data
It is the means by which the End
Users see the data warehouse or
the ODS or the Operational Systems

Data Access and Analysis Terminologies


Reporting
A category of data access solution in which the information is
presented in the form of reports
Reporting tools are also referred as Query and reporting tools

OLAP (On-Line Analytic Processing)


Defined as Fast Analysis of Multidimensional Information by
the OLAP council
Used interchangeably with BI
OLAP tools are synonymous with Multidimensional tools or
applications

DSS tools that use multidimensional data analysis


techniques
Support for a DSS data store
Data extraction and integration filter
Specialized presentation interface

Data Access and Analysis Terminologies


Data Mining

A process that uses a variety ofstatistical and


artificial intelligenceframeworks to discover
patterns and relationships in data
Used to make valid predictions indata analysis
problemswhere the exact sequence and
nature of queries/questions to
bewritten/asked against the data to make the
prediction is not known and the number of
variables involved in the analysis istoo large
to be intuitively handled by structured
querying or OLAP tools

Web Access

A category of data access solutions in which


information is viewed through a web browser

Importance of Data Access


Businesses today face challenges like

Large volume of data


User demands of flexible and timely access
to information
Extracting value from key business data

Data Access is the last mile that


enables decision makers to
Reach the database infrastructure

Prompt, reliable data access


Lowers operating costs
Reduces error
Increases productivity.

Traditional Decision Making


techniques
Spreadsheets and SQL are traditionally used
as tool for analysis and decision making
Limitations of Traditional techniques
It is very difficult to define the aggregation
levels, views in spreadsheets
SQL does not have a natural way of providing
flexible view reorganizations that will transpose
the data
Common analytic functions such as cumulative
average and total are not supported in SQL

Design Considerations

Platform - SMP, MPP, NT, Unix


Target Database - RDBMS, MDDB
Partitioning
Data Preparation - Data Quality Audit,
Cleansing, Extraction, Transformation
Modeling - Facts & Dimensions
Information Directory - Metadata
Management
Warehouse Administration
End User Tools
Granularity - Detail and Summarization

Data Warehouse Hardware


Hardware Considerations
Parallelism
SMP or MPP
Disk Storage

Hardware Considerations
Parallelism
Most deployments of VLDB Data
Warehouses are on SMP or MPP

Hardware Considerations
Three options for Hardware
Symmetric Multiprocessing (SMP)
Shared Memory Architecture

Massively Parallel Processing (MPP)


Shared Nothing Architecture
Each node has its own memory and I/O

Non Uniform Memory Access (NUMA)


Cluster of SMP machines
Classified as large SMP machines

SMP vs MPP machines


SMP

MPP

For Mission Critical OLTP

Complex Analytical large

or medium DSS

Scale DSS

Scale to 10-12 CPUs

(now 30)

Scale to more than 100


CPUs

Growth is Slow and


Steady

Growth is rapid and unpredictable

Database Size < 200GB

Database Size > 500 GB

Aim is Automation or Basic Decision


Support

Primary aim is strategic advantage

Server Scalability
MPP

NUMA
SMP

Entry level

Massively Parallel Processing Machines


Can scale upto 100s of processors
Suitable for Data Warehouse > 500 GB
Extremely complex data mining algorithms
Non Uniform Memory Architecture
High End SMP clusters
Suitable for Data Warehouses < 500 GB

Symmetric Multiprocessing Machines


Scale upto 10-12 CPUs
Good for Data Warehouse < 200 GB

Single CPU Systems


Desktops
Suitable only for small datamarts (<20GB)

OLTP Vs Warehouse
Operational System

Data Warehouse

Transaction Processing

Query Processing

Time Sensitive

History Oriented

Operator View

Managerial View

Organized by transactions (Order, Input,


Inventory)

Organized by subject (Customer, Product)

Relatively smaller database

Large database size

Many concurrent users

Relatively few concurrent users

Volatile Data

Non Volatile Data

Stores all data

Stores relevant data

Not Flexible

Flexible

Processing Power

Capacity Planning

Time of day
Processing Load Peaks During the Beginning and End of Day

Manufacturers
Manufacturers

Examples Of Some
Applications

Financial Reporting and


Consolidation
Target Marketing
Market Segmentation
Budgeting
Credit Rating Agencies
Churn Analysis

Profitability Management

Event tracking

Retailers
Retailers

Customers
Customers

Data Marts
Small Data Stores
More manageable data sets
Targeted to meet the needs of small
groups within the organization
Small, Single-Subject data warehouse
subset that provides decision support
to a small group of people

Data Marts
Enterprise wide data warehousing
projects have a very large cycle time
Getting consensus between multiple
parties may also be difficult
Departments may not be satisfied with
priority accorded to them
Sometimes individual departmental
needs may be strong enough to warrant
a local implementation
Application/database distribution is also
an important factor

Data Marts
Subject or Application Oriented
Business View of Warehouse
Quick Solution to a specific Business
Problem
Finance, Manufacturing, Sales etc.
Smaller amount of data used for
Analytic Processing

A Logical Subset of The Complete Data Warehouse

Data Warehouses or Data


Marts in changing their
For companies interested
corporate

cultures or integrating separate

departments, an

enterprise wide approach

makes sense.
Companies that want a quick solution to a

specific

business problem are better served by

a standalone data

mart.

Some companies opt to build a warehouse

incrementally, data mart by data mart.


A Logical Subset of The Complete Data Warehouse

Data Warehouse and Data


Mart Data Marts
Data Warehouse
Scope

Application Neutral
Centralized, Shared
Cross LOB/enterprise

Specific Application
Requirement
LOB, department
Business Process
Oriented

Data
Perspecti
ve

Historical Detailed data


Some summary

Detailed (some
history)
Summarized

Subjects

Multiple subject areas

Single Partial
subject
Multiple partial
subjects

Data Warehouse and Data


Mart Data Marts
Data Warehouse
Data Sources

Many
Operational/ External
Data

Few
Operational,
external data

Implement
Time Frame

9-18 months for first


stage
Multiple stage
implementation

4-12 months

Characteristi
cs

Flexible, extensible
Durable/Strategic
Data orientation

Restrictive, non
extensible
Short life/tactical
Project
Orientation

Warehouse or Mart First ?


Data Warehouse First

Data Mart first

Expensive

Relatively cheap

Large development cycle

Delivered in < 6 months

Change management is difficult

Easy to manage change

Difficult to obtain continuous


corporate support

Can lead to independent and


incompatible marts

Technical challenges in building


large databases

Cleansing, transformation,
modeling techniques may be
incompatible

OLTP Systems Vs Data


Warehouse

Remember
Between OLTP and Data Warehouse systems
users are different
data content is different,
data structures are different
hardware is different
Understanding The Differences Is The Key

Operational Data Store Definition

A
B

ODS

Data
Warehouse

C
Operational
DSS

Operational Data Store


The ODS applies only to the world of
operational systems.
The ODS contains current valued and
near current valued data.
The ODS contains almost exclusively
all detail data
The ODS requires a full function,
update, record oriented environment.

Operational Data Store


Functions of an ODS
Converts Data,
Decides Which Data of Multiple Sources
Is the Best,
Summarizes Data,
Decodes/encodes Data,
Alters the Key Structures,
Alters the Physical Structures,
Reformats Data,
Internally Represents Data,
Recalculates Data.

Different kinds of Information


Needs
Current
Current

Is this medicine available


in stock

Recent
Recent

What are the tests this


patient has completed so
far

Historic
Historic
al
al

Has the incidence of


Tuberculosis increased in
last 5 years in Southern
region

OLTP Vs ODS Vs DWH


Characteristic OLTP

ODS

Data
Warehouse

Audience

Operating
Personnel

Analysts

Managers and
analysts

Data access

Individual
records,
transaction
driven

Individual records,
transaction or
analysis driven

Set of records,
analysis driven

Data content

Current, realtime

Current and nearcurrent

Historical

Data Structure

Detailed

Detailed and lightly


summarized

Detailed and
Summarized

Data organization

Functional

Subject-oriented

Subject-oriented

Type of Data

Homogeneous

Homogeneous

Vast Supply of very


heterogeneous data

OLTP Vs ODS Vs DWH


Characteristic OLTP

ODS

Data
Warehouse

Data
redundancy

Non-redundant within
system; Unmanaged
redundancy among
systems

Somewhat
redundant with
operational
databases

Managed
redundancy

Data update
Database size

Field by field

Field by field

Controlled batch

Moderate

Moderate

Large to very
large

Development

Requirements driven,
structured

Data driven,
somewhat
evolutionary

Data driven,
evolutionary

Support day-to-day
operation

Support day-to- Support managing


day decisions
the enterprise
& operational
activities

Methodology
Philosophy

END OF UNIT I RECAP


BEGINNING OF UNIT II
TOPIC
BUSINESS ANALYSIS

Principles of Datawarehouse
/Business Analysis

The principle purpose of the data warehousing is to provide


information to business users for strategic decision making.
The decision making process is the business analysis of the
information stored in a data warehouse
The business analysis is enabled by
Number of applications
Number of tools
Number of techniques
To provide various business focused views to business
domain experts.

TOOL CATEGORIES
5 Main Categories of decision support
tools
Reporting
Managed Query
Executive Information Systems
Online Analytical Processing ( OLAP)
Data Mining

Category 1 - Reporting Tools


Production reporting tools
Used by companies to generate regular reports or
support high volume batch jobs such as calculating
and printing pay checks or summary of revenues
by month
Written using Cobol or high level languages such
as .net or java or using custom tools these are
expensive will be developed or customized based
on the
needs of an organization

Desktop report writers


Designed for end users and used by end users or
business users in their desktop for designing,
developing and generating reports daily or ondemand
Example Crystal Reports

Category 2 - Managed Query Tools


These tools shield end users from the
complexities of SQL and database
structures by inserting a meta-layer
between users and the database.
Meta-layer is the software that provides
subject-oriented views of database and
supports point-and-click creation of SQL..
drag- and drop and form the complex SQL
to search or produce information. These
follows three tiered architectures to
improve the scalability.
COGNOS & BUSINESS OBJECTS

Category 3
Systems

- Executive Information

These tools predate report writers and


managed query tools
They were first deployed in mainframe
Provides customized graphical decision
support applications that gives the
managers and executives a high level view
of the business and access to the external
sources such as custom and online feeds
Examples: Pilot Software, Platinum
Technology Forest and Trees, SAS

Category 4

- OLAP Tools

OLAP tools provide an intuitive way to view


corporate data
These tools aggregate data along common
business subjects or dimensions and then let
users navigate through the hierarchies and
dimensions with the click of a mouse button
Users can drill down, across or up levels in
each dimension or pivot and wap out
dimensions to change their view of the data.
COGNOS POWER PLAY
BRIO QUERY

Need for the tools and applications for business


analysis
Simple tabular form reporting
Ad-hoc user-specified queries
Pre-defined repeatable queries & Complex queries
with multi-table joins, multi-level
sub-queries & sophisticated search criteria
Ranking, Multivariable Analysis
Time Series Analysis
Data Visualization - graphing, charting & pivoting
Complex Textual Search
Statistical Analysis
Artificial Intelligence techniques for testing hypothesis
Information Mapping
Interactive Drill-Down Reporting and Analysis
(Mining)

QUERY AND REPORTING TOOLS


Must helps for the following three
distinct types of reporting
1.Creation and viewing of
STANDARD REPORTS
2.Definition and creation of ADHOC REPORTS
3.Data Exploration

Check Google or any web site or wikipedia


to know more about some of the tools
1). Cognos Impropmtu
2). PowerBuilder
3).
Forte
4). Information Builders Cactus & Focus
5). Microsoft SQLserver Ill provide the
notes
***Read about Purpose, Architecture,
Features, Supported DBMS and
applications

UNIT II
TOPIC
OLAP & MULTI-DIMENSIONAL MODELS

OLAP
Need or Drivers for OLAP
Need for More Intensive Decision Support
Multi-dimensional nature of the problems
Retrieval of very large data sets (100s of GBs or TBs) and
summarize them on the fly
The result set may look like a multi-dimensional spread-sheet hence
the term multi-dimensional. (traditional RDBMS supports two
dimensional relational model through SQL)
Solving modern business problems such as market analysis, financial
forecasting requires
Query centric and array oriented and multi-dimensional
database schemas

Nature of OLAP Analysis


Aggregation -- (total sales, percent-tototal)
Comparison -- Budget vs. Expenses
Ranking -- Top 10, quartile analysis
Access to detailed and aggregate data
Complex criteria specification
Visualization
Need interactive response to aggregate queries
83

Multi-dimensional Data

Re
gi
on

Measure - sales (actual, plan, variance)

Product

W
S
N
Juice
Cola
Milk
Cream
Toothpaste
Soap
1 2 34 5 6 7
Month

Dimensions: Product, Region, Time


Hierarchical summarization paths
Product
Industry

Region
Country

Time
Year

Category

Region

Quarter

Product

City
Office

Month

week

Day
84

Conceptual Model for OLAP


Numeric measures to be analyzed
e.g. Sales (Rs), sales (volume), budget,
revenue, inventory

Dimensions
other attributes of data, define the
space
e.g., store, product, date-of-sale
hierarchies on dimensions
e.g. branch -> city -> state
85

Operations
Rollup: summarize data
e.g., given sales data, summarize sales
for last year by product category and
region

Drill down: get more details


e.g., given summarized sales as above,
find breakup of sales by city within each
region, or within the Andhra region

86

More Cube Operations


Slice and dice: select and project
e.g.: Sales of soft-drinks in Andhra over
the last quarter

Pivot: change the view of data

Q1 Q2
22
15

L
S
37
Total

Total

33
44

55
59

77

114

14
Red
41

Blue
55
Total

S Total

07
52

21
93

59

114

87

More OLAP Operations


Hypothesis driven search: E.g.
factors affecting defaulters
view defaulting rate on age aggregated over
other dimensions
for particular age segment detail along
profession

Need interactive response to aggregate


queries
=> precompute various aggregates
88

MOLAP vs ROLAP
MOLAP: Multidimensional array OLAP
ROLAP: Relational OLAP
Type

Size

Colour Amount

Shirt
Shirt
Shirt
Shirt
Shirt
Shirt
Shirt

ALL

S
L
ALL
S
L
ALL
ALL

ALL

Blue
Blue
Blue
Red
Red
Red
ALL

ALL

10
25
35
3
7
10
45

1290

89

SQL Extensions
Cube operator
group by on all subsets of a set of
attributes (month,city)
redundant scan and sorting of data can
be avoided

Various other non-standard SQL


extensions by vendors

90

Strengths of OLAP
It is a powerful visualization
tool
It provides fast, interactive
response times
It is good for analyzing time
series
It can be useful to find
some clusters and outliners
Many vendors offer OLAP
tools
91

Brief History

Express and System W DSS


Online Analytical Processing - coined by
EF Codd in 1994 - white paper by
Arbor Software
Generally synonymous with earlier terms such as
Decisions Support, Business Intelligence, Executive
Information System
MOLAP: Multidimensional OLAP (Hyperion (Arbor
Essbase), Oracle Express)
ROLAP: Relational OLAP (Informix MetaCube,
Microstrategy DSS Agent)
92

OLAP and Executive


Information Systems

Oracle -- Express
Andyne Computing - Pilot -- LightShip
Pablo
Arbor Software -- Essbase Planning Sciences -Gentium
Cognos -- PowerPlay
Platinum Technology -Comshare -- Commander
ProdeaBeacon, Forest
OLAP
& Trees
Holistic Systems -- Holos
SAS Institute -Information Advantage -SAS/EIS, OLAP++
AXSYS, WebOLAP
Speedware -- Media
Informix -- Metacube

Microstrategies
--DSS/Agent

93

Microsoft OLAP strategy


Plato: OLAP server: powerful, integrating various
operational sources
OLE-DB for OLAP: emerging industry standard
based on MDX --> extension of SQL for OLAP
Pivot-table services: integrate with Office 2000
Every desktop will have OLAP capability.

Client side caching and calculations


Partitioned and virtual cube
Hybrid relational and multidimensional storage

94

Multidimensional Data Analysis


Techniques
Advanced Data Presentation Functions
3-D graphics, Pivot Tables, Crosstabs, etc.
Compatible with Spreadsheets & Statistical
packages
Advanced data aggregations, consolidation
and classification across time dimensions
Advanced computational functions
Advanced data modeling functions

Advanced Database Support


Advanced Data Access Features
Access to many kinds of DBMSs, flat files,
and internal and external data sources
Access to aggregated data warehouse
data
Advanced data navigation (drill-downs
and roll-ups)
Ability to map end-user requests to the
appropriate data source
Support for Very Large Databases

Easy-to-Use End-User
Interface
Graphical User Interfaces
Much more useful if access is kept
simple

Client/Server Architecture
Framework for the new systems to be
designed, developed and
implemented
Divide the OLAP system into several
components that define its
architecture
Same Computer
Distributed among several computer

OLAP Architecture
3 Main Modules
GUI
Analytical Processing Logic
Data-processing Logic

OLAP: 3 Tier DSS


Data Warehouse

Database Layer
Store atomic data
in industry
standard Data
Warehouse.

OLAP Engine

Decision Support Client

Application Logic Layer

Presentation Layer

Generate SQL
execution plans in the
OLAP engine to obtain
OLAP functionality.

Obtain multidimensional reports


from the DSS Client.

100

OLAP Client/Server Architecture

Relational OLAP
Relational Online Analytical Processing
OLAP functionality using relational
database and familiar query tools to store
and analyze multidimensional data

Multidimensional data schema support


Data access language & query
performance for multidimensional data
Support for Very Large Databases

Data Modeling for Data


Warehouse
How to structure the data in your
data warehouse ?
Process that produces abstract data
models for one or more database
components of the data warehouse
Modeling for Warehouse is different
from that for Operational database
Dimensional Modeling, Star Schema
Modeling or Fact/Dimension Modeling

Modeling Techniques
Entity-Relationship Modeling
Traditional modeling technique
Technique of choice for OLTP
Suited for corporate data warehouse

Dimensional Modeling
Analyzing business measures in the specific
business context
Helps visualize very abstract business
questions
End users can easily understand and navigate
the data structure

Entity-Relationship Modeling Basic Concepts


The ER modeling technique is a
discipline used to illuminate the
microscopic relationships among
data elements.
The highest art form of ER modeling
is to remove all redundancy in the
data.

An Order Processing ER
Model
City

Sales District

Sales Region

Sales Country

FK

Salesrep table

Order Header

FK

Customer Table

FK

Order Details

Item Table

Product Brand
Product Category

Entity-Relationship Modeling Basic Concepts


Entity
Object that can be observed and
classified by its properties and
characteristics
Business definition with a clear boundary
Characterized by a noun
Example
Product
Employee

Entity-Relationship Modeling Basic Concepts


Relationship
Relationship between entities structural interaction and association
described by a verb
Cardinality
1-1
1-M
M-M

Example : Books belong to Printed Media

Entity-Relationship Modeling Basic Concepts


Attributes
Characteristics and properties of entities
Example :
Book Id, Description, book category are
attributes of entity Book

Attribute name should be unique and


self-explanatory
Primary Key, Foreign Key, Constraints
are defined on Attributes

Entity-Relationship Modeling
Why Not ?
Use of the ER modeling technique
defeats the basic allure of data
warehousing, namely intuitive and
high-performance retrieval of data.

Dimensional Modeling - Basic


Concepts
Represents the data in a standard, intuitive
framework that allows for high-performance
access;
Schema designed to process large, complex,
adhoc and data intensive queries.
No concern for concurrency, locking and
insert/update/delete performance
Every dimensional model is composed of one
table with a multipart key, called the fact table,
and a set of smaller tables called dimension
tables.
This characteristic "star-like" structure is often
called a star join.

Star Schema Representation


Fact and Dimensions are represented by
physical tables in the data warehouse database
Fact tables are related to each dimension table
in a Many to One relationship (Primary/Foreign
Key Relationships)
Fact Table is related to many dimension tables
The primary key of the fact table is a
composite primary key from the dimension
tables
Each fact table is designed to answer a specific
DSS question

Star Schema
The fact table is always the largest
table in the star schema
Each dimension record is related to
thousand of fact records
Star Schema facilitated data retrieval
functions
DBMS first searches the Dimension
Tables before the larger fact table

Star Schema
DISTRICT

Dimension
s

STATE

CITY

REGION

PRODUCT

CITY

PERIOD

PRODUCT
BRAND
CATEGORY
COLOR
SIZE

CUSTOMER
SALES AMOUNT

DAY

UNITS

MONTH

ADDRESS
CATEGORY

QUARTER
YEAR

CUSTOMER

Measures

CONTACT

Star Schema for Sales


Dimensio
n Tables

Fact Table

Dimensional Modeling - Basic


Concepts
Fact Tables
The most useful facts in a fact table are
numeric and additive
Typically represents a business
transaction, or event that can be used in
analyzing business process
By nature fact tables are sparse
Usually very large - billions of records

Dimensional Modeling - Basic


Concepts
Dimension Tables
Each dimension table has a single-part primary
key that corresponds exactly to one of the
components of the multipart key in the fact
table.
Dimension tables, most often contain
descriptive textual information
Determine contextual background for facts
Examples :
Time
Location/Region
Customers

Dimensional Modeling - Basic


Concepts
Measures
A numeric attribute of a fact
Represents performance or behavior of the
business relative to the dimensions
The actual numbers are called variables
Occupy very little space compared to Fact
Tables
Examples :
Quantity supplied
Transaction amount
Sales volume

Fact Table & Dimension


Tables
Fact Tables
Numerical
Measurements of
business are stored in
Fact Tables.

Dimensional Tables
Dimensions are
attributes about facts.

Conformed Dimensions
Dimension that means the same thing
with every possible fact table that it can
be joined with
Conformed dimensions most essential
For the Bus Architecture
Integrated function of the Data Warehouse

Some common dimensions are :


Customer
Product
Location
Time

Surrogate Keys
All tables (facts and dimensions) should
not use production keys but Data
Warehouse generated surrogate keys
Productions keys get reused sometimes
In case of mergers/acquisitions, protects you
from different key formats
Production systems may change their
systems to generalize key definitions
Using surrogate key will be faster
Can handle Slowly Changing dimensions well

Slowly Changing Dimensions


Certain kinds of dimension attribute changes need
to be handled differently in Data Warehouse
Type I - Overwrite
e.g. Name Correction, Description changes

Type II - Partition History


Packing change, Customer movement
Create a new dimension record with new surrogate key

Type III - Organizational changes


Sales Force Reorganization
Show by sales broken by new and old organizations
Need to create an old and a new field

Factless Fact Tables


For Event Tracking e.g. attendance
Date
Dimension
Course
Dimension
Facility
Dimension

Date_Key
Student_Key
Course_Key
Teacher_Key
Facility_Key

Student
Dimension
Teacher
Dimension

Coverage Tables
Problem : To find out which Products
on promotion did not sell?
Fact Table

Date
Dimension
Store
Dimension

Date_Key
Product_Key
Store_Key
Promotion_Key
Dollars Sold
Units Sold

Product
Dimension
Promotion
Dimension

Coverage Tables
Solution - Coverage Tables
Date
Dimension

Date_Key
Product_Key

Store
Dimension

Product
Dimension

Store_Key
Promotion_Key

Sales Promotion Coverage Table

Promotion
Dimension

Snowflake Schema
Dimension tables are normalized by
decomposing at the attribute level
Each dimension has one key for each
level of the dimensions hierarchy
Good performance when queries
involve aggregation
Complicated maintenance and
metadata, explosion in number of table.
Makes user representation more
complex and intricate

Snowflake schema Example

Dim
Table

Dim
Table
Fact
Table

Dim
Table

Dim
Table

UNIT III to V
TOPIC
Data Mining

Data Mining - Definition


Data mining is the automated detection for
new, valuable and non trivial information
in large volumes of data.
It predicts future trends and finds behavior
that the experts may miss because it lies
outside their expectations
Data mining lets you be proactive
Prospective rather than Retrospective

Data Mining Leads to simplification and


automation of the overall statistical
process of deriving information from huge
volume of data.

Examples of Data Modeling


Tools
ERWIN
Supports Data Warehouse design as a
modeling technique

Powersoft WarehouseArchitect
Module of Power Designer specifically for DW
Modeling

Oracle Designer
Can be extended for Warehouse modeling

Others like Infomodeler, Silverrun are also


used

Data Mining Introduction


DM - what it can do
Exploit patterns & relationships in data to
produce models
Two uses for models:
Predictive
Descriptive

DM - what it cant do
Automatically find relationships
without user intervention
when no relationships exist

Data Mining Introduction


Data Mining and Data Warehousing
Data preparation for DM may be part of the
Data Warehousing
Data Warehouse not a requirement for Data
Mining

DM and OLAP
OLAP = Classic descriptive model
Requires significant user input
Example : Beer and diaper sales
An OLAP tools shows reports giving sales of different
items
A data mining tool analyses the data and predicts
how many times beer and diapers are sold together

Data Mining
Proactive
Automatically searches
Anomalies
Possible Relationships
Identify Problems before the end-user
Data Mining tools analyze the data, uncover
problems or opportunities hidden in data
relationships, form computer models based
on their findings, and then user the models
to predict business behavior with minimal
end-user intervention

Data Mining
A methodology designed to perform
knowledge-discovery expeditions
over the database data with minimal
end-user intervention
3 Stages of Data
Data
Information
Knowledge

Extraction of Knowledge from


Data

4 Phases of Data Mining


Data Preparation
Identify the main data sets to be used by
the data mining operation (usually the data
warehouse)

Data Analysis and Classification


Study the data to identify common data
characteristics or patterns
Data groupings, classifications, clusters,
sequences
Data dependencies, links, or relationships
Data patterns, trends, deviation

4 Phases of Data Mining


Knowledge Acquisition

Uses the Results of the Data Analysis and Classification


phase
Data mining tool selects the appropriate modeling or
knowledge-acquisition algorithms

Neural Networks
Decision Trees
Rules Induction
Genetic algorithms
Memory-Based Reasoning

Prognosis

Predict Future Behavior


Forecast Business Outcomes
65% of customers who did not use a particular credit card in
the last 6 months are 88% likely to cancel the account.

Data Mining
Still a New Technique
May find many Un-meaningful
Relationships
Good at finding Practical Relationships
Define Customer Buying Patterns
Improve Product Development and Acceptance
Etc.

Potential of becoming the next frontier in


database development

Why Data Mining


Credit ratings/targeted marketing:
Given a database of 100,000 names, which persons are
the least likely to default on their credit cards?
Identify likely responders to sales promotions

Fraud detection
Which types of transactions are likely to be fraudulent,
given the demographics and transactional history of a
particular customer?

Customer relationship management:


Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor? :

Data Mining helps extract


such information

Data mining
Process of semi-automatically analyzing
large databases to find patterns that are:
valid: hold on new data with some certainity
novel: non-obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to
interpret the pattern

Also known as Knowledge Discovery in


Databases (KDD)

Applications
Banking: loan/credit card approval

predict good customers based on old customers

Customer relationship management:

identify those who are likely to leave for a competitor.

Targeted marketing:

identify likely responders to promotions

Fraud detection: telecommunications,


financial transactions

from an online stream of event identify fraudulent events

Manufacturing and production:

automatically adjust knobs when process parameter


changes

Applications (continued)
Medicine: disease outcome, effectiveness of
treatments
analyze patient disease history: find relationship
between diseases

Molecular/Pharmaceutical: identify new drugs


Scientific data analysis:
identify new galaxies by searching for sub clusters

Web site/store design and promotion:


find affinity of visitor to pages and modify layout

Knowledge Discovery
Definition
Knowledge Discovery in Data is the
non-trivial process of identifying

valid
novel
potentially useful
and ultimately understandable
patterns in data.

from Advances in Knowledge Discovery and Data


Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

143

Related Fields
Machine
Learnin
g

Visualization

Data Mining and


Knowledge Discovery

Statistics

144

Databases

Statistics, Machine Learning and


Data Mining

Statistics:

Machine learning

145

more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics areas not part
of data mining

Data Mining and Knowledge Discovery

more theory-based
more focused on testing hypotheses

integrates theory and heuristics


focus on the entire process of knowledge discovery,
including data cleaning, learning, and integration and
visualization of results

Distinctions are fuzzy

Knowledge Discovery Process


flow, according to CRISP-DM

Monitoring

Continuous
monitoring and
improvement is
an addition to CRISP

146

Historical Note:
Many Names of Data Mining
Data Fishing, Data Dredging: 1960 used by statisticians (as bad name)

Data Mining :1990 - used in DB community, business

Knowledge Discovery in Databases


(1989-)
used by AI, Machine Learning Community
also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery are used interchangeably
147

Some Definitions
Instance (also Item or Record):
an example, described by a number of
attributes,
e.g. a day can be described by temperature,
humidity and cloud status

Attribute or Field
measuring aspects of the Instance, e.g.
temperature

Class (Label)
grouping of instances, e.g. days good for
playing
148

Major Data Mining Tasks


Classification:
Predicting an item class / Decision Tree
Clustering: Finding clusters in data
Associations: e.g. A & B & C occur
frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships

149

Data Growth

In 2 years (2003 to 2005),


the size of the largest database TRIPLED!

150

Data Growth Rate


Twice as much information was created
in 2002 as in 1999 (~30% growth rate)
Other growth rate estimates even
higher
Very little data will ever be looked at by
a human

Knowledge Discovery is NEEDED to make


sense and use of data.
151

Knowledge Discovery
Definition
Knowledge Discovery in Data is the
non-trivial process of identifying

valid
novel
potentially useful
and ultimately understandable
patterns in data.

from Advances in Knowledge Discovery and Data


Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996

152

Related Fields
Machine
Learnin
g

Visualization

Data Mining and


Knowledge Discovery

Statistics

153

Databases

Statistics, Machine Learning and


Data Mining

Statistics:

Machine learning

154

more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics areas not part
of data mining

Data Mining and Knowledge Discovery

more theory-based
more focused on testing hypotheses

integrates theory and heuristics


focus on the entire process of knowledge discovery,
including data cleaning, learning, and integration and
visualization of results

Distinctions are fuzzy

Knowledge Discovery Process


flow, according to CRISP-DM

Monitoring

see
www.crisp-dm.org
for more
information

Continuous
monitoring and
improvement is
an addition to CRISP

155

Historical Note:
Many Names of Data Mining
Data Fishing, Data Dredging: 1960 used by statisticians (as bad name)

Data Mining :1990 - used in DB community, business

Knowledge Discovery in Databases


(1989-)
used by AI, Machine Learning Community
also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery
are used interchangeably
156

Some Definitions
Instance (also Item or Record):
an example, described by a number of
attributes,
e.g. a day can be described by temperature,
humidity and cloud status

Attribute or Field
measuring aspects of the Instance, e.g.
temperature

Class (Label)
grouping of instances, e.g. days good for
playing
157

Major Data Mining Tasks


Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur
frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships

158