Vous êtes sur la page 1sur 49

Intro.

to Data Warehouse
รศ.ดร. วรพจน์ กรีสุระเดช
Worapoj Kreesuradej, Ph.D.
Associate Professor

Data Mining & Data Exploration Laboratory (DME Lab),


Faculty of Information Technology,
King Mongkut's Institute of Technology Ladkrabang,
Web: www.it.kmitl.ac.th/dme
Email: worapoj@it.kmitl.ac.th
Book

 Paulraj Ponniah, Data Warehousing


Fundamentals, John Wiley & Sons, 2001.

 Ralph Kimbal and Margy Ross, The Data


Warehouse Toolkit, John Wiley and Sons,
2002.
Definition of DW
 “A collection of integrated, subject-
oriented databases designed to supply
the information required for decision-
making.” - W. Inmon
 A decision support database that is
maintained separately from the
organization’s operational databases.
 A physical repository where relational
data are specially organized to provide
enterprise-wide, cleansed data in a
standardized format –E. Turban and etc.
R. Kimball’s definition of a DW
 A data warehouse is a copy of
transactional data specifically
structured for querying and analysis.
Problem: Data Management
in Large Enterprises
 Vertical fragmentation of informational
systems
 Result of application (user)-driven
development of operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...


Problem: Data Management
in Large Enterprises
 Two Approaches for accessing
data:
 Query-Driven (Lazy) ?
 Warehouse (Eager)
Source Source
The Need for DW
 Query-driven (lazy, on-demand)
Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source
Disadvantages of Query-
Driven Approach
 Delay in query processing
 Inefficient and potentially expensive
for frequent queries
 Competes with local processing at
sources
The Warehousing Approach
 Information Clients
integrated in
advance Data
Warehouse
 Stored in wh
for direct
Integration System Metadata
querying and
analysis ...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Advantages of Warehousing
Approach
 High query performance
 Doesn’t interfere with local processing
at sources
 Information copied at warehouse
 Can modify, annotate, summarize,
restructure, etc.
 Can store historical information
 Security, no auditing
Characteristics of DW
Subject oriented Data are organized by how
users refer to it
Integrated Inconsistencies are removed
in both nomenclature and
conflicting information; (i.e.
data are ‘clean’)
Non-volatile Read-only data. Data do not
change over time.
Time variant Data are time series, not
current status
Subject Oriented
 Data Warehouse is designed around
“subjects” rather than processes
 A company may have
 Retail Sales System
 Outlet Sales System
 Catalog Sales System
 DW will have a Sales Subject Area
Subject Oriented
OLTP Systems

Retail Sales Outlet Sales Catalog Sales


System System System

Data Warehouse
Sales Subject Area

Subject-Oriented Sales Information


Integrated
 Heterogeneous Source Systems
 Need to Integrate source data
 For Example: Product codes could
be different in different systems
 Arrive at common code in DW
Integrated
 Information
Clients
integrated in
advance
Data
 Stored in DW Warehouse

for direct
querying and Integration System Metadata

analysis
...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Non-Volatile
 Operational update of data does not occur
in the data warehouse environment.
 Does not require transaction
processing, recovery, and concurrency
control mechanisms
 Requires only two operations in data
accessing:
 initial loading of data and access of
data.
Non-Volatile(Read-Mostly)
Write
USER
OLTP
Read

USER DW
Read
Time Variant
 The time horizon for the data warehouse is
significantly longer than that of operational
systems.
 Operational database: current value data.
 Data warehouse data: provide information
from a historical perspective (e.g., past 5-
10 years)
Time Variant

 Most business Sales

analysis has a
time component

2001 2002 2003 2004


 Trend Analysis
(historical data is
required)
Data Warehousing
Process Overview
Data Warehousing
Process Overview
 The major components of a data
warehousing process
 Data sources
 Data extraction
 Data loading
 Comprehensive Database /Data Store
 Data Mart
 Metadata
 Middleware tools /information delivery
tools
ETL
• Data Extraction
• Data Cleaning and Transformation
Convert from legacy/host format to
warehouse format
• Load
Sort, summarize, consolidate,
compute views, check integrity,
build indexes, partition
The ETL Process

Source Staging DW
Systems Area Database

Extract Transform Load


Data Staging Area
• A storage area where extracted data is
cleaned, transformed and deduplicated.
• Initial storage for data
• Need not be based on Relational model
• Mainly sorting and Sequential
processing
• Does not provide data access to users
• Analogy – kitchen of a restaurant
ETL Process
Issues & Challenges
• Consumes 70-80% of project time
• Heterogeneous Source Systems
• Little or no control over source systems
• Source systems scattered
• Different currencies, measurement units
• Ensuring data quality
Comprehensive Database
/Data Store
 Mostly a relational DB
 Oracle, DB2, Sybase, SQL Server

 New DB design for special purpose of


DW (e.g., scale up, speed up, parallel
processing)
Data Warehouse Design
 OLTP Systems are Data Capture Systems
 “DATA IN” systems
 DW are “DATA OUT” systems

OLTP DW
Dimensional Modeling
 Facts are stored in FACT Tables
 Dimensions are stored in
DIMENSION tables
 Dimension tables contains textual
descriptors of business
 Fact and dimension tables form a
Star Schema
 “BIG” fact table in center surrounded
by “SMALL” dimension tables
Star Schema
CUSTOMER
# CUSTOMER _KEY
* C ID
TIME * C NAME
# TIME_KEY referenced by * STATE
* ORD ERD ATE * C ITY
* D AY_ OF_ WEEK referenced by
* D AY_ NU MBER_IN_MONTH SALES reference
* D AY_ NU MBER_IN_YEAR # TIME_KEY
* WEEK_N UMBER # PRODUC T_KEY
* MON TH # CUSTOMER _KEY
* QUARTER reference
* PRIC E
* H OL IDAY_FL AG * QUANTITY
* FISC AL_ YEAR * SALES
* FISC AL_ QUARTER

reference

referenced by

PRODUCT
# PRODUC T_KEY
* PID
* PNAME
* PCN AME
Star Schema
Data mart
 Data mart = subset of DW for community
users, e.g. accounting department
 Sometimes exist as Multidimensional
Database
 Info mart = summarized data + report for
community users
Meta Data
 Data about data
 Needed by both information technology
personnel and users
 IT personnel need to know data sources and
targets; database, table and column names;
refresh schedules; data usage measures; etc.
 Users need to know entity/attribute
definitions; reports/query tools available;
report distribution information; help desk
contact information, etc.
Information Delivery Tools
 Tools
 Query & reporting
 OLAP
 Data mining, visualization, segmentation,
clustering
 New developments: text mining, web mining
& personalization
 Mining multimedia data
Information Delivery Tools

 Commercial tools
 Crystal Report, Impromptu, WebFocus

 Increasingly common mode of delivery:


Web-enabled
Data Warehouse Architecture

 Data Flow Architecture


 System Architecture
Data Flow Architecture
Data Flow Architecture
Data Flow Architecture
 Operational data stores (ODS)
A type of database often used as an
interim area for a data warehouse,
especially for customer information files

 MDB=Multidimensional databases
System Architectures

 Three parts of the data warehouse


 The data warehouse that contains the data
and associated software
 Data acquisition (back-end) software that
extracts data from legacy systems and
external sources, consolidates and
summarizes them, and loads them into the
data warehouse
 Client (front-end) software that allows
users to access and analyze data from the
warehouse
System Architectures
System Architectures
System Architecture
System Architecture
Data Warehouse Development
 Data warehouse development
approaches
 Inmon Model: EDW approach, Enterprise-
wide warehouse, top down
 Kimball Model: Data mart approach, Data
mart, bottom up
 Which model is best?
 There is no one-size-fits-all strategy to data
warehousing
 When properly executed, both result in an
enterprise-wide data warehouse, but with
different architectures
The Data Mart Strategy
 The most common approach
 Begins with a single mart and architected
marts are added over time for more subject
areas
 Relatively inexpensive and easy to implement
 Can be used as a proof of concept for data
warehousing
 Can perpetuate the “silos of information”
problem
 Can postpone difficult decisions and
activities
 Requires an overall integration plan
The Enterprise-wide
Strategy
 A comprehensive warehouse is built initially
 An initial dependent data mart is built using a
subset of the data in the warehouse
 Additional data marts are built using subsets
of the data in the warehouse
 Like all complex projects, it is expensive, time
consuming, and prone to failure
 When successful, it results in an integrated,
scalable warehouse
DW Lifecycle (Ralph Kimball )
Data Warehouse Development
 Some best practices for implementing a
data warehouse (Weir, 2002):
 Project must fit with corporate strategy and
business objectives
 There must be complete buy-in to the
project by executives, managers, and users
 It is important to manage user expectations
about the completed project
 The data warehouse must be built
incrementally
 Build in adaptability
Data Warehouse Development
 Some best practices for implementing a
data warehouse (Weir, 2002):
 The project must be managed by both IT
and business professionals
 Develop a business/supplier relationship
 Only load data that have been cleansed and
are of a quality understood by the
organization
 Do not overlook training requirements
 Be politically aware

Vous aimerez peut-être aussi