Vous êtes sur la page 1sur 33

Data Warehousing and

Mining
Lecture 1
Lecture 1
Introduction and general information about the
subject
Introduction to Data Warehouse Concepts and
Designs

Reading
Data Warehousing Fundamentals, Paulraj Ponniah,
John Wiley & Sons, 2001.
The Data Warehouse Lifecycle Toolkit: Expert Methods
for Designing, Developing and Deploying Data
Warehouses, Ralph Kimball, John Wiley & Sons, 1998.
Handouts
Assessment

Activity Weightage
Mid-Term 20%
Final Exam 30%
Case Studies/CP 25%
Final Project 15%
Quizzes 10%
Total 100%
Requirements for Passing
Minimum 80% attendance
50% or more in the final components
which is
50% or more for the assignment
50% or more for the exams (Mid and
final)
Topics to be covered
Introduction and uses of a data
warehouse
Data Warehouse Concepts and Design
Extraction, Transformation and Loading
(ETL)
On-Line Analytic Processing (OLAP) and
On-Line Transactional Processing
(OLTP)
Introduction to DW Concepts
and Design
What is a Data Warehouse?
The purpose of a DW
An Operational Database V a Data
Warehouse
Schema Types for Data Warehouses
What is a Data Warehouse?
A Decision Support Database that is
maintained separately from the
organizations operational database
OR
Support Information Processing by
providing a solid platform for historical
consolidated data
What is a Data Warehouse?
A Data Warehouse is
Subject Oriented
Integrated
Time-variant
Non-Volatile

A copy of transaction data specifically


structured for querying, analysis and
reporting
Subject-Oriented
Operational databases are function-
oriented while a data warehouses design is
subject-oriented

Function oriented focuses on functions of


the business i.e. A/P, Ledger etc.
Subject-oriented focuses on departments
or subjects in a particular business e.g.
sales, marketing, Human Resource
Subject-Oriented
A DW works around major components of the

business which include, marketing, HR,


Customers, Sales etc.
DW works towards helping decision makers

make the right decisions. It is not modeled on


daily operations or transactions
Makes the data brief and easy to understand

by not displaying unwanted data for anyone


requiring decision support
Subject-oriented
For example:

Instead of storing an invoice, the data


warehouse stores its customer and product
components, because DSS activities require
sales summaries by product or customer5

5
Coronel, C and Rob, P., Database Systems Design, Implementation &
Management, Thompson Learning, 2000.
Integrated
The sources of a data warehouse can be more than one
and of different types (relational database,
transactions etc)
However, the data must be standardized before it is
input inside the DW
For example, consider the following two data sources:
SOURCE A
ID Name Gende Heigh Data cleaning and data
r t integration techniques are
123 John Male 1.82 applied.
4 Smith SOURCE B
ID FNam LNam Gende Height
e e r
C00 David Jones M 178
1
Time Variant
DW provides a time-based perspective of
historical data;
At least one unit of time is mandatory;
Each part of the DW should have been accurate
at one point in time i.e. changes stored as new
entries;
Time-dependent aggregations (i.e. expenses)
are re-done after new uploads
Uploaded expenses for Jan= 100$
Uploaded expenses for Feb= 100$
The total will be re-calculated = 200$
Non-Volatile
Data is stored separately from the
Operational Environment
If the data has been entered into the
system, changing or modifying it is
strictly controlled. The purpose is to
store time-variant data for a long time.
Updating data in ODS is different from
Updating Data in a DW.
If any updating is required, the data is
uploaded with a new timestamp.
Historical data can be kept this way.
++Non-Volatile
Unlike the ODS, a DW does not allow you

to change or update previously entered


data
As it deals only with accessing data, only
two processes occur,
Initial loading
Access
Introduction to DW Concepts
and Design
What is a Data Warehouse?
The purpose of a DW
An Operational Database V a Data
Warehouse
Schema Types for Data Warehouses
Goals of a Data Warehouse
Easy access
Supports forecasting and Decision-
making
Adaptive and resilient to change
Privacy of Information
Data should be correct to make the right
decisions
Introduction to DW Concepts
and Design
What is a Data Warehouse?
The purpose of a DW
An Operational Database vs a Data
Warehouse
Schema Types for Data Warehouses
Operational Database Systems
(ODS)
The ER-Diagrams you have made are used in
an ODS. They are relational databases.
Operational Database Systems (ODS )
Transactions and queries
also called OLTP (Online Transaction
Processing)
day to day operations in various areas
(e.g. purchasing, inventory, manufacturing,
banking, accounting etc.)
Data Warehouse
Data Warehouse (DW)
Provides support for decision makers
analysing data.
Also called OLAP (Online Analytical
Processing)
Why a separate DW?
High performance for both systems

DBMS tuned for OLTP: access methods, indexing,


concurrency control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
Different functions and different data:

missing data: Decision support requires historical data


which operational DBs do not typically maintain
data consolidation: DW requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to be
reconciled
Data Warehouse vs. Operational
DBMS
Traditional Operational DB
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
Data warehouse:
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making

22
Differences between OLTP and OLAP

Users and system orientation


OLTP
customer oriented
used for transaction & query processing by clerks,
clients and IT professionals
OLAP
market oriented
used by knowledge workers (managers, executives,
analysts)
Data contents
OLTP
current data (too detailed for decision making)
OLAP
large amounts of historical data
facilities for summarisation & aggregation
23
Differences between OLTP and OLAP

Database design
OLTP
Entity relationship (ER) data model
Application oriented design
OLAP
Star or snowflake model (multidimensional -discuss later)
Subject oriented database design
View
OLTP
Current data within enterprise or department
Does not refer to historical data or from other organisations
OLAP
Multiple versions of database schema
Information from other organisations
24
Differences between OLTP and OLAP

Access patterns
OLTP
Short,atomic transactions
Hence requires concurrency control and
recovery mechanisms
OLAP
Mostlyread-only operations
Complex queries over large intervals of time

25
Feature OLTP OLAP
characteristic operational processing informational processing
orientation transaction analysis
clerk, DBA, database
user professional knowledge worker
function day to day operation decision support
star/snowflake, subject-
DB design ER based, application oriented oriented
historical, maintained over
data current, guaranteed up to date time
summarisation high detail summarised, consolidated
view detailed, flat relational multidimensional
unit of work short, simple transaction complex query
access read/write mostly read
focus data in information out
number of users thousands hundreds
DB size 100MB to GB 100GB to TB
high performance, high
26
priority availability high flexibility
+++Data Warehouses DSS/BI
Why is understanding data warehouse
concepts and design important for the
business manager/analyst?
To know what business problems are solvable
using data warehouse technology
To have an understanding of the workload
involved in designing and implementing a
data warehouse solution
To converse with database designers who
also need to be aware of the business needs
of their users

27
Data Warehouse as a Support for DSS
With a correctly implemented data warehouse
the DSS analyst should not have to worry
about:
Searching for the definitive source of data
Creating special extract programs from existing
systems
Dealing with un-integrated data
Compiling and linking detailed and summary data and
the linkage between the two
Finding an appropriate time basis of data (finding
historical data)
Management constantly changing its mind about what
needs to be looked at next
In addition,
9 the DSS
Source: Inmon, W.H., analyst has
Building the DataaWarehouse,
rich supply of2005
Wiley,
28 summary data available.9
Introduction to Data Warehouse
Concepts and Design

What is a Data Warehouse?

Goals of a Data Warehouse


Differences between Operational Database
Systems and Data warehouses

Schema types for Data Warehouses

29
Schema types for Data
Warehouses
Star schema
One fact document, multiple not normalized
dimensions

Snowflake
One fact document, multiple normalized
dimensions

Fact constellation
Multiple fact documents, multiple dimensions

30
STAR SCHEMA DATA
WAREHOUSE
Time dimension
Item dimension
Time_key
Day_of_the_week Item_key
Month Sales Fact Table Item_name
Quarter Brand
Semester Time_key Supplier
Year Item_key
Branch_key
Location_key
Branch dimension Dollars_sold Location dimension
Units_sold
Branch_key Location_key
Branch_name Street
Branch_type City
Branch_address State
Country

31
SNOWFLAKE DATA WAREHOUSE
Time dimension
Item dimension Supplier
Time_key dimension
Day_of_the_week Item_key
Month Item_name Supplier_key
Quarter Sales Fact Table Brand Supplier_name
Semester Supplier_key Supplier_address
Year Time_key
Item_key
Branch_key
Location_key City
Dollars_sold dimension
Branch dimension Units_sold City_key
Location City_name
dimension State
Branch_key
country
Branch_name Location_key
Branch_type Street
Branch_address City_key ***
dimension
***
dimension .

.
32
FACT CONSTELLATION DATA WAREHOUSE
Supplier
Item dimension
dimension
Sales Fact Table
Item_key
Time dimension Supplier_key
Time_key Item_name
Supplier_name
Item_key Brand
Time_key Supplier_address
Branch_key Supplier_key
Day_of_the_week
Month Location_key
Quarter Dollars_sold
Semester Units_sold Location
Year dimension

Location_key
Street
Shipping fact
City
table
State
Branch dimension Country
Item_key
Time_key
Branch_key Shipper_key Shipper dimension
Branch_name From_location
Branch_type To_location Shipper_key
Branch_address Dollars_cost Shipper_name
Units_shipped Location_key
Shipper_type
Shipper_address
33

Vous aimerez peut-être aussi