Académique Documents
Professionnel Documents
Culture Documents
Warehousing
Topics Covered
Topics Covered
High Volume
Changes with time
Only current Data available
Answers simple queries
Little help to decision maker
OLTP Shortcomings:
Focus on transaction
Large amount of data but
Related to transaction
Does not maintain historical data
Does not maintain summarized data
DWH Emergence:
10
11
12
13
Repository of information
Improved access to integrated data
Provides historical perspective
Variety of end-users use it for different purposes
Requires a major system integration effort
Reduces the reporting and analysis impact on operational systems
14
15
Define requirements
Analysis and Planning based on requirements
Model (E-R Model)
Physical Design
Development (Coding)
Quality Assurance and User Acceptance
Implementation
16
Subject Definition
Data Identification or Data Discovery
Data Acquisition
Data Cleansing
Data Transformation
Data Loading
Exploitation
17
Logical Concept
Build logical data model
Develop transformational model
Translate logical model to physical model
18
data
from
RDBMS/DBMS/Flat
files
Data Cleansing
Removal of inconsistent data
Removal of Unwanted Data
Removing Extreme Cases (data-mining)
Data Transformation
Convert to consistent Business oriented format.
Generate derived information not stored in OLTP systems.
19
20
Exploitation
Enables the users to view, analyze and report on data
Simple query and reporting
Multidimensional analysis
OLAP using Slice and Dice, Drilling
21
DWH - Architecture
Major Components
Data identification
Cleanup
Extraction, Transformation and loading tools
Metadata repository
Data Marts
Data query, reporting, analysis and mining tools
Data Warehouse administration and management
22
Data Warehouse
Advantage of DWH:
There are many advantages to using a data
warehouse, some of them are:
Data warehouses enhance end-user access
to a wide variety of data.
Decision support system users can obtain
specified trend reports, e.g. the item with
the most sales in a particular area within
the last two years.
Data warehouses can be a significant
enabler of commercial business
applications, particularly customer
relationship management (CRM) systems.
23
Business Intelligence:
Business intelligence (BI) is a broad category of applications and
technologies for gathering, storing, analyzing, and providing access to
data to help enterprise users make better business decisions. BI
applications include the activities of decision support systems, query
and reporting, online analytical processing (OLAP), statistical analysis,
forecasting, and data mining. Business intelligence applications can be:
24
DWH - Architecture
Metadata Layer
Extraction
FS1
FS2
.
.
.
FSn
Transmission
N
E
T
W
O
R
K
Legacy System
Cleansing
S
T
A
G
I
N
G
Transformation
Data Mart
Population
Aggregation
Summarization
ODS
DM1
DW
DM2
DMn
A
R
E
A
OLAP ANALYSIS
Knowledge Discovery
25
26
Dimension Modeling
27
Dimension Modeling
Business model translates into a specific design called
DIMENSIONAL MODEL (also called STAR MODEL).
The outcome of the DIMENSIONAL MODEL is the STAR SCHEMA
or SNOWFLAKE SCHEMA
28
Star Schema
Attributes
A single fact table, with detail and summary data
Fact table primary key has only one key column per dimension
Each dimension is a single table, highly de-normalized
Benefits
Easy to understand, easy to define hierarchies, reduces number
of physical joins, low maintenance, very simple metadata.
Drawbacks
Summary data in the fact table yields poorer performance for
summary levels, huge dimensions tables a problem.
29
Star Schema
A Group of Facts connected to Multiple Dimensions
Channel
Financial
Transactions
Time
Customer
Organization
Product
30
Snowflake Schema
31
Snow-Flake Schema
Snow-flake Schema (= Extended Star Schema)
A Group of Facts connected to Dimensions, which are split across
multiple hierarchies and attributes
Time
Product
Financial
Transactions
Channel
Organization
Customer
Segment
Geography
32
Design Principle
33
Design Principle
Designing a Fact Table.
The first step in designing a fact table is to determine the granularity of
the fact table. By granularity, we mean the lowest level of information that
will be stored in the fact table. This constitutes two steps:
1.
Determine which dimensions will be included.
2.
Determine where along the hierarchy of each dimension the information
will be kept.
Which Dimensions To Include
Determining which dimensions to include is usually a straightforward
process, because business processes will often dictate clearly what are the
relevant dimensions. The determining factors usually goes back to the
requirements.
For example, in an off-line retail world, the dimensions for a sales fact
table are usually time, geography, and product. This list, however, is by
no means a complete list for all off-line retailers.
34
35
Design Principle
36
Design Principle
37
Data Marts
What is a Data Mart?
It is a subset of Data Warehouse with a specific purpose in mind.
Key to a successful Data Warehouse lies in getting a data mart in place
as soon as possible than implementing the entire Data Warehouse
initiative in one go
38
Metadata
What is Metadata?
Data about data.
Metadata repository / document gives detailed description of the
source, structure, content and attributes of the data warehouse.
Metadata created using Data modeling tools, ETL tools (e.g. ORACLE
Warehouse Builder, INFORMATICA) or manually
39
Surrogate Keys
A surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used
for the primary key to the table. The only requirement for a surrogate
primary key is that it is unique for each row in the table.
Some tables have columns such as AIRPORT_NAME or
CITY_NAME which are stated as the primary keys (according to the
business users) but ,not only can these change, indexing on a
numerical value is probably better and you could consider creating a
surrogate key called, say, AIRPORT_KEY. This would be internal to
the system and as far as the client is concerned you may display only
the AIRPORT_NAME.
40
Surrogate Key
Pros
Surrogate Keys never need changing
Save space
Improve query performance
Cons
Overhead in the key generation process
The user cannot understand the key, thus the table.
If new developers take over, they will also have to figure out the keys.
41
Types of Facts
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all of
the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up
for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up
for any of the dimensions present in the fact table.
42
Example Additive:
Fact table (Retailer) with the following columns:
Date
Store
Product
Sales_Amount
The purpose of this table is to record the sales amount for each product
in each store on a daily basis. Sales_Amount is the fact. In this case,
Sales_Amount is an additive fact, because you can sum up this fact
along any of the three dimensions present in the fact table -- date,
store, and product. For example, the sum of Sales_Amount for all 7
days in a week represent the total sales amount for that week.
43
Example Semi-Additive/Non-Additive:
Fact table (bank) with the following columns:
Date
Account
Current_Balance
Profit_Margin
The purpose of this table is to record the current balance for each
account at the end of each day, as well as the profit margin for each
account for each day. Current_Balance and Profit_Margin are the
facts. Current_Balance is a semi-additive fact, as it makes sense to add
them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for
each day of the month does not give us any useful information).
Profit_Margin is a non-additive fact, for it does not make sense to add
them up for the account level or the day level. week.
44
45
Cust_Key
1001
Name
Christina
State
Illinois
Christina is a customer with ABC Inc. She first lived in Chicago, Illinois.
So, the original entry in the customer lookup table has the following record:
At a later date, she moved to Los Angeles, California on January, 2003.
How should ABC Inc. now modify its customer table to reflect this change?
This is the "Slowly Changing Dimension" problem.
46
47
Type 1
Cust_Key
1001
Name
Christina
State
Illinois
Type 2
Cust_Key
Name
State
1001
Christina
Illinois
1010
Christina
Chicago
49
Type 3
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular
attribute of interest, one indicating the original value, and one indicating the current value. There
will also be a column that indicates when the current value becomes active.
To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
Customer Key ,Name ,Original State ,Current State ,Effective Date
After Christina moved from Illinois to California, the original information gets updated, and we
have the following table (assuming the effective date of change is January 15, 2003):
Advantages:
This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.
Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed more than once. For
example, if Christina later moves to Texas on December 15, 2003, the California information will
be lost.
Usage:
Type 3 is rarely used in actual practice.
When to use Type 3:
Type III slowly changing dimension should only be used when it is necessary for the data
warehouse to track historical changes, and when such changes will only occur for a finite number
of time
50
Conformed Dimension
A conformed dimension is a single, coherent view of the same piece of
data throughout the organization. The same dimension is used in all
subsequent star schemas defined. This enables reporting across the
complete data warehouse in a simple format.
A conformed dimension is a set of data attributes that have been
physically implemented in multiple data marts using the same
structure, attributes, domain values, definitions and concepts in each
implementation.
51
Conformed Dimension
52
Factless Fact
A factless fact is a fact table that does not contain numeric addictive
values, but is composed exclusively of keys. They may consist of
nothing but keys.
There are two types of factless fact tables:
Event-tracking
Coverage.
53
54
55
Do think that there are patterns in the data of our company. The
patterns are waiting to be detected. Detection of patterns in your
data is called Data Mining.
De-normalize data
56
57
58
59
60
ROLAP/MOLAP/HOLAP
ROLAP stands for Relational OLAP. Users see their data organized in
cubes with dimensions, but the data is really stored in a Relational
Database (RDBMS) like Oracle. The RDBMS will store data at a fine
grain level, response times are usually slow.
MOLAP stands for Multidimensional OLAP. Users see their data
organized in cubes with dimensions, but the data is store in a Multidimensional database (MDBMS) like Oracle Express Server. In a
MOLAP system lot of queries have a finite answer and performance is
usually critical and fast.
HOLAP stands for Hybrid OLAP, it is a combination of both worlds.
Seagate Software's Holos is an example HOLAP environment. In a
HOLAP system one will find queries on aggregated data as well as on
detailed data.
61
Data Mining
Given database of sufficient size and quality, data mining technology
can generate new business opportunities by providing these
capabilities.
Automated prediction of trends and behaviors
Automated discovery of previously unknown pattern
62
Data Mining
Commonly used data mining techniques
Decision trees
Rule induction
Artificial Neural Networks
Clustering
Market Basket Analysis
Link Analysis
Applications
Forecasting
Risk Management
Market Management
63
ETL Tools
Informatica
DataStage
Oracle Warehouse Builder (OWB)
64
OLAP Tools
Congas Products
Impromptu
Tranformer
PowerPlay
Visulizer
Oracle Products
Oracle Discover Administrator
Discover plus
Discover Desktop
65
OLAP Tools
Primary Business Objects products
66
BusinessObject Miner
Cognos 4Thought
Cognos Scenarios
Oracle Data Miner
67
Seagate
www.seagate.com
Data Warehousing
www.dw-institute.com
www.dwinfocenter.org
SAS Institute
www.sas.com
Others Links:
http://www.kimballgroup.com/html/designtips.html
http://www.learndatamodeling.com
http://www.1keydata.com/datawarehousing/concepts.html
68
Thank You
69
70
71
Null values
Unlike most other types of indexes, bitmap indexes include rows that have NULL values. Indexing of nulls
can be useful for some types of SQL statements, such as queries with the aggregate function COUNT.
Cardinality
The advantages of using bitmap indexes are greatest for columns in which the ratio of the number of distinct
values to the number of rows in the table is small. We refer to this ratio as the degree of cardinality. A gender
column, which has only two distinct values (male and female), is optimal for a bitmap index. However, data
warehouse administrators also build bitmap indexes on columns with higher cardinalities. For example, on a
table with one million rows, a column with 10,000 distinct values is a candidate for a bitmap index. A bitmap
index on this column can outperform a B-tree index, particularly when this column is often queried in
conjunction with other indexed columns. In fact, in a typical data warehouse environments, a bitmap index can
be considered for any non-unique column.
72
74