Académique Documents
Professionnel Documents
Culture Documents
Module 6
Course Agenda
2
Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts
A series of interrelated business processes which contribute to increased product value for the customer, and to profit for the enterprise
Porter 1985
Product Development
Operations
Customer Services
Drive to Compete
Businesses constantly strive to optimize each process in the value chain Optimization requires measuring and analyzing the effectiveness of each process as well as the value chain as a whole
Product Development
Operations
Customer Services
Process optimization
Supported by on-line transaction processing systems OLTP Supported by 'analytic' systems Data warehouse
Operations Sales and Marketing Customer Services
Product Development
Product Development
Operations
Customer Services
Book an order Print a pick list Record a cash withdrawal Post a payment
Run invoices Print ledger Pull up customer detail Focused on detail Predictable requirements and query patterns Does not reveal the overall performance of a process
Operational reporting
Design goals
10
11
For designing relational database structures For use in analytic systems Packaged goods industry 1996 book: 'The Data Warehouse Toolkit'
12
Q&A
13
14
15
Measurement Focus
Product Development Operations Sales and Marketing Customer Services
gross margin
receivables
return rate
16
Process Measurement
Measures
Metrics or indicators by which people evaluate a business process Referred to as Facts Margin Inventory Amount Sales Dollars Receivable Dollars Return Rate
Examples
2,400
1,632
68%
2,073
1,658
80%
9,473
7,090
75%
Facts
17
Perspective Focus
Product Development Operations Sales and Marketing Customer Services
category
G/L account
Product, warehouse
supplier
18
Process Perspectives
Dimensions
The parameters by which measures are viewed Used to break out, filter or roll up measures Often found after the word by in a business question Descriptive business terms Product Warehouse Customer Supplier
2,400
1,632
68%
Examples
2,073
1,658
80%
9,473
7,090
75%
19
Dimensions
Dimensional Model
Definition
Logical data model used to represent the measures and dimensions that pertain to one or more business subject areas Dimensional Model = Star Schema
20
Serves as basis for the design of a relational database schema Can easily translate into multi-dimensional database design if required Overcomes OLTP design shortcomings
Understandable Systematically represents history Reliable join paths High performance query Enterprise scalability
21
Schema Simplicity
Fewer tables
Denormalized Consolidated Familiar to users Facts go in the fact tables Dimensions in dimension tables
Dimensional
Product
22
Increases understandability
Star Schema
Data Familiarity
Single source field Expanded into parts Decoded into business terms Add special indicators and flags e.g. time dimension
ord_date
Time Dimension year quarter month date day of the week holiday flag
Increases understandability
23
Representing History
Time dimension
Part of every star schema Marks the date when the facts (process measurements) occurred Allows the schema to easily add and query data over time Especially useful for performing comparison queries
Facts
Product
24
Time Dimension
Defined during schema design - not runtime Business people can easily understand these relationships One-to-many relations between dimensions and facts Referential integrity always enforced
25
Fewer joins means less 'expensive' queries Deterministic query patterns Star schema query optimization supported by all major RDBMS vendors
26
Product Development
Operations
Customer Services
Enterprise Models
Enterprise Scope E/R model
Exercise 1
Scenario
Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales What are the top 10 selling car models this month? How do this months top 10 selling models compare to the top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by dealer by state?
29
Exercise 1 - worksheet
30
Exercise 1 Solution
Facts
Dimensions
31
Q&A
32
33
Dimension tables Store dimension values Textual content Dimension tables usually referred to simply as 'dimensions' Spend extra effort to add dimensional attributes
Dimension Dimension
Dimension
34
Dimension Keys
Synthetic keys
Dimension Dimension
key key
Each table assigned a unique primary key, specifically generated for the data warehouse Primary keys from source systems may be present in the dimension, but are not used as primary keys in the star schema
Dimension
key
35
Dimension Columns
Dimension attributes
Dimension Dimension
Key attribute attribute attribute Key attribute attribute attribute
Specify the way in which measures are viewed: rolled up, broken out or summarized Often follow the word by as in Show me Sales by Region and Quarter Frequently referred to as 'Dimensions'
Dimension
Key attribute attribute attribute
36
Process measures
Start by assigning one fact table per business subject area Fact tables store the process measures (aka Facts) Compared to dimension tables, fact tables usually have a very large number of rows
Fact Table
37
Fact Table
key key key fact1 fact2 fact3
38
Sparsity
Term used to describe the very common situation where a fact table does not contain a row for every combination of every dimension table row for a given time period Because fact tables contain a very small percentage of all possible combinations, they are said to be "sparsely populated" or "sparse"
39
Grain
The level of detail represented by a row in the fact table Must be identified early Cause of greatest confusion during design process Each row in the fact table represents the daily item sales total
Fact Table
Example
40
Sparsity Example
Assume
5,000 rows in 'dealer' dimension 50 rows in 'model' dimension 5,000 * 50 = 250,000 sales every day 91,250,000 sales every year Assuming only one model sold in every dealer! Means that only a small fraction of the total possible 250,00 will be sold on a given day Generally, only record sales - not zeroes in fact table
Sparsity
41
Five initial design steps Based on Kimball's six steps Start designing in order Re-visit and adjust over project life
42
Step One
43
Step Two
44
Step Three
Identify dimensions
45
Step Four
Select facts
46
Step Five
47
Exercise 2
Scenario
Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales What are the top 10 selling car models this month? How do this months top 10 selling models compare to the top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by dealer by state?
48
Exercise 2 - continued
Using these sources data elements, design a star schema that answers the proposed business questions
Sales revenue Quantity sold Model name Dealer name Dealer city Product line Region where sold State Vehicle category Month Date of sales
49
50
Exercise 2 - worksheet
51
Exercise 2 - solution
'Sale facts' Every row in the sales facts table is a summary of car model sales for that day at a single dealer Time, Model, Dealer Total revenue, Quantity sold See next page
Step 3 - Dimensions:
Step 4 - Facts:
52
Model
model_key category line model
Sales Facts
model_key dealer_key time_key revenue quantity
Dealer
dealer_key region state city dealer
53
Q&A
54
55
Sales Facts
model_key dealer_key time_key revenue quantity
56
model_key 1 2 3 4 5 1 3 5
dealer_key 1 1 1 1 1 2 2 2
quantity 2 3 1 4 1 1 2 2
Primary Key
Facts
Facts
Fully additive
Can be summed across any and all dimensions Stored in fact table Examples: revenue, quantity
58
Model
model_key brand category line model
Sales Facts
model_key dealer_key time_key revenue quantity
Dealer
dealer_key region state city dealer
59
Facts
Semi-additive
Can be summed across most dimensions but not all Examples: Inventory quantities, account balances, or personnel counts Anything that measures a level Must be careful with ad-hoc reporting Often aggregated across the forbidden dimension by averaging
60
Model
model_key brand category line model
Sales Facts
model_key dealer_key time_key
inventory
Dealer
dealer_key region state city dealer
61
Facts
Non-Additive
Cannot be summed across any dimension All ratios are non-additive Break down to fully additive components, store them in fact table
62
Sales Facts
model_key dealer_key time_key
revenue margin_amt
Time
time_key
year quarter month date
Dealer
dealer_key
region state city dealer
Unit Amounts
Are numeric, but not measures Store the extended amounts which are additive Unit amounts may be useful as dimensions for price point analysis May store unit values to save space
64
A fact table with no measures in it Nothing to measure... Except the convergence of dimensional attributes Sometimes store a 1 for convenience Examples: Attendance, Customer Assignments, Coverage
65
Q&A
66
Dimension Table
Details
67
time_key
year quarter month date
Dealer
dealer_key
region state city dealer
68
Synthetic Key
Attributes
69
Synthetic Key
Attributes
70
Dimension Tables
Characteristics
Hold the dimensional attributes Usually have a large number of attributes (wide) Add flags and indicators that make it easy to perform specific types of reports Have small number of rows in comparison to fact tables (most of the time)
71
Saves very little space Impacts performance Can confuse matters when multiple hierarchies exist A star schema with normalized dimensions is called a "snowflake schema" Usually advocated by software vendors whose product require snowflake for performance
72
Sales Facts
model_key dealer_key time_key
revenue quantity
Day
date_key
date month_key
Month
Quarter
quarter_key
quarter year_key
Year
year_key
year
month_key
month quarter_key
Line
line_key
line category_key
Category
category_key
category brand_key
Dealer
dealer_key
dealer city_key
City
city_key
city state_key
State
state_key
state region_key
Region
region_key
region
Brand
brand_key
73 brand
Dimension source data may change over time Relative to fact tables, dimension records change slowly Allows dimensions to have multiple 'profiles' over time to maintain history Each profile is a separate record in a dimension table
74
Existing facts need to remain associated with her single profile New facts need to be associated with her married profile
75
Type 1
Updates existing record with modifications Does not maintain history
Type 2
Adds new record Does maintain history Maintains old record
Type 3:
Keep old and new values in the existing row Requires a design change
76
Gather SCD requirements when designing data mapping and loading SCD needs to be defined and implemented at the dimensional attribute level Each column in a dimension table needs to be identified as a Type 1 or a Type 2 SCD If one Type 1 column changes, then all Type 1 columns will be updated If one Type 2 column changes, then a new record will be inserted into the dimension table
77
For large dimension tables, change data capture techniques may be used to minimize the data volume For smaller dimension tables, compare all OLTP records with dimension table records Balance data volume with change data capture logic complexities
78
79
Type 1 Example
OLTP
Customer OLTP
Cust ID Name Marital Home Status Income S $30K
Star Schema
Customer Dim
Cust Cust Key ID 1 Name Marital Home Status Income Status S $30K 0
Sales Facts
Cust Key 1 Day Key 1 Sales $40
Day Dim
Day Key 1 Business Date 1/31/01
Customer Dim
Name
Sales Facts
Day Key Sales 1 2 $40 $50
Day Dim
Day Key 1 2 Business Date 1/31/01 2/01/01
80
Type 1 Example
Observations
Customer history is not maintained in the OLTP system Customer history is not maintained in the star schema Sue only has one customer 'profile' in customer dimension table Sues sales facts across all history are associated with her married profile Sales facts that were associated with Sues single profile have been lost
81
Type 2 Example
OLTP
Customer OLTP
Cust ID Name Marital Home Status Income S 30K
Star Schema
Customer Dim
Cust Cust Key ID 1 Name Marital Home Status Income Status S $30K 0
Sales Facts
Cust Key 1 Day Key Sales 1 $40
Day Dim
Day Key 1 Business Date 1/31/01
Customer Dim
Name Sue Jones
Sales Facts
Day Key Sales 1 2 $40 $50
Day Dim
Day Key 1 2 Business Date 1/31/01 2/01/01
83
Type 2 Example
Type 2 Observations
Customer history is not maintained in the OLTP system Customer history is maintained in the star schema Sue has two 'profiles' in the customer dimension Sues sales facts may be analyzed for when she was single, when she was married, and across all history by using the customer id field Home income was updated in the new profile record
84
85
Values change rapidly over time . No yardstick for telling when a dimension is slowly changing or not and this is based on the judgment of the data modeler. An SCD may become a RCD over time or vice versa.
86
Large Dimensions
Dimensions containing several million records!!!
HOW TO SUPPORT? Database to support indexing technology that support rapid browsing Find and suppress duplicate entries in the dimension (eg. Name and address matching) Never use Type 2 to solve changing dimensions (adding records)
87
HOW TO SUPPORT? Break the Monster dimension into separate dimension tables Constant information in original table New dimension table can have discrete values for each attribute Choose pre-defined set of values per attribute
88
Indexing
Bitmap Indexes on the foreign key columns in the fact tables. Bitmap Indexes on low cardinality columns in dimensional tables like Month, Product Category, Store category, etc B-Tree Indexes on Dimension key columns.
89
Build the data in this dimension with all possible combinations of values for each attribute Identify each combination uniquely Everytime an event occurs and is recorded in fact table, attach it with the unique combination ID.
90
Example RCD
91
Example RCD
92
Degenerate Dimensions
Dimensions with no other place to go Stored in the fact table Are not facts Common examples include invoice numbers or order numbers
93
94
Junk/Dirty Dimension
A convenient grouping of random flags and attributes. After carving out all the dimensions some flags or text attributes that are left over in the fact table but do not belong to any of the dimension tables.
95
Junk/Dirty Dimension
Alternatives to be avoided:
Leaving the flags and attributes unchanged in the fact table record Making each flag and attribute into its own separate dimension Stripping out all of these flags and attributes from the design
Make a convenient grouping of the flags and attributes to get them out of a fact table into a useful dimensional framework.
96
Drilling
Drilling down
Southeast
97
Drilling
Quarterly Auto Sales Summary
Rolling up
Region Northeast
Units Sold
Revenue
Southeast
98
Drilling
Drilling across
A query that involves more than one fact table Not necessarily an action that changes how a user is looking at the data Best resolved by multiple SQL passes
99
Q&A
100
101
Design Phase
Development Phase
Deployment Phase
102
Design phase
Determine requirements and design schema Iterative build and feedback Automate load, document, train users
Development phase
Deployment phase
103
Project Deliverables
Design
Deployment
Project definition document Project plan Schema design Mapping document Report design
Development
104
Populated data mart Load routines (Sagent Plans) Query and reporting environment
Project Approach
The dimensional model is developed during the design stage Scope of the project has already been determined
Design Phase
Development Phase
Deployment Phase
105
Gather requirements through requirements workshops Develop star schema Conduct design review
Design Phase
Development Phase
Deployment Phase
106
Gather Requirements
Requirements definition
107
Design Deliverables
Deliverables
How these primary components are delivered will depend on needs and format chosen
108
Notation Example
IDEF1X
Model
model_key
Sales Facts
time_key model_key dealer_key
Dealer
dealer_key
109
Notation Example
Martin IE
Model
Sales Facts
Dealer
110
Notation Example
Kimball
Time
time_key
Model
model_key
Sales Facts
time_key model_key dealer_key
Dealer
dealer_key
111
Suggested conventions
112
Clear descriptions
113
Example of Data
As it will exist in the warehouse After decoding Adds to model understanding Removes ambiguity/uncertainty
114
115
Data Transformation
Serves as spec for ETL process Decodes Type conversion Conditional logic Handling of NULLs
116
Q&A
117
Aggregates Schemas
118
Aggregate Designs
Aggregates
Pre-stored fact summaries Along one or more dimensions The most effective tool for improving performance
Examples
119
Aggregate Background
Aggregate rationale
Improve end user query performance Reduce required CPU cycles Powerful cost saving tool
Restrictions
120
Aggregate Guidelines
Dont start with aggregates Design and build based on usage Sooner or later you'll need to build aggregates
121
Aggregate Types
Separate Tables
Separate fact table for every aggregate Separate dimension table for every aggregate dimension Same number of fact records as level field tables Removes possibility of double counting Schema clarity Requires software with aggregate navigation capability
Advantage
Caveat
122
Separate Tables
One Way Aggregate
Mthly Sales Facts Agg
month_key product_key market_key Quantity Amount
Month
month_key Year Fiscal Period Month
Market
market_key Region District State City
Product
product_key Category Brand Product Diet Indicator
Sales Facts
time_key product_key market_key Quantity Amount
Time
time_key Year Fiscal Period Month Day Day of Week
123
Separate Tables
Two Way Aggregate
Category
category_key Category
Month
month_key Year Fiscal Period Month
Market
market_key Region District State City
Time
time_key Year Fiscal Period Month Day Day of Week
124
Aggregate Pitfalls
Sparsity failure
Term used to describe the result of building too many aggregate fact that do not summarize enough rows. When Sparsity failure occurs, a relatively small star schema can grow (in terms of disk size) thousands of times. Sparsity failure = aggregate explosion
125
Rule of twenty
To avoid aggregate explosion Make sure each aggregate record summarizes 20 or more lower-level records Total number of possible fact tables in any given dimensional model = cartesian product of all levels in all the dimensions
Remember
126
Hierarchy diagram
Helps visualize options for building aggregates Adding cardinalities insures following the rule of 20
20 quarters
Quarter (4)
60 months
Month (12)
1825 days
Date (365)
127
Aggregate Navigation
Description
Function provided by software layer: Aggregate Navigator Directs user queries to the most favorable available aggregate Transparent to the end user
128
Aggregate Framework
Business View
Designer View
129
Aggregate Architecture
Aggregate Aware SQL RDBMS Client PC
SQL
Client PC
SQL
Client PC
130
Aggregate Deployment
131
Aggregate Deployment
Exercise 3
Scenario
Given the original star schema and the following hierarchy, design a two-way aggregate table structure that will drastically increase performance Make your own assumptions about summary levels
133
Model
model_key category line model
Sales Facts
model_key dealer_key time_key revenue quantity
Dealer
dealer_key region state city dealer
134
Exercise 3
Scenario
Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales What are the top 10 selling car models this month? How do this months top 10 selling models compare to the top 10 over the last six months? Show me dealer sales by region by model by day What is the total number of cars sold by month by dealer by state?
135
Exercise 3
Dealer
All
Model
All
Time
All
Region
5 Category 20
Year
50
State
10
Quarter
1000
City
20
Line
60
Month
1000
136
Dealer name
40
Model name
1825
Date
Exercise 3 Worksheet
137
Exercise 3 Solution
State
state_key
region state
Month
month_key
year quarter month
Time
time_key
year quarter month date
Model
model_key
category line model
Sales Facts
model_key dealer_key time_key
revenue quantity
Dealer
dealer_key
region state city dealer
138
Q&A
139
140
Different business processes usually require different fact tables There are also several cases where a single business process will require multiple fact tables
141
Different business processes usually require different fact tables In practice, it may be hard to identify what a process is Sometimes you can spot different processes because measures are recorded
142
Shipment Facts
Product
Sales Facts
The 'not applicable' dimension value Using a 'not applicable' row in a dimension confuses the grain and can introduce reporting difficulty
144
Sometimes, it is not easy to identify the discrete business processes All measures may have the same dimensionality or grain Different measures are recorded at different times
145
Different Timing
Building a single fact table would require recording zero or null for measures that are not applicable at a point in time Reports would contain a confusing combination of zeros, nulls, and absence of data
146
Product
product_key Category Brand Product Diet Indicator
Market
market_key Region District State City
147
Product
product_key Category Brand Product Diet Indicator
148
Look at the measures in question Sort them into fact tables based on
149
150
There is a set of dimension attributes and measures shared in all cases Depending on the value in a dimension, certain extra dimension attributes or measures are recorded
151
Account Facts
time_key product_key branch_key customer_key Balance Transaction_count
Time
time_key ...
Customer
customer_key ...
Branch
branch_key ...
Checking Account
checking_key ...custom checking attributes
152
All attributes shared no matter what Appropriate for analysis across entire subject area
153
Contain attributes specific to a particular dimension value (e.g. Checking) Only appropriate when the business question is limited to that particular dimension value Should repeat shared facts to minimize need to access two fact tables
Coverage Schema
A star schema usually measure events that happen Relationships between the dimensions involved are not captured if events do not happen A coverage table fills the gap
What did not sell that was on promotion? Who was assigned to that customer?
Usually factless
154
Sales facts does not reveal who is assigned to a customer if they do not sell
Time Product
product_key Category Brand Product SKU time_key Year Fiscal Period Month Day Day of Week
Sales Facts
time_key product_key customer_key rep_key quantity sales_dollars
Customer
customer_key Name Company Account Phone_num
155
Sales_rep
rep_key rep_name rep_phone Region District State City
Coverage Table
Customer
customer_key Name Company Account Phone_num
Sales_rep
rep_key rep_name rep_phone Region District State City
156
The changes to what is being measured The status at a point in time Changes to inventory Current status of inventory
Snapshot
Example
157
Snapshot
Inventory Snapshot
time_key product_key location_key quantity_on_hand
Time
time_key Year Fiscal Period Month Day Day of Week
Location
location_key Warehouse WH_code City State
158
Transaction
How did inventory change today? How much product was returned due to failed Time inspection?
Inventory Transactions
time_key product_key location_key transaction_type_key transaction_amount time_key Year Fiscal Period Month Day Day of Week
Product
product_key Category Brand Product SKU
Location
location_key Warehouse WH_code City State
Transaction_type
transaction_type_key transaction_type_code transaction_type transaction_category
159
Aggregate Tables
Aggregate table
A fact table that summarizes another fact table Created for performance reasons Covered in previous section
160
Mark where facts apply to dimensions Mark where facts apply to dimensional attributes When facts don't apply, assume separate fact table
161
Bus Matrix
A Planning Methodology for Large Data Warehouses with multiple data marts or dimensional models. Enables technical planning as well as executive communication. Exceptionally effective for distributed data warehouses without a center. Is simply a vertical list of data marts and a horizontal list of dimensions.
162
Example Matrix
Attribute 3
Attribute 1
X X X X
Attribute 2
X X X X
Attribute 4
X X X X
Attribute 5
Attribute 6
X X X X X X
163
Fact Table 2
Attribute 7
Attribute 8
Fact Table 1
Exercise 4
Scenario
Industry: Automobile manufacturing Company: Millennium Motors Value chain focus: Sales What are the top 10 selling car models this month? How do this months top 10 selling models compare to the top 10 over the last six months? Show me dealer sales by region by model by day. How many cars have been purchased over the last six months by customers with yearly household incomes greater than $200,000?
164
Exercise 4 - continued
Using these sources data elements, design a star schema that answers the proposed business questions
Daily sales revenue Daily quantity sold Model Dealer Dealer city Product line Region where sold State Vehicle category Date of sales
Customer name Customer zip code Customer yearly income P.O. Number Purchase price Discount amount Brand of car
165
Exercise 4 - worksheet
166
167
facts
daily_sales
daily_quantity Customer name Customer zip code Model Customer income X X X X X X X X X X X X X X X X Dealer P.O. Number Dealer city Product line Brand of car Region where sold State X X Vehicle category
discount_amount
X X X X X X X X X X X X X
Time
time_key
year quarter month date
dealer_key
region state city dealer
Q&A
169
170
Data Mart
Meaning of the term 'data mart' has shifted over the last several years...
171
E.T.L. Software
E.T.L. Software
Operational Systems
172
Data Warehouse
Software
Data Marts
Analysis Users
Query & Reporting Software Data Mart Data Warehouse Analysis Users
Data Mart
Incremental warehouse development Centralized architecture Not new Well - suited to star schemas
175
Inconsistent and overlapping data Difficult and costly to maintain Redundant data load Cant drill across Integration requires starting over
Product
Product
176
Conformed Dimensions
Definition
Dimensions are conformed when they are the same -orWhen one dimension is a strict rollup of another
177
Conformed Dimensions
1. ... have exactly the same set of primary keys and 2. ... have the same number of records
178
Conformed Dimensions
Rolled up dimension
Which means
Two conformed dimensions can be combined into a single logical dimension by creating a union of the attributes
179
Conformed Dimensions
Description
Shared common dimensions Integrates logical design Ensures consistency between data marts Allows incremental development Independent of physical location Some re-work may be required
180
Conformed Dimensions
Advantages
Enables an incremental development approach Easier and cheaper to maintain Drastically reduces extraction and loading complexity Answers business questions that cross data marts Supports both centralized and distributed architectures
181
Product Dimension
Warehouse Dimension
182
Conformed Dimensions
183
Store
Product
Day
Warehouse
Month
When to Conform
Two approaches
184
Conform Up Front
Conform As-You-Go
186
Q&A
187
Course Review
188
Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts