Lecture 05 and 06

Data Warehousing & DATA
MINING (SE-409)
Lecture-5 and 6
Huma Ayub
Software Engineering department
University of Engineering and Technology, Taxila

Relational OLAP (ROLAP)
Ahsan Abdullah 2
Why ROLAP?
Issue of scalability i.e. curse of dimensionality for
MOLAP
– Aggregate awareness allows using pre-built summary

tables by some front-end tools.
– Star schema designs usually used to facilitate ROLAP

querying
Ahsan Abdullah 3
ROLAP as a “Cube”
• OLAP data is stored in a relational database (e.g. a star
schema)
• The fact table is a way of visualizing as a “un-rolled” cube.
• So where is the cube?

– It’s a matter of perception
– Visualize the fact table as an elementary cube.
Fact Table
Product
Month Product Zone Sale K Rs.
M1 P1 Z1 250
M2 P2 Z1 500
Time
Ahsan Abdullah 4
How to create “Cube” in ROLAP
• Cube is a logical entity containing values of a certain fact
at a certain aggregation level at an intersection of a
combination of dimensions.
• The following table can be created using 3 queries

Month_ID
SUM M1 M2 M3 ALL
(Sales_Amt)
Product_ID
P1
P2
P3
Total
Ahsan Abdullah 5
How to create “Cube” in ROLAP using SQL
• For the table entries, without the totals
SELECT S.Month_Id, S.Product_Id,
SUM(S.Sales_Amt)
FROM Sales
GROUP BY S.Month_Id, S.Product_Id;
• For the row totals

SELECT S.Product_Id, SUM (Sales_Amt)
FROM Sales
GROUP BY S.Product_Id;
• For the column totals

SELECT S.Month_Id, SUM (Sales)
FROM Sales
GROUP BY S.Month_Id;
Ahsan Abdullah 6
Problem With Simple Approach
• Number of required queries increases exponentially with
the increase in number of dimensions.
– Its wasteful to compute all queries.[MORE TIME AND SPACE

REQUIRED TO STORE]
– In the example, the first query can do most of the work of the
other two queries.
– If we could save that result and aggregate over Month_Id and

Product_Id, we could compute the other queries more
efficiently
Ahsan Abdullah 7
CUBE Clause
• The CUBE clause is part of SQL:1999
– GROUP BY CUBE (v1, v2, …, vn)
– Equivalent to a collection of GROUP BYs, one for

each of the subsets of v1, v2, …, vn
Ahsan Abdullah 8
ROLAP & Space Requirement
If one is not careful in aggregation , with the increase in
number of dimensions, the number of summary tables
gets very large
Consider the example discussed earlier with the following

two dimensions on the fact table...
Time: Day, Week, Month, Quarter, Year, All Days
Product: Item, Sub-Category, Category, All Products
Ahsan Abdullah 9
EXAMPLE: ROLAP & Space Requirement
A naïve implementation will require all combinations of summary
tables at each and every aggregation level.

 
…
24 summary tables, add in
geography, results in 120 tables
10
Ahsan Abdullah
HOLAP
• Target is to get the best of both worlds.
• HOLAP (Hybrid OLAP) allow co-existence of pre-

built MOLAP cubes alongside relational OLAP or
ROLAP structures.
• Environment Can Query from both
Ahsan Abdullah 11
DOLAP
Subset of the cube is

transferred to the local
machine
Cube on the
remote server
Local Machine/Server
Ahsan Abdullah 12
Dimensional Modeling (DM)
13
The need for ER modeling?
• Problems with early COBOLian data processing
systems.
• Collection of data
• Data redundancies
• From flat file to Table, each entity ultimately

becomes a Table in the physical schema.
• Simple O(n2) Join to work with Tables
14
Why ER Modeling has been so successful?
– Coupled with normalization drives out all the
redundancy out of the database.
– Change (or add or delete) the data at just one

point.
– Can be used with indexing for very fast access.
– Resulted in success of OLTP systems.
15
Need for DM: Un-answered Qs
• Lets have a look at a typical ER data model first.
• Some Observations:
– All tables look-alike, as a consequence it is difficult to identify:
• Which table is more important ?
• Which is the largest?
• Which tables contain numerical measurements of the

business?
• Which table contain nearly static descriptive attributes?

[dimension info]
17
Need for DM: Complexity of Representation
– Many topologies for the same ER diagram, all
appearing different.
• Very hard to visualize and remember.
12
7 6
3 12 7
11 4 8
8
9
1 10
10 9 11
6 1
3 2 5
2 5 4
• A large number of possible connections to any

two (or more) tables
18
Need for DM: The Paradox
• The Paradox: Trying to make information accessible using tables
resulted in an inability to query them!
• ER and Normalization result in large number of tables which are:

– Hard to understand by the users (DB programmers): EPR
system span on multiple tables
– Hard to navigate optimally by DBMS software
• Real value of ER is in using tables individually or in pairs[ good

performance in one or less table in join operation ]
• Too complex for queries that span multiple tables with a large
number of records
19
ER vs. DM
ER DM
Constituted to optimize OLTP Constituted to optimize DSS
performance. query performance.
Models the macro

Models the micro/detail
relationships among data
relationships among data
elements with an overall
elements.
deterministic strategy.
All dimensions serve as
A wild variability of the
equal entry points to the
structure of ER models.
fact table.
Very vulnerable to changes in Changes in users' querying
the user's querying habits, habits can be
because such schemas are accommodated by
asymmetrical. automatic SQL generators.
20
How to simplify a ER data model?
• Bring it to DSS
• Two general methods:
– De-Normalization
– Dimensional Modeling (DM)
21
What is DM?…
• A simpler logical model optimized for decision support.
• Inherently dimensional in nature[fact + dimension] , with a single
central fact table and a set of smaller dimensional tables.
• Multi-part key for the fact table (long in terms of data, contain
numerical data, how many item sale, what revenue we get from
sale+ how much sale we need + single column primary key).
• Dimensional tables with a single-part PK.(one and more but small +

single column key+ info regarding time, geography, product
dimension).
• Keys are usually system generated.(Fk between fact and

dimensional table).
• CHANGES IN BUSINESS . Maintenance issue
22
What is DM?...
• Results in a star like structure, called star

schema or star join. Fact in center and
dimension around it.
– All relationships mandatory M-1.
– Single path between any two levels.[fact vs

dimensional table]
23
Dimensions have Hierarchies
Items
Books Cloths
Fiction Text Men Women
Engg Medical
Analysts tend to look at the data through dimension at a

particular “level” in the hierarchy
24
The two Schemas
Star
Snow-flake
25
“Simplified” 3NF (Retail)
CITY DISTRICT M DIVISION PROVINCE
1 district BACK
1 1
zone M division
M DISTRICT DIVISION
ZONE CITY
1
store M week
1
STORE # STREET ZONE ... DATE WEEK
1 M
sale_header quarter
M M
RECEIPT # STORE # DATE ... MONTH QTR
1 1
M M
1
WEEK MONTH
M sale_detail month 1
RECEIPT # ITEM # ... $
YEAR QTR
1 M M
1 year
ITEM # CATEGORY
ITEM # SUPPLIER
item_x_cat M
1 item_x_splir
CATEGORY DEPT
cat_x_dept 26
Vastly Simplified Star Schema
Product Dim
Geography Dim
1 ITEM#
STORE# 1
Fact Table CATEGORY
ZONE
RECEIPT#
DEPT
CITY
STORE#
M SUPPLIER
DISTRICT
ITEM# M
DIVISION
DATE Time Dim
M
PROVINCE . DATE
. 1
facts . WEEK
Sale Rs. MONTH
QUARTER
YEAR
27
The Benefit of Simplicity
Beauty lies in close correspondence with

the business, evident even to business
users.[means simplicity]
28
Features of Star Schema
Dimensional hierarchies are collapsed into a single table for
each dimension. Loss of Information? Relationship lost
A single fact table created with a single header from the

detail records, resulting in:
– A vastly simplified physical data model!
– Fewer tables (thousands of tables in some ERP systems).

– Fewer joins resulting in high performance.
– Some requirement of additional space.

29
Quantifying space requirement
Quantifying use of additional space using star schema
There are about 10 million mobile phone users in Pakistan.

Say the top company has half of them = 500,000
Number of days in 1 year = 365

Number of calls recorded each day = 250,000 (assumed)
Maximum number of records in fact table = 91 billion rows
Assuming a relatively small header size = 128 bytes
Fact table storage used = 11 Tera bytes
Average length of city name = 8 characters  8 bytes
Total number of cities with telephone access = 170 (1 byte)
Space used for city name in fact table using Star = 8 x 0.091 = 0.728 TB
Space used for city code using snow-flake = 1x 0.091= 0.091 TB
Additional space used  0.637 Tera byte
30

Lecture 05 and 06

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lecture 05 and 06

Transféré par

Droits d'auteur :

Formats disponibles

Data Warehousing & DATA

University of Engineering and Technology, Taxila

– Aggregate awareness allows using pre-built summary

– Star schema designs usually used to facilitate ROLAP

• So where is the cube?

• The following table can be created using 3 queries

• For the row totals

• For the column totals

– Its wasteful to compute all queries.[MORE TIME AND SPACE

– If we could save that result and aggregate over Month_Id and

• The CUBE clause is part of SQL:1999

– GROUP BY CUBE (v1, v2, …, vn)

– Equivalent to a collection of GROUP BYs, one for

Consider the example discussed earlier with the following

Time: Day, Week, Month, Quarter, Year, All Days

Product: Item, Sub-Category, Category, All Products

• Target is to get the best of both worlds.

• HOLAP (Hybrid OLAP) allow co-existence of pre-

• Environment Can Query from both

Subset of the cube is

• From flat file to Table, each entity ultimately

• Simple O(n2) Join to work with Tables

– Change (or add or delete) the data at just one

– Can be used with indexing for very fast access.

– Resulted in success of OLTP systems.

• Which table is more important ?

• Which is the largest?

• Which tables contain numerical measurements of the

• Which table contain nearly static descriptive attributes?

• A large number of possible connections to any

• ER and Normalization result in large number of tables which are:

– Hard to navigate optimally by DBMS software

• Real value of ER is in using tables individually or in pairs[ good

Models the macro

– Dimensional Modeling (DM)

• Dimensional tables with a single-part PK.(one and more but small +

• Keys are usually system generated.(Fk between fact and

• CHANGES IN BUSINESS . Maintenance issue

• Results in a star like structure, called star

– All relationships mandatory M-1.

– Single path between any two levels.[fact vs

Fiction Text Men Women

Analysts tend to look at the data through dimension at a

Sale Rs. MONTH

Beauty lies in close correspondence with

A single fact table created with a single header from the

– A vastly simplified physical data model!

– Fewer tables (thousands of tables in some ERP systems).

– Some requirement of additional space.

There are about 10 million mobile phone users in Pakistan.

Number of days in 1 year = 365

Vous aimerez peut-être aussi