Part 2 Data Warehousing and Data Mining

DATA WAREHOUSING AND DATA MINING
10BI23
Instructor: Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and Engineering
PES Institute of technology
Bangalore
Part 2
Aggregations:
Aggregations: SQL and Aggregation
Aggregation functions and grouping
Aggregation
The single most dramatic way to affect

performance in a large data warehouse
is to provide a proper set of aggregate
(summary) records ... in some cases speeding
queries by a factor of 100 or even 1,000.
No other means exist to harvest such
spectacular gains."
- Ralph kimball
10/7/2019 Dr. Navneet Goyal, BITS, Pilani 3
Aggregates
• Pre-computed aggregates
– Materialized views
– Aggregate navigation
– Dimension and fact aggregates
• Selection of aggregates
– Manual selection
– Greedy algorithm
– Limitations of greedy approach
Star Schema
 Single data (fact)
table surrounded by Dim
multiple descriptive
(dimension) tables
Dim Dim
Fact
Dim Dim
Star Schema for Sales
Dimension
Tables
Fact Table
Star Schema Representation
 Fact and Dimensions are represented by physical
tables in the data warehouse database
 Fact tables are related to each dimension table in
a Many to One relationship (Primary/Foreign Key
Relationships)
 Fact Table is related to many dimension tables
– The primary key of the fact table is a composite primary
key from the dimension tables
 Each fact table is designed to answer a specific
DSS question
Physical Database Design
• Logical database design
– What are the facts and dimensions?
– Goals: Simplicity, Expressiveness
– Make the database easy to understand
– Make queries easy to ask
• Physical database design
– How should the data be arranged on disk?
– Goal: Performance
• Manageability is an important secondary concern
– Make queries run fast
Load vs. Query Trade-off
• Trade-off between query performance and load
performance
• To make queries run fast:
– Precompute as much as possible
– Build lots of data structures
• Indexes
• Materialized views
• But…
– Data structures require disk space to store
– Building/updating data structures takes time
– More data structures → longer load time
Typical Storage Allocations
• Base data
– Fact tables and dimension tables
– Fact table space >> Dimension table space
• Indexes
– 100%-200% of base data
• Aggregates / Materialized Views
– 100% of base data
• Extra data structures 2-3 times size of base data
Why This Syntax?
 Abstract syntax
select <field list> <aggregate list>
from <table expression>
where <search condition>
group by [ cube | drill down] <aggregate list>
having <search condition>
 Allows functional aggregations (e.g., sales by quarter):

select store, quarter, sum(units)
from sales
where nation = “Mexico”
group by drill down store, quarter(date) as quarter
and year = 1994;
Group, Cube, Roll-Up Algebra
 The operators have the following properties:
 CUBE(ROLLUP) = CUBE
 ROLLUP (GROUP BY) = ROLLUP
 SQL extended GROUP BY operator
 GROUP BY [<aggregation list>]
[ROLLUP < aggregation list>]
[CUBE < aggregation list>]
 CUBE creates nested relations, that is,
relations can be values.
SQL: 1999
 SQL: 1999 defines a rich set of
aggregate functions
 New aggregate functions on
single attributes:
 stddev
 variance
 Binary aggregate functions
 correlations
 covariances
 regression curves
10/7/2019 Dr. Navneet Goyal, BITS Pilani 13
SQL: 1999
 Generalization of GROUP BY
construct:
 Cube
 Rollup
 Window Queries
 Top N Queries (Ranking)

The MOLAP Cube
Fact table view: Multi-dimensional cube:

sale prodId storeId amt
p1 s1 12 s1 s2 s3
p2 s1 11 p1 12 50
p1 s3 50 p2 11 8
p2 s2 8
dimensions = 2
3-D Cube
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt

p1 s1 1 12
p2 s1 1 11 s1 s2 s3
day 2
p1 s3 1 50 p1 44 4
p2 s2 1 8 p2 s1 s2 s3
p1 s1 2 44 day 1
p1 12 50
p1 s2 2 4 p2 11 8
dimensions = 3
Cube Aggregation: Roll-up
Example: computing sums
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
size wise
day1 + day2 p1+p2
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
product wise
sum
rollup p1 110
s1+s2+s3
p2 19
drill-down
Cube Operators for Roll-up
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)
Extended Cube
* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
Aggregation Using Hierarchies
s1 s2 s3
day 2
p1 44 4
store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region
country
region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)
Slicing
PRODUCT = p1
s1 s2 s3
day 2
p1 44 4 s1 s2 s3
p2 s1 s2 s3 day 1 44 4
day 1
p1 12 50 day 2 12 50
p2 11 8
TIME = day 1
s1 s2 s3
p1 12 50
p2 11 8
Cross Tabulation
 The table above is an example of a cross-tabulation

(cross-tab), also referred to as a pivot-table.
 Values for one of the dimension attributes form the row
headers
 Values for another dimension attribute form the column
headers
 Other dimension attributes are listed on top
 Values in individual cells are (aggregates of) the values of the
dimension attributes that specify the cell.
Relational Representation of Cross-
tabs
 Cross-tabs can be
represented as relations
 We use the value ‘all’ is
used to represent
aggregates
 The SQL:1999 standard
actually uses null values in
place of all despite
confusion with regular null
values
Data Cube
 A data cube is a multidimensional generalization of a cross-tab
 Can have n dimensions; we show 3 below
 Cross-tabs can be used as views on a data cube
Cube Operator
pid timeid Locid sales
Locid City State Country
11 1 1 25
1 Madison WI USA
11 2 1 8
2 Fresno CA USA
11 3 1 15
5 Chennai TN India
12 1 1 30
12 2 1 20
Pid Pname category Price
12 3 1 50
11 Lee Jeans Apparel 25 13 1 1 8
12 Zord Toys 18 13 2 1 10
13 3 1 10
13 Biro Pen Stationery 2
11 1 2 35
11 2 2 22
Timeid Date Month Year Holiday 11 3 2 10
1 10/11/05 Nov 1995 N 12 1 2 26
2 11/11/05 Nov 1996 N 12 2 2 45
3 12/11/05 Nov 1997 N 12 3 2 20

13 1 2 20
13 2 2 40
13 3 2 5
Cube Operator
pid timeid Locid sales
Locid City State Country
11 1 1 25
1 Madison WI USA
11 2 1 8
2 Fresno CA USA
11 3 1 15
5 Chennai TN India
12 1 1 30
12 2 1 20
Pid Pname category Price
12 3 1 50
11 Lee Jeans Apparel 25 13 1 1 8
12 Zord Toys 18 13 2 1 10
13 3 1 10
13 Biro Pen Stationery 2
11 1 2 35
11 2 2 22
Timeid Date Month Year Holiday 11 3 2 10
1 10/11/05 Nov 1995 N 12 1 2 26
2 11/11/05 Nov 1996 N 12 2 2 45
3 12/11/05 Nov 1997 N 12 3 2 20

13 1 2 20
13 2 2 40
13 3 2 5
Cube Operator
Select T.year, L.state, SUM (sales) WI CA Total
from Sales S, Times T, Locations L
Where S.timeid=T.timeid & S.locid=L.locid
1995 63 81 144
Group By T.year, L.state
1996 38 107 145
Select T.year, SUM (sales) 1997 75 35 110
from Sales S, Times T
Where S.timeid=T.timeid Total 176 223 399
Group By T.year
Select SUM (sales)
Select L.state, SUM (sales) from Sales S, Locations L
from Sales S, Locations L Where S.locid=L.locid
Where S.locid=L.locid OR
Group By L.state Select SUM (sales)
from Sales S, Time T
Where S.timeid=T.timeid
How many such SQL queries to build cross-tab?

Cube Operator
Select T.year, L.state, SUM (sales) T.Year L.State SUM(sales
from Sales S, Times T, Locations L )
Where S.timeid=T.timeid & S.locid=L.locid 1995 WI 63
Group By CUBE (T.year, L.state) 1995 CA 81
1995 All 144
1996 WI 38
WI CA Total
1996 CA 107
1995 63 81 144 1996 All 145
1997 WI 75
1996 38 107 145
1997 CA 35
1997 75 35 110 1997 All 110
All WI 176
Total 176 223 399
All CA 223
All All 399
GROUP BY clause with CUBE keyword is equivalent to a

collection of GROUP BY statements, with one GROUP BY st.
for each subset of k-dimensions

Rollup Operator
Select T.year, L.state, SUM (sales)
T.Year L.State SUM(sales)
Group By ROLLUP (T.year, L.state)
1995 CA 81
WI CA Total 1995 All 144
1995 63 81 144 1996 WI 38
1996 38 107 145 1996 CA 107
1997 75 35 110 1996 All 145
Total 176 223 399 1997 WI 75
1997 CA 35
1997 All 110
All All 399

Rollup Operator
Group By ROLLUP (T.year, L.state)
1995 CA 81
WI CA Total 1995 All 144
1995 63 81 144 1996 WI 38
1996 38 107 145 1996 CA 107
1997 75 35 110 1996 All 145
Total 176 223 399 1997 WI 75
1997 CA 35
1997 All 110
All All 399

Rollup Operator
from Sales S, Times T, Locations L 1995 WI 63
1996 WI 38
Group By ROLLUP (L.state, T.year)
1997 WI 75
WI CA Total ALL WI 176
1995 63 81 144 1995 CA 81
1996 38 107 145 1996 CA 107
1997 CA 35
1997 75 35 110
ALL CA 223
Total 176 223 399
All All 399

Rollup Operator
from Sales S, Times T, Locations L 1995 WI 63
1996 WI 38
Group By ROLLUP (L.state, T.year)
1997 WI 75
WI CA Total ALL WI 176
1995 63 81 144 1995 CA 81
1996 38 107 145 1996 CA 107
1997 CA 35
1997 75 35 110
ALL CA 223
Total 176 223 399
All All 399

Example
roll-up to region
Dimensions:
S3
S2
Time, Product, Store
roll-up to brand
S1
Attributes:
10
Product (descr, price, …)
Juice
Store …
Product
Milk 34
56 …
Coke
Cream 32 Hierarchies:
Soap 12 Product  Brand  …
Bread 56 roll-up to week Day  Week  Quarter
M T W Th F S S
Store  Region  Country
Time
56 units of bread sold in S1 on M
Twelve rules for evaluating OLAP products
(E.F.Codd)
• 1.Multidimensional Conceptual View
• 2.Transparency
• 3.Accessibility
• 4.Consistent Reporting Performance
• 5.Client-Server Architecture
• 6.Generic Dimensionality
• 7.Dynamic Sparse Matrix Handling
• 8.Multi-User Support
• 9.Unrestricted Cross-dimensional Operations
• 10.Intuitive Data Manipulation
• 11.Flexible Reporting
• 12.Unlimited Dimensions and Aggregation Levels
Materialized Views
• Many DBMSs support materialized views
– Precomputed result of a particular query
– Goal: faster response for related queries
• Example:
– View definition:
SELECT State, SUM(Quantity)
FROM Sales GROUP BY State
– Query:
SELECT SUM(Quantity)
FROM Sales
WHERE State = 'CA'
– Scan view rather than Sales table
• View matching problem
– When can a query be re-written using a materialized view?
– Difficult to solve in its full generality
– DBMSs handle common cases via limited set of re-write rules
Cross Tab Report
Example Simple Cross-Tabular Report With Subtotals

Table is a sample cross-tabular report showing the total sales
by country_id and channel_desc for the US and UK through the Internet and
direct sales in September 2000.
Table Simple Cross-Tabular Report With Subtotals
Channel Country
UK US Total
Direct Sales 1,378,126 2,835,557 4,213,683
Internet 911,739 1,732,240 2,643,979
Total 2,289,865 4,567,797 6,857,662
Query request SUM(amount_sold) and GROUP BY(channel_desc, country_id)
Query
SELECT channel_desc, country_id,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels
WHERE sales.time_id=times.time_id AND
sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND
channels.channel_desc IN ('Direct Sales', 'Internet') AND
times.calendar_month_desc='2000-09'
AND country_id IN ('UK', 'US')
GROUP BY CUBE(channel_desc, country_id);
CUBE and Roll Up Operators
Aggregate
Group By
Sum (with total)
By Color
RED
WHITE Cross Tab
BLUE Chevy Ford By Color
RED
Sum
WHITE The Data Cube and
BLUE The Sub-Space Aggregates
By Make
Sum
By Year
By Make
By Make & Year
RED
WHITE
BLUE
By Color & Year
By Make & Color
Sum By Color
The Data Cube Concept
B 1994
W 1995
1994 MAKE B
1995
W
Ford
Chevy
1994
Black
1995 F
F YEAR
White
C
C
1994 COLOR
1995 B
F W
C
Sub-cube Derivation
<M,Y,C>
<M,Y,*> <M,*,C> <*,Y,C>
<*,*,C>
<M,*,*> <*,Y,*>
<*,*,*>
 Dimension collapse, * denotes ALL

Cube Operator Example
SALES DATA CUBE
Model Year Color Sales Model Year Color Sales
Chevy 1990 red 5 ALL ALL ALL 942
Chevy 1990 white 87 chevy ALL ALL 510
Chevy 1990 blue 62 ford ALL ALL 432
Chevy 1991 red 54 ALL 1990 ALL 343
Chevy 1991 white 95 ALL 1991 ALL 314
ALL 1992 ALL 285
CUBE
Chevy 1991 blue 49
Chevy 1992 red 31 ALL ALL red 165
Chevy 1992 white 54 ALL ALL white 273
Chevy 1992 blue 71 ALL ALL blue 339
Ford 1990 red 64 chevy 1990 ALL 154
Ford 1990 white 62 chevy 1991 ALL 199
Ford 1990 blue 63 chevy 1992 ALL 157
Ford 1991 red 52 ford 1990 ALL 189
Ford 1991 white 9 ford 1991 ALL 116
Ford 1991 blue 55 ford 1992 ALL 128
Ford 1992 red 27 chevy ALL red 91
Ford 1992 white 62 chevy ALL white 236
Ford 1992 blue 39 chevy ALL blue 183
ford ALL red 144
ford ALL white 133
ford ALL blue 156
ALL 1990 red 69
ALL 1990 white 149
ALL 1990 blue 125
ALL 1991 red 107
ALL 1991 white 104
ALL 1991 blue 104
ALL 1992 red 59
ALL 1992 white 116
ALL 1992 blue 110
User Defined Aggregate Function
Generalized For Cubes
 Aggregates have graduated difficulty
 Distributive: can compute cube from next lower
dimension values (count, min, max,...)
 Algebraic: can compute cube from next lower lower
scratchpads (average, ...)
 Holistic: Need base data (Median, Mode, Rank..)
 Distributive and Algebraic have simple and efficient

algorithm: build higher dimensions from core
 Holistic computation seems to require multiple passes.
 real systems use sampling to estimate them
 (e.g., sample to find median, quartile boundaries)

Interesting Aggregate
Functions
 From RedBrick systems
 Rank (in sorted order)
 N-Tile (histograms)
 Running average (cumulative functions)
 Windowed running average
 Percent of total
 Users want to define their own aggregate
functions
 statistics
 domain specific
Aggregate Tables
• Also known as summary tables
• Common form of precomputation in data
warehouses
• Reduce dimensionality of fact by
aggregating across some dimensions
– Result is smaller version of fact table
• Can be used to answer queries that refer
only to dimensions that are retained
– Similar to the idea of a covering index
Aggregate Table Example
• Sales Fact table
– Dimensions (Date, Product, Store, Promotion, Transaction ID)
– Measurements DollarAmt, Quantity
• Create aggregate table with (Date, Store)
– SELECT DateKey, StoreKey, SUM(DollarAmt),
SUM(Quantity)
FROM Sales
GROUP BY DateKey, StoreKey
– Store the result in Sales2 table
• Queries that only reference Date and Store attributes
can use the aggregate table instead
– SELECT Store.District, SUM(Quantity)
FROM Sales, Store, Date
WHERE Sales.Date_key = Date.Date_key
AND Sales.Store_key = Store.Store_key
AND Date.Month = 'September 2004'
GROUP BY Store.District
– Replace “Sales” by “Sales2” → Same query result!
Aggregate Tables vs. Indexes
• Idea behind covering fact index:
– Thinner version of fact table
– Index takes up less space than fact table
– Fewer I/Os required to scan it
• Idea behind aggregate table:
– Thinner and shorter version of fact table
– Aggregate table takes up much less space than fact table
– Fewer I/Os required to scan it
• Aggregate table has fewer rows
– Index has 1 index entry per fact table row
• Regardless of how many columns are in the index
– Aggregate table has 1 row per unique combination of dimensions
• Often many fewer rows compared to the fact table!
• Index supports efficient lookup on leading terms
– Useful when filters are selective
– Avoid scanning rows that will be filtered out
– Can build indexes on aggregate tables, too!
Aggregate Navigation
• Two techniques to manage aggregates
– Let the database management system do it
– Do it yourself
• Let the database do it
– Use materialized view capabilities built in to the DBMS
– Query re-writing happens automatically
• Do it yourself
– Create and populate additional tables
– Perform explicit query re-write (aggregate navigation)
• Pros and Cons of Do-It-Yourself
– Pros:
• More flexibility & re-writing power
• Better load performance (possibly)
– Cons:
• More tables to manage
• Load becomes more complex
• Need to write aggregate navigation code
• Clients need to use aggregate navigator
Aggregate Navigation
• Clients are unaware of aggregates
Warehouse
• Client queries reference base-level
SQL fact and dimension tables
Client Queries
Rewritten
Warehouse Aggregate Queries
Client Navigator Data
Warehouse
Warehouse
Client What aggregates exist?
Aggregate How large is each one?
Metadata
Dimension Aggregates
• Methods to define aggregates
– 1. Include or leave out entire dimensions
– 2. Include some columns of each dimension, but not others
– Second approach is more common / more useful
• Example: Several versions of Date dimension
– Base Date dimension
• 1 row per day
• Includes all Date attributes
– Monthly aggregate dimension
• 1 row per month
• Includes Date attributes at month level or higher
– Yearly aggregate dimension
• 1 row per year
• Includes only year-level Date attributes
• Each dimension aggregate has its own set of surrogate keys
• Each aggregated fact joins to 1 version of the Date dimension
– Or else the Date dimension is omitted entirely…
Choosing Dimension Aggregates
• Dimension aggregates often roll up along hierarchies
– Day – month – year
– SKU – brand – category – department
– Store – city – county – state – country
• Any subset of dimension attributes can be used
– Promotion dimension includes attributes for coupon, ad,
discount, and end-of-aisle display
– Promotion aggregate might include only ad-related attributes
– Customer aggregate might include only a few frequently-queried
columns (age, gender, income, marital status)
• Goal: reduced number of distinct combinations
– Results in fewer rows in aggregated fact
– Customer aggregate that included SSN would be pointless
Aggregate Examples
• Sales fact table
– Date, Product, Store, Promotion, Transaction ID
• Date aggregates
– Month
– Year
• Product aggregates
– Brand
– Manufacturer
• Promotion aggregates
– Ad
– Discount
– Coupon
– In-Store Display
• Store aggregates
– District
– State
Choosing Fact Aggregates
• Aggregated fact table includes:
– Foreign keys to dimension aggregate tables
– Aggregated measurement columns
• For example, Quantity column holds SUM(Quantity)
• Need to choose which version of each dimension to use
• Transaction ID will never be included
– Aggregates with degenerate dimensions are rare
• Number of possible fact aggregates: 4 * 4 * 6 * 4
– Dimension with n aggregates → n+2 possibilities
• Include base dimension
• Omit dimension entirely
• n dimension aggregates to choose from
– Constructing fact aggregates for all combinations would be
impractical
– How to decide which combinations to construct?
Aggregate Selection
• Two approaches
– Manual selection
• Data warehouse designer chooses aggregates
• Could be time-consuming and error-prone
– Automatic selection
• Use algorithms to optimize choice of aggregates
• Good in principle; however, problem is hard
• Heuristics for manual selection
– Include a mixture of breadth and depth
• A few broad aggregates
– Lots of dimensions at fairly fine-grained level of detail
– Achieve moderate speedup on a wide range of queries
• Lots of highly targeted aggregates with only a few rows each
– Dimensions are highly rolled up or omitted entirely
– Each aggregate achieves large speedup on a small class of queries
• Roughly equal allocation of space to each type
– Consider the query workload
• What attributes are queried most frequently?
• What sets of attributes are often queried together?
• Make sure that common / important queries run fast
• Aggregate design can be adjusted over time (learn from experience)
Constructing Aggregates
• Constructing dimension aggregates
– Determine attributes in aggregate
– Generate unique combinations of those attributes by SELECT
DISTINCT from dimension table
– Assign surrogate keys to aggregate table rows
• Constructing fact aggregates
– Build mapping table that maps dimension keys to dimension
aggregate keys, for each dimension
– Join fact table to mapping tables and group by aggregate keys
• Constructing aggregates from other aggregates
– Fact aggregate on (Product, Year) can be built from (Product,
Month) aggregate
– Faster than using base table
Data Cube Lattice
State, Month,
Color
State, State, Month,

Month Color Color
Drill Roll
Down Up
State Month Color
Total
Sparsity Revisited
• Fact tables are usually sparse
– Not all possible combinations of dimension values actually occur
– E.g. not all products sell in all stores on all days
• Aggregate tables are not as sparse
– Coarser grain → lesser sparsity
– All products sell in SOME store in all MONTHS
– Thus space savings from aggregation can be less than one
might think
• Example: Date dimension vs. Year dimension aggregate
– Year table has 1/365 as many rows as Date table
– Fact aggregate with (Product, Store, Year) has fewer rows than
fact aggregate with (Product, Store, Date)
– But more than 1/365 of the rows!
– Some potential space savings are lost due to reduced sparsity
Automatic Selection of Aggregates
• Problem Definition
– Inputs:
• Query workload
• Set of candidate aggregates
• Query cost model
• Maximum space to use for aggregates
– Output:
• Set of aggregates to construct
– Objective:
• Minimize cost of executing query workload
• Problem is NP-Complete
– Approximation is required
– We’ll discuss a Greedy algorithm for the problem
– Due to Harinarayan, Rajaraman, and Ullman (1996)
Problem Inputs
• Set of candidate aggregates
– We’ll consider the data cube lattice
• Query workload
– Each OLAP query maps to a node in the lattice
• Union of grouping and filtering attributes
– Workload = weight assigned to each lattice node
• Weight = fraction of queries in workload that correspond to this node
• Query cost model
– We’ll use simple linear cost model
– Cost proportional to size of fact aggregate used to answer query
– Justification:
• Dominant cost is I/O to retrieve tables
• Dimension tables small relevant to fact
– Oversimplification of reality
• Makes the problem easier to analyze
• Maximum space to use for aggregates
– We’ll fix a maximum number of aggregates, regardless of their size
– Another simplification for purposes of analysis
Aggregation
Still aggregations is so underused. Why?

We are still not comfortable with
redundancy!
Requires extra space
Most of us are not sure of what aggregates
to store
A bizarre phenomena called
SPARSITY FAILURE
Aggregates
Need new indexes that can quickly and
“logically” get us to millions of records
Logically because we need only the
summarized result and not the individual
records
Aggregates as Summary Indexes!

Aggregates
Aggregates belong to the same DB as the low
level atomic data that is indexed (unlike data
marts)
Queries should always target the atomic data
Aggregate Navigation automatically rewrites
queries to access the best presently available
aggregates
Aggregate navigation is a form of query
optimization
Should be offered by DB query optimizers
Intelligent Middleware
Aggregation: Thumb Rule
The size of the database should not

become more than double of its
original size

Aggregates: Trade-Offs
Query performance vs. Costs
Costs
– Storing
– Building
– Maintaining
– Administrating
Imbalance: Retail DW that collapsed under the
weight of more that 2500 aggregates and that
took more than 24 hours to refresh!!!

Aggregates: Guidelines
Set an aggregate storage limit (not more than
double the original size of the DB)
Dynamic portfolio of aggregates that change
with changing demands
Define small aggregates: 10 to 20 times smaller
than the FT or aggregate on which it is based
– Monthly product sales aggregate: How many times
smaller than daily product sales table?
If your answer is 30…you are forgiven, but you
are likely to be wrong
– Reason: Sparstiy Failure

Spread aggregates: Goal should be
to accelerate a broad spectrum of
queries

Spread aggregates: Goal should be
to accelerate a broad spectrum of
queries
Figure 1 Poor use of the space allocated for aggregates.

Aggregation
Figure taken from Neil Raden article (www.hiredbrains.com/artic9.html)

Aggregates
Issues
Which aggregates to create?
How to guard against sparsity failure?
How to store them?
New Fact Table approach
New Level Field approach
How queries are directed to appropriate
aggregates?

Aggregates: Example
Grocery Store
3 dimensions – Product, Location, & Time
10000 products
1000 stores
100 time periods
10% Sparsity
Total no. of records = 100 million

Aggregates: Example
Hierarchies
10000 products in 2000 categories
1000 stores in 100 districts
30 aggregates in 100 time periods

Aggregates: Example
How many aggregates are possible?

1-way: Category by Store by Day
1-way: Product by District by Day
1-way: Product by Store by Month
2-way: Category by District by Day
2-way: Category by Store by Month
2-way: Product by District by Month
3-way: Category by District by Month

Aggregates: Example
What is Sparsity?
Fact tables are sparse in their keys!
10% sparsity at base level means that only 10%
of the products are sold on any given day
(average)
As we move from base level to 1-way the sparsity
_Increases!
_______
What affect sparsity will have on the size of the
aggregate fact table?

Aggregates: Example
Let us assume that sparsity for 1-way

aggregates is 50%
For 2-way 80%
For 3-way 100%
Do you agree with this?
Is it logical?

Aggregates: Example
Table Prod. Store Time Sparsity # Records

Base 10000 1000 100 10% 100 million
1-way 2000 1000 100 50% 100 million
1-way 10000 100 100 50% 50 million
1-way 10000 1000 30 50% 150 million
2-way 2000 100 100 80% 16 million
2-way 2000 1000 30 80% 48 million
2-way 10000 100 30 80% 24 million
3-way 2000 100 30 100% 6 million
Grand Total 494 million

Aggregates: Example
An increase of almost 400%
Why it happened?
Look at the aggregates involving Location
and Time!
How can we control this aggregate
explosion?
Do the calculations again with 500
categories and 5 time aggregates

Aggregates: Example
Table Prod. Store Time Sparsity # Records
Base 10000 1000 100 10% 100 million
1-way 500 1000 100 50% 25 million
1-way 10000 100 100 50% 50 million
1-way 10000 1000 5 50% 25 million
2-way 500 100 100 80% 4 million
2-way 500 1000 5 80% 2 million
2-way 10000 100 5 80% 4 million
3-way 500 100 5 100% 0.25 million
Grand Total 210.25 million

Aggregate Design Principle
Each aggregate must summarize

at least 10 and preferably 20 or
more lower-level items

Aggregates

Aggregates:
Shrunken Dimensions

Aggregates:
Shrunken Dimensions

Aggregates:
Shrunken Dimensions

Aggregates:
Shrunken Dimensions

Aggregate Navigator
How queries are directed to the appropriate

aggregates?
Do end user query tools have to be “hardcoded”
to take advantage of aggregates?
If DBA changed the aggregates all end user
applications have to be recoded
How do we overcome this problem?

Aggregate Navigator
Aggregate Navigator (AN) is the solution

So what is an AN?
A middleware sitting between user queries
and DBMS
With AN, user applications speak just base
level SQL
AN uses metadata to transform base level
SQL into “Aggregate Aware” SQL
AN Algorithm
1. Rank Order all the aggregate fact tables
from the smallest to the largest
2. For the smallest FT, look in associated DTs
to verify that all the dimensional attributes of
the current query can be found.If found, we
are through. Replace the base-level FT with
the aggregate fact and aggregate DTs
3. If step 2 fails, find the next smallest
aggregate FT and try step 2 again. If we run
out of aggregate FTs, then we must use
base tables

Design Requirements
#1 Aggregates must be stored in their own fact

tables, separate from the base-level data. In
addition, each distinct aggregation level must
occupy its own unique fact table
#2 The dimension tables attached to the aggregate
fact tables must, wherever possible, be
shrunken versions of the dimension tables
associated with the base fact table.

Design Requirements
#3 The base Fact table and all of its related

aggregate Fact tables must be associated
together as a "family of schemas" so that the
aggregate navigator knows which tables are
related to one another.
#4 Force all SQL created by any end user or
application to refer exclusively to the base fact
table and its associated full-size dimension
tables.

Storing Aggregates
New fact and dimension table approach

(Approach 1)
New Level Field approach (Approach 2)
Both require same space?
Approach 1 is recommended
Reasons?

Lost Dimensions

Collapsed Dimensions

Configurations
A configuration with 3 aggregates Node cost assignments
A A
B A B A
C C A B
A configuration consists of a set of aggregates

To compute the cost of a configuration:
– For each lattice node, find its smallest ancestor in the
configuration
– Cost of that node = size of smallest ancestor
– Cost of configuration = weighted sum of node costs
Greedy Algorithm
Greedy aggregate selection algorithm
– Add aggregates one at a time until space budget is exhausted
– Always add the aggregate whose addition will most decrease the
cost of the current configuration
Performance guarantee for Greedy
– Benefit of configuration C =(cost of no-aggregate configuration) -
(cost of C)
– BG,k = Benefit of k-aggregate configuration chosen by greedy
algorithm
– BOPT,k = Benefit of best possible k-aggregate configuration
– Theorem: BG,k > BOPT,k *0.63
Greedy always achieves at least 63% of optimal benefit
For proof, see “Implementing Data Cubes Efficiently”, by
Harinarayan, Rajaraman, and Ullman, 1996
Example of Greedy Algorithm
Fact table Empty configuration has
Weight=1
cost = 500 cost = 2500
A 200
– 5 queries * 500 cost
Which to add first?
– A cost: 1000
B 100 C 99 100 5*200
– B cost: 1700
2*100 + 3*500
90 90 90 90 – C cost: 1698
D
2*99 + 3*500
Weight=1 Weight=1 Weight=1 Weight=1 – D cost: 2100
1*100 + 4*500
Number of allowed aggregates = 3
Add aggregate A first
Fact table Which to add next?
Weight=1
cost = 500
A 200 – AB cost: 800
2*100 + 3*200
– AC cost: 798
2*99 + 3*200
B 100 C 99 100
– AD cost: 890
1*90 + 4*200
Add C next
D 90 90 90 90
Final 3-aggregate
Weight=1 Weight=1 Weight=1 Weight=1 configuration = ACD
– Cost = 688
Number of allowed aggregates = 3
– 1*90 + 2*99 + 2*200
Greedy Configuration Optimal Configuration
Weight=1 Weight=1
200 200
100 99 100 100 99 100
90 90 90 90 90 90 90 90
Weight=1Weight=1 Weight=1 Weight=1 Weight=1Weight=1 Weight=1 Weight=1
Total cost = 688 Total cost = 600
Practical Limitations
Sizes of aggregates
– Greedy algorithm assumed sizes of aggregates known
– In reality, computing the size of an aggregate can be expensive
Essentially, need to construct the aggregate table
– Estimation techniques
Based on sampling or hashing
Number of candidates
– n total dimension attributes → 2n possible fact aggregates
– Considering all possible aggregates would take too long even for
moderate n
– Need to prune the search space before applying greedy selection
Number of aggregates vs. space consumed
– Space consumption is more appropriate limit than maximum number
– Modified greedy algorithm: At each step, add aggregate that has best
ratio of (cost improvement) / (size of aggregate)
Impact of indexes
– Selection of indexes also affects query performance
– Should not be done independently of aggregate selection
Thursday: A more practical technique

Part 2 Data Warehousing and Data Mining

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Part 2 Data Warehousing and Data Mining

Transféré par

Droits d'auteur :

Formats disponibles

DATA WAREHOUSING AND DATA MINING

The single most dramatic way to affect

 Allows functional aggregations (e.g., sales by quarter):

10/7/2019 Dr. Navneet Goyal, BITS Pilani 14

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt

 The table above is an example of a cross-tabulation

3 12/11/05 Nov 1997 N 12 3 2 20

3 12/11/05 Nov 1997 N 12 3 2 20

How many such SQL queries to build cross-tab?

10/7/2019 Dr. Navneet Goyal, BITS Pilani 27

GROUP BY clause with CUBE keyword is equivalent to a

10/7/2019 Dr. Navneet Goyal, BITS Pilani 28

1996 38 107 145 1996 CA 107

1997 75 35 110 1996 All 145

Total 176 223 399 1997 WI 75

1997 All 110

All All 399

10/7/2019 Dr. Navneet Goyal, BITS Pilani 29

1996 38 107 145 1996 CA 107

1997 75 35 110 1996 All 145

Total 176 223 399 1997 WI 75

1997 All 110

All All 399

10/7/2019 Dr. Navneet Goyal, BITS Pilani 30

WI CA Total ALL WI 176

1995 63 81 144 1995 CA 81

1996 38 107 145 1996 CA 107

10/7/2019 Dr. Navneet Goyal, BITS Pilani 31

WI CA Total ALL WI 176

1995 63 81 144 1995 CA 81

1996 38 107 145 1996 CA 107

10/7/2019 Dr. Navneet Goyal, BITS Pilani 32

Example Simple Cross-Tabular Report With Subtotals

<M,Y,*> <M,*,C> <*,Y,C>

 Dimension collapse, * denotes ALL

 Distributive and Algebraic have simple and efficient

 (e.g., sample to find median, quartile boundaries)

State, State, Month,

Still aggregations is so underused. Why?

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 61

The size of the database should not

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 63

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 64

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 65

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 66

Figure 1 Poor use of the space allocated for aggregates.

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 67

Figure taken from Neil Raden article (www.hiredbrains.com/artic9.html)

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 69

Total no. of records = 100 million

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 71

How many aggregates are possible?

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 72

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 73

Let us assume that sparsity for 1-way

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 74

Table Prod. Store Time Sparsity # Records

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 75

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 76

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 77

Each aggregate must summarize

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 78

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 79

<M,Y,> <M,,C> <*,Y,C>