Vous êtes sur la page 1sur 97

DATA WAREHOUSING AND DATA MINING

10BI23

Instructor: Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and Engineering
PES Institute of technology
Bangalore
Part 2

Aggregations:
Aggregations: SQL and Aggregation
Aggregation functions and grouping
Aggregation

The single most dramatic way to affect


performance in a large data warehouse
is to provide a proper set of aggregate
(summary) records ... in some cases speeding
queries by a factor of 100 or even 1,000.
No other means exist to harvest such
spectacular gains."

- Ralph kimball
10/7/2019 Dr. Navneet Goyal, BITS, Pilani 3
Aggregates
• Pre-computed aggregates
– Materialized views
– Aggregate navigation
– Dimension and fact aggregates
• Selection of aggregates
– Manual selection
– Greedy algorithm
– Limitations of greedy approach
Star Schema
 Single data (fact)
table surrounded by Dim
multiple descriptive
(dimension) tables
Dim Dim

Fact

Dim Dim
Star Schema for Sales
Dimension
Tables

Fact Table
Star Schema Representation
 Fact and Dimensions are represented by physical
tables in the data warehouse database
 Fact tables are related to each dimension table in
a Many to One relationship (Primary/Foreign Key
Relationships)
 Fact Table is related to many dimension tables
– The primary key of the fact table is a composite primary
key from the dimension tables
 Each fact table is designed to answer a specific
DSS question
Physical Database Design
• Logical database design
– What are the facts and dimensions?
– Goals: Simplicity, Expressiveness
– Make the database easy to understand
– Make queries easy to ask
• Physical database design
– How should the data be arranged on disk?
– Goal: Performance
• Manageability is an important secondary concern
– Make queries run fast
Load vs. Query Trade-off
• Trade-off between query performance and load
performance
• To make queries run fast:
– Precompute as much as possible
– Build lots of data structures
• Indexes
• Materialized views
• But…
– Data structures require disk space to store
– Building/updating data structures takes time
– More data structures → longer load time
Typical Storage Allocations
• Base data
– Fact tables and dimension tables
– Fact table space >> Dimension table space
• Indexes
– 100%-200% of base data
• Aggregates / Materialized Views
– 100% of base data
• Extra data structures 2-3 times size of base data
Why This Syntax?
 Abstract syntax
select <field list> <aggregate list>
from <table expression>
where <search condition>
group by [ cube | drill down] <aggregate list>
having <search condition>

 Allows functional aggregations (e.g., sales by quarter):


select store, quarter, sum(units)
from sales
where nation = “Mexico”
group by drill down store, quarter(date) as quarter
and year = 1994;
Group, Cube, Roll-Up Algebra
 The operators have the following properties:
 CUBE(ROLLUP) = CUBE
 ROLLUP (GROUP BY) = ROLLUP
 SQL extended GROUP BY operator
 GROUP BY [<aggregation list>]
[ROLLUP < aggregation list>]
[CUBE < aggregation list>]
 CUBE creates nested relations, that is,
relations can be values.
SQL: 1999
 SQL: 1999 defines a rich set of
aggregate functions
 New aggregate functions on
single attributes:
 stddev
 variance
 Binary aggregate functions
 correlations
 covariances
 regression curves
10/7/2019 Dr. Navneet Goyal, BITS Pilani 13
SQL: 1999
 Generalization of GROUP BY
construct:
 Cube
 Rollup
 Window Queries
 Top N Queries (Ranking)

10/7/2019 Dr. Navneet Goyal, BITS Pilani 14


The MOLAP Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId amt
p1 s1 12 s1 s2 s3
p2 s1 11 p1 12 50
p1 s3 50 p2 11 8
p2 s2 8

dimensions = 2
3-D Cube
Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11 s1 s2 s3
day 2
p1 s3 1 50 p1 44 4
p2 s2 1 8 p2 s1 s2 s3
p1 s1 2 44 day 1
p1 12 50
p1 s2 2 4 p2 11 8

dimensions = 3
Cube Aggregation: Roll-up
Example: computing sums

s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
size wise
day1 + day2 p1+p2
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
product wise
sum
rollup p1 110
s1+s2+s3
p2 19
drill-down
Cube Operators for Roll-up
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)

s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)
Extended Cube

* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
Aggregation Using Hierarchies

s1 s2 s3
day 2
p1 44 4
store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region

country

region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)
Slicing
PRODUCT = p1
s1 s2 s3
day 2
p1 44 4 s1 s2 s3
p2 s1 s2 s3 day 1 44 4
day 1
p1 12 50 day 2 12 50
p2 11 8

TIME = day 1
s1 s2 s3
p1 12 50
p2 11 8
Cross Tabulation

 The table above is an example of a cross-tabulation


(cross-tab), also referred to as a pivot-table.
 Values for one of the dimension attributes form the row
headers
 Values for another dimension attribute form the column
headers
 Other dimension attributes are listed on top
 Values in individual cells are (aggregates of) the values of the
dimension attributes that specify the cell.
Relational Representation of Cross-
tabs

 Cross-tabs can be
represented as relations
 We use the value ‘all’ is
used to represent
aggregates
 The SQL:1999 standard
actually uses null values in
place of all despite
confusion with regular null
values
Data Cube
 A data cube is a multidimensional generalization of a cross-tab
 Can have n dimensions; we show 3 below
 Cross-tabs can be used as views on a data cube
Cube Operator
pid timeid Locid sales
Locid City State Country
11 1 1 25
1 Madison WI USA
11 2 1 8
2 Fresno CA USA
11 3 1 15
5 Chennai TN India
12 1 1 30
12 2 1 20
Pid Pname category Price
12 3 1 50
11 Lee Jeans Apparel 25 13 1 1 8

12 Zord Toys 18 13 2 1 10
13 3 1 10
13 Biro Pen Stationery 2
11 1 2 35
11 2 2 22
Timeid Date Month Year Holiday 11 3 2 10
1 10/11/05 Nov 1995 N 12 1 2 26
2 11/11/05 Nov 1996 N 12 2 2 45

3 12/11/05 Nov 1997 N 12 3 2 20


13 1 2 20
13 2 2 40
13 3 2 5
10/7/2019 Dr. Navneet Goyal, BITS Pilani 25
Cube Operator
pid timeid Locid sales
Locid City State Country
11 1 1 25
1 Madison WI USA
11 2 1 8
2 Fresno CA USA
11 3 1 15
5 Chennai TN India
12 1 1 30
12 2 1 20
Pid Pname category Price
12 3 1 50
11 Lee Jeans Apparel 25 13 1 1 8
12 Zord Toys 18 13 2 1 10
13 3 1 10
13 Biro Pen Stationery 2
11 1 2 35
11 2 2 22
Timeid Date Month Year Holiday 11 3 2 10
1 10/11/05 Nov 1995 N 12 1 2 26
2 11/11/05 Nov 1996 N 12 2 2 45

3 12/11/05 Nov 1997 N 12 3 2 20


13 1 2 20
13 2 2 40
13 3 2 5
10/7/2019 Dr. Navneet Goyal, BITS Pilani 26
Cube Operator
Select T.year, L.state, SUM (sales) WI CA Total
from Sales S, Times T, Locations L
Where S.timeid=T.timeid & S.locid=L.locid
1995 63 81 144
Group By T.year, L.state
1996 38 107 145
Select T.year, SUM (sales) 1997 75 35 110
from Sales S, Times T
Where S.timeid=T.timeid Total 176 223 399
Group By T.year
Select SUM (sales)
Select L.state, SUM (sales) from Sales S, Locations L
from Sales S, Locations L Where S.locid=L.locid
Where S.locid=L.locid OR
Group By L.state Select SUM (sales)
from Sales S, Time T
Where S.timeid=T.timeid

How many such SQL queries to build cross-tab?

10/7/2019 Dr. Navneet Goyal, BITS Pilani 27


Cube Operator
Select T.year, L.state, SUM (sales) T.Year L.State SUM(sales
from Sales S, Times T, Locations L )
Where S.timeid=T.timeid & S.locid=L.locid 1995 WI 63
Group By CUBE (T.year, L.state) 1995 CA 81
1995 All 144
1996 WI 38
WI CA Total
1996 CA 107
1995 63 81 144 1996 All 145
1997 WI 75
1996 38 107 145
1997 CA 35
1997 75 35 110 1997 All 110
All WI 176
Total 176 223 399
All CA 223
All All 399

GROUP BY clause with CUBE keyword is equivalent to a


collection of GROUP BY statements, with one GROUP BY st.
for each subset of k-dimensions

10/7/2019 Dr. Navneet Goyal, BITS Pilani 28


Rollup Operator
Select T.year, L.state, SUM (sales)
T.Year L.State SUM(sales)
from Sales S, Times T, Locations L
Where S.timeid=T.timeid & S.locid=L.locid 1995 WI 63
Group By ROLLUP (T.year, L.state)
1995 CA 81
WI CA Total 1995 All 144
1995 63 81 144 1996 WI 38

1996 38 107 145 1996 CA 107

1997 75 35 110 1996 All 145

Total 176 223 399 1997 WI 75

1997 CA 35

1997 All 110

All All 399

10/7/2019 Dr. Navneet Goyal, BITS Pilani 29


Rollup Operator
Select T.year, L.state, SUM (sales)
T.Year L.State SUM(sales)
from Sales S, Times T, Locations L
Where S.timeid=T.timeid & S.locid=L.locid 1995 WI 63
Group By ROLLUP (T.year, L.state)
1995 CA 81
WI CA Total 1995 All 144
1995 63 81 144 1996 WI 38

1996 38 107 145 1996 CA 107

1997 75 35 110 1996 All 145

Total 176 223 399 1997 WI 75

1997 CA 35

1997 All 110

All All 399

10/7/2019 Dr. Navneet Goyal, BITS Pilani 30


Rollup Operator
T.Year L.State SUM(sales)
Select T.year, L.state, SUM (sales)
from Sales S, Times T, Locations L 1995 WI 63
Where S.timeid=T.timeid & S.locid=L.locid
1996 WI 38
Group By ROLLUP (L.state, T.year)
1997 WI 75

WI CA Total ALL WI 176

1995 63 81 144 1995 CA 81

1996 38 107 145 1996 CA 107

1997 CA 35
1997 75 35 110
ALL CA 223
Total 176 223 399
All All 399

10/7/2019 Dr. Navneet Goyal, BITS Pilani 31


Rollup Operator
T.Year L.State SUM(sales)
Select T.year, L.state, SUM (sales)
from Sales S, Times T, Locations L 1995 WI 63
Where S.timeid=T.timeid & S.locid=L.locid
1996 WI 38
Group By ROLLUP (L.state, T.year)
1997 WI 75

WI CA Total ALL WI 176

1995 63 81 144 1995 CA 81

1996 38 107 145 1996 CA 107

1997 CA 35
1997 75 35 110
ALL CA 223
Total 176 223 399
All All 399

10/7/2019 Dr. Navneet Goyal, BITS Pilani 32


Example
roll-up to region
Dimensions:
S3
S2
Time, Product, Store
roll-up to brand
S1
Attributes:
10
Product (descr, price, …)
Juice
Store …
Product

Milk 34
56 …
Coke
Cream 32 Hierarchies:
Soap 12 Product  Brand  …
Bread 56 roll-up to week Day  Week  Quarter
M T W Th F S S
Store  Region  Country

Time
56 units of bread sold in S1 on M
Twelve rules for evaluating OLAP products
(E.F.Codd)
• 1.Multidimensional Conceptual View
• 2.Transparency
• 3.Accessibility
• 4.Consistent Reporting Performance
• 5.Client-Server Architecture
• 6.Generic Dimensionality
• 7.Dynamic Sparse Matrix Handling
• 8.Multi-User Support
• 9.Unrestricted Cross-dimensional Operations
• 10.Intuitive Data Manipulation
• 11.Flexible Reporting
• 12.Unlimited Dimensions and Aggregation Levels
Materialized Views
• Many DBMSs support materialized views
– Precomputed result of a particular query
– Goal: faster response for related queries
• Example:
– View definition:
SELECT State, SUM(Quantity)
FROM Sales GROUP BY State
– Query:
SELECT SUM(Quantity)
FROM Sales
WHERE State = 'CA'
– Scan view rather than Sales table
• View matching problem
– When can a query be re-written using a materialized view?
– Difficult to solve in its full generality
– DBMSs handle common cases via limited set of re-write rules
Cross Tab Report

Example Simple Cross-Tabular Report With Subtotals


Table is a sample cross-tabular report showing the total sales
by country_id and channel_desc for the US and UK through the Internet and
direct sales in September 2000.
Table Simple Cross-Tabular Report With Subtotals
Channel Country
UK US Total
Direct Sales 1,378,126 2,835,557 4,213,683
Internet 911,739 1,732,240 2,643,979
Total 2,289,865 4,567,797 6,857,662
Query request SUM(amount_sold) and GROUP BY(channel_desc, country_id)
Query
SELECT channel_desc, country_id,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels
WHERE sales.time_id=times.time_id AND
sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND
channels.channel_desc IN ('Direct Sales', 'Internet') AND
times.calendar_month_desc='2000-09'
AND country_id IN ('UK', 'US')
GROUP BY CUBE(channel_desc, country_id);
CUBE and Roll Up Operators
Aggregate
Group By
Sum (with total)
By Color
RED
WHITE Cross Tab
BLUE Chevy Ford By Color
RED

Sum
WHITE The Data Cube and
BLUE The Sub-Space Aggregates
By Make
Sum
By Year
By Make
By Make & Year
RED
WHITE
BLUE
By Color & Year
By Make & Color

Sum By Color
The Data Cube Concept
B 1994
W 1995

1994 MAKE B
1995
W
Ford

Chevy
1994
Black
1995 F
F YEAR
White
C
C
1994 COLOR
1995 B
F W
C
Sub-cube Derivation
<M,Y,C>

<M,Y,*> <M,*,C> <*,Y,C>

<*,*,C>
<M,*,*> <*,Y,*>

<*,*,*>

 Dimension collapse, * denotes ALL


Cube Operator Example
SALES DATA CUBE
Model Year Color Sales Model Year Color Sales
Chevy 1990 red 5 ALL ALL ALL 942
Chevy 1990 white 87 chevy ALL ALL 510
Chevy 1990 blue 62 ford ALL ALL 432
Chevy 1991 red 54 ALL 1990 ALL 343
Chevy 1991 white 95 ALL 1991 ALL 314
ALL 1992 ALL 285

CUBE
Chevy 1991 blue 49
Chevy 1992 red 31 ALL ALL red 165
Chevy 1992 white 54 ALL ALL white 273
Chevy 1992 blue 71 ALL ALL blue 339
Ford 1990 red 64 chevy 1990 ALL 154
Ford 1990 white 62 chevy 1991 ALL 199
Ford 1990 blue 63 chevy 1992 ALL 157
Ford 1991 red 52 ford 1990 ALL 189
Ford 1991 white 9 ford 1991 ALL 116
Ford 1991 blue 55 ford 1992 ALL 128
Ford 1992 red 27 chevy ALL red 91
Ford 1992 white 62 chevy ALL white 236
Ford 1992 blue 39 chevy ALL blue 183
ford ALL red 144
ford ALL white 133
ford ALL blue 156
ALL 1990 red 69
ALL 1990 white 149
ALL 1990 blue 125
ALL 1991 red 107
ALL 1991 white 104
ALL 1991 blue 104
ALL 1992 red 59
ALL 1992 white 116
ALL 1992 blue 110
User Defined Aggregate Function
Generalized For Cubes
 Aggregates have graduated difficulty
 Distributive: can compute cube from next lower
dimension values (count, min, max,...)
 Algebraic: can compute cube from next lower lower
scratchpads (average, ...)
 Holistic: Need base data (Median, Mode, Rank..)

 Distributive and Algebraic have simple and efficient


algorithm: build higher dimensions from core
 Holistic computation seems to require multiple passes.
 real systems use sampling to estimate them

 (e.g., sample to find median, quartile boundaries)


Interesting Aggregate
Functions
 From RedBrick systems
 Rank (in sorted order)
 N-Tile (histograms)
 Running average (cumulative functions)
 Windowed running average
 Percent of total
 Users want to define their own aggregate
functions
 statistics
 domain specific
Aggregate Tables
• Also known as summary tables
• Common form of precomputation in data
warehouses
• Reduce dimensionality of fact by
aggregating across some dimensions
– Result is smaller version of fact table
• Can be used to answer queries that refer
only to dimensions that are retained
– Similar to the idea of a covering index
Aggregate Table Example
• Sales Fact table
– Dimensions (Date, Product, Store, Promotion, Transaction ID)
– Measurements DollarAmt, Quantity
• Create aggregate table with (Date, Store)
– SELECT DateKey, StoreKey, SUM(DollarAmt),
SUM(Quantity)
FROM Sales
GROUP BY DateKey, StoreKey
– Store the result in Sales2 table
• Queries that only reference Date and Store attributes
can use the aggregate table instead
– SELECT Store.District, SUM(Quantity)
FROM Sales, Store, Date
WHERE Sales.Date_key = Date.Date_key
AND Sales.Store_key = Store.Store_key
AND Date.Month = 'September 2004'
GROUP BY Store.District
– Replace “Sales” by “Sales2” → Same query result!
Aggregate Tables vs. Indexes
• Idea behind covering fact index:
– Thinner version of fact table
– Index takes up less space than fact table
– Fewer I/Os required to scan it
• Idea behind aggregate table:
– Thinner and shorter version of fact table
– Aggregate table takes up much less space than fact table
– Fewer I/Os required to scan it
• Aggregate table has fewer rows
– Index has 1 index entry per fact table row
• Regardless of how many columns are in the index
– Aggregate table has 1 row per unique combination of dimensions
• Often many fewer rows compared to the fact table!
• Index supports efficient lookup on leading terms
– Useful when filters are selective
– Avoid scanning rows that will be filtered out
– Can build indexes on aggregate tables, too!
Aggregate Navigation
• Two techniques to manage aggregates
– Let the database management system do it
– Do it yourself
• Let the database do it
– Use materialized view capabilities built in to the DBMS
– Query re-writing happens automatically
• Do it yourself
– Create and populate additional tables
– Perform explicit query re-write (aggregate navigation)
• Pros and Cons of Do-It-Yourself
– Pros:
• More flexibility & re-writing power
• Better load performance (possibly)
– Cons:
• More tables to manage
• Load becomes more complex
• Need to write aggregate navigation code
• Clients need to use aggregate navigator
Aggregate Navigation
• Clients are unaware of aggregates
Warehouse
• Client queries reference base-level
SQL fact and dimension tables
Client Queries

Rewritten
Warehouse Aggregate Queries
Client Navigator Data
Warehouse

Warehouse
Client What aggregates exist?
Aggregate How large is each one?
Metadata
Dimension Aggregates
• Methods to define aggregates
– 1. Include or leave out entire dimensions
– 2. Include some columns of each dimension, but not others
– Second approach is more common / more useful
• Example: Several versions of Date dimension
– Base Date dimension
• 1 row per day
• Includes all Date attributes
– Monthly aggregate dimension
• 1 row per month
• Includes Date attributes at month level or higher
– Yearly aggregate dimension
• 1 row per year
• Includes only year-level Date attributes
• Each dimension aggregate has its own set of surrogate keys
• Each aggregated fact joins to 1 version of the Date dimension
– Or else the Date dimension is omitted entirely…
Choosing Dimension Aggregates
• Dimension aggregates often roll up along hierarchies
– Day – month – year
– SKU – brand – category – department
– Store – city – county – state – country
• Any subset of dimension attributes can be used
– Promotion dimension includes attributes for coupon, ad,
discount, and end-of-aisle display
– Promotion aggregate might include only ad-related attributes
– Customer aggregate might include only a few frequently-queried
columns (age, gender, income, marital status)
• Goal: reduced number of distinct combinations
– Results in fewer rows in aggregated fact
– Customer aggregate that included SSN would be pointless
Aggregate Examples
• Sales fact table
– Date, Product, Store, Promotion, Transaction ID
• Date aggregates
– Month
– Year
• Product aggregates
– Brand
– Manufacturer
• Promotion aggregates
– Ad
– Discount
– Coupon
– In-Store Display
• Store aggregates
– District
– State
Choosing Fact Aggregates
• Aggregated fact table includes:
– Foreign keys to dimension aggregate tables
– Aggregated measurement columns
• For example, Quantity column holds SUM(Quantity)
• Need to choose which version of each dimension to use
• Transaction ID will never be included
– Aggregates with degenerate dimensions are rare
• Number of possible fact aggregates: 4 * 4 * 6 * 4
– Dimension with n aggregates → n+2 possibilities
• Include base dimension
• Omit dimension entirely
• n dimension aggregates to choose from
– Constructing fact aggregates for all combinations would be
impractical
– How to decide which combinations to construct?
Aggregate Selection
• Two approaches
– Manual selection
• Data warehouse designer chooses aggregates
• Could be time-consuming and error-prone
– Automatic selection
• Use algorithms to optimize choice of aggregates
• Good in principle; however, problem is hard
• Heuristics for manual selection
– Include a mixture of breadth and depth
• A few broad aggregates
– Lots of dimensions at fairly fine-grained level of detail
– Achieve moderate speedup on a wide range of queries
• Lots of highly targeted aggregates with only a few rows each
– Dimensions are highly rolled up or omitted entirely
– Each aggregate achieves large speedup on a small class of queries
• Roughly equal allocation of space to each type
– Consider the query workload
• What attributes are queried most frequently?
• What sets of attributes are often queried together?
• Make sure that common / important queries run fast
• Aggregate design can be adjusted over time (learn from experience)
Constructing Aggregates
• Constructing dimension aggregates
– Determine attributes in aggregate
– Generate unique combinations of those attributes by SELECT
DISTINCT from dimension table
– Assign surrogate keys to aggregate table rows
• Constructing fact aggregates
– Build mapping table that maps dimension keys to dimension
aggregate keys, for each dimension
– Join fact table to mapping tables and group by aggregate keys
• Constructing aggregates from other aggregates
– Fact aggregate on (Product, Year) can be built from (Product,
Month) aggregate
– Faster than using base table
Data Cube Lattice
State, Month,
Color

State, State, Month,


Month Color Color
Drill Roll
Down Up
State Month Color

Total
Sparsity Revisited
• Fact tables are usually sparse
– Not all possible combinations of dimension values actually occur
– E.g. not all products sell in all stores on all days
• Aggregate tables are not as sparse
– Coarser grain → lesser sparsity
– All products sell in SOME store in all MONTHS
– Thus space savings from aggregation can be less than one
might think
• Example: Date dimension vs. Year dimension aggregate
– Year table has 1/365 as many rows as Date table
– Fact aggregate with (Product, Store, Year) has fewer rows than
fact aggregate with (Product, Store, Date)
– But more than 1/365 of the rows!
– Some potential space savings are lost due to reduced sparsity
Automatic Selection of Aggregates
• Problem Definition
– Inputs:
• Query workload
• Set of candidate aggregates
• Query cost model
• Maximum space to use for aggregates
– Output:
• Set of aggregates to construct
– Objective:
• Minimize cost of executing query workload
• Problem is NP-Complete
– Approximation is required
– We’ll discuss a Greedy algorithm for the problem
– Due to Harinarayan, Rajaraman, and Ullman (1996)
Problem Inputs
• Set of candidate aggregates
– We’ll consider the data cube lattice
• Query workload
– Each OLAP query maps to a node in the lattice
• Union of grouping and filtering attributes
– Workload = weight assigned to each lattice node
• Weight = fraction of queries in workload that correspond to this node
• Query cost model
– We’ll use simple linear cost model
– Cost proportional to size of fact aggregate used to answer query
– Justification:
• Dominant cost is I/O to retrieve tables
• Dimension tables small relevant to fact
– Oversimplification of reality
• Makes the problem easier to analyze
• Maximum space to use for aggregates
– We’ll fix a maximum number of aggregates, regardless of their size
– Another simplification for purposes of analysis
Aggregation

Still aggregations is so underused. Why?


We are still not comfortable with
redundancy!
Requires extra space
Most of us are not sure of what aggregates
to store
A bizarre phenomena called
SPARSITY FAILURE
10/7/2019 Dr. Navneet Goyal, BITS, Pilani 60
Aggregates
Need new indexes that can quickly and
“logically” get us to millions of records
Logically because we need only the
summarized result and not the individual
records
Aggregates as Summary Indexes!

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 61


Aggregates
Aggregates belong to the same DB as the low
level atomic data that is indexed (unlike data
marts)
Queries should always target the atomic data
Aggregate Navigation automatically rewrites
queries to access the best presently available
aggregates
Aggregate navigation is a form of query
optimization
Should be offered by DB query optimizers
Intelligent Middleware
10/7/2019 Dr. Navneet Goyal, BITS, Pilani 62
Aggregation: Thumb Rule

The size of the database should not


become more than double of its
original size

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 63


Aggregates: Trade-Offs
Query performance vs. Costs
Costs
– Storing
– Building
– Maintaining
– Administrating
Imbalance: Retail DW that collapsed under the
weight of more that 2500 aggregates and that
took more than 24 hours to refresh!!!

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 64


Aggregates: Guidelines
Set an aggregate storage limit (not more than
double the original size of the DB)
Dynamic portfolio of aggregates that change
with changing demands
Define small aggregates: 10 to 20 times smaller
than the FT or aggregate on which it is based
– Monthly product sales aggregate: How many times
smaller than daily product sales table?
If your answer is 30…you are forgiven, but you
are likely to be wrong
– Reason: Sparstiy Failure

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 65


Aggregates: Guidelines
Spread aggregates: Goal should be
to accelerate a broad spectrum of
queries

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 66


Aggregates: Guidelines
Spread aggregates: Goal should be
to accelerate a broad spectrum of
queries

Figure 1 Poor use of the space allocated for aggregates.

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 67


Aggregation

Figure taken from Neil Raden article (www.hiredbrains.com/artic9.html)


10/7/2019 Dr. Navneet Goyal, BITS, Pilani 68
Aggregates
Issues
Which aggregates to create?
How to guard against sparsity failure?
How to store them?
New Fact Table approach
New Level Field approach
How queries are directed to appropriate
aggregates?

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 69


Aggregates: Example
Grocery Store
3 dimensions – Product, Location, & Time
10000 products
1000 stores
100 time periods
10% Sparsity

Total no. of records = 100 million


10/7/2019 Dr. Navneet Goyal, BITS, Pilani 70
Aggregates: Example

Hierarchies
10000 products in 2000 categories
1000 stores in 100 districts
30 aggregates in 100 time periods

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 71


Aggregates: Example

How many aggregates are possible?


1-way: Category by Store by Day
1-way: Product by District by Day
1-way: Product by Store by Month
2-way: Category by District by Day
2-way: Category by Store by Month
2-way: Product by District by Month
3-way: Category by District by Month

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 72


Aggregates: Example
What is Sparsity?
Fact tables are sparse in their keys!
10% sparsity at base level means that only 10%
of the products are sold on any given day
(average)
As we move from base level to 1-way the sparsity
_Increases!
_______
What affect sparsity will have on the size of the
aggregate fact table?

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 73


Aggregates: Example

Let us assume that sparsity for 1-way


aggregates is 50%
For 2-way 80%
For 3-way 100%
Do you agree with this?
Is it logical?

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 74


Aggregates: Example

Table Prod. Store Time Sparsity # Records


Base 10000 1000 100 10% 100 million
1-way 2000 1000 100 50% 100 million
1-way 10000 100 100 50% 50 million
1-way 10000 1000 30 50% 150 million
2-way 2000 100 100 80% 16 million
2-way 2000 1000 30 80% 48 million
2-way 10000 100 30 80% 24 million
3-way 2000 100 30 100% 6 million
Grand Total 494 million

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 75


Aggregates: Example
An increase of almost 400%
Why it happened?
Look at the aggregates involving Location
and Time!
How can we control this aggregate
explosion?
Do the calculations again with 500
categories and 5 time aggregates

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 76


Aggregates: Example
Table Prod. Store Time Sparsity # Records
Base 10000 1000 100 10% 100 million
1-way 500 1000 100 50% 25 million
1-way 10000 100 100 50% 50 million
1-way 10000 1000 5 50% 25 million
2-way 500 100 100 80% 4 million
2-way 500 1000 5 80% 2 million
2-way 10000 100 5 80% 4 million
3-way 500 100 5 100% 0.25 million
Grand Total 210.25 million

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 77


Aggregate Design Principle

Each aggregate must summarize


at least 10 and preferably 20 or
more lower-level items

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 78


Aggregates

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 79


Aggregates:
Shrunken Dimensions

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 80


Aggregates:
Shrunken Dimensions

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 81


Aggregates:
Shrunken Dimensions

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 82


Aggregates:
Shrunken Dimensions

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 83


Aggregate Navigator

How queries are directed to the appropriate


aggregates?
Do end user query tools have to be “hardcoded”
to take advantage of aggregates?
If DBA changed the aggregates all end user
applications have to be recoded

How do we overcome this problem?

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 84


Aggregate Navigator

Aggregate Navigator (AN) is the solution


So what is an AN?
A middleware sitting between user queries
and DBMS
With AN, user applications speak just base
level SQL
AN uses metadata to transform base level
SQL into “Aggregate Aware” SQL
10/7/2019 Dr. Navneet Goyal, BITS, Pilani 85
10/7/2019 Dr. Navneet Goyal, BITS, Pilani 86
AN Algorithm
1. Rank Order all the aggregate fact tables
from the smallest to the largest
2. For the smallest FT, look in associated DTs
to verify that all the dimensional attributes of
the current query can be found.If found, we
are through. Replace the base-level FT with
the aggregate fact and aggregate DTs
3. If step 2 fails, find the next smallest
aggregate FT and try step 2 again. If we run
out of aggregate FTs, then we must use
base tables

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 87


Design Requirements

#1 Aggregates must be stored in their own fact


tables, separate from the base-level data. In
addition, each distinct aggregation level must
occupy its own unique fact table
#2 The dimension tables attached to the aggregate
fact tables must, wherever possible, be
shrunken versions of the dimension tables
associated with the base fact table.

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 88


Design Requirements

#3 The base Fact table and all of its related


aggregate Fact tables must be associated
together as a "family of schemas" so that the
aggregate navigator knows which tables are
related to one another.
#4 Force all SQL created by any end user or
application to refer exclusively to the base fact
table and its associated full-size dimension
tables.

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 89


Storing Aggregates

New fact and dimension table approach


(Approach 1)
New Level Field approach (Approach 2)
Both require same space?
Approach 1 is recommended
Reasons?

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 90


Lost Dimensions

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 91


Collapsed Dimensions

10/7/2019 Dr. Navneet Goyal, BITS, Pilani 92


Configurations
A configuration with 3 aggregates Node cost assignments
A A

B A B A

C C A B

A configuration consists of a set of aggregates


To compute the cost of a configuration:
– For each lattice node, find its smallest ancestor in the
configuration
– Cost of that node = size of smallest ancestor
– Cost of configuration = weighted sum of node costs
Greedy Algorithm
Greedy aggregate selection algorithm
– Add aggregates one at a time until space budget is exhausted
– Always add the aggregate whose addition will most decrease the
cost of the current configuration
Performance guarantee for Greedy
– Benefit of configuration C =(cost of no-aggregate configuration) -
(cost of C)
– BG,k = Benefit of k-aggregate configuration chosen by greedy
algorithm
– BOPT,k = Benefit of best possible k-aggregate configuration
– Theorem: BG,k > BOPT,k *0.63
Greedy always achieves at least 63% of optimal benefit
For proof, see “Implementing Data Cubes Efficiently”, by
Harinarayan, Rajaraman, and Ullman, 1996
Example of Greedy Algorithm
Fact table Empty configuration has
Weight=1
cost = 500 cost = 2500
A 200
– 5 queries * 500 cost
Which to add first?
– A cost: 1000
B 100 C 99 100 5*200
– B cost: 1700
2*100 + 3*500
90 90 90 90 – C cost: 1698
D
2*99 + 3*500
Weight=1 Weight=1 Weight=1 Weight=1 – D cost: 2100
1*100 + 4*500
Number of allowed aggregates = 3
Add aggregate A first
Example of Greedy Algorithm
Fact table Which to add next?
Weight=1
cost = 500
A 200 – AB cost: 800
2*100 + 3*200
– AC cost: 798
2*99 + 3*200
B 100 C 99 100
– AD cost: 890
1*90 + 4*200
Add C next
D 90 90 90 90
Final 3-aggregate
Weight=1 Weight=1 Weight=1 Weight=1 configuration = ACD
– Cost = 688
Number of allowed aggregates = 3
– 1*90 + 2*99 + 2*200
Example of Greedy Algorithm
Greedy Configuration Optimal Configuration
Weight=1 Weight=1
200 200

100 99 100 100 99 100

90 90 90 90 90 90 90 90
Weight=1Weight=1 Weight=1 Weight=1 Weight=1Weight=1 Weight=1 Weight=1
Total cost = 688 Total cost = 600
Practical Limitations
Sizes of aggregates
– Greedy algorithm assumed sizes of aggregates known
– In reality, computing the size of an aggregate can be expensive
Essentially, need to construct the aggregate table
– Estimation techniques
Based on sampling or hashing
Number of candidates
– n total dimension attributes → 2n possible fact aggregates
– Considering all possible aggregates would take too long even for
moderate n
– Need to prune the search space before applying greedy selection
Number of aggregates vs. space consumed
– Space consumption is more appropriate limit than maximum number
– Modified greedy algorithm: At each step, add aggregate that has best
ratio of (cost improvement) / (size of aggregate)
Impact of indexes
– Selection of indexes also affects query performance
– Should not be done independently of aggregate selection
Thursday: A more practical technique

Vous aimerez peut-être aussi