Académique Documents
Professionnel Documents
Culture Documents
10BI23
Instructor: Dr S.Natarajan
Professor and Key Resource Person
Department of Information Science and Engineering
PES Institute of technology
Bangalore
Part 2
Aggregations:
Aggregations: SQL and Aggregation
Aggregation functions and grouping
Aggregation
- Ralph kimball
10/7/2019 Dr. Navneet Goyal, BITS, Pilani 3
Aggregates
• Pre-computed aggregates
– Materialized views
– Aggregate navigation
– Dimension and fact aggregates
• Selection of aggregates
– Manual selection
– Greedy algorithm
– Limitations of greedy approach
Star Schema
Single data (fact)
table surrounded by Dim
multiple descriptive
(dimension) tables
Dim Dim
Fact
Dim Dim
Star Schema for Sales
Dimension
Tables
Fact Table
Star Schema Representation
Fact and Dimensions are represented by physical
tables in the data warehouse database
Fact tables are related to each dimension table in
a Many to One relationship (Primary/Foreign Key
Relationships)
Fact Table is related to many dimension tables
– The primary key of the fact table is a composite primary
key from the dimension tables
Each fact table is designed to answer a specific
DSS question
Physical Database Design
• Logical database design
– What are the facts and dimensions?
– Goals: Simplicity, Expressiveness
– Make the database easy to understand
– Make queries easy to ask
• Physical database design
– How should the data be arranged on disk?
– Goal: Performance
• Manageability is an important secondary concern
– Make queries run fast
Load vs. Query Trade-off
• Trade-off between query performance and load
performance
• To make queries run fast:
– Precompute as much as possible
– Build lots of data structures
• Indexes
• Materialized views
• But…
– Data structures require disk space to store
– Building/updating data structures takes time
– More data structures → longer load time
Typical Storage Allocations
• Base data
– Fact tables and dimension tables
– Fact table space >> Dimension table space
• Indexes
– 100%-200% of base data
• Aggregates / Materialized Views
– 100% of base data
• Extra data structures 2-3 times size of base data
Why This Syntax?
Abstract syntax
select <field list> <aggregate list>
from <table expression>
where <search condition>
group by [ cube | drill down] <aggregate list>
having <search condition>
dimensions = 2
3-D Cube
Fact table view: Multi-dimensional cube:
dimensions = 3
Cube Aggregation: Roll-up
Example: computing sums
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
size wise
day1 + day2 p1+p2
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
product wise
sum
rollup p1 110
s1+s2+s3
p2 19
drill-down
Cube Operators for Roll-up
s1 s2 s3
day 2 ...
p1 44 4
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8 sale(s1,*,*)
s1 s2 s3
sum 67 12 50
s1 s2 s3
p1 56 4 50
p2 11 8
129
sum
sale(s2,p2,*) p1 110
p2 19 sale(*,*,*)
Extended Cube
* s1 s2 s3 *
p1 56 4 50 110
p2 11 8 19
day 2 *
s1 67
s2 12
s3 *50 129
p1 44 4 48
p2
s1 s2 s3 *
day 1
p1
*
12
44 4
50 62
48 sale(*,p2,*)
p2 11 8 19
* 23 8 50 81
Aggregation Using Hierarchies
s1 s2 s3
day 2
p1 44 4
store
p2 s1 s2 s3
day 1
p1 12 50
p2 11 8
region
country
region A region B
p1 56 54
p2 11 8
(store s1 in Region A;
stores s2, s3 in Region B)
Slicing
PRODUCT = p1
s1 s2 s3
day 2
p1 44 4 s1 s2 s3
p2 s1 s2 s3 day 1 44 4
day 1
p1 12 50 day 2 12 50
p2 11 8
TIME = day 1
s1 s2 s3
p1 12 50
p2 11 8
Cross Tabulation
Cross-tabs can be
represented as relations
We use the value ‘all’ is
used to represent
aggregates
The SQL:1999 standard
actually uses null values in
place of all despite
confusion with regular null
values
Data Cube
A data cube is a multidimensional generalization of a cross-tab
Can have n dimensions; we show 3 below
Cross-tabs can be used as views on a data cube
Cube Operator
pid timeid Locid sales
Locid City State Country
11 1 1 25
1 Madison WI USA
11 2 1 8
2 Fresno CA USA
11 3 1 15
5 Chennai TN India
12 1 1 30
12 2 1 20
Pid Pname category Price
12 3 1 50
11 Lee Jeans Apparel 25 13 1 1 8
12 Zord Toys 18 13 2 1 10
13 3 1 10
13 Biro Pen Stationery 2
11 1 2 35
11 2 2 22
Timeid Date Month Year Holiday 11 3 2 10
1 10/11/05 Nov 1995 N 12 1 2 26
2 11/11/05 Nov 1996 N 12 2 2 45
1997 CA 35
1997 CA 35
1997 CA 35
1997 75 35 110
ALL CA 223
Total 176 223 399
All All 399
1997 CA 35
1997 75 35 110
ALL CA 223
Total 176 223 399
All All 399
Milk 34
56 …
Coke
Cream 32 Hierarchies:
Soap 12 Product Brand …
Bread 56 roll-up to week Day Week Quarter
M T W Th F S S
Store Region Country
Time
56 units of bread sold in S1 on M
Twelve rules for evaluating OLAP products
(E.F.Codd)
• 1.Multidimensional Conceptual View
• 2.Transparency
• 3.Accessibility
• 4.Consistent Reporting Performance
• 5.Client-Server Architecture
• 6.Generic Dimensionality
• 7.Dynamic Sparse Matrix Handling
• 8.Multi-User Support
• 9.Unrestricted Cross-dimensional Operations
• 10.Intuitive Data Manipulation
• 11.Flexible Reporting
• 12.Unlimited Dimensions and Aggregation Levels
Materialized Views
• Many DBMSs support materialized views
– Precomputed result of a particular query
– Goal: faster response for related queries
• Example:
– View definition:
SELECT State, SUM(Quantity)
FROM Sales GROUP BY State
– Query:
SELECT SUM(Quantity)
FROM Sales
WHERE State = 'CA'
– Scan view rather than Sales table
• View matching problem
– When can a query be re-written using a materialized view?
– Difficult to solve in its full generality
– DBMSs handle common cases via limited set of re-write rules
Cross Tab Report
Sum
WHITE The Data Cube and
BLUE The Sub-Space Aggregates
By Make
Sum
By Year
By Make
By Make & Year
RED
WHITE
BLUE
By Color & Year
By Make & Color
Sum By Color
The Data Cube Concept
B 1994
W 1995
1994 MAKE B
1995
W
Ford
Chevy
1994
Black
1995 F
F YEAR
White
C
C
1994 COLOR
1995 B
F W
C
Sub-cube Derivation
<M,Y,C>
<*,*,C>
<M,*,*> <*,Y,*>
<*,*,*>
CUBE
Chevy 1991 blue 49
Chevy 1992 red 31 ALL ALL red 165
Chevy 1992 white 54 ALL ALL white 273
Chevy 1992 blue 71 ALL ALL blue 339
Ford 1990 red 64 chevy 1990 ALL 154
Ford 1990 white 62 chevy 1991 ALL 199
Ford 1990 blue 63 chevy 1992 ALL 157
Ford 1991 red 52 ford 1990 ALL 189
Ford 1991 white 9 ford 1991 ALL 116
Ford 1991 blue 55 ford 1992 ALL 128
Ford 1992 red 27 chevy ALL red 91
Ford 1992 white 62 chevy ALL white 236
Ford 1992 blue 39 chevy ALL blue 183
ford ALL red 144
ford ALL white 133
ford ALL blue 156
ALL 1990 red 69
ALL 1990 white 149
ALL 1990 blue 125
ALL 1991 red 107
ALL 1991 white 104
ALL 1991 blue 104
ALL 1992 red 59
ALL 1992 white 116
ALL 1992 blue 110
User Defined Aggregate Function
Generalized For Cubes
Aggregates have graduated difficulty
Distributive: can compute cube from next lower
dimension values (count, min, max,...)
Algebraic: can compute cube from next lower lower
scratchpads (average, ...)
Holistic: Need base data (Median, Mode, Rank..)
Rewritten
Warehouse Aggregate Queries
Client Navigator Data
Warehouse
Warehouse
Client What aggregates exist?
Aggregate How large is each one?
Metadata
Dimension Aggregates
• Methods to define aggregates
– 1. Include or leave out entire dimensions
– 2. Include some columns of each dimension, but not others
– Second approach is more common / more useful
• Example: Several versions of Date dimension
– Base Date dimension
• 1 row per day
• Includes all Date attributes
– Monthly aggregate dimension
• 1 row per month
• Includes Date attributes at month level or higher
– Yearly aggregate dimension
• 1 row per year
• Includes only year-level Date attributes
• Each dimension aggregate has its own set of surrogate keys
• Each aggregated fact joins to 1 version of the Date dimension
– Or else the Date dimension is omitted entirely…
Choosing Dimension Aggregates
• Dimension aggregates often roll up along hierarchies
– Day – month – year
– SKU – brand – category – department
– Store – city – county – state – country
• Any subset of dimension attributes can be used
– Promotion dimension includes attributes for coupon, ad,
discount, and end-of-aisle display
– Promotion aggregate might include only ad-related attributes
– Customer aggregate might include only a few frequently-queried
columns (age, gender, income, marital status)
• Goal: reduced number of distinct combinations
– Results in fewer rows in aggregated fact
– Customer aggregate that included SSN would be pointless
Aggregate Examples
• Sales fact table
– Date, Product, Store, Promotion, Transaction ID
• Date aggregates
– Month
– Year
• Product aggregates
– Brand
– Manufacturer
• Promotion aggregates
– Ad
– Discount
– Coupon
– In-Store Display
• Store aggregates
– District
– State
Choosing Fact Aggregates
• Aggregated fact table includes:
– Foreign keys to dimension aggregate tables
– Aggregated measurement columns
• For example, Quantity column holds SUM(Quantity)
• Need to choose which version of each dimension to use
• Transaction ID will never be included
– Aggregates with degenerate dimensions are rare
• Number of possible fact aggregates: 4 * 4 * 6 * 4
– Dimension with n aggregates → n+2 possibilities
• Include base dimension
• Omit dimension entirely
• n dimension aggregates to choose from
– Constructing fact aggregates for all combinations would be
impractical
– How to decide which combinations to construct?
Aggregate Selection
• Two approaches
– Manual selection
• Data warehouse designer chooses aggregates
• Could be time-consuming and error-prone
– Automatic selection
• Use algorithms to optimize choice of aggregates
• Good in principle; however, problem is hard
• Heuristics for manual selection
– Include a mixture of breadth and depth
• A few broad aggregates
– Lots of dimensions at fairly fine-grained level of detail
– Achieve moderate speedup on a wide range of queries
• Lots of highly targeted aggregates with only a few rows each
– Dimensions are highly rolled up or omitted entirely
– Each aggregate achieves large speedup on a small class of queries
• Roughly equal allocation of space to each type
– Consider the query workload
• What attributes are queried most frequently?
• What sets of attributes are often queried together?
• Make sure that common / important queries run fast
• Aggregate design can be adjusted over time (learn from experience)
Constructing Aggregates
• Constructing dimension aggregates
– Determine attributes in aggregate
– Generate unique combinations of those attributes by SELECT
DISTINCT from dimension table
– Assign surrogate keys to aggregate table rows
• Constructing fact aggregates
– Build mapping table that maps dimension keys to dimension
aggregate keys, for each dimension
– Join fact table to mapping tables and group by aggregate keys
• Constructing aggregates from other aggregates
– Fact aggregate on (Product, Year) can be built from (Product,
Month) aggregate
– Faster than using base table
Data Cube Lattice
State, Month,
Color
Total
Sparsity Revisited
• Fact tables are usually sparse
– Not all possible combinations of dimension values actually occur
– E.g. not all products sell in all stores on all days
• Aggregate tables are not as sparse
– Coarser grain → lesser sparsity
– All products sell in SOME store in all MONTHS
– Thus space savings from aggregation can be less than one
might think
• Example: Date dimension vs. Year dimension aggregate
– Year table has 1/365 as many rows as Date table
– Fact aggregate with (Product, Store, Year) has fewer rows than
fact aggregate with (Product, Store, Date)
– But more than 1/365 of the rows!
– Some potential space savings are lost due to reduced sparsity
Automatic Selection of Aggregates
• Problem Definition
– Inputs:
• Query workload
• Set of candidate aggregates
• Query cost model
• Maximum space to use for aggregates
– Output:
• Set of aggregates to construct
– Objective:
• Minimize cost of executing query workload
• Problem is NP-Complete
– Approximation is required
– We’ll discuss a Greedy algorithm for the problem
– Due to Harinarayan, Rajaraman, and Ullman (1996)
Problem Inputs
• Set of candidate aggregates
– We’ll consider the data cube lattice
• Query workload
– Each OLAP query maps to a node in the lattice
• Union of grouping and filtering attributes
– Workload = weight assigned to each lattice node
• Weight = fraction of queries in workload that correspond to this node
• Query cost model
– We’ll use simple linear cost model
– Cost proportional to size of fact aggregate used to answer query
– Justification:
• Dominant cost is I/O to retrieve tables
• Dimension tables small relevant to fact
– Oversimplification of reality
• Makes the problem easier to analyze
• Maximum space to use for aggregates
– We’ll fix a maximum number of aggregates, regardless of their size
– Another simplification for purposes of analysis
Aggregation
Hierarchies
10000 products in 2000 categories
1000 stores in 100 districts
30 aggregates in 100 time periods
B A B A
C C A B
90 90 90 90 90 90 90 90
Weight=1Weight=1 Weight=1 Weight=1 Weight=1Weight=1 Weight=1 Weight=1
Total cost = 688 Total cost = 600
Practical Limitations
Sizes of aggregates
– Greedy algorithm assumed sizes of aggregates known
– In reality, computing the size of an aggregate can be expensive
Essentially, need to construct the aggregate table
– Estimation techniques
Based on sampling or hashing
Number of candidates
– n total dimension attributes → 2n possible fact aggregates
– Considering all possible aggregates would take too long even for
moderate n
– Need to prune the search space before applying greedy selection
Number of aggregates vs. space consumed
– Space consumption is more appropriate limit than maximum number
– Modified greedy algorithm: At each step, add aggregate that has best
ratio of (cost improvement) / (size of aggregate)
Impact of indexes
– Selection of indexes also affects query performance
– Should not be done independently of aggregate selection
Thursday: A more practical technique