Ozsu Chap5

Chap.
5 Distributed Database Design
X Introduction
– Alternative design strategies
X Distribution design issues
X Data fragmentation
X Data allocation
M.H. Kim, KAIST

1
Introduction: Design Strategies
X Alternative design strategies

» approaches of distribution design
– Top-down design
» suitable when designing systems from scratch
» mostly in homogeneous systems
main focus in this chapter
– Bottom-up design
» suitable when DBs already exist at a number of sites
» mostly in heterogeneous systems
will be discussed in Chapter 15
M.H. Kim, KAIST

2
1
Introduction: Design Strategies (cont’d)
Requirements
Analysis
Objectives
User Input
Conceptual View Design
Design View Integration
GCS Access External

Information Schema
Distribution
Design User Input
LCS
Physical
Design
Top-down design process
LIS
M.H. Kim, KAIST
3
Introduction: Design Strategies (cont’d)
Distribution design
» design the local conceptual schemas
9 by distributing entities over the sites of DCS
– fragmentation
– allocation
M.H. Kim, KAIST

4
2
Distribution Design Issues (cont’d)
X Reasons for fragmentation

y relation may not be a suitable unit of distribution
– application views are usually subsets of relations
» i.e., locality or proximity
– permits a number of transactions to execute concurrently
y i.e., transactions that access different portions of a relation
» inter-query concurrency
» intra-query concurrency
y i.e., parallel execution of a single query
M.H. Kim, KAIST

5
Disadvantage of fragmentation
– may require extra processing, e.g., join
» for views that cannot be defined on a single fragment
– semantic data control is more difficult
» especially, integrity enforcement
M.H. Kim, KAIST

6
3
X Fragmentation alternatives
– horizontal fragmentation
– vertical fragmentation
M.H. Kim, KAIST

7
(Ex) Horizontal fragmentation
PNO PNAME BUDGET LOC

PROJ
P1 Instrumentation 150000 Montreal
P2 Database Develop. 135000 New York
P3 CAD/CAM 250000 New York
P4 Maintenance 310000 Paris
P5 CAD/CAM 500000 Boston
PROJ1: projects with budgets less than $200,000
PROJ2: projects with budgets greater than or equal to $200,000
M.H. Kim, KAIST

8
4
(Example cont’d)
PROJ1 PNO PNAME BUDGET LOC

P1 Instrumentation 150000 Montreal
PROJ2 PNO PNAME BUDGET LOC

M.H. Kim, KAIST

9
(Ex) Vertical fragmentation
PNO PNAME BUDGET LOC

PROJ P1 Instrumentation 150000 Montreal
PROJ1: information about project budgets

PROJ2: information about project names and locations
M.H. Kim, KAIST
10
5
(Example cont’d)
PROJ1 PROJ2
PNO BUDGET PNO PNAME LOC

P1 150000 P1 Instrumentation Montreal
P2 135000 P2 Database Develop. New York
P3 250000 P3 CAD/CAM New York
P4 310000 P4 Maintenance Paris
P5 500000 P5 CAD/CAM Boston
M.H. Kim, KAIST

11
X Degree of fragmentation
a large number of alternatives
tuples relation
or
attributes
– find the suitable level of partitioning within this range

» can only be defined w.r.t. the applications that will run on the DB
M.H. Kim, KAIST

12
6
X Correctness of fragmentation
– Completeness
» decomposition of relation R into fragments R1, R2, …, Rn is
complete iff each data item in R can also be found in some Ri
– Disjointness
» if relation R is decomposed into fragments R1, R2, …, Rn, and data
item di is in Rj, then di should not be in any other fragment Rk (k≠j)
– Reconstruction
» if relation R is decomposed into fragments R1, R2, …, Rn, then
there should exist some relational operator ∇ such that
R = ∇1≤i≤n Ri
M.H. Kim, KAIST

13
X Allocation alternatives
– non-replicated
» partitioned: each fragment resides at only one site
– replicated
y good for reliability and efficiency of read-only-queries
y may cause trouble in update
» fully replicated: each fragment at all sites
» partially replicated : each fragment at some of the sites
M.H. Kim, KAIST

14
7
Comparison of replication alternatives
full-replication partial-replication partitioning
query same difficulty

processing easy
directory easy or
management non-existent same difficulty
concurrency moderate
control difficult easy
reliability very high high low
possible possible
Reality applications realistic applications M.H. Kim, KAIST
15
X Information requirements
– Database information
for for
– Application information fragmentation allocation
– Communication network information
– Site information
» i.e., computer system information
M.H. Kim, KAIST

16
8
Fragmentation
z Horizontal Fragmentation (HF)
– Primary Horizontal Fragmentation (PHF)
– Derived Horizontal Fragmentation (DHF)
z Vertical Fragmentation (VF)
z Hybrid Fragmentation (HF)
M.H. Kim, KAIST

17
Fragmentation (cont’d)
X Horizontal fragmentation (HF)

– Information requirements
» database information
» application information
M.H. Kim, KAIST

18
9
Database Information
– Join graph
» equi-join relationships among relations
PAY
TITLE, SAL Owner
1
L1
EMP PROJ n
ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC Member
L2 L3
ASG
ENO, PNO, RESP, DUR
Join graph M.H. Kim, KAIST

19
Application Information
– qualitative information
» minterm predicate
y denotes access patterns of user applications
4 quantitative information
» minterm selectivity
y given a minterm predicate, how much tuples are accessed
» access frequency of the query
y how frequently the query is issued
M.H. Kim, KAIST

20
10
80/20 rule
» analyzing all the user applications may not be possible
– the most active 20% of user queries account for
» 80 % of the total data accesses
– may be used as a guideline
M.H. Kim, KAIST

21
z Simple predicate
– given relation R(A1, A2, …, An),
» a simple predicate pj has the form
Ai θ Value
» where θ ∈ {=,<,≤,>,≥,≠} and Value is a value in attribute Ai.
(ex) a single condition in the SQL-WHERE clause
– Pr = {p1,p2, …,pm} denotes the set of all simple predicates in R.
(Ex) simple predicates

9 PNAME = “Maintenance”
9 BUDGET ≤ 200000
M.H. Kim, KAIST
22
11
z Minterm predicate
– given relation R and Pr = {p1, p2, …, pm},
» define M = {m1,m2,…,mr} as
M = {mi | mi = ∧pj∈Pr pj*}, 1≤j≤m, 1≤i≤z
» where pj* = pj or pj* = ¬(pj).
y i.e., each simple predicate occurs in a minterm predicate
either in its natural form or its negated form
(Ex) minterm predicates

m1: PNAME=“Maintenance” ∧ BUDGET≤200000
m2: NOT(PNAME=“Maintenance”) ∧ BUDGET≤200000
m3: PNAME=“Maintenance” ∧ NOT(BUDGET≤200000)
m4: NOT(PNAME=“Maintenance”) ∧ NOT(BUDGET≤200000)
M.H. Kim, KAIST
23
4 Access frequency of a query: acc(qi)

– frequency with which a user application qi accesses data
4 Minterm selectivity: sel(mi)

– number of tuples of the relation
» that would be accessed by a user query which is specified
according to a given minterm predicate mi.
y e.g., sel(m1) = 0, sel(m2) = 2
M.H. Kim, KAIST

24
12
Fragmentation: PHF
X Primary horizontal fragmentation (PHF)

– defined by a selection operation on the owner relations
y in the Join graph
– given relation R, its horizontal fragments are
Rj = σFj(R), 1 ≤ j ≤ w,
» where Fj is a selection formula,
y which is (preferably) a minterm predicate mi
M.H. Kim, KAIST

25
Fragmentation: PHF (cont’d)
– horizontal fragment Ri of relation R consists of

» all the tuples of R that satisfy a minterm predicate mi
– there are as many horizontal fragments of relation R
» as there are minterm predicates
Minterm fragment
» horizontal fragment defined by a minterm predicate
M.H. Kim, KAIST

26
13
z Outline of PHF
– given
» a relation R, and the set of simple predicates Pr
– output
» the set of fragments of R = {R1, R2, . . . , Rw}
y which obey the fragmentation rule
M.H. Kim, KAIST

27
Rule 1 (fragmentation rule)

» a relation or fragment is partitioned into at least two parts which
are accessed differently
y by at least one application
» i.e., access frequency af1 of tuples in fragment ft1, and access
frequency af2 of tuples in fragment ft2 are different by some
application
M.H. Kim, KAIST

28
14
z Requirements for the set of simple predicates Pr

– Pr should be complete
– Pr should be minimal
M.H. Kim, KAIST

29
Completeness of simple predicates

– a set of simple predicates Pr is said to be complete
» if and only if two tuples of the same minterm fragment
9 defined by Pr
y have the same probability of being accessed by every
application
» i.e., any two tuples in the fragment have the same access
frequency by every application
M.H. Kim, KAIST

30
15
(Ex) Completeness of simple predicates

» assume relation PROJ(PNO,PNAME,BUDGET,LOC) has two
applications defined on it
– application 1
» Find the budgets of projects at “Montreal.
» Find the budgets of projects at “New York.
» Find the budgets of projects at “Paris.
– application 2
» Find projects with budgets less than $200000.
» Find projects with budgets greater than or equal to $200000.
M.H. Kim, KAIST

31
(Example cont’d)
– according to application1,
» Pr = {LOC=“Montreal”,LOC=“New York”,LOC =“Paris”}
y but, this is not complete with respect to application 2
– thus, modify
» Pr = {LOC=“Montreal”,LOC=“New York”,LOC =“Paris”,
BUDGET≤200000, BUDGET>200000}
– then, it is complete.
M.H. Kim, KAIST

32
16
Minimality of simple predicates

– If a predicate influences the fragmentation
y i.e., causes a fragment f to be further fragmented
9 into, say, fi and fj
» then there should be at least one application that accesses fi and fj
differently
– In other words, the simple predicate should be relevant in
determining a fragmentation
– if all the predicates of a set Pr are relevant,
» then Pr is minimal
M.H. Kim, KAIST

33
(Ex) Minimality of simple predicates

Pr = {LOC=“Montreal”,LOC=“New York”,LOC =“Paris”,
BUDGET≤200000,BUDGET>200000}
– is minimal (in addition to being complete).
However, if we add
PNAME = “Instrumentation” to Pr,
» then Pr is not minimal.
y because there is no application that would access the
resulting fragments any differently
M.H. Kim, KAIST

34
17
Algorithm COM_MIN
– input
» a relation R and a set of simple predicates Pr
– output
» a complete and minimal set of simple predicates Pr´ for Pr
\ F: set of minterm fragments
Rule 1 (fragmentation by a relevant predicate)

» a relation or fragment is partitioned into at least two parts which
are accessed differently
y by at least one application
M.H. Kim, KAIST
35
{ Initialization
– find a pi ∈ Pr such that pi partitions R
– according to Rule 1
– set Pr´ ← pi; Pr ← Pr - pi ; F ← fi
/* fi : fragment fi defined according to a minterm predicate defined
over the predicates of Pr´ */
| Iteratively add predicates to Pr´ until it is complete
– find a pj ∈ Pr such that pj partitions some fk
» according to Rule 1
– set Pr´ ← Pr´ ∪ pi; Pr ← Pr - pi; F ← F ∪ fi
M.H. Kim, KAIST

36
18
Primary horizontal fragmentation algorithm

» makes use of COM_MIN to perform fragmentation
– input
» a relation R and a set of simple predicates Pr
– output
» a set of minterm predicates M
y according to which relation R is to be fragmented
M.H. Kim, KAIST

37

{ Pr´ ← COM_MIN (R, Pr)
| determine the set M of minterm predicates
} determine the set I of implications among pi ∈ Pr´
» (ex) Loc=“Montreal” implies
9 NOT(Loc=“New York”) ∧ NOT(Loc=“Paris”)
~ eliminate the contradictory minterms from M
9 by using the set of implications I
M.H. Kim, KAIST

38
19
(Ex) Primary horizontal fragmentation for PAY and PROJ
z Fragmentation of relation PAY

– application: “Check the salary info and determine raise.”
» employee records are managed at two sites
y one site handles records for salaries greater than 30,000
y the other site handles the rest of records
» this application can run at two sites
– simple predicates
p1 : SAL ≤ 30000, p2 : SAL > 30000
» Pr = {p1, p2} which is complete and minimal, thus Pr´ = Pr
M.H. Kim, KAIST

39
(Example cont’d)
– Minterm predicates
m1 : (SAL ≤ 30000)
m2 : (SAL > 30000) , i.e., NOT(SAL ≤ 30000)
PAY1 PAY2
TITLE SAL T IT LE SAL
Mech. Eng. 27000 E lect. E ng. 4000 0
Program mer 24000 S yts. A nal. 3400 0
M.H. Kim, KAIST

40
20
(Example cont’d)
z Fragmentation of relation PROJ

– applications
1. Given a location, find the name and budget of projects in
that location.
9 issued at three sites
2. Access project information according to budget.
9 one site accesses ≤ 200000,
9 the other site accesses > 200000
M.H. Kim, KAIST

41
(Example cont’d)
– simple predicates
» for application 1,
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
» for application 2,
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
» Pr = Pr´ = {p1, p2, p3, p4, p5}
M.H. Kim, KAIST

42
21
(Example cont’d)
– Minterm predicates
9 left after eliminating meaningless ones
» m1 : (LOC = “Montreal”) ∧ (BUDGET ≤ 200000)
» m2 : (LOC = “Montreal”) ∧ (BUDGET > 200000)
» m3 : (LOC = “New York”) ∧ (BUDGET ≤ 200000)
» m4 : (LOC = “New York”) ∧ (BUDGET > 200000)
» m5 : (LOC = “Paris”) ∧ (BUDGET ≤ 200000)
» m6 : (LOC = “Paris”) ∧ (BUDGET > 200000)
M.H. Kim, KAIST

43
(Example cont’d)
PROJ1 PROJ2
PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC
Database
P1 Instrumentation 150000 Montreal P2 135000 New York
Develop.
PROJ4 PROJ6
PNO PNAME BUDGET LOC PNO PNAME BUDGET LOC
P3 CAD/CAM 250000 New York P4 Maintenance 310000 Paris
M.H. Kim, KAIST

44
22
Correctness of the PHF algorithm

– Completeness
» clear from the method
y every tuple is in some minterm fragment
– Reconstruction
» if relation R is fragmented into FR = {R1,R2, . . . ,Rr}
R = ∪∀Ri ∈ FR Ri
– Disjointness
» minterm predicates that form the basis of fragmentation should be
mutually exclusive
M.H. Kim, KAIST

45
Fragmentation: DHF
X Derived horizontal fragmentation (DHF)

– defined on a member relation of a link
y according to a selection operation specified on its owner
PAY Owner
TITLE, SAL
1
L1
EMP PROJ n
ENO, ENAME, TITLE PNO, PNAME, BUDGET, LOC Member
L2 L3
ASG
ENO, PNO, RESP, DUR
M.H. Kim, KAIST

46
23
Fragmentation: DHF (cont’d)
– each link is an equijoin

» tuples of a member relation participating an equijoin
y can be obtained by means of semijoins
– Given a link L where owner(L)=S and member(L)=R,

» the derived horizontal fragments of R are defined as,
Ri = R Si, 1 ≤ i ≤ w
y where Si = σFi (S)
» here, Fi is the formula according to which the primary horizontal
fragment Si is defined
main objective: efficient join

M.H. Kim, KAIST
47
(Ex) Derived horizontal fragmentation

– Consider L1
y owner(L1)=PAY and member(L1)=EMP
» EMP1 = EMP PAY1 , where PAY1 = σSAL≤30000(PAY)
» EMP2 = EMP PAY2 , where PAY2 = σSAL>30000(PAY)
EMP1 EMP2
ENO ENAM E T IT L E ENO ENAM E T IT L E
E3 A. Lee M ech. Eng. E1 J. D oe E le c t. E n g .
E4 J . M ille r P ro g ra m m e r E2 M . S m ith S y st. A n a l.
E7 R . D a v is M ech. Eng. E5 B. C asey S y st. A n a l.
E6 L. Chu E le c t. E n g .
E8 J. Jones S y st. A n a l.
M.H. Kim, KAIST

48
24
Complication in DHF
» there can be multiple links on the target (i.e. member) relation
y i.e., there can be several ways of DHF
– Criteria to decide which DHF
» fragmentation used on more applications
y try to focus on the heavy users
» fragmentation with better join characteristics
y joins can be performed on smaller relations
y joins can be performed in a distributed fashion
9 i.e., distributed join
M.H. Kim, KAIST

49
Distributed join
» sub-joins between horizontally fragmented relations
– efficiency of distributed join:
» affected by the nature of a join graph
y simple join graph between fragments
y complex join graph between fragments
M.H. Kim, KAIST

50
25
y Simple join graph between fragments

– there is only one link in each fragment
9 occurs when the link in the join graph is one-many
relationship
» sub-joins can proceed independently and in parallel
» allocating fragments of the owner and member at the same site
may be very effective
M.H. Kim, KAIST

51
y Complex join graph between fragments

– some sub-graph is not a simple join graph,
9 occurs when the link in the join graph is many-many
relationship
» difficult for efficient parallel processing
M.H. Kim, KAIST

52
26
R1 S1 R1 S1
R2 S2 R2 S1
R3 S3 R3 S2
R4 S4 R4 S3
Simple join graph Complex join graph

between fragments between fragments
M.H. Kim, KAIST

53
Correctness of the DHF algorithm

– Completeness
» if DHF is based on foreign keys, the proof is simple
y thus, referential integrity must be preserved
» otherwise, difficult
– Reconstruction
» reconstruction can be performed by the union operator
– Disjointness
» guaranteed if the join graph (between fragments) is simple
simple join graphs between fragments need to be produced
» otherwise, difficult
M.H. Kim, KAIST

54
27
Fragmentation: VF
X Vertical fragmentation
– has been studied within the centralized context
» physical clustering for the most active sub-relations
– number of alternatives is very large
» for m non-primary key attributes, the possible number of
fragments is B(m), i.e, m-th Bell number
» for large m, B(m) ≈ mm
y e.g., B(10) ≈ 115,000, B(15) ≈ 109, B(30) ≈ 1023
M.H. Kim, KAIST

55
Fragmentation: VF (cont’d)
z Two types of heuristic approaches

– grouping
» from a set of attributes to fragments
– splitting
» from an entire relation to fragments
M.H. Kim, KAIST

56
28
Replication of the global relation’s key

– easier for completeness, i.e., lossless decomposition
» necessary for reconstruction of the global relation
– easier to enforce many functional dependencies
9 for integrity checking etc.
» most dependencies involve key attributes
M.H. Kim, KAIST

57
Information requirements
– attribute affinity
y a measure indicating how closely the attributes are related
9 can be obtained from more primitive usage data
» attribute usage values: use(qi, Aj)

y given a set of queries Q = {q1, q2, …, q3} that will run on the
relation R(A1, A2, …, An),
1 if attribute Aj is referenced by query qi
use(qi, Aj) =
0 otherwise
M.H. Kim, KAIST

58
29
z Attribute affinity measure: aff(Ai, Aj)

» measures the bond between two attributes of a relation
y according to how they are accessed by applications
– the attribute affinity measure between two attributes Ai and Aj of a
relation R(A1, A2, …, An) with respect to the set of applications Q = (q1,
q2, …, qq) is defined as follows:
aff(Ai, Aj) = Σall queries that access Ai and Aj (query_accesses)
» query_accesses = Σall sites (frequency of query qk) *

(# of accesses to Ai and Aj together per execution of qk)
M.H. Kim, KAIST

59
(Ex) Attribute affinity

– Consider the following 4 queries for relation PROJ:
q1: SELECT BUDGET q2: SELECT PNAME, BUDGET

FROM PROJ FROM PROJ
WHERE PNO = Value
q3: SELECT PNAME q4: SELECT SUM(BUDGET)

FROM PROJ FROM PROJ
WHERE LOC = Value WHERE LOC = Value
M.H. Kim, KAIST

60
30
(Example cont’d)
– Let A1 = PNO, A2 = PNAME, A3 = BUDGET, A4 = LOC.
A1 A2 A3 A4
q1 1 0 1 0
q2 0 1 1 0 attribute usage matrix
q3 0 1 0 1
q4 0 0 1 1
M.H. Kim, KAIST

61
(Example cont’d)
– assume
» each query accesses the attributes once during each execution.
» following frequencies of queries at three sites
S1 S2 S3
q1 15 20 10
q2 matrix for
5 0 0
query frequencies
q3 25 25 25 at three sites
q4 3 0 0
M.H. Kim, KAIST

62
31
(Example cont’d)
– Then, the attribute affinity matrix AA is A1 A2 A3 A4
A1 45 0 45 0
» e.g., aff(A1, A3) = 15*1 + 20*1 + 10*1 = 45
A2 0 80 5 75
A3 45 5 53 3
A4 0 75 3 78
A1 A2 A3 A4 S1 S2 S3 S
q1 1 0 1 0 q1 15 20 10 attribute affinity (AA) matrix

45
q2 0 1 1 0 q2 5 0 0 ⇒ 5
q3 0 1 0 1 q3 25 25 25 75
q4 0 0 1 1 q4 3 0 0 3
attribute usage query frequency

M.H. Kim, KAIST
63
Fragmentation: VF-Clustering
X BEA algorithm: Clustering algorithm

» take the attribute affinity matrix AA, and
» reorganize the attribute orders to form clusters
y where the attributes in each cluster demonstrate high affinity
to one another
– Bond Energy Algorithm (BEA) finds an ordering of attributes
such that the global affinity measure, AM
AM = Σ Σ ( affinity of Ai and Aj with their neighbors)
i j
is maximized.
M.H. Kim, KAIST

64
32
Fragmentation: VF-Clustering (cont’d)
z Global affinity measure AM

n n
AM = ∑ ∑ aff ( Ai , Aj )[aff ( Ai , Aj −1) + aff ( Ai , Aj +1) + aff ( Ai −1, Aj ) + aff ( Ai +1, Aj )]
i =1 j =1
» where aff(A0, Aj) = aff(Ai, A0) = aff(An+1, Aj) = aff(Ai, An+1) = 0
Since AA matrix is symmetric, we can simply define:

n n
AM = ∑ ∑ aff ( Ai , Aj )[aff ( Ai , Aj −1) + aff ( Ai , Aj +1)]
i =1 j =1
» grouping large values with large ones, and
y small values with small ones
M.H. Kim, KAIST

65
Bond Energy Algorithm (BEA)

» input: AA Matrix
» output: clustered affinity matrix CA
1. Initialization
– place and fix one of the columns of AA in CA.
2. Iteration
» suppose i is the number of columns already placed in CA
– pick each of the remaining n-i columns, and place it in the first i+1
possible positions in the CA matrix.
– for each column, choose the placement that makes the most
contribution to the global affinity measure.
3. Row ordering
– order the rows according to the column ordering.
M.H. Kim, KAIST
66
33
z Contribution of placing attribute Ak between Ai and Aj
cont(Ai, Ak, Aj) = 2bond(Ai, Ak) + 2bond(Ak, Aj) - 2bond(Ai, Aj)
where
bond(Ax, Ay) =zΣ

=1
aff(Az, Ax)aff(Az,Ay)
M.H. Kim, KAIST

67
– definition of AM
n n
AM =
∑∑aff ( A , A )[aff ( A , A
i =1 j =1
i j i j −1 ) + aff ( Ai , Aj +1 )]
n n n
=
∑[∑aff ( A , A )aff ( A , A
j =1 i =1
i j i j −1 ) + ∑aff ( Ai , Aj )aff ( Ai , Aj +1)]
i =1
– Bond between two attributes

n
» bond(Ax, Ay) = ∑aff ( A , A )aff ( A , A )
z =1
z x z y
n
– then, AM = ∑[bond( A , A
j =1
j j −1 ) + bond( Aj , Aj +1 )]
M.H. Kim, KAIST

68
34
l=i: bond(Ai-1, Ai) + bond(Ai, Ai+1)
– Consider the following n attributes
l=i+1: bond(Ai, Ai+1) + bond(Ai+1, Ai+2)
A1 A2 L Ai-1 Ai Aj Aj+1 L An
then, the global affinity measure can be written as:
i −1
AMold = ∑[bond( A , A
l =1
l l −1 ) + bond( Al , Al +1 )] +
∑[bond( Al , Al−1 ) + bond( Al , Al+1 )] +

l =i +2
2bond( Ai , Aj ) + bond( Ai , Ai −1) + bond( Ai +1, Ai + 2 )
M.H. Kim, KAIST

69
– Consider a new attribute Ak between Ai and Aj
A1 A2 L Ai-1 Ai Ak Aj Aj+1 L An
AMnew = AMold + 2bond( Ai , Ak ) + 2bond( Ak , Aj ) - 2bond( Ai , Aj )
– thus, Contribution of placing Ak between Ai and Aj :
cont (Ai, Ak, Aj) = AMnew - AMold
= 2[bond( Ai , Ak ) + bond( Ak , Aj ) − bond( Ai , Aj )]
M.H. Kim, KAIST

70
35
(Ex) Attribute clustering by BEA

– Consider the following AA matrix and the CA matrix
» where A1 and A2 have been placed.
– Now try to place A3:
A1 A2 A3 A4 A1 A2
A1 45 0 45 0 45 0
A2 0 80 5 75 0 80
AA = CA =
A3 45 5 53 3 45 5
A4 0 75 3 78 0 75
M.H. Kim, KAIST

71
(Example cont’d)
Ordering (0-3-1) :
cont(A0, A3, A1) = 2bond(A0, A3) + 2bond(A3, A1) - 2bond(A0, A1)
= 0 + 2(45*45 + 45*53) - 0 = 2*4410 = 8820
Ordering (1-3-2) :
= 2*4410 + 2(80*5 + 5*53 + 75*3) - 2(45*5)
= 2*4410 + 2*890 + 2*225 = 10150
Ordering (2-3-4) :
= 890 + 0 - 0 = 1780
M.H. Kim, KAIST

72
36
(Example cont’d)
Therefore, the CA matrix has the form
A1 A3 A2
A1 45 45 0
A3 0 5 80
A2 45 53 5
A4 0 3 75
M.H. Kim, KAIST

73
(Example cont’d)
– Now, try to place A4

» by similar calculation A4 should be placed to the right of A2 .
» Thus, the CA matrix has the form
A1 A3 A2 A4
A1 45 45 0 0
A2 0 5 80 75
A3 45 53 5 3
A4 0 3 75 78
M.H. Kim, KAIST
74
37
(Example cont’d)
– Row ordering
» the final form of the CA matrix (after row ordering) is
A1 A3 A2 A4
A1 45 45 0 0
A3 45 53 5 3
A2 0 5 80 75
A4 0 3 75 78
M.H. Kim, KAIST

75
Fragmentation: VF-Partitioning
X Partitioning algorithm
– divide a set of clustered attributes {A1, A2, …,An} into two (or
more) sets {A1, A2, …, Ai} and {Ai+1, …, An}
» such that these sets of attributes are accessed
y solely, or
y for the most part, by distinct applications
M.H. Kim, KAIST

76
38
Fragmentation: VF-Partitioning (cont’d)
A1 A2 A3 • • • Ai Ai+1 • • • An
A1
A2 TA: top attributes
A3 TA
• BA: bottom attributes
•
•
Ai
Ai+1
• BA
•
•
An
M.H. Kim, KAIST

77
z Sets of applications
TQ = set of applications that access only TA
BQ = set of applications that access only BA
OQ = set of applications that access both TA and BA
z Cost for the applications

CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
M.H. Kim, KAIST

78
39
Q Then, find the point along the diagonal that maximizes

CTQ*CBQ - COQ2
– defines two fragments
y such that CTQ and CBQ are as equal as possible
» balance the processing loads
y when the fragments are distributed
– the partitioning algorithm has complexity O(n)
» n: number of attributes
M.H. Kim, KAIST

79
(Ex) Partitioning the CA matrix

– Consider the CA matrix in the previous example.
– after partitioning,
» F = {F1, F2}
y F1 = (A1, A3)
y F2 = (A2, A4)
M.H. Kim, KAIST

80
40
Problems in the partitioning algorithm

– Design of m-way partitioning
» try 1, 2, …, m -1 split points along diagonal, and
y check the best point for each of these
m
9 problem: its cost has complexity O(2 )
» recursive application of the binary partitioning
y apply the binary partitioning algorithm to each of the fragment
obtained in the previous iteration, recursively
y may be better alternative
M.H. Kim, KAIST

81
– Cluster forming in the middle of the CA matrix

» shift a row up and a column left and apply the algorithm to find the
“best” partitioning point
» do this for all possible shifts
Refer to Algorithm PARTITION in page 144 in the text: O(n2)
M.H. Kim, KAIST

82
41
Correctness of the partitioning algorithm

» consider a relation R, defined over attribute set A and key K, that
generates the vertical partitioning FR= {R1, R2, …, Rr}.
– Completeness
» the following should be true for A :
A = ∪ ARi
– Reconstruction
» reconstruction can be achieved by
R= K Ri , ∀Ri∈FR
– Disjointness
» duplicated keys (or TID’s) are not considered to be overlapping
M.H. Kim, KAIST

83
Fragmentation: HF
X Hybrid fragmentation
– VF may be followed by HF, or vice versa
» producing a tree-structured partitioning
R
HF HF
R1 R2
VF VF VF VF VF
R11 R12 R21 R22 R23
M.H. Kim, KAIST

84
42
Fragment Allocation
X Allocation problem
» allocation of resources across the network has been much studied
y however, most of this work is about placing files
9 rather than DDB design
– Given
F = {F1, F2, …, Fn} fragments
S = {S1, S2, …, Sm} network sites
Q = {q1, q2, …, qq} applications
– Find the “optimal” distribution of F to S.
M.H. Kim, KAIST

85
Fragment Allocation (cont’d)
Definition of optimality
1. Minimal cost
» communication cost +
» storage cost +
» processing cost (read & update)
2. Performance
» response time and/or
» throughput
– optimality measure should include both performance and the

cost factors
M.H. Kim, KAIST

86
43
File allocation (FAP) vs Database allocation (DAP)

» FAP problem is NP-complete
» DAP problem is more complex
– fragments are not individual files
y relationships between fragments have to be considered
– access to databases is more complicated
y relationship between allocation and query processing
» simple “remote file access model” may not be applicable
– cost of integrity enforcement should be considered
– cost of concurrency control should be considered
M.H. Kim, KAIST

87
X Information requirements
– Database information
» selectivity of fragment Fj with respect to query qi
y # of tuples in Fj that need to be accessed for qi
» size of a fragment
– Application information
» number of read accesses of a query to a fragment
» number of update accesses of a query to a fragment
» a matrix indicating which queries updates which fragments
» a similar matrix for retrievals
» originating site of each query
M.H. Kim, KAIST

88
44
– Site information
» storage capacity
» processing capacity
» unit cost of storing data at a site
» unit cost of processing at a site
– Network information
» communication cost per frame between two sites
» frame size
M.H. Kim, KAIST

89
X Allocation model
» minimize the total cost of processing and storage
y while trying to meet response time constraints
– min(Total cost)
» subject to
y response time constraint
y storage constraint
y processing constraint
decision variable
1 if fragment Fi is stored at site Sj
xij =
0 otherwise
M.H. Kim, KAIST
90
45
Total cost
Σall queries (query processing cost) +

Σall sitesΣall fragments (storage cost of a fragment at a
site)
– storage cost of fragment Fj at site Sk

» (unit storage cost at Sk) * (size of Fj) * xjk
– query processing cost

» processing component + transmission component
M.H. Kim, KAIST

91
z Query processing cost for one application
{ Processing component
» access cost + integrity enforcement cost + concurrency control cost
– access cost
Σ Σ
all sites all fragments (# of read accesses + # of update accesses) *
xij * (local processing cost at a site)
4 simple assumption
y read cost = update cost
– integrity enforcement and concurrency control costs

» can be similarly calculated
M.H. Kim, KAIST
92
46
| Transmission component
» cost for updates + cost for retrievals
– cost for updates
Σ Σ
all sites all fragments(cost of update message) * xij +
Σ all sites Σ all fragments(cost of acknowledgment) * xij
– cost for retrieval
Σ all fragments minall sites (cost of retrieval request +

cost of sending back the result) * xij
M.H. Kim, KAIST

93
Constraints
– response time constraint
» (execution time of a query) ≤ (maximum allowable response time
for that query)
– storage capacity constraint (for a site)

Σ all fragments (storage requirement of a fragment at that site) ≤
(storage capacity at that site)
– processing capacity constraints (for a site)

Σ all queries (processing load of a query at that site) ≤
(processing capacity of that site)
M.H. Kim, KAIST

94
47
X Solution methods
y FAP is NP-complete
y DAP is also NP-complete
» has to look for heuristic methods
– heuristics commonly adopted for FAP and DAP
» knapsack problem solutions
» branch and bound techniques
» network flow problem solutions
M.H. Kim, KAIST

95
Attempts to reduce the complexity of the problem

– find an optimal non-replicated solution at the first step
9 i.e., ignore replication at first
» replication is handled at the second step
y by applying a greedy algorithm to improve the initial feasible
solution
M.H. Kim, KAIST

96
48

Ozsu Chap5

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Ozsu Chap5

Transféré par

Droits d'auteur :

Formats disponibles

Chap.

5 Distributed Database Design

M.H. Kim, KAIST

Introduction: Design Strategies

X Alternative design strategies

M.H. Kim, KAIST

GCS Access External

Introduction: Design Strategies (cont’d)

M.H. Kim, KAIST

X Reasons for fragmentation

M.H. Kim, KAIST

Distribution Design Issues (cont’d)

M.H. Kim, KAIST

M.H. Kim, KAIST

Distribution Design Issues (cont’d)

(Ex) Horizontal fragmentation

PNO PNAME BUDGET LOC

PROJ1: projects with budgets less than $200,000

PROJ2: projects with budgets greater than or equal to $200,000

M.H. Kim, KAIST

PROJ1 PNO PNAME BUDGET LOC

PROJ2 PNO PNAME BUDGET LOC

M.H. Kim, KAIST

Distribution Design Issues (cont’d)

(Ex) Vertical fragmentation

PNO PNAME BUDGET LOC

PROJ1: information about project budgets

PNO BUDGET PNO PNAME LOC

M.H. Kim, KAIST

Distribution Design Issues (cont’d)

a large number of alternatives

– find the suitable level of partitioning within this range

M.H. Kim, KAIST

M.H. Kim, KAIST

Distribution Design Issues (cont’d)

M.H. Kim, KAIST

Comparison of replication alternatives

full-replication partial-replication partitioning

query same difficulty

reliability very high high low

Distribution Design Issues (cont’d)

– Communication network information

M.H. Kim, KAIST

z Horizontal Fragmentation (HF)

– Primary Horizontal Fragmentation (PHF)

– Derived Horizontal Fragmentation (DHF)

z Vertical Fragmentation (VF)

z Hybrid Fragmentation (HF)

M.H. Kim, KAIST

X Horizontal fragmentation (HF)

M.H. Kim, KAIST

Join graph M.H. Kim, KAIST

M.H. Kim, KAIST

M.H. Kim, KAIST

– Pr = {p1,p2, …,pm} denotes the set of all simple predicates in R.

(Ex) simple predicates

(Ex) minterm predicates

4 Access frequency of a query: acc(qi)

4 Minterm selectivity: sel(mi)

M.H. Kim, KAIST

X Primary horizontal fragmentation (PHF)

M.H. Kim, KAIST