Vous êtes sur la page 1sur 23

Hierar

hi al Computing: An Ar hite ture for E ient Transa tion


Pro essing
Juan Rubio and Lizy K. John
Laboratory for Computer Ar hite ture
The University of Texas at Austin
Austin, TX 78712

fjrubio,ljohnge e.utexas.edu

Abstra t

Transa tion pro essing workloads impose heavy demands in the memory and storage sub-systems and often
result in large amounts of tra in I/O and memory buses. In this paper, we propose to utilize pro essing
elements distributed a ross the memory hierar hy, with the obje tive of performing the omputation lose
to the data residen e. Leveraging on a tive memory modules and a tive disk devi es emerging from other
resear h groups and available in the market, we propose a hierar hi al omputing system in whi h the
distributed pro essing elements operate on urrently and ommuni ate using a hierar hi al inter onne t.
Transa tions are partitioned a ross the di erent layers in the hierar hy depending on the anity of ode
to a parti ular layer or other heuristi s. Commands per olate down into the lower layers of the hierar hy
and prepro essed/partially pro essed information ows up into the higher layers. All layers ACTIVELY
parti ipate in the pro essing of the transa tion by doing tasks for whi h they are parti ularly suited. The
lower layers ontain inexpensive pro essor units and in onjun tion with the powerful entral pro essor
and other ollaborating memory and disk pro essors, yield high performan e in a ost-e e tive fashion.
This on ept is then applied to the online transa tion pro essing ben hmark TPC-C and s hemes for
ode partitioning are outlined. A hierar hi al omputing system ontaining four inexpensive memory
pro essors and 32 very inexpensive disk pro essors an yield speedups of up to 4:52x when ompared with
a traditional system. Sin e transa tion pro essing has been seen to ontain ne-grained and oarse-grained
parallelism, the proposed hierar hi al omputing paradigm whi h exploits parallelism and redu es data
transport requirements seems to be a feasible model for future database servers.
Keywords: Computer Ar hite ture, Parallel Pro essing, Memory Hierar hy, High Performan e Computing, Database Servers.

Te hni al Area:

Ar hite ture

Introdu tion

It is a well known fa t that the server market is the driving fa tor for several of the te hnologi al advan ements in the omputer industry. A few years ba k, this market was mainly dominated by te hni al
workloads. But during the last two de ades, it has hanged to power a large portion of ommer ial operations.
One important type of appli ations in the group of ommer ial workloads is Transa tion Pro essing
(TP). Transa tion pro essing workloads are lassi ed in two types: Online Transa tion Pro essing (OLTP)
and De ision Support Systems (DSS). OLTP systems are used to handle those operations that o ur during
the normal operation of a business (e.g. a lient buys produ ts, the managers he k the inventory, adjust
the pri e of an item). On the other hand, DSS systems are used to take de isions based on the data
gathered by a business, whi h usually omes from an OLTP system (e.g. nd most popular produ t within
a given demographi bra ket, estimate net pro t of all sales in the last three months). Even when both
workloads t within the ategory of transa tion pro essing, they have many di eren es:

 OLTP operations are of short duration, taking millise onds to omplete, whereas DSS operations
take minutes.

 The above statement also applies to the dataset of an operation. While OLTP operations usually have

datasets in the order of kilobytes or megabytes, DSS operations usually a ess megabytes or hundreds
or megabytes of data. Re ent literature suggests that DSS systems will be a essing gigabytes in the
next ouple of years [1.

 The number of on urrent operations in a OLTP system is in the order of thousands, while DSS
systems normally have less than a hundred on urrent operations.

 OLTP systems onstantly modify the data stored in the databases (

enter a sale, deliver a pa kage).


DSS systems, on the other hand, use mostly read operations during their exe ution.
e.g.

Transa tion pro essing systems are typi ally implemented using a multi-tier ar hite ture. The idea
is to implement a fun tional pipeline that streams transa tions from the lients to the server database
in an e ient way. As an be seen from Figure 1, lients on the left are onne ted to an intermediate
server or Middle-Tier, through a swit hed network. The fun tion of the middle-tier server is to a t as a
lter and reje t those requests presented by the lients that are in orre tly generated. It also enfor es the
se urity in the system and serves as a parser that transforms requests formulated in one language domain
(e.g. HTML) to another domain (e.g. SQL).
The next omponent is the Middle-Tier server. In this example, the Middle-Tier server is implemented
as a server luster, with a front-side onne tion to the lients through a load-balan ing swit h, whose
2

CLIENTS

BACKENDTIER
SERVER

FRONTSIDE
SWITCH

0
1
0
0 1
1
0
1
111111
000000
0
1
0
1
000000
111111
0
1
0
1
0
1
0
1
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
000000
111111
0 1
1
0
1
000000
111111
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
000000
111111
0 1
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
000000
111111
0 1
1
0
1
000000
111111

BACKEND
SWITCH

MIDDLETIER
SERVER

Figure 1: Conventional System Level Ar hite ture for a Transa tion Pro essing System.
fun tion is to reate a distributed load in all the nodes of the luster. Work done in this area in ludes
lo ality aware distribution algorithms like the one developed by Pai, et al.[2. The nodes in the luster
ommuni ate with ea h other through a ba k-side network interfa e, whi h they also use to send the
requests to the database server, also referred to as the Ba k-end-Tier server.
The nal omponent of the system is the Ba k-end-Tier, whi h is also the fo us of this paper. This
server is the one that manipulates the primary data of the ommer ial operation (e.g. it keeps the list of
lients, the orders they pla e, pri es of items and their quantities in the warehouses). As su h, the ba kend-tier has omplete ontrol over a large portion of the data, whi h is normally lo al to it and a essed
using a Relational Database Manager System (RDBMS or ommonly DBMS). Implementations of this
server in lude symmetri multipro essor systems (SMP) as well as luster servers.
When we look at the exe ution behavior of ommer ial workloads, we observe ommer ial workloads
are di erent than te hni al workloads and present more vigorous demands to the memory and storage
sub-systems [3, 4, 5. In fa t, studies that analyzed transa tion pro essing workloads indi ate that systems
spend around 90% of the time waiting for the I/O devi es to a ess the data [6. On e the data are brought
to memory, the pro essor uses between 25% to 45% of the exe ution time handling memory a esses [7.
That results in a sub-optimal utilization of the laten y hiding features of modern dynami ally s heduled
pro essors [8.
One of the reasons for this unbalan e between omputation and data a ess ba ktra ks to the prin iples
3

of traditional memory hierar hies, where data moves from the storage sub-system to the pro essor before
it an be pro essed. Although we have be ome a ustomed to this exe ution model, whi h works well
for te hni al and some other appli ations, it is far from optimal when used with a transa tion pro essing
workload. The a tion of moving data ba k and forth between the storage and the omputing elements not
only results in a high volume of tra , whi h hurts the s alability of the system, but it also reates an
arti ial bottlene k by serializing the exe ution in an environment with enough parallelism.
This paper presents the Hierar hi al Computing model as a possible solution to the problems presented
above. The next se tion introdu es the idea, giving spe ial attention to the operation of the hardware and
ommuni ation of the devi es. Se tion 3 overs the programming model used in the system, how we plan
to partition the problem, and use a set of basi primitives to operate on the data. Se tion 5 performs
a mathemati al analysis of the idea, and identi es those parameters that a e t the performan e of this
te hnique. Se tion 4 presents a basi ode partitioning s heme and inspe ts the idea of using a onventional
transa tion pro essing workload, whi h helps us to determine the feasibility of the model. Se tion 6 looks
at other ideas proposed in the literature. Se tion 7 on ludes with a highlight of the most signi ant
ontributions.

Hierar hi al Computing

To address the problems presented in the previous se tion, most transa tion pro essing systems exploit
the oarse grain parallelism present in the form of on urrent transa tions. Before pro eeding further it is
important to de ne what is a transa tion. A transa tion is a sequen e of operations, whi h are exe uted
atomi ally (atomi ), whi h always maintain a onsistent state in the database ( onsistent), whose exe ution
is not a e ted by on urrent transa tions (isolation) and whose e e ts are permanent (durable). These
set of properties, ommonly referred as the ACID properties for the rst letter of ea h property, is the
basis for the transa tion pro essing theory. Implementations of urrent systems use a thin layer of software
known as the Transa tion Manager. Its fun tion is to enfor e an order in the arrival of the transa tions
and the ACID properties required by the transa tion model. The queries that form a transa tion are then
passed to the database pro esses whi h run in ea h one of the pro essors in the system. Although this
approa h has been relatively su essful, it is possible to ause resour e management problems, manifested
in the form of hot spots during the a ess of the database tables, thus inhibiting the exploitation of all the
parallelism in the system.
The Hierar hi al Computing exe ution model exploits the available parallelism present within a single
transa tion, in addition to thread-level or oarse level parallelism that an be exploited using other means.
The idea is to distribute the omputation a ross a omputer system on behalf of a unique transa tion.
In the model, the ommuni ation is based in a message based approa h using a hierar hi al topology of
4

inter onne ts. Sin e this model exploits a di erent type of parallelism than a traditional system (intratransa tion vs inter-transa tion), its use an be orthogonal to existing methods. While in a traditional
system all the omputations are required to be performed by the CPU, under this model, there is no need
to insist on that. There is an important di eren e between both models, whi h is seen in the way a server
handles a query from a lient. In fa t, several operations may be e iently performed at the data residen e.
This brings a set of interesting tradeo s, whi h are presented and evaluated in the next se tions.
To distribute omputing in this way, some omputing power is lo ated in memory and some in disk, lose
to the lo ation of the data, where they perform simple omputations su h as omparison and a umulation.
The omponents are oupled using a hierar hi al inter onne t (any sort of dire ted a y li graph like a
binary tree).
Processing
Element

1. Processor
(ILP)

Interconnect

2. Interconnect

3. Main Memory
(multibank)

4. Interconnect

Interconnect

Interconnect

5. Disk
(Array)

Figure 2: Sample topology for a Hierar hi al Computing system.


Figure 2 shows the topology of a sample system based on the above ideas and where we perform
omputations in ve di erent levels. The additional points of omputation are represented by shaded
areas and are lo ated in the memory banks, storage devi es and inter onne ts between the main levels.
To exemplify the operation of the system, we an introdu e the analogy of a orporate o e, where the
employees are organized in a well de ned hierar hy, ea h of them with an amount of data in its lose
vi inity and over whi h they have omplete ownership. As a team, they handle ea h of the transa tions
re eived by the o e, in a very distributed fashion. Even when at a parti ular point in time, only some
members work in the same task, ea h one operates on the transa tion at some point. Managers at di erent
5

levels do not need to know every detail of the operations of their subordinates. Ea h level in the hierar hy
onveys only the right amount of information to the layer above.
A system with, say, 1 main memory module and 4 disk modules needs to have only 3 levels of omputations (1 in the main CPU, 1 in memory and 1 in the disks), whereas pro essors in the inter onne t
will be riti al to larger hierar hi al omputing systems. The number of levels of inter onne t pro essors
depends on the number of disks reporting to the same memory, as well as the inter onne t network, thus
for a binary tree we need a number of levels of inter onne ts expressed by:
(
log2 N umber of memory units 1; 0)
(1)
The intelligen e in the di erent levels an be realized using intelligent memory modules investigated
in re ent resear h [9, 10, 11, 12, and intelligent or a tive disks [13, 14, 15, 16, 17. It is possible to nd
storage devi es in the market with 150 MIPS ore and up to 2MB of main memory [18, 19, 20, 21. Ongoing e ort in intelligent or a tive memories and disk omponents an thus be leveraged to implement
hierar hi al omputing systems. If omputing apability is required in the network or swit hes, they
an be realized using hips similar to mi ro- ontrollers embedded in the swit h/bus interfa e. It may be
noted that the omputing resour es required in the storage devi es or memory are signi antly heaper
than the entral pro essor. Use of a powerful pro essor whi h exploits instru tion level parallelism and
thread-level-parallelism is favored at the root of the hierar hy.
The hierar hi al mode of operation has worked relatively well in our so iety, and we expe t it to work
as well in a omputer system due to the following hara teristi s:
 Raw data stay lo al to a level of the hierar hy, whi h gives more freedom to the upper levels to
operate and hold temporary results of the operations in their fast but redu ed storage spa e.
M ax log2 N umber of disks

 It suggests a spe ialization of the units, whi h permits a system to use dynami ally s heduled pro-

essors in the upper levels of the hierar hy, where omplex de isions are required and ontrol ow
is hard to predi t. In-order pro essors, or narrower power-e ient pro essors an be used to handle
the bulk of the data in the lower levels.
The impli ations of the rst point a e t the mode of operation of the hierar hy and its programming
model and are overed in the next se tion. The analysis of the se ond point dire tly a e ts the hardware
used in the system and will be studied in subsequent se tions.

Programming Model

As it was mentioned in the previous se tion, the basi element behind the Hierar hi al Computing model
is the use of omputation engines that sit lose to the lo ation of the data. The idea is to expedite the
6

movement of data from its natural point of residen e to the omputation unit. This omputation unit an
be the main pro essor, for tasks that require the high omputation power provided by a high frequen y
and dynami ally s heduled pro essor, or ould as well be a mu h simpler mi ropro essor or mi ro ontroller
that fun tions as an intelligent memory or disk ontroller. The partition of the data and operations to
take advantage of this new ar hite ture are overed in Se tion 4. This se tion deals with the programming
model used to support a transa tion pro essing workload.
The heart of a transa tion pro essing system is the database server. Nowadays, databases are based
on the Relational model [22. In this model data is stored in the form of tables with a variable number
of rows of a predetermined width. To a ess these tables, these systems use data manipulation languages
(DML), of whi h the stru tured query language (SQL) is the one most frequently used. The operations
supported by the SQL language in lude:

 S an: lo ates rows that mat h a parti ular riteria. The riteria an be a single predi ate (

list
all ights to Pittsburgh), or a omplex predi ate (e.g. nd all vehi les in the state of Texas and
registered after 1973). It is alled sele t under some data manipulation languages. This operates on
a single table.
e.g.

 Join: similar to the s an operation, but operates on two or more tables (

nd all vehi les registered


before 1966 and urrently owned by individuals born after 1966). The two pie es of information are
ontained in separate tables.
e.g.

 Insert: reates a new row in an existing table.


 Remove: homologous to insert, it removes a row from a table.
 Update: modi es one or more elds within one or more rows.
 Sort: is not an independent type of operation, but an be applied to a s an operation.
In order to implement the fun tionality required by the relational model and the SQL language and
to take advantage of the on urren y provided by the Hierar hi al Computing model, we have opted for a
message-based data ow model. This on ept is presented in Figure 3, whi h shows two levels of a sample
hierar hy (although the model is exible enough to support several levels).
We assume that data is already partitioned and that it resides in the lower level (Lf1; 2g). The top level
(T f1g) is the level that initiates the operation by issuing a ommand (CM D) to one or more modules in
the lower levels (T f1g and/or T f2g). The a tion of sending the ommand an be a broad ast or multi ast
(e.g. sele t all rows whi h mat h a riteria), or it an also be a uni ast (e.g. insert row in table). The
ommands en ompass enough information to allow the lower levels to perform the omputations in behalf
7

of the upper level. On e the upper level sends the ommand to the lower level, it waits for data, whi h
depending on the operation an be a null response, a single element or a sequen e of elements. The
hardware provides basi ontrol ow signals to help the upper level handle the amount of data that might
result from an operation.
111
000
000
111
000
111
000
111
000
111
000
111

T{1}

11
00
00 Active
11
00
11
00
11

component

CMD
.
DATUM

L{1,2}

. DATUM

111
000
000
111
000
111
000
111
000
111

111
000
000
111
000
111
000
111
000
111

Figure 3: Exe ution of an operation in the Hierar hi al Computing model.


From the perspe tive of the lower level, on e it re eives a ommand from the upper level, it performs
a preorder traversal starting at this level. Thus, the node pro eeds to a ess the data over whi h it has
ontrol. If the data is not present in its level, it forwards the ommand to the level immediately under it
and forwards all responses to the upper level. If there are no levels under it, a blank response is sent to
the upper level.
In order to support this me hanism both ommands and responses need to be tagged with a unique
identi er. This te hnique is similar to the tokens present in traditional data ow ma hines [23. We also tag
the ommands with the ID of the level that initiated it, this is done to redu e the overhead of pro essing
the responses. Finally, the model also implements a name spa e lo ator in the form of a table allo ation
index. This index permits the pro essor in the level lo ate data within its boundaries. It is also used to
determine if data is not present, thus avoids a lengthy traversal of all the data.
After studying the di erent SQL operations and the algorithms used in transa tion pro essing workloads, we designed two types of operations: individual and aggregate. They are based in the exe ution
presented in Figure 3, but di er in the way the lower level generates the results and how the upper level
interprets them.
 Type 1: Individual
Named Individual, as the lower level informs the upper level of every single result. It e e tively a ts
as an unbu ered lter. The semanti s an be designed to allow the operation to return on the rst
8

event triggered or to ontinue operating until it rea hes the end of the region.
Example: Sear h a range of data for a string.

111
000
000
111
000
111
000
111
000
111
000
111
000
111

1
0
0
1
0
1
*

Active component
Data Match

5
7

CMD

111
000
000
111
000
111
000
111
000 CMD
111
1

2
3
4

CMD

111
000
000
111
000
111
000
111
000
111
5

Figure 4: Primitive Type 1 (Individual).

 Type 2: Aggregate

For this primitive, the lower level a esses its asso iated data, and nds those elements that mat h
a parti ular riteria. However, it does not send all these results to the upper level. Instead it
produ es an aggregate number and sends it on e all its data has been analyzed. In this ontext,
an aggregate fun tion Aggregate() is any fun tion that produ es a single number based on a set of
numbers (fI0 ; :::; I g). The most ommon aggregate fun tions in transa tion pro essing workloads
are: sum(), ount(), average(),max(), and min(). In this ase, the results returned by the di erent
nodes in the lower level might need to be ombined in the upper level to produ e a unique answer.
We an a omplish that, if we apply the same aggregate fun tion in the upper level to the values
returned. This works for all the examples shown above, ex ept for the average() fun tion. For those
ir umstan es the algorithm is hanged to return both sum() and ount() from the nodes in the
lower levels and the upper level performs the omputation of average().
Example: Count orders pla ed in January 2001.
N

Code partitioning

When we introdu ed the programming model in the Se tion 3, we assumed that the data was laid out
orre tly in disk. The distribution of ode among the di erent levels of the hierar hy impa ts performan e.
In this se tion we address the issue of how to partition a transa tion in order to a hieve a good performan e
with the Hierar hi al Computing model.
9

111
000
000
111
000
111
000
111
000
111
000
111
000
111
aggr<3>

1
0
0
1
0
1
*

Active component
Data Match

aggr<5,7>
CMD

111
000
000
111
000
111
000
111
000 CMD
111
1

2
3
4

CMD

111
000
000
111
000
111
000
111
000
111
5

Figure 5: Primitive Type 2 (Aggregate).


To study the data and ode partitioning, we begin by looking at the TPC-C ben hmark [24, a popular
transa tion pro essing workload. The TPC-C ben hmark is developed by the Transa tion Pro essing
Coun il (TPC), and is intended to serve as a standard ben hmark for Online Transa tion Pro essing
systems. The ben hmark models the operation of a business with ve di erent types of transa tions (new
order, payment, order status, delivery and sto k level).
Table 1 shows the hara teristi s of the database tables used by the ben hmark. The parameter W
represents the number of warehouses in the ben hmark, and is used to s ale it to di erent hardware
on gurations. The ardinality olumn indi ates the number of rows in a database table. The next olumn
shows the size of a row for our implementation using IBM DB2 Universal Database [25. The last olumn
shows the size of the tables for a on guration with 17,500 warehouses, whi h is lose to the highest
non- lustered re ord to TPC-C by the time we ondu ted this study.
Spe i ed as part of the TPC-C ben hmark is the high level des ription of ea h of the ve di erent
transa tions. For ea h of these transa tions, we show the query exe ution plans (Figures 6 to 10). Exe ution
plans are dire ted graphs that represent the tables in the database, the ow of data and the operations
performed over the data. The tables are represented by the ir les, operations by the re tangles and the
ow of data by the ar s.
Note that there is no referen e to time in this representation, as the omputation is driven by the
arrival of data. Also, to read them orre tly, they should be traversed in postorder (i.e. visit hildren
before entering the node). So if we look at, say, Figure 8, we observe that we annot perform the s an over
table order-line before we perform the s an over table order, whi h itself needs the s an over table ustomer
together with a sort operation. An additional lari ation is needed for the transa tion shown in Figure 6,
where there is a dotted box around some operations. It is used to indi ate that a se tion of the plan is
10

Update

Scan

Scan

Scan

Warehouse

District

Customer

Insert

Insert

Order

OrderLine

Update

Scan

Scan

Item

Stock

Figure 6: Exe ution Plan for the TPC-C New-Order Transa tion.
Update

Update

Update

Insert

Scan

Scan

Scan

History

Warehouse

District

Customer

Figure 7: Exe ution Plan for the TPC-C Payment Transa tion.
Scan

Scan

Sort

OrderLine

Order

Scan

Customer

Figure 8: Exe ution Plan for the TPC-C Order-Status Transa tion.
11

Update

Scan

Aggregate

Customer

Update

Remove

Scan

Sort

Order

Update

Scan

OrderLine

Scan

NewOrder

Figure 9: Exe ution Plan for the TPC-C Delivery Transa tion.

Count

Join

Scan

Scan

Stock

OrderLine

District

Figure 10: Exe ution Plan for the TPC-C Sto k-Level Transa tion.
12

Table

Cardinality Row Size Table Size (GB)


(bytes)
(W=17.5k)
Warehouse
W
101
<0.1
Distri t
W  10
107
<0.1
Customer
W  30k
701
342.8
Sto k
W  100k
330
537.9
Item
100k
90
<0.1
Order
W  30k +
40
19.6
New-Order W  9k+
10
1.5
Order-Line W  300k+
80
391.2
History
W  30k +
68
33.3
Total
1,270.4
Table 1: Dimensions of tables for the TPC-C ben hmark.
repeated several times. In this parti ular ase, one time for every item in the order.
In a old system all the data is stored in disk, but as the exe ution progresses, we an expe t tables, a
subset of all the rows in the table or indi es to move to upper levels in the hierar hy. We all this pro ess
data promotion. Di erent from what happens in a traditional system with a hes, this operation is not
transparent to the software, and is ontrolled by the mapping algorithms.
Based on the query exe ution plans, we use a stati ost analysis similar to the one used in some of the
rst query optimizers [26, where the ost (in instru tions) of a essing a table is omputed as a fun tion
of the dimensions of the table and the memory apa ity. Given a table T able of size Size(T able ) and
with Rows(T able ) rows, in a hierar hy with M levels, the pro ess used to partition the ode is shown
in Figure 11. For join operations we use a se ond table T able of size Size(T able ) and Row(T able )
rows. This is a simple model, and by no means should be onsidered an optimal partition. Additional
information used by the algorithm might in lude the notion of pro essing anity [27, where a module of
omputation is sent to the omponent whi h will exe ute it in the minimum amount of time.
j

Evaluation

To evaluate the potential from the hierar hi al omputing model, we estimate the time needed to perform
a task in this model and ompare it with the time needed in a onventional system. Equation 2 shows
the basi expression for the time ne essary to omplete a task, where performan e gains an be obtained
by improving any of the fa tors of the equation. We have hosen to express the time required to exe ute
an instru tion (TPI) as the produ t of the lo k frequen y and the CPI ( y les-per-instru tion). The
rst is ommonly asso iated with the hardware implementation details, while the se ond is a fa tor of the
pro essor ar hite ture and the workload being exe uted.
13

Initial table allo ation

For ea h Level
For ea h T able
if (i = M )
i

T ableInLevel(T able ; Level )


T rue
else
if (Size(T able ) < T hreshold(Level ))
T ableInLevel(T able ; Level )
T rue
else
T ableInLevel(T able ; Level )
F alse
j

Cost estimation

Traverse exe ution plan


For ea h operation
Obtain type(operation)
Set level where omputation is performed
Levels
fLevel j T ableInLevel(T able(operation); Level ) = T rueg
HighestLevel
M in(Levels)
if (type(operation) = fSCAN; IN SERT; REM OV E; U P DAT E g
Exe OpInLevel(operation)
HighestLevel
if (type(operation) = fJ OIN g
Exe OpInLevel(operation)
M in(Rows(T able ); Rows(T able ))
Assist(operation)
M ax(HighestLevel 1; 1)
Remember tables used by level
LevelU sesT able(Exe OpInLevel(operation); T ablesU sedBy (operation))
Compute ost(operation)
if (type(operation) = SCAN ) ost(operation) Rows(T able )
if (type(operation) = IN SERT ) ost(operation) 1
if (type(operation) = REM OV E ) ost(operation) 1
if (type(operation) = U P DAT E )
if (dependen ies(operation) > 0) ost(operation) Rows(T able )
else ost(operation) 0
if (type(operation) = AGGREGAT E ) ost(operation) 0
if (type(operation) = SORT ) ost(operation) Rows(T able )
if (type(operation) = J OIN ) ost(operation) M ax(j; k)
Add ost(operation) to CostOf Level(Exe OpInLevel(operation))
i

Remove unused opies of tables

For ea h Level where i > M


For ea h T able
if (LevelU sesT able(Level ; T able ) = F alse)
T ableInLevel(Level ; T able )
F alse
i

Figure 11: Simple s heme for partitioning transa tion pro essing workloads.
14

T rue

 TPI
= Instru tions
N

(2)

pro essors

TPI

= lk  CP I

(3)

This exe ution model performs omputations in the di erent levels of the storage hierar hy, in what an
be onsidered heterogeneous omputing elements, thus Equation 4 expresses the time spent omputing by
a single level. The index level i goes from 1 to M , where M is the total number of levels in the hierar hy.
1 The parameter N indi ates the number of pro essing elements in the parti ular level i.
i

ti

 TPI
= Instru tions
N
i

(4)

Another interesting hara teristi of this model, is that it allows for operations to be performed on urrently in the di erent levels. However there might be algorithms that do not allow for that level of
parallelism be ause they have serial omponents in their exe ution. Taking that into a ount, Equations 5
and 6 show the range of values for the total time taken by a task. When we an exploit full parallelism,
the resulting time will be the maximum of the individual times. Situations where all the operations need
to be serialized would result in a time equal to the sum of all the times. Hen e the range of exe ution time
an be al ulated as follows:
tmax

X
M

ti

no overlapping

=1

(5)

tmin

= M=1
AX
M

ti

f ull overlapping

(6)

To enable this model, omputation is partitioned so operations are assigned to the omputation elements
in ea h one of the levels. Considering the nature of transa tion pro essing workloads, the partition of
operations an be done stati ally or using query optimizers like the ones built in most ommer ial databases
[26. With these systems, we express the distribution of omputation in Equation 7. The oe ients '
indi ate the fra tion of the total number of instru tions exe uted in level i of the hierar hy.
i

X
=
M

Instru tions

Instru tionsi

=1

(7)

'i
1

= Instru tions
Instru tions

(8)

Our onvention is to number the hierar hy from the top to the bottom, so the level with the main pro essor is assigned a

value

i = 1.

15

The number of instru tions a pro essor is apable of exe uting is a fun tion of a myriad of fa tors in
the system. Among them, the type of workload plays an important role. Given that we are evaluating this
model under the light of transa tion pro essing workloads, we will use some
CP I

= CP I

omputation

+ CP I

(9)

storage

A hara teristi of this model, is the use of relatively simple omputation engines in the lower levels of
the hierar hy. This a e ts the balan e of the operations in the system, a ording to Equation 10.
CP I omputation;i < CP I omputation;j

(10)

i<j

Assuming a hierar hy of three levels like the one shown in Figure 12, it is possible to present an
expansion of the above relations, showing the fa tors in whi h the time de omposes. We have assigned the
rst level to operations omputed by the main pro essor, the se ond level to the main memory and the
third level to those operations performed by the disk ontroller.
11111
00000
00000
11111
P.main
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111

Main
(i=1)

0000
1111
1111
0000
P.memory
0000
1111
0000
1111

Memory
(i=2)

0000
1111
1111
0000
P.memory
0000
1111
0000
1111

11
00
00
11

Disk
(i=3)

N2

11
00
00
11

P.disk

P.disk

N3

Figure 12: Topology of a Simple Hierar hi al Computing system with M = 3.


t

Instru tions1

{z

N1
main

 T P I1 +

Instru tions2

} |

{z

N2

 T P I2 +

memory

Instru tions3

} |

{z

N3
disk

 T P I3

(11)

Sin e typi ally the disk and memory pro essors will not be as sophisti ated as the entral pro essor, we
use a degradation fa tor to indi ate how the memory and disks pro essors ompare to the entral pro essor.
16

The degradation fa tor for T P I is designed as  =


non-overlapping mode using Equation 5 is:
i

Speedup

'1

T P Ii

T P I1

, whi h we use to obtain the speedup for the

1
'2 2

N1

N2

'3 3

(12)

N1

N3

Similarly, the speedup when the algorithm shows full overlapping is expressed as:
Speedup

M AX '1 ; '2 2

N1

N2

; '3 3

(13)

)
3

N1
N

F(17%,25%,58%) Y(1,4,32)

3.00
2.75
2.50
2.25
2.75-3.00
2.50-2.75
2.25-2.50
2.00-2.25
1.75-2.00
1.50-1.75
1.25-1.50
1.00-1.25
0.75-1.00
0.50-0.75
0.25-0.50
0.00-0.25

2.00
1.75

Speedup 1.50
1.25
1.00
0.75
0.50
0.25
0.00
2.0

3.0
3.1

4.6
6.3

4.2

t2

7.9

5.2

9.6

6.3

t3

11.2

7.4
8.5
9.6
14.5

12.8

Figure 13: Speedup for (17%; 25%; 58%) (1; 4; 32).


Let us use tuple to represent the number of omputation elements in ea h level of the hierar hy.
Code will be partitioned between the di erent layers based on anity or other heuristi s. We use tuple 
to represent the fra tion of the instru tions exe uted in ea h level of the hierar hy (' ). To analyze the
e e t of the sele tion of devi es, we generate the speedup surfa e for variations of 2 and 3 (Figure 13).
As it was expe ted, we a hieve the maximum speedup when we use the smallest numbers for 2 and 3
(i.e. P.memory and P.disk are not signi antly slower than P.main). There are however several points with
equal speedup for multiple pairs of degradation fa tors, whi h gives designers a great degree of freedom
when sele ting omponents.
i

17


( 1, 4,32)
( 1, 4,32)
( 1, 4,32)
( 1, 4,32)
( 1, 4,32)
( 1, 4,16)
( 1, 4,32)
( 1, 4,64)
( 1, 2,32)
( 1, 2,64)


(10%, 5%, 85%)
(35%, 10%, 55%)
(15%, 25%, 60%)
( 0%, 25%, 75%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)

Speedup
Min Max
1.60 4.32
1.17 2.13
0.95 2.86
1.01 4.52
1.08 3.48
0.80 2.67
1.08 3.48
1.31 4.10
0.70 2.58
0.79 2.91

Table 2: Speedups of hierar hi al omputing system with onventional memory and disk omponents.
The algorithm presented in Figure 11 is used to obtain a preliminary estimate of the ode partitioning. For the situation where T hreshold(Level ) = Capa ity(Level ), Table 3 shows the fra tions of the
omputation that should be performed in ea h level for a on guration of (1; 4; 16).
i

Transa tion type


New Order
Payment
Order Status
Delivery
Sto k Level

'1

'2

'3

10.0 % 0.8 % 89.2 %


36.3 % 9.1 % 54.6 %
16.7 % 25.0 % 58.3 %
0.00 % 25.0 % 75.0 %
6.3 % 18.8 % 74.9 %

Table 3: Partition of ode for the TPC-C ben hmark.


Now, we pro eed to do a sensitivity analysis assuming a hierar hi al omputing system built with
available te hnologies. We assume a main pro essor running at 1 GHz, memory pro essors running between
100 and 500 MHz (i.e. 2 < 2 < 10), and disk pro essors running between 66 and 250 MHz (i.e. 4 < 3 <
15). Considering the database sizes of some implementations of transa tion pro essing systems, we sele ted
a on guration with four memory pro essors and 32 disks (whi h is expressed by (1; 4; 32)). The rst
se tion of Table 2 shows the maximum and minimum speedups obtained for di erent partitions of the ode
(i.e. varying values of ). Depending on the distribution of work, we an observe speedups of up to 4:52x
when omparing with a traditional unipro essor system. If the memory and disk pro essors are very slow,
potential slowdowns may be observed for ertain partitions of the ode. For instan e, the distribution
(15%; 25%; 60%) en ounters a slowdown when 2 = 10 and 3 = 15. However the same on guration is
apable of a speedup of 2:86x given more powerful nodes.
Code partition is dependent on the nature of the transa tion, and as su h is relatively hard to hange
18

without restru turing the algorithm. It is then important to analyze the e e t that hanging the hardware
on guration has over the speedup for a given ode partition. This is shown in the lower half of Table 2.
Here we work on a partition (10%; 20%; 70%) and hange the number of memory and disk modules.
Given the large amount of ode given to the disk modules, we observe that in rementing the number of
disks results in an in rease in the speedup. As in any other multipro essor system, the in rease is not
linear with the amount of pro essing elements added. It may also be noted that a tive memory systems
and a tive disk systems are subsets of the proposed general hierar hi al omputing system.

Related Work

During the 1970s, omputer s ientists looked at database appli ations and proposed spe ially designed
ma hines to handle the in reasing gap in the performan e between primary and se ondary storage as well
as the overwhelming software omplexity present in database appli ations [28, 29. Known as Database
Ma hines, these systems in orporated spe ialized omponents in the form of pro essors per-disk, per-tra k,
per-head and asso iative memories, in order to fa ilitate the a ess of data. The problem with these systems
was that the use of non- ommodity hardware drasti ally in reased the ost of the systems. Additionally
they were designed to handle only database workload, whi h resulted in a de lined interest by the rest of
the ar hite ture and software ommunities.
Database ma hines saw their last days with the development of parallel databases [30, whi h proved
to be a ost-e e tive solution for the problems of the day. Sin e then, modern ommer ial databases have
adopted several of the proposed algorithms: parallel sort [31, parallel join [32 and other algorithms that
tradeo memory utilization for I/O bandwidth [33.
In addition to database ma hines, several resear h proje ts have looked at the idea of having omputation elements lose to the data. Intelligent memories have mostly been targeted to regular numeri
appli ations [9, 10, but re ent attempts also look at its use in non-regular appli ations [11, 34, 12, 35.
Likewise the idea of intelligent disk has also been overed by di erent resear h groups. The workloads onsidered for this te hnology onsist of De ision Support Systems [13, 14, 15, 17, data-mining and multimedia
appli ations [16.
Related to our resear h, the X-Tree ar hite ture [36, 37, looks at a multipro essor organization, where
pro essors are onne ted using a binary tree. This topology fa ilitates the design of high bandwidth
systems as the average distan e between the nodes in reases logarithmi ally with the number of nodes in
the system. However, the emphasis in the X-Tree system was to build VLSI hips based on the idea of
re ursive ar hite tures [38, where it was possible to design a omputer system by onstru ting a hierar hy
with the same type of pro essors. Another example of a re ursive system is the Data Driven Ma hine
(DDM1), designed by Davis et al.[39. DDM1 was able to exploit on urren y due to its implementation
19

of Data Driven Nets (DDN), whi h onstitute a form of data ow similar to the one used by our model.
While the omputing paradigm presented in this paper has similarities to the aforementioned resear h
e orts, it must be noted that the merits of several past ar hite tural paradigms are being synergisti ally ombined and applied to transa tion pro essing in our urrent resear h e ort. We are leveraging
on advan ements in a tive memories and a tive disks while at the same time taking advantage of the
advan ements done during the last twenty years in parallel databases, and query optimizers.

Con lusions

In this paper, we have presented a hierar hi al model of omputing, whi h is based on the on ept of
performing omputations in a hierar hi al manner distributed over the memory hierar hy. This resear h is
intended to alleviate the imbalan e of omputation and data a essing experien ed by large s ale transa tion
pro essing systems. The building blo ks that help to realize the proposed hierar hi al model are a tive
memory units and a tive disk devi es that are emerging in the market. The basi prin iple is to use
omputation engines that sit lose to the lo ation of the data and the use of a hierar hy to onne t these
omputation engines. This paradigm also brings in bene ts of the data ow model of omputation.
We des ribed the omputation paradigm and outlined a simple ode partitioning s heme. Then we
applied the ode partitioning s heme to the TPC-C ben hmark, and observed that it is possible to split
the transa tions using stati information about the database tables and storage apa ity of the omputation
nodes. Using the simple partitioning algorithm and assuming inexpensive memory and disk pro essors,
we show that a hierar hi al omputing system with four memory pro essors and 32 disk pro essors an
obtain speedups of up to 4:52x when ompared with traditional unipro essor systems. While performan e
degradations will o ur in non-parallelizable ode, or systems with very slow memory or disk pro essors,
it is seen that judi ious partitioning of ode an yield in performan e improvements. Sin e transa tion
pro essing has been shown to ontain signi ant amounts of parallelism, partitioning ode in a fruitful
way is not immensely di ult. In summary, hierar hi al omputing whi h exploits parallelism, distributes
omputations, and redu es the data transport requirements is a feasible model of omputation for future
database servers.
One attra tive result of using this paradigm is the feasibility of using pro essors, with di erent performan e ratings in the same system. This is possible due to the use of heterogenous pro essing elements in
the di erent layers of the hierar hy. Pro essors whi h are used at the top of the hierar hy move down as
memory pro essors on e a new high performan e pro essor generation arrives. Simultaneously, the urrent
memory pro essors will move to the disk as disk pro essors. This maximizes the lifetime of a pro essor
design, amortizing the ost of the design.
20

Referen es
[1 R. Winter, \The growth of enterprise data: Impli ations for the storage infrastru ture," in Whitepaper
Winter Corporation, 1998.
[2 V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Drus hel, W. Zwaenepoel, and E. Nahum, \Lo alityaware request distribution in luster-based network servers," in Pro eedings of the Eight International
Conferen e on Ar hite tural Support for Programming Languages and Operating Systems, (San Jose,
CA, USA), pp. 205{216, O t. 2{7 1998.
[3 A. M. G. Maynard, C. M. Donnelly, and B. R. Olszewski, \Contrasting hara teristi s and a he performan e of te hni al and multi-user ommer ial workloads," in Pro eedings of the Sixth International
Conferen e on Ar hite tural Support for Programming Languages and Operating Systems, (San Jose,
CA, USA), pp. 145{156, O t. 4{7 1994.
[4 S. E. Perl and R. L. Sites, \Studies of Windows NT Performan e Using Dynami Exe ution Trees,"
in Pro eedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation,
(Seattle, WA, USA), pp. 169{184, O t. 28{31 1996.
[5 L. Barroso, K. Ghara horloo, and E. Bugnion, \Memory system hara terization of ommer ial workloads," in Pro eedings of the 25th Annual International Symposium on Computer Ar hite ture (ISCA98), (Bar elona, Spain), pp. 3{14, June 27{July 1 1998.
[6 M. Rosenblum, E. Bugnion, S. A. Herrod, E. Wit hel, and A. Gupta, \The impa t of ar hite tural
trends on operating system performan e," in Pro eedings of the Fifteenth ACM Symposium on Operating Systems Prin iples, (Copper Mountain, CO), pp. 285{298, ACM Press, De . 1995.
[7 A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood, \DBMSs on a modern pro essor: Where does
time go?," in Pro eedings of the 25th Conferen e on Very Large Data Bases (VLDB'99), (Edinburgh,
S otland), pp. 15{26, Sept. 7{10 1999.
[8 K. Keeton, D. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker, \Performan e Chara terization
of a Quad Pentium Pro SMP using OLTP Workloads," in Pro eedings of the 25th Annual International
Symposium on Computer Ar hite ture (ISCA-98), (Bar elona, Spain), pp. 15{26, June 27{July 1 1998.
[9 D. G. Elliott, W. M. Snelgrove, and M. Stumm, \Computational RAM: A memory-SIMD hybrid and
its appli ation to DSP," in Custom Integrated Cir uits Conferen e, pp. 30.6.1{30.6.4, May 1992.
[10 D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and
K. Yeli k, \A ase for intelligent RAM: IRAM," IEEE MICRO, Apr. 1997.
[11 M. Oskin, F. Chong, and T. Sherwood, \A tive pages: A omputation model for intelligent memory,"
in Pro eedings of the 25th Annual International Symposium on Computer Ar hite ture (ISCA-98),
(Bar elona, Spain), pp. 192{203, June 27{July 1 1998.
[12 M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Drapper, J. LaCoss, J. Grana ki, J. Bro kman,
A. Srivastava, W. Athas, V. Free h, J. Shin, and J. Park, \Mapping irregular appli ations to DIVA,
a PIM-based data-intensive ar hite ture," in Pro eedings of the High Performan e Networking and
Computing Conferen e (SC99), (Portland, OR), Nov. 13{19 1999.
21

[13 K. Keeton, D. A. Patterson, and J. M. Hellerstein, \A ase for intelligent disks (IDISKs)," in Pro eedings of the ACM SIGMOD International Conferen e on Management of Data (SIGMOD-98), (Seattle,
WA, USA), pp. 42{52, June 1{4 1998.
[14 A. A harya, M. Uysal, and J. Saltz, \A tive disks: Programming model, algorithms and evaluation," in Pro eedings of the Eight International Conferen e on Ar hite tural Support for Programming
Languages and Operating Systems, (San Jose, CA, USA), pp. 81{91, O t. 2{7 1998.
[15 G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobio , C. Hardin, E. Riedel,
D. Ro hberg, and J. Zelenka, \A ost-e e tive, high-bandwidth storage ar hite ture," in Pro eedings of the Eight International Conferen e on Ar hite tural Support for Programming Languages and

, (San Jose, CA, USA), pp. 92{103, O t. 2{7 1998.


E. Riedel, G. Gibson, and C. Faloustos, \A tive storage for large-s ale data mining and multimedia,"
in Pro eedings of the 24th VLDB Conferen e, Aug. 24{27 1998.
G. Memik, M. T. Kandemir, and A. Choudhary, \Design and evaluation of smart disk ar hite ture
for DSS ommer ial workloads," in Pro eedings of the 2000 International Conferen e on Parallel
Pro essing, (Toronto, ON, Canada), pp. 335{342, Aug. 21{24 2000.
Cirrus Logi , In ., \Preliminary produ t bulletin CLSH8665." June 1998.
Intel Corporation, \i960 HX mi ropro essor developers's manual." Order Number 272484-002, Sept.
1998.
Siemens Mi roele troni s, \TriCore ar hite ture overview handbook." Feb. 1999.
A. Tessardo, \TMS320C27x: New generation of embedded pro essors looks like a mi ro ontroller,
runs like a DSP." White Paper SPRA446, Digital Signal Pro essing Solutions, 1998.
E. F. Codd, \A relational model for shared large data banks," Communi ations of the ACM, vol. 13,
no. 6, pp. 377{387, 1970.
Arvind and R. S. Nikhil, \Exe uting a program on the MIT tagged-token data ow ar hite ture,"
in PARLE '87, Parallel Ar hite tures and Languages Europe, Volume 2: Parallel Languages (J. W.
de Bakker, A. J. Nijman, and P. C. Treleaven, eds.), Berlin, DE: Springer-Verlag, 1987. Le ture Notes
in Computer S ien e 259.
\TPC-C spe i ation."
http://www.tp .org/ spe .html.
\IBM DB2 Universal Database."
http://www.software.ibm. om/data/db2/udb/.
M. Jarke and J. Ko h, \Query optimization in database systems," ACM Computing Survey, vol. 16,
pp. 111{152, June 1984.
J. Lee, Y. Solihin, and J. Torrellas, \Automati ally mapping ode on an intelligent memroy ar hite ture," in Pro eedings of the Seventh International Symposium on High Performan e Computer
Ar hite ture (HPCA-7), (Monterrey, Mexi o), Jan. 19{24 2001.
D. K. Hsiao, ed., Advan ed Database Ma hine Ar hite ture. Englewood Cli s, NJ: Prenti e-Hall, 1983.

Operating Systems

[16
[17
[18
[19
[20
[21
[22
[23

[24
[25
[26
[27
[28

22

[29 L. L. Miller, A. R. Hurson, and S. H. Pakzad, eds., Parallel ar hite tures for data/knowledge-based
systems. Los Alamitos, CA: IEEE Computer So iety Press, 1995.
[30 D. J. DeWitt and J. Gray, \Parallel database systems: The future of high-performan e database
systems," Communi ations of the ACM, vol. 35, pp. 85{98, June 1992.
[31 M. H. Nodine and J. S. Vitter, \Greed sort optimal deterministi sorting on parallel disks," ACM
Transa tions on Database Systems, vol. 42, pp. 919{933, July 1995.
[32 A. Segev, \Optimization of join operations in horizontally partitioned database systems," ACM Transa tions on Database Systems, vol. 11, pp. 48{80, Mar. 1986.
[33 L. D. Shapiro, \Join pro essing in database systems with large main memories," ACM Transa tions
on Database Systems, vol. 11, pp. 239{264, Sept. 1986.
[34 Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, \Flexram:
Toward an advan ed intelligent memory system," in Pro eedings of the International Conferen e on
Computer Design (ICCD99), (Austin, TX, USA), O t. 1999.
[35 K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, \Smart memories: A modular
re on gurable ar hite ture," in Pro eedings of the 27th Annual International Symposium on Computer
Ar hite ture (ISCA'00), (Van ouver, BC, Canada), pp. 161{171, June 12{14 2000.
[36 A. M. Despain and D. A. Patterson, \X-TREE: A tree stru tured multi-pro essor omputer ar hite ture," in Pro eedings of the 5th Annual International Symposium on Computer Ar hite ure, (Palo
Alto, CA, USA), pp. 144{151, Apr. 3{5 1978.
[37 D. A. Patterson, E. S. Fehr, and C. H. Sequin, \Design onsiderations for the VLSI pro essor of
X-TREE," in Pro eedings of the 6th Annual International Symposium on Computer Ar hite ure,
(Philadelphia, PA, USA), pp. 90{101, Apr. 23{25 1979.
[38 P. C. Treleaven, \VLSI pro essor ar hite tures," IEEE Computer, pp. 33{45, June 1982.
[39 A. L. Davis, \The ar hite ure and system method of DDM1: A re ursively stru tured data driven
ma hine," in Pro eedings of the 5th Annual International Symposium on Computer Ar hite ure, (Palo
Alto, CA, USA), pp. 210{215, Apr. 3{5 1978.

23

Vous aimerez peut-être aussi