Académique Documents
Professionnel Documents
Culture Documents
fjrubio,ljohnge e.utexas.edu
Abstra t
Transa
tion pro
essing workloads impose heavy demands in the memory and storage sub-systems and often
result in large amounts of tra
in I/O and memory buses. In this paper, we propose to utilize pro
essing
elements distributed a
ross the memory hierar
hy, with the obje
tive of performing the
omputation
lose
to the data residen
e. Leveraging on a
tive memory modules and a
tive disk devi
es emerging from other
resear
h groups and available in the market, we propose a hierar
hi
al
omputing system in whi
h the
distributed pro
essing elements operate
on
urrently and
ommuni
ate using a hierar
hi
al inter
onne
t.
Transa
tions are partitioned a
ross the dierent layers in the hierar
hy depending on the anity of
ode
to a parti
ular layer or other heuristi
s. Commands per
olate down into the lower layers of the hierar
hy
and prepro
essed/partially pro
essed information
ows up into the higher layers. All layers ACTIVELY
parti
ipate in the pro
essing of the transa
tion by doing tasks for whi
h they are parti
ularly suited. The
lower layers
ontain inexpensive pro
essor units and in
onjun
tion with the powerful
entral pro
essor
and other
ollaborating memory and disk pro
essors, yield high performan
e in a
ost-ee
tive fashion.
This
on
ept is then applied to the online transa
tion pro
essing ben
hmark TPC-C and s
hemes for
ode partitioning are outlined. A hierar
hi
al
omputing system
ontaining four inexpensive memory
pro
essors and 32 very inexpensive disk pro
essors
an yield speedups of up to 4:52x when
ompared with
a traditional system. Sin
e transa
tion pro
essing has been seen to
ontain ne-grained and
oarse-grained
parallelism, the proposed hierar
hi
al
omputing paradigm whi
h exploits parallelism and redu
es data
transport requirements seems to be a feasible model for future database servers.
Keywords: Computer Ar
hite
ture, Parallel Pro
essing, Memory Hierar
hy, High Performan
e Computing, Database Servers.
Te hni al Area:
Ar hite ture
Introdu tion
It is a well known fa
t that the server market is the driving fa
tor for several of the te
hnologi
al advan
ements in the
omputer industry. A few years ba
k, this market was mainly dominated by te
hni
al
workloads. But during the last two de
ades, it has
hanged to power a large portion of
ommer
ial operations.
One important type of appli
ations in the group of
ommer
ial workloads is Transa
tion Pro
essing
(TP). Transa
tion pro
essing workloads are
lassied in two types: Online Transa
tion Pro
essing (OLTP)
and De
ision Support Systems (DSS). OLTP systems are used to handle those operations that o
ur during
the normal operation of a business (e.g. a
lient buys produ
ts, the managers
he
k the inventory, adjust
the pri
e of an item). On the other hand, DSS systems are used to take de
isions based on the data
gathered by a business, whi
h usually
omes from an OLTP system (e.g. nd most popular produ
t within
a given demographi
bra
ket, estimate net prot of all sales in the last three months). Even when both
workloads t within the
ategory of transa
tion pro
essing, they have many dieren
es:
OLTP operations are of short duration, taking millise
onds to
omplete, whereas DSS operations
take minutes.
The above statement also applies to the dataset of an operation. While OLTP operations usually have
datasets in the order of kilobytes or megabytes, DSS operations usually a
ess megabytes or hundreds
or megabytes of data. Re
ent literature suggests that DSS systems will be a
essing gigabytes in the
next
ouple of years [1.
The number of
on
urrent operations in a OLTP system is in the order of thousands, while DSS
systems normally have less than a hundred
on
urrent operations.
Transa
tion pro
essing systems are typi
ally implemented using a multi-tier ar
hite
ture. The idea
is to implement a fun
tional pipeline that streams transa
tions from the
lients to the server database
in an e
ient way. As
an be seen from Figure 1,
lients on the left are
onne
ted to an intermediate
server or Middle-Tier, through a swit
hed network. The fun
tion of the middle-tier server is to a
t as a
lter and reje
t those requests presented by the
lients that are in
orre
tly generated. It also enfor
es the
se
urity in the system and serves as a parser that transforms requests formulated in one language domain
(e.g. HTML) to another domain (e.g. SQL).
The next
omponent is the Middle-Tier server. In this example, the Middle-Tier server is implemented
as a server
luster, with a front-side
onne
tion to the
lients through a load-balan
ing swit
h, whose
2
CLIENTS
BACKENDTIER
SERVER
FRONTSIDE
SWITCH
0
1
0
0 1
1
0
1
111111
000000
0
1
0
1
000000
111111
0
1
0
1
0
1
0
1
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
000000
111111
0 1
1
0
1
000000
111111
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
000000
111111
0 1
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
1
000000
111111
0
1
0
000000
111111
0 1
1
0
1
000000
111111
BACKEND
SWITCH
MIDDLETIER
SERVER
Figure 1: Conventional System Level Ar
hite
ture for a Transa
tion Pro
essing System.
fun
tion is to
reate a distributed load in all the nodes of the
luster. Work done in this area in
ludes
lo
ality aware distribution algorithms like the one developed by Pai, et al.[2. The nodes in the
luster
ommuni
ate with ea
h other through a ba
k-side network interfa
e, whi
h they also use to send the
requests to the database server, also referred to as the Ba
k-end-Tier server.
The nal
omponent of the system is the Ba
k-end-Tier, whi
h is also the fo
us of this paper. This
server is the one that manipulates the primary data of the
ommer
ial operation (e.g. it keeps the list of
lients, the orders they pla
e, pri
es of items and their quantities in the warehouses). As su
h, the ba
kend-tier has
omplete
ontrol over a large portion of the data, whi
h is normally lo
al to it and a
essed
using a Relational Database Manager System (RDBMS or
ommonly DBMS). Implementations of this
server in
lude symmetri
multipro
essor systems (SMP) as well as
luster servers.
When we look at the exe
ution behavior of
ommer
ial workloads, we observe
ommer
ial workloads
are dierent than te
hni
al workloads and present more vigorous demands to the memory and storage
sub-systems [3, 4, 5. In fa
t, studies that analyzed transa
tion pro
essing workloads indi
ate that systems
spend around 90% of the time waiting for the I/O devi
es to a
ess the data [6. On
e the data are brought
to memory, the pro
essor uses between 25% to 45% of the exe
ution time handling memory a
esses [7.
That results in a sub-optimal utilization of the laten
y hiding features of modern dynami
ally s
heduled
pro
essors [8.
One of the reasons for this unbalan
e between
omputation and data a
ess ba
ktra
ks to the prin
iples
3
of traditional memory hierar
hies, where data moves from the storage sub-system to the pro
essor before
it
an be pro
essed. Although we have be
ome a
ustomed to this exe
ution model, whi
h works well
for te
hni
al and some other appli
ations, it is far from optimal when used with a transa
tion pro
essing
workload. The a
tion of moving data ba
k and forth between the storage and the
omputing elements not
only results in a high volume of tra
, whi
h hurts the s
alability of the system, but it also
reates an
arti
ial bottlene
k by serializing the exe
ution in an environment with enough parallelism.
This paper presents the Hierar
hi
al Computing model as a possible solution to the problems presented
above. The next se
tion introdu
es the idea, giving spe
ial attention to the operation of the hardware and
ommuni
ation of the devi
es. Se
tion 3
overs the programming model used in the system, how we plan
to partition the problem, and use a set of basi
primitives to operate on the data. Se
tion 5 performs
a mathemati
al analysis of the idea, and identies those parameters that ae
t the performan
e of this
te
hnique. Se
tion 4 presents a basi
ode partitioning s
heme and inspe
ts the idea of using a
onventional
transa
tion pro
essing workload, whi
h helps us to determine the feasibility of the model. Se
tion 6 looks
at other ideas proposed in the literature. Se
tion 7
on
ludes with a highlight of the most signi
ant
ontributions.
Hierar hi al Computing
To address the problems presented in the previous se
tion, most transa
tion pro
essing systems exploit
the
oarse grain parallelism present in the form of
on
urrent transa
tions. Before pro
eeding further it is
important to dene what is a transa
tion. A transa
tion is a sequen
e of operations, whi
h are exe
uted
atomi
ally (atomi
), whi
h always maintain a
onsistent state in the database (
onsistent), whose exe
ution
is not ae
ted by
on
urrent transa
tions (isolation) and whose ee
ts are permanent (durable). These
set of properties,
ommonly referred as the ACID properties for the rst letter of ea
h property, is the
basis for the transa
tion pro
essing theory. Implementations of
urrent systems use a thin layer of software
known as the Transa
tion Manager. Its fun
tion is to enfor
e an order in the arrival of the transa
tions
and the ACID properties required by the transa
tion model. The queries that form a transa
tion are then
passed to the database pro
esses whi
h run in ea
h one of the pro
essors in the system. Although this
approa
h has been relatively su
essful, it is possible to
ause resour
e management problems, manifested
in the form of hot spots during the a
ess of the database tables, thus inhibiting the exploitation of all the
parallelism in the system.
The Hierar
hi
al Computing exe
ution model exploits the available parallelism present within a single
transa
tion, in addition to thread-level or
oarse level parallelism that
an be exploited using other means.
The idea is to distribute the
omputation a
ross a
omputer system on behalf of a unique transa
tion.
In the model, the
ommuni
ation is based in a message based approa
h using a hierar
hi
al topology of
4
inter
onne
ts. Sin
e this model exploits a dierent type of parallelism than a traditional system (intratransa
tion vs inter-transa
tion), its use
an be orthogonal to existing methods. While in a traditional
system all the
omputations are required to be performed by the CPU, under this model, there is no need
to insist on that. There is an important dieren
e between both models, whi
h is seen in the way a server
handles a query from a
lient. In fa
t, several operations may be e
iently performed at the data residen
e.
This brings a set of interesting tradeos, whi
h are presented and evaluated in the next se
tions.
To distribute
omputing in this way, some
omputing power is lo
ated in memory and some in disk,
lose
to the lo
ation of the data, where they perform simple
omputations su
h as
omparison and a
umulation.
The
omponents are
oupled using a hierar
hi
al inter
onne
t (any sort of dire
ted a
y
li
graph like a
binary tree).
Processing
Element
1. Processor
(ILP)
Interconnect
2. Interconnect
3. Main Memory
(multibank)
4. Interconnect
Interconnect
Interconnect
5. Disk
(Array)
levels do not need to know every detail of the operations of their subordinates. Ea
h level in the hierar
hy
onveys only the right amount of information to the layer above.
A system with, say, 1 main memory module and 4 disk modules needs to have only 3 levels of
omputations (1 in the main CPU, 1 in memory and 1 in the disks), whereas pro
essors in the inter
onne
t
will be
riti
al to larger hierar
hi
al
omputing systems. The number of levels of inter
onne
t pro
essors
depends on the number of disks reporting to the same memory, as well as the inter
onne
t network, thus
for a binary tree we need a number of levels of inter
onne
ts expressed by:
(
log2 N umber of memory units 1; 0)
(1)
The intelligen
e in the dierent levels
an be realized using intelligent memory modules investigated
in re
ent resear
h [9, 10, 11, 12, and intelligent or a
tive disks [13, 14, 15, 16, 17. It is possible to nd
storage devi
es in the market with 150 MIPS
ore and up to 2MB of main memory [18, 19, 20, 21. Ongoing eort in intelligent or a
tive memories and disk
omponents
an thus be leveraged to implement
hierar
hi
al
omputing systems. If
omputing
apability is required in the network or swit
hes, they
an be realized using
hips similar to mi
ro-
ontrollers embedded in the swit
h/bus interfa
e. It may be
noted that the
omputing resour
es required in the storage devi
es or memory are signi
antly
heaper
than the
entral pro
essor. Use of a powerful pro
essor whi
h exploits instru
tion level parallelism and
thread-level-parallelism is favored at the root of the hierar
hy.
The hierar
hi
al mode of operation has worked relatively well in our so
iety, and we expe
t it to work
as well in a
omputer system due to the following
hara
teristi
s:
Raw data stay lo
al to a level of the hierar
hy, whi
h gives more freedom to the upper levels to
operate and hold temporary results of the operations in their fast but redu
ed storage spa
e.
M ax log2 N umber of disks
It suggests a spe ialization of the units, whi h permits a system to use dynami ally s heduled pro-
essors in the upper levels of the hierar
hy, where
omplex de
isions are required and
ontrol
ow
is hard to predi
t. In-order pro
essors, or narrower power-e
ient pro
essors
an be used to handle
the bulk of the data in the lower levels.
The impli
ations of the rst point ae
t the mode of operation of the hierar
hy and its programming
model and are
overed in the next se
tion. The analysis of the se
ond point dire
tly ae
ts the hardware
used in the system and will be studied in subsequent se
tions.
Programming Model
As it was mentioned in the previous se
tion, the basi
element behind the Hierar
hi
al Computing model
is the use of
omputation engines that sit
lose to the lo
ation of the data. The idea is to expedite the
6
movement of data from its natural point of residen
e to the
omputation unit. This
omputation unit
an
be the main pro
essor, for tasks that require the high
omputation power provided by a high frequen
y
and dynami
ally s
heduled pro
essor, or
ould as well be a mu
h simpler mi
ropro
essor or mi
ro
ontroller
that fun
tions as an intelligent memory or disk
ontroller. The partition of the data and operations to
take advantage of this new ar
hite
ture are
overed in Se
tion 4. This se
tion deals with the programming
model used to support a transa
tion pro
essing workload.
The heart of a transa
tion pro
essing system is the database server. Nowadays, databases are based
on the Relational model [22. In this model data is stored in the form of tables with a variable number
of rows of a predetermined width. To a
ess these tables, these systems use data manipulation languages
(DML), of whi
h the stru
tured query language (SQL) is the one most frequently used. The operations
supported by the SQL language in
lude:
S an: lo ates rows that mat h a parti ular riteria. The riteria an be a single predi ate (
list
all
ights to Pittsburgh), or a
omplex predi
ate (e.g. nd all vehi
les in the state of Texas and
registered after 1973). It is
alled sele
t under some data manipulation languages. This operates on
a single table.
e.g.
of the upper level. On
e the upper level sends the
ommand to the lower level, it waits for data, whi
h
depending on the operation
an be a null response, a single element or a sequen
e of elements. The
hardware provides basi
ontrol
ow signals to help the upper level handle the amount of data that might
result from an operation.
111
000
000
111
000
111
000
111
000
111
000
111
T{1}
11
00
00 Active
11
00
11
00
11
component
CMD
.
DATUM
L{1,2}
. DATUM
111
000
000
111
000
111
000
111
000
111
111
000
000
111
000
111
000
111
000
111
event triggered or to
ontinue operating until it rea
hes the end of the region.
Example: Sear
h a range of data for a string.
111
000
000
111
000
111
000
111
000
111
000
111
000
111
1
0
0
1
0
1
*
Active component
Data Match
5
7
CMD
111
000
000
111
000
111
000
111
000 CMD
111
1
2
3
4
CMD
111
000
000
111
000
111
000
111
000
111
5
Type 2: Aggregate
For this primitive, the lower level a
esses its asso
iated data, and nds those elements that mat
h
a parti
ular
riteria. However, it does not send all these results to the upper level. Instead it
produ
es an aggregate number and sends it on
e all its data has been analyzed. In this
ontext,
an aggregate fun
tion Aggregate() is any fun
tion that produ
es a single number based on a set of
numbers (fI0 ; :::; I g). The most
ommon aggregate fun
tions in transa
tion pro
essing workloads
are: sum(),
ount(), average(),max(), and min(). In this
ase, the results returned by the dierent
nodes in the lower level might need to be
ombined in the upper level to produ
e a unique answer.
We
an a
omplish that, if we apply the same aggregate fun
tion in the upper level to the values
returned. This works for all the examples shown above, ex
ept for the average() fun
tion. For those
ir
umstan
es the algorithm is
hanged to return both sum() and
ount() from the nodes in the
lower levels and the upper level performs the
omputation of average().
Example: Count orders pla
ed in January 2001.
N
Code partitioning
When we introdu
ed the programming model in the Se
tion 3, we assumed that the data was laid out
orre
tly in disk. The distribution of
ode among the dierent levels of the hierar
hy impa
ts performan
e.
In this se
tion we address the issue of how to partition a transa
tion in order to a
hieve a good performan
e
with the Hierar
hi
al Computing model.
9
111
000
000
111
000
111
000
111
000
111
000
111
000
111
aggr<3>
1
0
0
1
0
1
*
Active component
Data Match
aggr<5,7>
CMD
111
000
000
111
000
111
000
111
000 CMD
111
1
2
3
4
CMD
111
000
000
111
000
111
000
111
000
111
5
Update
Scan
Scan
Scan
Warehouse
District
Customer
Insert
Insert
Order
OrderLine
Update
Scan
Scan
Item
Stock
Figure 6: Exe
ution Plan for the TPC-C New-Order Transa
tion.
Update
Update
Update
Insert
Scan
Scan
Scan
History
Warehouse
District
Customer
Figure 7: Exe
ution Plan for the TPC-C Payment Transa
tion.
Scan
Scan
Sort
OrderLine
Order
Scan
Customer
Figure 8: Exe
ution Plan for the TPC-C Order-Status Transa
tion.
11
Update
Scan
Aggregate
Customer
Update
Remove
Scan
Sort
Order
Update
Scan
OrderLine
Scan
NewOrder
Figure 9: Exe ution Plan for the TPC-C Delivery Transa tion.
Count
Join
Scan
Scan
Stock
OrderLine
District
Figure 10: Exe
ution Plan for the TPC-C Sto
k-Level Transa
tion.
12
Table
Evaluation
To evaluate the potential from the hierar
hi
al
omputing model, we estimate the time needed to perform
a task in this model and
ompare it with the time needed in a
onventional system. Equation 2 shows
the basi
expression for the time ne
essary to
omplete a task, where performan
e gains
an be obtained
by improving any of the fa
tors of the equation. We have
hosen to express the time required to exe
ute
an instru
tion (TPI) as the produ
t of the
lo
k frequen
y and the CPI (
y
les-per-instru
tion). The
rst is
ommonly asso
iated with the hardware implementation details, while the se
ond is a fa
tor of the
pro
essor ar
hite
ture and the workload being exe
uted.
13
For ea
h Level
For ea
h T able
if (i = M )
i
Cost estimation
Figure 11: Simple s
heme for partitioning transa
tion pro
essing workloads.
14
T rue
TPI
= Instru
tions
N
(2)
pro essors
TPI
= lk CP I
(3)
This exe
ution model performs
omputations in the dierent levels of the storage hierar
hy, in what
an
be
onsidered heterogeneous
omputing elements, thus Equation 4 expresses the time spent
omputing by
a single level. The index level i goes from 1 to M , where M is the total number of levels in the hierar
hy.
1 The parameter N indi
ates the number of pro
essing elements in the parti
ular level i.
i
ti
TPI
= Instru
tions
N
i
(4)
Another interesting
hara
teristi
of this model, is that it allows for operations to be performed
on
urrently in the dierent levels. However there might be algorithms that do not allow for that level of
parallelism be
ause they have serial
omponents in their exe
ution. Taking that into a
ount, Equations 5
and 6 show the range of values for the total time taken by a task. When we
an exploit full parallelism,
the resulting time will be the maximum of the individual times. Situations where all the operations need
to be serialized would result in a time equal to the sum of all the times. Hen
e the range of exe
ution time
an be
al
ulated as follows:
tmax
X
M
ti
no overlapping
=1
(5)
tmin
= M=1
AX
M
ti
f ull overlapping
(6)
To enable this model,
omputation is partitioned so operations are assigned to the
omputation elements
in ea
h one of the levels. Considering the nature of transa
tion pro
essing workloads, the partition of
operations
an be done stati
ally or using query optimizers like the ones built in most
ommer
ial databases
[26. With these systems, we express the distribution of
omputation in Equation 7. The
oe
ients '
indi
ate the fra
tion of the total number of instru
tions exe
uted in level i of the hierar
hy.
i
X
=
M
Instru tions
Instru tionsi
=1
(7)
'i
1
= Instru
tions
Instru
tions
(8)
Our onvention is to number the hierar hy from the top to the bottom, so the level with the main pro essor is assigned a
value
i = 1.
15
The number of instru
tions a pro
essor is
apable of exe
uting is a fun
tion of a myriad of fa
tors in
the system. Among them, the type of workload plays an important role. Given that we are evaluating this
model under the light of transa
tion pro
essing workloads, we will use some
CP I
= CP I
omputation
+ CP I
(9)
storage
A
hara
teristi
of this model, is the use of relatively simple
omputation engines in the lower levels of
the hierar
hy. This ae
ts the balan
e of the operations in the system, a
ording to Equation 10.
CP I
omputation;i < CP I
omputation;j
(10)
i<j
Assuming a hierar
hy of three levels like the one shown in Figure 12, it is possible to present an
expansion of the above relations, showing the fa
tors in whi
h the time de
omposes. We have assigned the
rst level to operations
omputed by the main pro
essor, the se
ond level to the main memory and the
third level to those operations performed by the disk
ontroller.
11111
00000
00000
11111
P.main
00000
11111
00000
11111
00000
11111
00000
11111
00000
11111
Main
(i=1)
0000
1111
1111
0000
P.memory
0000
1111
0000
1111
Memory
(i=2)
0000
1111
1111
0000
P.memory
0000
1111
0000
1111
11
00
00
11
Disk
(i=3)
N2
11
00
00
11
P.disk
P.disk
N3
Instru tions1
{z
N1
main
T P I1 +
Instru tions2
} |
{z
N2
T P I2 +
memory
Instru tions3
} |
{z
N3
disk
T P I3
(11)
Sin
e typi
ally the disk and memory pro
essors will not be as sophisti
ated as the
entral pro
essor, we
use a degradation fa
tor to indi
ate how the memory and disks pro
essors
ompare to the
entral pro
essor.
16
Speedup
'1
T P Ii
T P I1
1
'2 2
N1
N2
'3 3
(12)
N1
N3
Similarly, the speedup when the algorithm shows full overlapping is expressed as:
Speedup
M AX '1 ; '2 2
N1
N2
; '3 3
(13)
)
3
N1
N
F(17%,25%,58%) Y(1,4,32)
3.00
2.75
2.50
2.25
2.75-3.00
2.50-2.75
2.25-2.50
2.00-2.25
1.75-2.00
1.50-1.75
1.25-1.50
1.00-1.25
0.75-1.00
0.50-0.75
0.25-0.50
0.00-0.25
2.00
1.75
Speedup 1.50
1.25
1.00
0.75
0.50
0.25
0.00
2.0
3.0
3.1
4.6
6.3
4.2
t2
7.9
5.2
9.6
6.3
t3
11.2
7.4
8.5
9.6
14.5
12.8
17
( 1, 4,32)
( 1, 4,32)
( 1, 4,32)
( 1, 4,32)
( 1, 4,32)
( 1, 4,16)
( 1, 4,32)
( 1, 4,64)
( 1, 2,32)
( 1, 2,64)
(10%, 5%, 85%)
(35%, 10%, 55%)
(15%, 25%, 60%)
( 0%, 25%, 75%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)
(10%, 20%, 70%)
Speedup
Min Max
1.60 4.32
1.17 2.13
0.95 2.86
1.01 4.52
1.08 3.48
0.80 2.67
1.08 3.48
1.31 4.10
0.70 2.58
0.79 2.91
Table 2: Speedups of hierar
hi
al
omputing system with
onventional memory and disk
omponents.
The algorithm presented in Figure 11 is used to obtain a preliminary estimate of the
ode partitioning. For the situation where T hreshold(Level ) = Capa
ity(Level ), Table 3 shows the fra
tions of the
omputation that should be performed in ea
h level for a
onguration of (1; 4; 16).
i
'1
'2
'3
without restru
turing the algorithm. It is then important to analyze the ee
t that
hanging the hardware
onguration has over the speedup for a given
ode partition. This is shown in the lower half of Table 2.
Here we work on a partition (10%; 20%; 70%) and
hange the number of memory and disk modules.
Given the large amount of
ode given to the disk modules, we observe that in
rementing the number of
disks results in an in
rease in the speedup. As in any other multipro
essor system, the in
rease is not
linear with the amount of pro
essing elements added. It may also be noted that a
tive memory systems
and a
tive disk systems are subsets of the proposed general hierar
hi
al
omputing system.
Related Work
During the 1970s,
omputer s
ientists looked at database appli
ations and proposed spe
ially designed
ma
hines to handle the in
reasing gap in the performan
e between primary and se
ondary storage as well
as the overwhelming software
omplexity present in database appli
ations [28, 29. Known as Database
Ma
hines, these systems in
orporated spe
ialized
omponents in the form of pro
essors per-disk, per-tra
k,
per-head and asso
iative memories, in order to fa
ilitate the a
ess of data. The problem with these systems
was that the use of non-
ommodity hardware drasti
ally in
reased the
ost of the systems. Additionally
they were designed to handle only database workload, whi
h resulted in a de
lined interest by the rest of
the ar
hite
ture and software
ommunities.
Database ma
hines saw their last days with the development of parallel databases [30, whi
h proved
to be a
ost-ee
tive solution for the problems of the day. Sin
e then, modern
ommer
ial databases have
adopted several of the proposed algorithms: parallel sort [31, parallel join [32 and other algorithms that
tradeo memory utilization for I/O bandwidth [33.
In addition to database ma
hines, several resear
h proje
ts have looked at the idea of having
omputation elements
lose to the data. Intelligent memories have mostly been targeted to regular numeri
appli
ations [9, 10, but re
ent attempts also look at its use in non-regular appli
ations [11, 34, 12, 35.
Likewise the idea of intelligent disk has also been
overed by dierent resear
h groups. The workloads
onsidered for this te
hnology
onsist of De
ision Support Systems [13, 14, 15, 17, data-mining and multimedia
appli
ations [16.
Related to our resear
h, the X-Tree ar
hite
ture [36, 37, looks at a multipro
essor organization, where
pro
essors are
onne
ted using a binary tree. This topology fa
ilitates the design of high bandwidth
systems as the average distan
e between the nodes in
reases logarithmi
ally with the number of nodes in
the system. However, the emphasis in the X-Tree system was to build VLSI
hips based on the idea of
re
ursive ar
hite
tures [38, where it was possible to design a
omputer system by
onstru
ting a hierar
hy
with the same type of pro
essors. Another example of a re
ursive system is the Data Driven Ma
hine
(DDM1), designed by Davis et al.[39. DDM1 was able to exploit
on
urren
y due to its implementation
19
of Data Driven Nets (DDN), whi
h
onstitute a form of data
ow similar to the one used by our model.
While the
omputing paradigm presented in this paper has similarities to the aforementioned resear
h
eorts, it must be noted that the merits of several past ar
hite
tural paradigms are being synergisti
ally
ombined and applied to transa
tion pro
essing in our
urrent resear
h eort. We are leveraging
on advan
ements in a
tive memories and a
tive disks while at the same time taking advantage of the
advan
ements done during the last twenty years in parallel databases, and query optimizers.
Con lusions
In this paper, we have presented a hierar
hi
al model of
omputing, whi
h is based on the
on
ept of
performing
omputations in a hierar
hi
al manner distributed over the memory hierar
hy. This resear
h is
intended to alleviate the imbalan
e of
omputation and data a
essing experien
ed by large s
ale transa
tion
pro
essing systems. The building blo
ks that help to realize the proposed hierar
hi
al model are a
tive
memory units and a
tive disk devi
es that are emerging in the market. The basi
prin
iple is to use
omputation engines that sit
lose to the lo
ation of the data and the use of a hierar
hy to
onne
t these
omputation engines. This paradigm also brings in benets of the data
ow model of
omputation.
We des
ribed the
omputation paradigm and outlined a simple
ode partitioning s
heme. Then we
applied the
ode partitioning s
heme to the TPC-C ben
hmark, and observed that it is possible to split
the transa
tions using stati
information about the database tables and storage
apa
ity of the
omputation
nodes. Using the simple partitioning algorithm and assuming inexpensive memory and disk pro
essors,
we show that a hierar
hi
al
omputing system with four memory pro
essors and 32 disk pro
essors
an
obtain speedups of up to 4:52x when
ompared with traditional unipro
essor systems. While performan
e
degradations will o
ur in non-parallelizable
ode, or systems with very slow memory or disk pro
essors,
it is seen that judi
ious partitioning of
ode
an yield in performan
e improvements. Sin
e transa
tion
pro
essing has been shown to
ontain signi
ant amounts of parallelism, partitioning
ode in a fruitful
way is not immensely di
ult. In summary, hierar
hi
al
omputing whi
h exploits parallelism, distributes
omputations, and redu
es the data transport requirements is a feasible model of
omputation for future
database servers.
One attra
tive result of using this paradigm is the feasibility of using pro
essors, with dierent performan
e ratings in the same system. This is possible due to the use of heterogenous pro
essing elements in
the dierent layers of the hierar
hy. Pro
essors whi
h are used at the top of the hierar
hy move down as
memory pro
essors on
e a new high performan
e pro
essor generation arrives. Simultaneously, the
urrent
memory pro
essors will move to the disk as disk pro
essors. This maximizes the lifetime of a pro
essor
design, amortizing the
ost of the design.
20
Referen
es
[1 R. Winter, \The growth of enterprise data: Impli
ations for the storage infrastru
ture," in Whitepaper
Winter Corporation, 1998.
[2 V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Drus
hel, W. Zwaenepoel, and E. Nahum, \Lo
alityaware request distribution in
luster-based network servers," in Pro
eedings of the Eight International
Conferen
e on Ar
hite
tural Support for Programming Languages and Operating Systems, (San Jose,
CA, USA), pp. 205{216, O
t. 2{7 1998.
[3 A. M. G. Maynard, C. M. Donnelly, and B. R. Olszewski, \Contrasting
hara
teristi
s and
a
he performan
e of te
hni
al and multi-user
ommer
ial workloads," in Pro
eedings of the Sixth International
Conferen
e on Ar
hite
tural Support for Programming Languages and Operating Systems, (San Jose,
CA, USA), pp. 145{156, O
t. 4{7 1994.
[4 S. E. Perl and R. L. Sites, \Studies of Windows NT Performan
e Using Dynami
Exe
ution Trees,"
in Pro
eedings of the 2nd USENIX Symposium on Operating Systems Design and Implementation,
(Seattle, WA, USA), pp. 169{184, O
t. 28{31 1996.
[5 L. Barroso, K. Ghara
horloo, and E. Bugnion, \Memory system
hara
terization of
ommer
ial workloads," in Pro
eedings of the 25th Annual International Symposium on Computer Ar
hite
ture (ISCA98), (Bar
elona, Spain), pp. 3{14, June 27{July 1 1998.
[6 M. Rosenblum, E. Bugnion, S. A. Herrod, E. Wit
hel, and A. Gupta, \The impa
t of ar
hite
tural
trends on operating system performan
e," in Pro
eedings of the Fifteenth ACM Symposium on Operating Systems Prin
iples, (Copper Mountain, CO), pp. 285{298, ACM Press, De
. 1995.
[7 A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood, \DBMSs on a modern pro
essor: Where does
time go?," in Pro
eedings of the 25th Conferen
e on Very Large Data Bases (VLDB'99), (Edinburgh,
S
otland), pp. 15{26, Sept. 7{10 1999.
[8 K. Keeton, D. Patterson, Y. Q. He, R. C. Raphael, and W. E. Baker, \Performan
e Chara
terization
of a Quad Pentium Pro SMP using OLTP Workloads," in Pro
eedings of the 25th Annual International
Symposium on Computer Ar
hite
ture (ISCA-98), (Bar
elona, Spain), pp. 15{26, June 27{July 1 1998.
[9 D. G. Elliott, W. M. Snelgrove, and M. Stumm, \Computational RAM: A memory-SIMD hybrid and
its appli
ation to DSP," in Custom Integrated Cir
uits Conferen
e, pp. 30.6.1{30.6.4, May 1992.
[10 D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and
K. Yeli
k, \A
ase for intelligent RAM: IRAM," IEEE MICRO, Apr. 1997.
[11 M. Oskin, F. Chong, and T. Sherwood, \A
tive pages: A
omputation model for intelligent memory,"
in Pro
eedings of the 25th Annual International Symposium on Computer Ar
hite
ture (ISCA-98),
(Bar
elona, Spain), pp. 192{203, June 27{July 1 1998.
[12 M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Drapper, J. LaCoss, J. Grana
ki, J. Bro
kman,
A. Srivastava, W. Athas, V. Free
h, J. Shin, and J. Park, \Mapping irregular appli
ations to DIVA,
a PIM-based data-intensive ar
hite
ture," in Pro
eedings of the High Performan
e Networking and
Computing Conferen
e (SC99), (Portland, OR), Nov. 13{19 1999.
21
[13 K. Keeton, D. A. Patterson, and J. M. Hellerstein, \A
ase for intelligent disks (IDISKs)," in Pro
eedings of the ACM SIGMOD International Conferen
e on Management of Data (SIGMOD-98), (Seattle,
WA, USA), pp. 42{52, June 1{4 1998.
[14 A. A
harya, M. Uysal, and J. Saltz, \A
tive disks: Programming model, algorithms and evaluation," in Pro
eedings of the Eight International Conferen
e on Ar
hite
tural Support for Programming
Languages and Operating Systems, (San Jose, CA, USA), pp. 81{91, O
t. 2{7 1998.
[15 G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobio, C. Hardin, E. Riedel,
D. Ro
hberg, and J. Zelenka, \A
ost-ee
tive, high-bandwidth storage ar
hite
ture," in Pro
eedings of the Eight International Conferen
e on Ar
hite
tural Support for Programming Languages and
Operating Systems
[16
[17
[18
[19
[20
[21
[22
[23
[24
[25
[26
[27
[28
22
[29 L. L. Miller, A. R. Hurson, and S. H. Pakzad, eds., Parallel ar
hite
tures for data/knowledge-based
systems. Los Alamitos, CA: IEEE Computer So
iety Press, 1995.
[30 D. J. DeWitt and J. Gray, \Parallel database systems: The future of high-performan
e database
systems," Communi
ations of the ACM, vol. 35, pp. 85{98, June 1992.
[31 M. H. Nodine and J. S. Vitter, \Greed sort optimal deterministi
sorting on parallel disks," ACM
Transa
tions on Database Systems, vol. 42, pp. 919{933, July 1995.
[32 A. Segev, \Optimization of join operations in horizontally partitioned database systems," ACM Transa
tions on Database Systems, vol. 11, pp. 48{80, Mar. 1986.
[33 L. D. Shapiro, \Join pro
essing in database systems with large main memories," ACM Transa
tions
on Database Systems, vol. 11, pp. 239{264, Sept. 1986.
[34 Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, \Flexram:
Toward an advan
ed intelligent memory system," in Pro
eedings of the International Conferen
e on
Computer Design (ICCD99), (Austin, TX, USA), O
t. 1999.
[35 K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, \Smart memories: A modular
re
ongurable ar
hite
ture," in Pro
eedings of the 27th Annual International Symposium on Computer
Ar
hite
ture (ISCA'00), (Van
ouver, BC, Canada), pp. 161{171, June 12{14 2000.
[36 A. M. Despain and D. A. Patterson, \X-TREE: A tree stru
tured multi-pro
essor
omputer ar
hite
ture," in Pro
eedings of the 5th Annual International Symposium on Computer Ar
hite
ure, (Palo
Alto, CA, USA), pp. 144{151, Apr. 3{5 1978.
[37 D. A. Patterson, E. S. Fehr, and C. H. Sequin, \Design
onsiderations for the VLSI pro
essor of
X-TREE," in Pro
eedings of the 6th Annual International Symposium on Computer Ar
hite
ure,
(Philadelphia, PA, USA), pp. 90{101, Apr. 23{25 1979.
[38 P. C. Treleaven, \VLSI pro
essor ar
hite
tures," IEEE Computer, pp. 33{45, June 1982.
[39 A. L. Davis, \The ar
hite
ure and system method of DDM1: A re
ursively stru
tured data driven
ma
hine," in Pro
eedings of the 5th Annual International Symposium on Computer Ar
hite
ure, (Palo
Alto, CA, USA), pp. 210{215, Apr. 3{5 1978.
23