Vous êtes sur la page 1sur 22

A Local-Optimization based Strategy for

Cost-Effective Datasets Storage of


Scientific Applications in the Cloud
Many slides from authors
presentation on CLOUD 2011

Presenter: Guagndong Liu
Mar 13
th
, 2012
Dec 8
th
, 2011 Dec 8
th
, 2011
Outline
Introduction
A Motivating Example
Problem Analysis
Important Concepts and Cost Model of
Datasets Storage in the Cloud
A Local-Optimization based Strategy for
Cost-Effective Datasets Storage of
Scientific Applications in the Cloud
Evaluation and Simulation


Dec 8
th
, 2011 Dec 8
th
, 2011
Introduction
Scientific applications
Computation and data intensive
Generated data sets: terabytes or even
petabytes in size
Huge computation: e.g. scientific workflow
Intermediate data: important!
Reuse or reanalyze
For sharing between institutions
Regeneration vs storing



Dec 8
th
, 2011 Dec 8
th
, 2011
Introduction
Cloud computing
A new way for deploying scientific applications
Pay-as-you-go model
Storing strategy
Which generated dataset should be stored?
Tradeoff between cost and user preference
Cost-effective strategy
Dec 8
th
, 2011 Dec 8
th
, 2011
A Motivating Example
Parkes radio telescope and pulsar survey
Pulsar searching workflow

Candidates
Candidates
Beam
Beam
De-disperse
Acceleate
Record
Raw
Data
Extract
Beam
Pulse
Seek
FFT
Seek
FFA
Seek
Get
Candidates
Elimanate
candidates
Fold to
XML
Extract
Beam
Get
Candidates

...

.
.
.
...
Make
decision
Trial Measure 1
Trial Measure 1200
Trial Measure 2
...
Compress
Beam
...

.
.
.
Dec 8
th
, 2011 Dec 8
th
, 2011
A Motivating Example




Current storage strategy
Delete all the intermediate data, due to storage
limitation
Some intermediate data should be stored
Some need not

Raw beam
data
Accelerated
De-
dispersion
files
De-
dispersion
files
Extracted &
compressed
beam
Seek
results files
Candidate
list XML files
Size:
Generation time:
20 GB
245 mins 1 mins 80 mins 300 mins 790 mins 27 mins
25 KB 1 KB 16 MB 90 GB 90 GB
Dec 8
th
, 2011 Dec 8
th
, 2011
Problem Analysis
Which datasets should be stored?
Data challenge: double every year over the next
decade and further -- [Szalay et al. Nature, 2006]
Different strategies correspond to different costs
Scientific workflows are very complex and there
are dependencies among datasets
Furthermore, one scientist can not decide the
storage status of a dataset anymore
Data accessing delay
Datasets should be stored based on the trade-off
of computation cost and storage cost
A cost-effective datasets storage strategy is
needed

Dec 8
th
, 2011 Dec 8
th
, 2011
Important Concepts
Data Dependency Graph (DDG)
A classification of the application data
Original data and generated data
Data provenance
A kind of meta-data that records how data
are generated
DDG

d
1
d
2
d
3
d
8
d
7
d
6
d
4
d
5
Dec 8
th
, 2011 Dec 8
th
, 2011
Important Concepts
Attributes of a Dataset in DDG
A dataset d
i
in DDG has the attributes: <x
i
, y
i
,
f
i
, v
i
, provSet
i
, CostR
i
>
x
i
($) denotes the generation cost of dataset
d
i
from its direct predecessors.
y
i
($/t) denotes the cost of storing dataset
d
i
in the system per time unit.
f
i
(Boolean) is a flag, which denotes the
status whether dataset d
i
is stored or
deleted in the system.
v
i
(Hz) denotes the usage frequency, which
indicates how often d
i
is used.
Dec 8
th
, 2011 Dec 8
th
, 2011
Important Concepts
Attributes of a Dataset in DDG
provSet
i
denotes the set of stored
provenances that are needed when regenerating
dataset d
i
.

CostR
i
($/t) is d
i
s cost rate, which means the
average cost per time unit of d
i
in the system.


Cost = C + S
C: total cost of computation resources
S: total cost of storage resources


. e
+ =
} {
) (
i k j i j k
d d d provSet d d k i i
x x d genCost

= -
=
=
deleted f v d genCost
stored f y
CostR
i i i
i i
i
, ) (
,
Dec 8
th
, 2011 Dec 8
th
, 2011
Cost Model of Datasets Storage in the
Cloud
Total cost rate of a DDG:
S is the storage strategy of the DDG






For a DDG with n datasets, there are 2
n

different storage strategies
( )
S
DDG d i S
i
R Cost TCR

e
=
d
1
d
2
d
3
(x
1
, y
1
,v
1
) (x
3
, y
3
,v
3
) (x
2
, y
2
,v
2
)
S
1
: f
1
=1 f
2
=0 f
3
=0
3 3 2 2 2 1
) (
1
v x x v x y TCR
S
+ + + =
S
2
: f
1
=0 f
2
=0 f
3
=1 3 2 2 1 1 1
) (
2
y v x x v x TCR
S
+ + + =
.
.
.
Dec 8
th
, 2011 Dec 8
th
, 2011
CTT-SP Algorithm
To find the minimum cost storage strategy
for a DDG
Philosophy of the algorithm:
Construct a Cost Transitive Tournament (CTT)
based on the DDG.
In the CTT, the paths (from the start to the end
dataset) have one-to-one mapping to the storage
strategies of the DDG
The length of each path equals to the total cost
rate of the corresponding storage strategy.
The Shortest Path (SP) represents the minimum
cost storage strategy



Dec 8
th
, 2011 Dec 8
th
, 2011
CTT-SP Algorithm
Example






The weights of cost edges:
y
1
d
1
d
2
d
3
(x
1
, y
1
,v
1
) (x
3
, y
3
,v
3
) (x
2
, y
2
,v
2
)
x
1
v
1
+y
2
d
1
d
2
d
3
d
s
d
e
x
3
v
3
x
2
v
2
+y
3
x
2
v
2
+(x
2
+x
3
)v
3
x
1
v
1
+(x
1
+x
2
)v
2
+(x
1
+x
2
+x
3
)v
3
x
1
v
1
+(x
1
+x
2
)v
2
+y
3
y
2
y
3
0
DDG
CTT
( )

. e
- + >= <
} {
) ( ,
j k i k k
d d d DDG d d k k j j i
v d genCost y d d e
Dec 8
th
, 2011 Dec 8
th
, 2011
A Local-Optimization based Datasets
Storage Strategy
Requirements of Storage Strategy
Efficiency and Scalability
The strategy is used at runtime in the cloud
and the DDG may be large
The strategy itself takes computation
resources
Reflect users preference and data accessing
delay
Users may want to store some datasets
Users may have certain tolerance of data
accessing delay


Dec 8
th
, 2011 Dec 8
th
, 2011
A Local-Optimization based Datasets
Storage Strategy
Introduce two new attributes of the datasets in DDG
to represent users accessing delay tolerance, which
are <T
i
,
i
>
T
i
is a duration of time that denotes users tolerance
of dataset d
i
s accessing delay


i
is the parameter to denote users cost related
tolerance of dataset d
i
s accessing delay, which is a
value between 0 and 1



|
.
|

\
|
< . . e
k
k
j k i k
T
CostCPU
d genCost
d d d DDG d
) (
) (
Dec 8
th
, 2011 Dec 8
th
, 2011
A Local-Optimization based Datasets
Storage Strategy
Dec 8
th
, 2011 Dec 8
th
, 2011
A Local-Optimization based Datasets
Storage Strategy
Efficiency and Scalability
A general DDG is very complex. The computation
complexity of CTT-SP algorithm is O(n
9
), which is
not efficient and scalable to be used on large DDGs
Partition the large DDG into small linear segments




Utilize CTT-SP algorithm on linear DDG segments
in order to guarantee a localized optimum


...
...
...
...
Linear DDG
1
Linear DDG
3
Linear DDG
2
Linear DDG
4
Partitioning
point dataset
Partitioning
point dataset
Dec 8
th
, 2011 Dec 8
th
, 2011
Evaluation
Use random generated DDG for simulation
Size: randomly distributed from 100GB to 1TB.
Generation time : randomly distributed from 1 hour to 10
hours
Usage frequency: randomly distributed 1 day to 10 days
(time between every usage).
Users delay tolerance (Ti) , randomly distributed from 10
hours to one day
Cost parameter (i) : randomly distributed from 0.7 to 1
to every datasets in the DDG
Adopt Amazon cloud services price model
(EC2+S3):
$0.15 per Gigabyte per month for the storage resources.
$0.1 per CPU hour for the computation resources.


Dec 8
th
, 2011 Dec 8
th
, 2011
Evaluation
Compare different storage strategies with
proposed strategy
Usage based strategy
Generation cost based strategy
Cost rate based strategy
Dec 8
th
, 2011 Dec 8
th
, 2011
Evaluation

Change of daily cost (4% users stored datasets)
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900 1000 1100
Number of Datasets in DDG
C
o
s
t

R
a
t
e

(
U
S
D
/
D
a
y
)
Store all datasets
Store none
Usage based
strategy
Generation cost
based strategy
Cost rate based
strategy
Local-optimisation
based strategy
Dec 8
th
, 2011 Dec 8
th
, 2011
Evaluation
CPU Time of the strategies
0
20
40
60
80
100
120
140
160
180
200
50 100 150 200 300 500 1000
Number of datasets in DDG
T
i
m
e
(
s
)
Cost rate based
strategy
Local-optimisation
based strategy
(n_i=10)
Local-optimisation
based strategy
Local-optimisation
based strategy
(m=5)
CTT-SP algorithm
Dec 8
th
, 2011
2007 The Board of Regents of the University of Nebraska. All rights reserved.
Thanks