Scheduling Shared Scans of Large Data Files

Scheduling Shared Scans of Large Data Files
Parag Agrawal Daniel Kifer Christopher Olston

Stanford University Yahoo! Research Yahoo! Research
ABSTRACT the web), and the communication is minimal (distributive

We study how best to schedule scans of large data files, in and algebraic aggregation functions enable early aggregation
the presence of many simultaneous requests to a common on the Map side of the job, and the data transmitted to the
set of files. The objective is to maximize the overall rate of Reduce side is small). Many jobs even disable the Reduce
processing these files, by sharing scans of the same file as component, because they do not require global processing
aggressively as possible, without imposing undue wait time (e.g., generate a hash-based synopsis of every document in
on individual jobs. This scheduling problem arises in batch a large collection).
data processing environments such as Map-Reduce systems, The execution time of these jobs is dominated by scanning
some of which handle tens of thousands of processing re- the input file. If the number of unique input files is small
quests daily, over a shared set of files. relative to the number of daily jobs (e.g., in a search engine
As we demonstrate, conventional scheduling techniques company many jobs process the web crawl, user click log,
such as shortest-job-first do not perform well in the presence and search query log), then it is desirable to amortize the
of cross-job sharing opportunities. We derive a new family work of scanning one of these files across multiple jobs. Un-
of scheduling policies specifically targeted to sharable work- fortunately, caching is not good enough because often these
loads. Our scheduling policies revolve around the notion data sets are so large that they do not fit in memory, even
that, all else being equal, it is good to schedule nonsharable if spread across a large cluster of machines.
scans ahead of ones that can share IO work with future jobs, Cooperative scans [6, 8, 21] can help here: multiple jobs
if the arrival rate of sharable future jobs is expected to be that require scanning the same file can be executed simulta-
high. We evaluate our policies via simulation over varied neously, with the scanning performed once and the scanned
synthetic and real workloads, and demonstrate significant data fed into each job’s processing component. The work on
performance gains compared with conventional scheduling cooperative scans has focused on mechanisms to realize IO
approaches. savings across multiple co-executing jobs. However there is
another opportunity here: In the Map-Reduce context jobs
tend to run for a long time, and users do not expect quick
1. INTRODUCTION turnaround. It is acceptable to reorder pending jobs, within
As disk seeks become increasingly expensive relative to a reasonable limit on delaying individual jobs, if doing so
sequential access, data processing systems are being archi- can improve the total amount of useful work performed by
tected to favor bulk sequential scans of large files. Database, the system.
warehouse and mining systems have incorporated scan- In this paper we study how to schedule jobs that can ben-
centric access methods for a long time, but at the mo- efit from shared scans over a common set of files. To our
ment the most prominent example of scan-centric archi- knowledge this scheduling problem has not been posed be-
tectures is Map-Reduce [4]. Map-Reduce systems execute fore. Existing scheduling techniques such as shortest-job-
UDF-enhanced group-by programs over extremely large, dis- first do not necessarily work well in the presence of sharable
tributed files. Other architectures in this space include jobs, and it is not obvious how to design ones that do work
Dryad [10] and River [1]. well. We illustrate these points via a series of informal ex-
Large Map-Reduce installations handle tens of thousands amples (rigorous formal analysis follows).
of jobs daily, where a job consists of a scan of a large file ac-
companied by some processing and perhaps communication 1.1 Motivating Examples
work. In many cases the processing is relatively light (e.g.,
count the number of times Britney Spears is mentioned on Example 1
Permission to copy without fee all or part of this material is granted provided Suppose the system’s work queue contains two pending jobs,
that the copies are not made or distributed for direct commercial advantage, J1 and J2 , which are unrelated (i.e., they scan different files),
the VLDB copyright notice and the title of the publication and its date appear, and hence there is no benefit in executing them jointly.
and notice is given that copying is by permission of the Very Large Data Therefore we execute them sequentially, and we must de-
Base Endowment. To copy otherwise, or to republish, to post on servers cide which one to execute first. We might consider execut-
or to redistribute to lists, requires a fee and/or special permission from the ing them in order of arrival (FIFO), or perhaps in order of
publisher, ACM.
VLDB ‘08, August 24-30, 2008, Auckland, New Zealand expected running time (a policy known as shortest-job-first
Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. scheduling, which aims for low average response time in non-
sharable workloads). If J1 arrived slightly earlier and has a 2. RELATED WORK
slightly shorter execution time than J2 , then both FIFO We are not aware of any prior work that addresses the
and shortest-job-first would schedule J1 first. This decision, problem studied in this paper. That said, there is a tremen-
which is made without taking sharing into account, seems dous amount of work, in both the database and scheduling
reasonable because J1 and J2 are unrelated. theory communities, that is peripherally related. We survey
However, one might want to consider the fact that addi- this work below.
tional jobs may arrive in the queue while J1 and J2 are being
executed. Since future jobs may be sharable with J1 or J2 , 2.1 Database Literature
they can influence the optimal execution order of J1 and J2 . Prior work on cooperative scans [6, 8, 21] focused on mech-
Even if one does not anticipate the exact arrival schedule of anisms for sharing scans across jobs or queries that get ex-
future jobs, a simple stochastic model of future job arrivals ecuted at the same time. Our work is complementary: we
can influence the decision of which of J1 or J2 to execute consider how to schedule a queue of pending jobs to ensure
first. that sharable jobs get executed together and can benefit
Suppose J1 scans file F1 , and J2 scans file F2 . Let λi from cooperative scan techniques.
denote the frequency with which jobs that scan Fi are sub- Gupta et al. [7] study how to select an execution order for
mitted. In our example, if λ1 > λ2 , then all else being enqueued jobs, to maximize the chance that data cached on
equal it might make sense to schedule J2 first. While J2 is behalf of one job can be reused for a subsequent job. That
executing, new jobs that are sharable with J1 may arrive, work only takes into account jobs that are already in the
permitting us to amortize J1 ’s work across multiple jobs. queue, whereas our work focuses on scheduling in view of
This amortization of work, in turn, can lead to lower av- anticipated future jobs.
erage job response times going forward. The schedule we
produced by considering future job arrival rates differs from 2.2 Scheduling Literature
the one produced by FIFO and shortest-job-first.
Scheduling theory is a vast field with countless variations
on the scheduling problem, including various performance
metrics, machine environments (such as single machine, par-
Example 2
allel machines, and shop), and constraints (such as release
In a more subtle scenario, suppose instead that λ1 = λ2 . times, deadlines, precedence constraints, and preemption)
Suppose F1 is 1 TB in size, and F2 is 10 TB. Assume [11]. Some of the earliest complexity results for scheduling
each job’s execution time is dominated by scanning the file. problems are given in [13]. In particular, the problem of
Hence, J2 takes about ten times as long to execute as J1 . minimizing the sum of completion times on a single proces-
Now, which one of J1 and J2 should we execute first? sor in the presence of release dates (i.e. job arrival times)
Perhaps J1 should be executed first because J2 can benefit is NP-hard. On the other hand, minimizing the maximum
more from sharing, and postponing J2 ’s execution permits absolute or relative wait times can be done in polynomial
additional, sharable F2 jobs to accumulate in the queue. On time using the algorithm proposed in [12]. Both of these
the other hand, perhaps J2 ought to be executed first since problems are special cases of the problem considered in this
it takes roughly ten times as long as J1 , thereby allowing paper when all of the shared costs are zero.
ten times as many F1 jobs to accumulate for future joint In practice, the quality of a schedule depends on several
execution with J1 . factors (such as maximum completion time, average com-
Which of these opposing factors dominates in this case? pletion time, maximum earliness, maximum lateness). Op-
How can we reason about these issues in general, in order timizing schedules with respect to several performance met-
to maximize system productivity or minimize average job rics is known as multicriteria scheduling [9].
response time? Online scheduling algorithms [18, 20] make scheduling de-
cisions without knowledge of future jobs. In non-clairvoyant
scheduling [16], the characteristics of the jobs (such as run-
1.2 Contributions and Outline ning time) are not known until the job finishes. Online al-
In this paper we formalize and study the problem of schedul- gorithms are typically evaluated using competitive analysis
ing sharable jobs, using a combination of analytical and em- [18, 20]: if C(I) is the cost of an online schedule on instance
pirical techniques. We demonstrate that scheduling policies I and Copt (I) is the cost of the optimal schedule, then the
that work well in the traditional context of nonsharable jobs online algorithm is c-competitive if C(I) ≤ c · Copt (I) + b for
can yield poor schedules in the presence of sharing. We all instances I and for some constant b.
identify simple policies that do work well in the presence of Divikaran and Saks [5] studied the online scheduling prob-
sharing, and are robust to fluctuations in the workload such lem with setup times. In this scenario, jobs belong to job
as bursts of job arrivals. families and a setup cost is incurred whenever the proces-
The remainder of this paper is structured as follows. We sor switches between jobs of different families. For example,
discuss related work in Section 2, and give our formal model jobs in the same family can perform independent scans of
of scheduling jobs with shared scans in Section 3. Then in the same file, in which case the setup cost is the time it
Section 4 we derive a family of scheduling policies, which takes to load a file into memory. The problem considered
have some convenient properties that make them practical in this paper differs in two ways: all jobs executed in one
as we discuss in Section 5. We perform some initial empirical batch have the same completion time since the scans occur
analysis of our policies in Section 6. Then in Section 7 we concurrently instead of serially; also, once a batch has been
extend our family of policies to include hybrid ones that processed, the next batch still has a shared cost even if it is
balance multiple scheduling objectives. We present our final from the same job family (for example, if the entire file does
empirical evaluation in Section 8. not fit into memory).
Given that tsi is the dominant cost, for simplicity we treat
the nonshared execution cost tn i as being the same for all
jobs in a batch, even though in reality each job may incur a
different cost in its custom processing. We verify empirically
in Section 6 that nonuniform within-batch processing costs
do not throw off our results.
3.1 System Workload
For the purpose of our analysis we model job arrival as
a stationary process (in Section 8.2.2 we study the effect of
bursty job arrivals empirically). In our model, for each job
family Fi , jobs arrive according to a Poisson process with
Figure 1: Model: input queues and job executor. rate parameter λi .
Obviously, a high enough aggregate job arrival rate can
Stochastic scheduling [15] considers another variation on overwhelm a given system, regardless of the scheduling pol-
the scheduling problem: the processing time of a job is a icy. To reason about what job workload a system is capable
random variable, usually with finite mean and variance, and of handling, it is instructive to consider what happens if jobs
typically only the distribution or some of its moments are are executed in extremely large batches. In the asymptote,
known. Online versions of these problems for minimizing as batch sizes approach infinity, the tn values dominate and
expected weighted completion time have also been consid- theP ts values become insignificant, so system load converges
ered [3, 14, 19] in cases where there is no sharing of work n
to i λi · ti . If this quantity exceeds the system’s intrin-
among jobs. sic processing capacity, then it is impossible to keep queue
3. MODEL lengths from growing without bound, and the system can
never “catch up” with pending work under any scheduling
Map-Reduce and related systems execute jobs on large
regime. Hence we impose a workload feasibility condition:
clusters, over data files that are spread across many nodes X
(each node serves a dual storage and computation role). asymptotic load = λi · tn
i < 1
Large files (e.g., a web crawl, or a multi-day search query and i
result log) are spread across essentially all nodes, whereas
smaller files may only occupy a subset of nodes. Correspond- 3.2 Scheduling Objectives
ingly, jobs that access large files are spread onto the entire The performance metric we use in this paper is average
cluster, and jobs over small files generally only use a subset perceived wait time. The perceived wait time (PWT) of job
of nodes. J is the difference between the system’s response time in
In this paper we focus on the issue of ordering jobs to handling J, and the minimum possible response time t(J).
maximize shared scans, rather than the issue of how to al- (Response time is the total delay between submission and
locate data and jobs onto individual cluster nodes. Hence completion of a job.)
for the purpose of this paper we abstract away the per-node As stated in Section 1, the class of systems we consider
details and model the cluster as a single unit of storage and is geared toward maximizing overall system productivity,
execution. For workloads dominated by large data sets and rather than committing to response time targets for indi-
jobs that get spread across the full cluster, this abstraction vidual jobs. This stance would seem to suggest optimizing
is appropriate. for system throughput. However, in our context maximiz-
Our model of a data processing engine has two parts: an ing throughput means maximizing batch sizes, which leads
executor module that processes jobs, and an input queue to indefinite job wait times. While these systems may find
that holds pending jobs. Each job Ji requires a scan over a it acceptable to delay some jobs in order to improve overall
(large) input file Fi , and performs some custom processing throughput, it does not make sense to delay all jobs.
over the content of the file. Jobs can be categorized based Optimizing for average PWT still gives an incentive to
on their input file into job families, where all jobs that access batch multiple jobs together when the sharing opportunity
file Fi belong to family Fi . It is useful to think of the input is large (thereby improving throughput), but not so much
queue as being divided into a set of smaller queues, one per that the queues grow indefinitely. Furthermore, PWT seems
job family, as shown in Figure 1. like an appropriate metric because it corresponds to users’
The executor is capable of executing a batch of multiple end-to-end view of system performance. Informally, average
jobs from the same family, in which case the input file is PWT can be thought of as an indicator of how unhappy
scanned once and each job’s custom processing is applied users are, on average, due to job processing delays. Another
over the stream of data generated by scanning the file. For consideration is the maximum PWT across all jobs, which
simplicity we assume that one batch is executed at a time, indicates how unhappy the least happy user is.
although our techniques can easily be extended to the case Our aim is to minimize average PWT, while keeping maxi-
of k simultaneous batches. mum PWT from being excessively high. We focus on steady-
The time to execute a batch consisting of n jobs from state behavior, rather than a fixed time period such as one
family Fi equals tsi + n · tn s
i , where ti represents the cost of day, to avoid knapsack-style tactics that “squeeze” short
scanning the input file Fi (i.e., the sharable execution cost), jobs in at the end of the period. Knapsack-style behav-
and tn i represents the custom processing cost incurred by ior only makes sense in the context of real-time scheduling,
each job (i.e., the nonsharable cost). We assume that tsi is which is not a concern in the class of systems we study.
large relative to tni , i.e., the jobs are IO-bound as discussed For a given job J, PWT can either be measured on an ab-
in Section 1. solute scale as the difference between the system’s response
symbol meaning
Fi ith job family
tsi sharable execution time for Fi jobs
tn
i nonsharable execution time for Fi jobs
λi arrival rate of Fi jobs
Bi theoretical batch size for Fi
ti theoretical time to execute one Fi batch
Ti theoretical scheduling period for Fi
fi theoretical processing fraction for Fi
ωi perceived wait time for Fi jobs
Pi scheduling priority of Fi
Figure 2: Ways to measure perceived wait time. Bi queue length for Fi
Ti waiting time of oldest enqueued Fi job
time and the minimum possible response time (e.g., 10 min-
utes), or on a relative scale as the ratio of the system’s re- Table 1: Notation.
sponse time to the minimum possible response time (e.g.,
Let Pi denote the scheduling priority of family Fi . If there
1.5 × t(J)). (Relative PWT is also known as stretch [17].)
is no sharing, SJF sets Pi equal to the time to complete one
The space of PWT metric variants is shown in Figure 2.
job. If there is sharing, then we let Pi equal the average
For convenience we adopt the abbreviations AA, MA, AR
per-job execution time of a job batch. Suppose Bi is the
and MR to refer to the four variants.
number of enqueued jobs in family Fi , in other words, the
3.3 Scheduling Policy current batch size for Fi . Then the total time to execute a
batch is tsi + Bi · tn i . The average per-job execution time is
A scheduling policy is an online algorithm that is (re)invoked
(tsi + Bi · tn
i )/Bi , which gives us the SJF scheduling priority:
each time the executor becomes idle. Upon invocation, the „ s «
policy leaves the executor idle for some period of time (pos- ti
sibly zero time), and then removes a nonempty subset of SJF Policy : Pi = − + tn
i
Bi
jobs from the input queue, packages them into an execution
batch, and submits the batch to the executor. Unfortunately, as we demonstrate empirically in Section 6,
In this paper, to simplify our analysis we impose two very SJF does not work well in the presence of sharing. To under-
reasonable restrictions on our scheduling policies: stand why, consider a simple example with two job families:
• No idle. If the input queue is nonempty, do not leave F1 : ts1 = 1, tn

1 = 0, λ1 = a
the executor idle. Given the stochastic nature of job F2 : ts2 = a, tn
2 = 0, λ2 = 1
arrivals, this policy seems appropriate.
for some constant a > 1.
• Always share. Whenever a job family Fi is scheduled In this scenario, F2 jobs have long execution time (ts2 = a)
for execution, all enqueued jobs from family Fi are so SJF schedules F2 infrequently: once every a2 time units,
included in the execution batch. While it is true that on expectation. The average perceived wait time under this
if tn > ts , one achieves lower average absolute PWT schedule is O(a) due to holding back F2 jobs a long time
by scheduling jobs sequentially instead of in a batch, between batches. A policy that is aware of the fact that F2
in this paper we assume ts > tn , as stated above. If jobs are relatively rare (λ2 = 1) would elect to schedule F2
ts > tn it is always beneficial to form large batches, in more often, and schedule F1 less often but in much larger
terms of average absolute PWT of jobs in the batch. batches. In fact, a policy that schedules F2 every a3/2 time
In all cases, large batches reduce the wait time of jobs units achieves an average PWT of only O(a1/2 ). For large
outside the batch that are executed afterward. a, SJF performs very poorly in comparison.
Since SJF does not always produce good schedules in the
presence sharing, we begin from first principles. Unfortu-
4. BASIC SCHEDULING POLICIES nately, as discussed in Section 2.2, solving even the non-
We derive scheduling policies aimed at minimizing each of shared scheduling problem exactly is NP-hard. Hence, to
average absolute PWT (Section 4.1) and maximum absolute make our problem tractable we consider a relaxed version of
PWT (Section 4.2).1 the problem, find an optimal solution to the relaxed prob-
The notation we use in this section is summarized in Ta- lem, and apply this solution to the original problem.
ble 1.
4.1.1 Relaxation 1
4.1 Average Absolute PWT In our initial, simple relaxation, each job family (each
If there is no sharing, low average absolute PWT is achieved queue in Figure 1) has a dedicated executor. The total work
via shortest-job-first (SJF) scheduling and its variants. (In done by all executors, in steady state, is constrained to be
a stochastic setting, the generalization of SJF is asymptoti- less than or equal to the total work performed by the one
cally optimal [3].) We generalize SJF to the case of sharable executor in the original problem. Furthermore, rather than
jobs as follows. discrete jobs, in our relaxation we treat jobs as continuously
1
We tried deriving policies that directly aim to minimize rel- arriving, infinitely divisible units of work.
ative PWT, but the resulting policies did not perform well, In steady state, an optimal schedule will exhibit periodic
perhaps due to breakdowns in the approximation schemes behavior: For each job family Fi , wait until Bi jobs have
used to derive the policies. arrived on the queue and execute those Bi jobs as a batch.
Given the arrival rate λi , on expectation a new batch is • Old jobs. jobs that are already in the queue when
executed every Ti = Bi /λi time units. A batch takes time the Fi batch is executed, are also delayed. Under
ti = tsi + Bi · tn
i to complete. The fraction of time Fi ’s Relaxation
P 1, the expected number of such jobs is
executor is in use (rather than idle), is fi = ti /Ti . j6=i (Tj · λj )/2. The delay incurred to each one is
We arrive at the following optimization problem: ti , making the overall delay incurred to other in-queue
X X jobs equal to
fi ≤ 1 min λi · ωiAA
i i ti X
D3 = · (Tj · λj )
2
where ωiAA is the average absolute PWT for jobs in Fi . j6=i
There are two factors that contribute to the PWT of a
newly-arrived job: (1) the delay until the next batch is The total delay imposed on other jobs per unit time is
formed (2) the fact that a batch of size Bi takes longer to proportional to 1/Ti · (D1 + D2 + D3 ). If we minimize the
finish than a singleton batch. The expected value of Factor sum of this quantity across all families P Fi , again subject
1 is Ti /2. Factor 2 equals (Bi − 1) · tn
i . Overall,
to the resource utilization constraint i fi ≤ 1 using the
Lagrange method, we obtain the following invariant across
Ti
ωiAA = + (Bi − 1) · tn
i job families:
2 !
We solve the above optimization problem using the method Bi2 s
X Bi2 n n
X
− ti · λj + · (λi · ti ) · ti · λj + 1
of Lagrange Multipliers. In the optimal solution the follow- λi · tsi λi · tsi
j j
ing quantity is constant across all job families Fi :
The scheduling policy resulting from this invariant does
Bi2
· (1 + 2 · λi · tn
i ) achieve the hoped-for O(a1/2 ) average PWT in our example
λi · tsi two-family scenario.
Given the λ, ts and tn values, one can select batch sizes (B
values) accordingly. 4.1.3 Implementation and Intuition
P n
Recall the workload feasibility condition i λi · ti < 1
4.1.2 Relaxation 2 from Section 3.1. If the executor’s load is spread across a
Unfortunately, the optimal solution to Relaxation 1 can large number of job families, then for each Fi , λi ·tn
i is small.
differ substantially from the optimal solution to the origi- Hence, it is reasonable to drop the terms involving λi · tn i
nal problem. Consider the simple two-family example we from our above formulae, yielding the following simplified
presented earlier in Section 4.1. The optimal policy under invariants2 :
Relaxation 1 schedules job families in a round robin fashion,
yielding an average PWT of O(a). Once again this result is • Relaxation 1 result: For all job families Fi , the
much worse than the achievable O(a1/2 ) value we discussed following quantity is equal:
earlier. Bi2
Whereas SJF errs by scheduling F2 too infrequently, the
λi · tsi
optimal Relaxation 1 policy errs in the other direction: it
schedules F2 too frequently. Doing so causes F1 jobs to wait • Relaxation 2 result: For all job families Fi , the
behind F2 batches too often, hurting average wait time. following quantity is equal:
The problem is that Relaxation 1 reduces the original
scheduling problem to a resource allocation problem. Under Bi2 X
− tsi · λj
Relaxation 1, the only interaction among job families
P is fact
s
λi · ti j
that they must share the overall processing time ( i fi ≤ 1).
In reality, resource allocation is not the only important con- A simple way to translate these statements into imple-
sideration. We must also take into account the fact that the mentable policies is as follows: Assign a numeric priority
execution batches must be serialized into a single sequen- Pi to each job family Fi . Every time the executor becomes
tial schedule and executed on a single executor. When a idle schedule the family with the highest priority, as a sin-
long-running batch is executed, other batches must wait for gle batch of Bi jobs, where Bi denotes the queue length for
a long time. family Fi . If we are in steady state, then Bi should roughly
Consider a job family Fi , for which a batch of size Bi is equal Bi . This observation suggests the following priority
executed once every Ti time units. Whenever an Fi batch values for the scheduling policies implied by Relaxations 1
is executed, the following contributions to PWT occur: and 2, respectively:
• In-batch jobs. The Bi Fi jobs in the current batch
are delayed by (Bi − 1) · tn
i time units each, for a total Bi2
of D1 = Bi · (Bi − 1) · tn AA Policy 1 : Pi =
i time units. λi · tsi
• New jobs. jobs that arrive while the Fi batch is being Bi2 X
AA Policy 2 : Pi = − tsi · λj
executed, are
P delayed. The expected number of such
s
λi · ti j
jobs is ti · j λj . The delay incurred to each one is
ti /2 on average, making the overall delay incurred to
other new jobs equal to
2
t2 X There are also practically-motivated reasons to drop terms
D2 = i · λj involving tn , as we discuss in Section 5. In Section 6 we give
2 j empirical justification for dropping the tn terms.
These formulae have a fairly simple intuitive explanation. This policy can be thought of as FIFO applied to job family
First, if many new jobs with a high degree of sharing are batches, since it schedules the family of the job that has
expected to arrive in the future (λi · tsi in the denomina- been waiting the longest.
tor, which we refer to as the sharability of family Fi ), we
should postpone execution of Fi and allow additional jobs
to accumulate into the same batch, so as to achieve greater 5. PRACTICAL CONSIDERATIONS
sharing with little extra waiting. On the other hand, as the The scheduling policies we derived in Section 4 rely on
number of enqueued jobs becomes large (Bi2 in the numer- several parameters related to job execution cost and job ar-
ator), the execution priority increases quadratically, which rival rates. In this section we explain how these parameters
eventually forces the execution of a batch from family Fi to can be obtained in practice.
avoid imposing excessive delay on the enqueued jobs.
Policy 2 has an extra subtractive term, which penalizes Robust cost estimation: The fact that we were able to
long batches (i.e., ones with largePts ) if the overall rate of drop the nonsharable execution time tn from our scheduling
arrival of jobs is high (i.e., high priority formulae not only keeps them simple, it also means
j λj ). Doing so allows
short batches to execute ahead of long batches, in the spirit that the scheduler does not need to estimate this quantity.
of shortest-job-first. In practice, estimating the full execution time of a job accu-
For singleton job families (families with just one job), rately can be difficult, especially in the Map-Reduce context
tsi = 0 and the priority value Pi goes to infinity. Hence in which processing is specified via opaque user-defined func-
nonsharable jobs are to be scheduled ahead of sharable ones. tions. (In Section 6 we verify empirically that the perfor-
The intuition is that nonsharable jobs cannot be beneficially mance of our policies is not sensitive to whether the factors
coexecuted with future jobs, so we might as well execute involving tn are included.)
them right away. If there are multiple nonsharable jobs, ties Our formulae do require estimates of the sharable exe-
can be broken according to shortest-job-first. cution time ts , i.e., the IO cost of scanning the input file.
For large files, this cost is nearly linearly proportional to
4.2 Maximum Absolute PWT the size of the input file, a quantity that is easy to obtain
Here, instead of optimizing for average absolute PWT, from system metadata. (The proportionality constant can
we optimize for the maximum. We again adopt a relaxation be dropped, as linear scaling of the ts values does not affect
of the original problem that assumes parallel executors and our priority-based scheduling policies.)
infinitely divisible work. Under the relaxation, the objective
Dynamic estimation of arrival rates: Some of our pri-
function is:
ority formulae contain λ values, which denote job arrival
min max ωiMA rates. Under the Poisson model of arrival, one can estimate
i
the λ values dynamically, by keeping a time-decayed count
where ωiMAis the maximum absolute PWT for Fi jobs. of arrivals. In this way the arrival rate estimates (λ values)
As stated in Section 4.1.1 there are two factors that con- automatically adjust as the workload shifts over time. (See
tribute to the PWT of a newly-arrived job: (1) the delay Section 6.1 for details.)
until the next batch is formed (2) the fact that a batch of
size Bi takes longer to finish than a singleton batch. The
maximum values of these factors are Ti and (Bi − 1) · tn
6. BASIC EXPERIMENTS
i ,
respectively. Overall, In this section we present experiments that:
• Justify ignoring the nonsharable execution time compo-
ωiMA = Ti + (Bi − 1) · tn
i nent tn in our scheduling policies (Section 6.2).
or, written differently: • Compare our scheduling policy variants empirically (Sec-
tion 6.3).
ωiMA = Ti · (1 + λi · tn n
i ) − ti (We compare our policies against baseline policies in Sec-
In the optimal solution ωiMA is constant across all job fam- tion 8.)
ilies Fi . The intuition behind this result is that if one of
the ωiMA values is larger than the others, we can decrease it 6.1 Experimental Setup
somewhat by increasing the other ωiMA values, thereby re- We built a simulator and a workload generator. Our work-
ducing the maximum PWT. Hence in the optimal solution load consists of 100 job families. For each job family, the
all ωiMA values are equal. sharable cost ts is generated from the heavy-tailed distri-
bution 1 + |X |, where X is a Cauchy random variable. For
4.2.1 Implementation and Intuition greater realism, the nonsharable cost tn is on a per-job basis,
As justified in Section 4.1.3, we drop terms involving λi · rather than a per-family basis as in our model in Section 3.
tn MA
formula and obtain ω MA ≈ Ti − tn In our default workload, each time a job arrives, we select
i from our ω i . As
stated in Section 3, we assume the tn values to be a small a nonshared cost randomly as follows: with probability 0.6,
component of the overall job execution times, so we also drop tn = 0.1 · ts ; with probability 0.2, tn = 0.2 · ts ; with prob-
the −tn MA
≈ Ti . ability 0.2, tn = 0.3 · ts . (The scenario we focus on in this
i term and arrive at the approximation ω
Let Ti denote the waiting time of the oldest enqueued Fi paper is one in which the shared cost dominates, because it
job, which should roughly equal Ti in steady state. We use represents IO and jobs tend to be IO-bound, as discussed in
Ti as the basis for our priority based scheduling policy: Section 3.) In some of our experiments we deviate from this
default workload and study what happens when tn tends to
MA Policy(FIFO) : Pi = Ti be larger than ts .
700 1600
tn-ignorant policy for AA PWT AA Policy 2, lambda known
600 tn-aware policy for AA PWT 1400 AA Policy 1, est. lambda
1200 AA Policy 2, est. lambda
500
AA Policy 1, known lambda
1000
AA PWT
AA PWT
400
800
300
600
200 400
100 200
0 0
1 5 10 33 66 100 0 1 2 3 4 4.3
Shared cost divisor Shared cost skew
Figure 3: tn -awareness versus tn -ignorance for AA Figure 5: AA Policy 1 versus AA Policy 2, varying
Policy 2. shared cost skew.
2500 30
tn-ignorant policy for MA PWT AA Policy 2, est. lambda
tn-aware policy for MA PWT AA Policy 1, est. lambda
Average Absolute PWT

25
2000
20
MA PWT
1500
15
1000
10
500 5
0 0
1 5 10 33 66 100 1 5 10 20 30 40
Shared cost divisor Skew
Figure 4: tn -awareness versus tn -ignorance for MA Figure 6: AA Policy 1 versus AA Policy 2, varying
Policy. shared cost skew, λi tsi = const.
Job arrival events are generated using the standard ho- job instance in the queue.)
mogenous Poisson point process [2]. Each job family Fi has Figures 3 and 4 plot the performance of the tn -aware and
an arrival parameter λi which represents the expected num- tn -ignorant variants of our policies (AA Policy 2 and MA
ber of jobs that arrive in one unit of time. There are 500, 000 Policy, respectively) as we vary the magnitude of the shared
units of time in each run of the experiments. The λi values cost (keeping the tn distribution and λ values fixed). In
are initially chosen from a Pareto distribution
P with parame- both graphs, the y-axis plots the metric the policy is tuned
ter α = 1.9 and then are rescaled soPthat i λi E[tn i ] = load. to optimize (AA PWT and MA PWT, respectively). The
The total asymptotic system load ( λi ·tn i ) is 0.5 by default. x-axes plot the shared cost divisor, which is the factor by
Some of our scheduling policies require estimation of the which we divided all shared costs. When the shared cost
job arrival rate λi . To do this, we maintain an estimate Ii divisor is large (e.g., 100), the ts values become quite small
of the difference in the arrival times of the next two jobs relative to the tn values, on average.
in family Fi . We adjust Ii as new job arrivals occur, by Even when nonshared costs are large relative to shared
taking a weighted average of our previous estimate Ii and costs (right-hand side of Figures 3 and 4), tn -awareness has
Ai , the difference in arrival times of the two most recent jobs little impact on performance. Hence from this point forward
from Fi . Formally, the update step is Ii ← 0.05Ai + 0.95Ii . we only consider the simpler, tn -ignorant variants of our
Given Ii and the time t since the last arrival of a job in Fi , policies.
1
we estimate λi as 1/Ii if t < Ii and as 0.05+0.95I i
otherwise.
6.3 Comparison of Policy Variants
6.2 Influence of Nonshared Execution Time
In our first set of experiments, we measure how knowl- 6.3.1 Relaxation 1 versus Relaxation 2
edge of tn affects our scheduling policies. Recall that in We now turn to a comparison of AA Policy 1 versus AA
Sections 4.1.3 and 4.2.1 we dropped tn from the priority Policy 2 (recall that these are based on Relaxation 1 (Sec-
formulae, on the grounds that the factors involving tn are tion 4.1.1) and Relaxation 2 (Section 4.1.2) of the original
small relative to other factors. To validate ignoring tn in our AA PWT minimization problem, respectively). Figure 5
scheduling policies, we compare tn -aware variants (which shows that the two variants exhibit nearly identical perfor-
use the full formulae with tn values) against the tn -ignorant mance, even as we vary the skew in the shared cost (ts ) dis-
variants presented in Sections 4.1.3 and 4.2.1. (The tn -aware tribution among job families (here there are five job families
variants are given knowledge of the precise tn value of each Fi with shared cost tsi = iα , where α is the skew parameter).
6.4 Summary of Findings
The findings from our basic experiments are:
• Estimating the arrival rates (λ values) online, as op-
posed to knowing them from an oracle, does not hurt
performance.
• It is not necessary to incorporate tn estimates into the
priority functions.
• AA Policy 2 (which is based on Relaxation 2) domi-
nates AA Policy 1 (based on Relaxation 1).
From this point forward, we use tn -ignorant AA Policy 2
with online λ estimation.
7. HYBRID SCHEDULING POLICIES

The quality of a scheduling policy is generally evaluated
using several criteria [9] and so optimizing for either the av-
Figure 7: Relative effectiveness of different priority erage or maximum perceived wait time, as in Section 4, may
formula variants. be too extreme. If we optimize solely for the average, there
However, if we introduce the invariant that λi · tsi (which may be certain jobs with very high PWT. Conversely if we
represents the “sharability” of jobs in family Fi ; see Sec- optimize solely for the maximum, we end up punishing the
tion 4.1.3) remain constant across all job families Fi , a dif- majority of jobs in order to help a few outlier jobs. In prac-
ferent picture emerges. Figure 6 shows the result of varying tice it may make more sense to optimize for a combination
the shared cost skew, as we hold λi · tsi constant across job of average and maximum PWT. A simple approach is to
families. (Here there are two job families: ts2 = λ1 = 1 and optimize for a linear combination of the two:
ts1 = λ2 = skew parameter (x-axis).) In this case, we see a
X
min α · ωiAA + (1 − α) · ωiM A
clear difference in performance between the policies based i
on the two relaxations, with the one based on Relaxation 2 AA
(AA Policy 2) performing much better. where ω denotes average absolute PWT and ω M A denotes
Overall, it appears that AA Policy 2 dominates AA Policy maximum absolute PWT. The parameter α ∈ [0, 1] denotes
1, as expected. As to whether the case in which AA Policy the relative importance of having low average PWT versus
2 performs significantly better than AA Policy 1 is likely low maximum PWT.
to occur in practice, we do not know. Clearly, using AA We apply the methods used in Section 4 to the hybrid
Policy 2 is the safest option, and besides it is not much optimization objective, resulting in the following policy:
more complex to implement than AA Policy 1. " #
1 Bi2 X
Hybrid Policy : Pi = α · P · s
− tsi · λj
2· j λj λi · ti j
6.3.2 Use of Different Estimators Ti2
Recall that our AA Policies 1 and 2 (Section 4.1.3) have a + xi · (1 − α) ·
tsi
Bi2 /λi term. In the model assumed by Relaxation 1, using
the equivalence Bi = Ti · λi , we can rewrite this term in
where xi = 1 if Ti = maxj Tj , and xi = 0 otherwise.
four different ways: Bi2 /λi (using batch size), Ti2 λi (using
waiting time), Bi ·Tˆi (the geometric
˜ mean of the two previous The hybrid policy degenerates to the nonhybrid policies
options), and max Bi2 /λi , Ti2 · λi .
of Section 4 if we set α = 0 or α = 1. For intermediate
In Figure 7 we compare these variants, and also compare
values of α, job families receive the same relative priority
using the true λ values versus using an online estimator for
as they would under the average PWT regime, except the
λ as described in Section 6.1. We used a more skewed non-
family that has been waiting the longest (i.e., the one with
shared cost (tn ) distribution than in our other experiments,
xi = 1), which gets an extra boost in priority. This “extra
to get a clear separation of the variants. In particular we
boost” reduces the maximum wait time, while raising the
used: with probability 0.6, tn = 0.1 · ts ; with probability 0.2,
average wait time a bit.
tn = 0.2·ts ; with probability 0.1, tn = 0.5·ts ; with probabil-
ity 0.1, tn = 1.0·ts . We generated 20 sample workloads, and
for each workload we computed the best AA PWT among 8. FURTHER EXPERIMENTS
the policy variants. For each policy variant, Figure 7 plots We are now ready for further experiments. In particular
the fraction of times the policy variant had an AA PWT that we study:
was more than 3% worse than the best AA PWT for each • The behavior of our hybrid policy (Section 8.1).
workload. The result is that the variant that uses Bi2 /λi
• The performance of our policies compared to baseline
(the form given in Section 4.1.3) clearly outperforms the
policies (Section 8.2.1).
rest. Furthermore, estimating the arrival rates (λ values)
works fine, compared to knowing them in advance via an • The ability to cope with large bursts of job arrivals (Sec-
oracle. tion 8.2.2).
420 4500 25000
AA PWT FIFO
400 4000
MA PWT AA Policy 2

380 20000
Max Absolute PWT

360 3500 Hybrid
Oblivious SJF
340 3000 15000 Aware SJF
320
2500
300
280 2000 10000
260 1500
240 5000
220 1000
200 500 0
0 0.20 0.40 0.60 0.80 0.95 0.99 1.00 0.1 0.3 0.5 0.7 0.9
Hybrid Parameter Load (lambda increasing)
Figure 8: Hybrid Policy performance on average and Figure 10: Policy performance on AA PWT met-
maximum absolute PWT, as we vary the hybrid pa- ric, as job arrival rates increase (both SJF variants
rameter α. shown).
200 2000
α = 0.99 against two generalizations of shortest-job-first
AR PWT (SJF): The policy “Aware SJF” is the one given in Sec-
180 MR PWT 1800 tion 4.1, which knows the nonshared cost of jobs in its queue,
Average Relative PWT
Max Relative PWT

160 1600 and chooses the job family for which it can execute the most
number of jobs per unit of time (i.e., the family that min-
140 1400 imizes (batch execution cost)/B). By a simple interchange
120 1200 argument it can be shown that this policy is optimal for the
case when jobs have stopped arriving. The policy “Oblivi-
100 1000
ous SJF” does not know the nonshared cost of jobs and so it
80 800 chooses the family for which ts /B is minimized. This policy
60 600 is optimal for the case when jobs have stopped arriving and
0 0.20 0.40 0.60 0.80 0.95 0.99 1.00 the nonshared costs are small.
In these experiments we tested how these policies are af-
Hybrid Parameter
fected by the total load placed on the
P system. (Recall from
Section 3.1 that asymptotic load = λi · tn
i .) To vary load,
Figure 9: Hybrid Policy performance on average and we started with workloads with asymptotic load = 0.1, and
maximum relative PWT, as we vary α. then caused load to increase by various increments, in one
of two ways: (1) increase the nonshared costs (tn values), or
8.1 Hybrid Policy (2) increase the job arrival rates (λ values). In both cases,
Figure 8 shows the performance of our Hybrid Policy (Sec- all other workload parameters are held constant.
tion 7), in terms of both average and maximum absolute In Section 8.2.1 we report results for the case where job
PWT. Figure 9 shows the same thing, but for relative PWT. arrivals are generated by a homogeneous Poisson point pro-
In both graphs the x-axis plots the hybrid parameter α (this cess. In Section 8.2.2 we report results under bursty arrivals.
axis is not on a linear scale, for the purpose of presentation).
The decreasing curve plots average PWT, whose scale is on 8.2.1 Stationary Workloads
the left-hand y-axis; the increasing curve plots maximum In Figure 10 we plot AA PWT as the job arrival rate,
PWT, whose scale is on the right-hand y-axis. and thus total system load, increases. It is clear that Aware
With α = 0, the hybrid policy behaves like the MA Pol- SJF has terrible performance. The reason is as follows: In
icy (FIFO), which achieves low maximum PWT at the ex- our workload generator, expected nonshared costs are pro-
pense of very high average PWT. On the other extreme, portional to shared costs (e.g., the cost of a CPU scan of
with α = 1 it behaves like the AA Policy, which achieves the file is roughly proportional to its size on disk). Hence,
low average PWT but very high maximum PWT. Using in- Aware SJF has a very strong preference for job families with
termediate values of α trades off the two objectives. In both small shared cost (essentially ignoring the batch size), which
the absolute and relative cases, a good balance is achieved leads to starvation of ones with large shared cost.
at approximately α = 0.99: maximum PWT is only slightly In the rest of our experiments we drop Aware SJF, so we
higher than with α = 0, and average PWT is only slightly can focus on the performance differences among the other
higher than with α = 1. policies. Figure 11 is the same as Figure 10, with Aware
Basically, when configured with α = 0.99, the Hybrid Pol- SJF removed and the y-axis re-scaled. Here we see that AA
icy mimics the AA Policy most of the time, but makes an Policy 2 and the Hybrid Policy outperform both FIFO and
exception if it notices that one job has been waiting for a SJF, especially at higher loads.
very long time. In Figure 12 we show the corresponding graph with MA
PWT on the y-axis. Here, as expected, FIFO and the Hy-
8.2 Comparison Against Baselines brid Policy perform very well.
In the following experiments, we compare the policies AA Figures 13 and 14 show the corresponding plots for the
Policy 2, MA Policy (FIFO), and the Hybrid Policy with case where load increases due to a rise in nonshared cost.
3000 1800
FIFO FIFO
1600
AA Policy 2 AA Policy 2

2500
Hybrid 1400 Hybrid
2000 Oblivious SJF 1200 Oblivious SJF
1000
1500
800
1000 600
400
500
200
0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Load (lambda increasing) Load (non-shared cost increasing)
Figure 11: Policy performance on AA PWT metric, Figure 13: Policy performance on AA PWT metric,
as job arrival rates increase. as nonshared costs increase.
70000 40000
FIFO FIFO
60000 AA Policy 2 35000 AA Policy 2
Max Absolute PWT
Hybrid
Max Absolute PWT

50000 30000 Hybrid
Oblivious SJF Oblivious SJF
40000 25000
20000
30000
15000
20000
10000
10000
5000
0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Figure 12: Policy performance on MA PWT metric, Figure 14: Policy performance on MA PWT metric,
These graphs are qualitatively similar to Figures 11 and 12, increases via increasing non-shared costs. Here, SJF slightly
but the differences among the scheduling policies are less outperforms our policies on AA PWT, but our Hybrid Policy
pronounced. performs well on both average and maximum PWT.
Figures 15, 16, 17 and 18 are the same as Figures 11, Figure 21 shows average absolute PWT as the job arrival
12, 13 and 14, respectively, but with the y-axis measuring rate increases, while keeping the nonshared cost distribution
relative PWT. If we are interested in minimizing relative constant. Here AA Policy 2 and Hybrid slightly outperform
PWT, our policies, which aim to minimize absolute PWT, SJF.
do not necessarily do as well as SJF. Devising policies that To visualize the temporal behavior in the presence of bursts,
specifically optimize for relative PWT is an important topic Figure 22 shows a moving average of absolute PWT on the
of future work. y-axis, with time plotted on the x-axis. This time series is a
sample realization of the experiment that produced Figure
8.2.2 Bursty Workloads 19, with load = 0.7.
To model bursty job arrival behavior we use two different Since our policies focus on exploiting job arrival rate (λ)
Poisson processes for each job family. One Poisson process estimates, it is not surprising that under extremely bursty
corresponds to a low arrival rate and the other corresponds workloads where there is no semblance of a steady-state λ,
to an arrival rate that is ten times as fast. We switch be- they do not perform as well relative to the baselines as un-
tween these processes using a Markov process: after a job der stationary workloads (Section 8.2.1). However, it is re-
arrives, we switch states (from high arrival rate to low ar- assuring that our Hybrid Policy does not perform noticeably
rival rate or vice versa) with probability 0.05, and stay in worse than shortest-job-first, even under these extreme con-
the same state with probability 0.95. The initial probability ditions.
of either state is the stationary distribution of this process
(i.e. with probaility 0.5 we start with a high arrival rate). 8.3 Summary of Findings
The expected number of jobs coming from bursts is the same
The findings from our experiments on the absolute PWT
as the expected number of jobs not coming from bursts. If
metric which our policies are designed to optimize, are:
λi is the arrival rate for the non-burst process, then the ex-
pected λi (number of jobs perP second) asymptotically equals
20λi /11. Thus the load is i E[λi ]E[tn i ]. • Our MA Policy (a generalization of FIFO to shared
In Figures 19 and 20 we show the average and maximum workloads) is the best policy on maximum PWT, but
absolute PWT, respectively, for bursty job arrivals as load performs poorly on average PWT, as expected.
1200 400
FIFO FIFO
AA Policy 2 350 AA Policy 2

1000
Hybrid 300 Hybrid
800 Oblivious SJF Oblivious SJF
250
600 200
150
400
100
200
50
0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Figure 15: Policy performance on AR PWT metric, Figure 17: Policy performance on AR PWT metric,
8000 4000
FIFO FIFO
7000 3500 AA Policy 2
AA Policy 2
Max Relative PWT

Hybrid
Max Relative PWT
6000 Hybrid 3000

Oblivious SJF Oblivious SJF
5000 2500
4000 2000
3000 1500
2000 1000
1000 500
0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Figure 16: Policy performance on MR PWT metric, Figure 18: Policy performance on MR PWT metric,
• Our Hybrid Policy, if properly tuned, achieves a “sweet a combination of the two objectives. Compared with the
spot” in balancing average and maximum PWT, and baseline shortest-job-first and FIFO policies, which do not
is able to perform quite well on both. account for future sharing opportunities, our policies achieve
significantly lower perceived wait time. This means that
• With stationary workloads, our Hybrid Policy substan- users’ jobs will generally complete earlier under our schedul-
tially outperforms the better of two generalizations of ing policies.
shortest-job-first to shared workloads.
• With extremely bursty workloads, our Hybrid Policy
performs on par with shortest-job-first.
10. REFERENCES
[1] R. H. Arpaci-Dusseau. Run-time adaptation in River.
ACM Trans. on Computing Systems, 21(1):36–86, Feb.
9. SUMMARY 2003.
In this paper we studied how to schedule jobs that can [2] P. Billingsley. Probability and Measure. John Wiley &
share scans over a common set of input files. The goal is to Sons, Inc., New York, 3nd edition, 1995.
amortize expensive file scans across many jobs, but without [3] M. C. Chou, H. Liu, M. Queyranne, and
unduly hurting individual job response times. D. Simchi-Levi. On the asymptotic optimality of a
Our approach builds a simple stochastic model of job simple on-line algorithm for the stochastic
arrivals for each input file, and takes into account antici- single-machine weighted completion time problem and
pated future jobs while scheduling jobs that are currently its extensions. Operations Research, 54(3):464–474,
enqueued. The main idea is as follows: If an enqueued job 2006.
J requires scanning a large file F , and we anticipate the [4] J. Dean and S. Ghemawat. MapReduce: Simplified
near-term arrival of additional jobs that also scan F , then it data processing on large clusters. In Proc. OSDI, 2004.
may make sense to delay J if it has not already waited too [5] S. Divakaran and M. Saks. Online scheduling with
long and other, less sharable, jobs are available to run. release times and set-ups. Technical Report 2001-50,
We formalized the problem and derived a simple and ef- DIMACS, 2001.
fective scheduling policy, under the objective of minimizing [6] P. M. Fernandez. Red brick warehouse: A read-mostly
perceived wait time (PWT) for completion of user jobs. Our RDBMS for open SMP platforms. In Proc. ACM
policy can be tuned for average PWT, maximum PWT, or SIGMOD, 1994.
9000 2500
FIFO FIFO
8000
AA Policy 2 AA Policy 2

7000 2000
Hybrid Hybrid
6000 Oblivious SJF Oblivious SJF
1500
5000
4000
1000
3000
2000 500
1000
0 0
0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9
Load (non-shared cost increasing) Load with bursts (non-shared cost increasing)
Figure 19: Policy performance on AA PWT metric, Figure 21: Policy performance on AA PWT metric,
as nonshared costs increase, with bursty job arrivals. as arrival rates increase, with bursty job arrivals.
140000
FIFO
120000 AA Policy 2
Max Absolute PWT
Hybrid
100000
Oblivious SJF
80000
60000
40000
20000
0
0.1 0.3 0.5 0.7 0.9
Load (non-shared cost increasing)
Figure 20: Policy performance on MA PWT metric,

as nonshared costs increase, with bursty job arrivals.
Figure 22: Performance over time, with bursty job
arrivals.
[7] A. Gupta, S. Sudarshan, and S. Vishwanathan. Query
scheduling in multiquery optimization. In [15] R. H. Möhring, F. J. Radermacher, and G. Weiss.
International Symposium on Database Engineering Stochastic scheduling problems I – general strategies.
and Applications (IDEAS), 2001. Mathematical Methods of Operations Research,
[8] S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. 28(7):193–260, 1984.
QPipe: A simultaneously pipelined relational query [16] R. Motwani, S. Phillips, and E. Torng.
engine. In Proc. ACM SIGMOD, 2005. Non-clairvoyant scheduling. In Proc. SODA
[9] H. Hoogeveen. Multicriteria scheduling. European Conference, pages 422–431, 1993.
Journal of Operational Research, 167(3):592–623, [17] S. Muthukrishnan, R. Rajaraman, A. Shaheen, and
2005. J. E. Gehrke. Online scheduling to minimize average
[10] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. stretch. In Proc. FOCS Conference, 1999.
Dryad: Distributed data-parallel programs from [18] K. Pruhs, J. Sgall, and E. Torng. Online scheduling,
sequential building blocks. In Proc. European chapter 15. Handbook of Scheduling: Algorithms,
Conference on Computer Systems (EuroSys), 2007. Models, and Performance Analysis. Chapman &
[11] D. Karger, C. Stein, and J. Wein. Scheduling Hall/CRC, 2004.
algorithms. In M. J. Atallah, editor, Handbook of [19] A. S. Schulz. New old algorithms for stochastic
Algorithms and Theory of Computation. CRC Press, scheduling. In Algorithms for Optimization with
1997. Incomplete Information, Dagstuhl Seminar
[12] E. L. Lawler. Optimal sequencing of a single machine Proceedings, 2005.
subject to precedence constraints. Management [20] J. Sgall. Online scheduling – a survey. In On-Line
Science, 19(5):544–546, 1973. Algorithms, Lecture Notes in Computer Science.
[13] J. Lenstra, A. R. Kan, and P. Brucker. Complexity of Springer-Verlag, 1997.
machine scheduling problems. Annals of Discrete [21] M. Zukowski, S. Heman, N. Nes, and P. Boncz.
Mathematics, 1:343–362, 1977. Cooperative scans: Dynamic bandwidth sharing in a
[14] N. Megow, M. Uetz, and T. Vredeveld. Models and DBMS. In Proc. VLDB Conference, 2007.
algorithms for stochastic online scheduling.
Mathematics of Operations Research, 31(3), 2006.

Scheduling Shared Scans of Large Data Files

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Scheduling Shared Scans of Large Data Files

Transféré par

Droits d'auteur :

Formats disponibles

Scheduling Shared Scans of Large Data Files

Parag Agrawal Daniel Kifer Christopher Olston

ABSTRACT the web), and the communication is minimal (distributive

• No idle. If the input queue is nonempty, do not leave F1 : ts1 = 1, tn

Average Absolute PWT

7. HYBRID SCHEDULING POLICIES

Average Absolute PWT

Max Absolute PWT

Max Relative PWT

Average Absolute PWT

Max Absolute PWT

Average Relative PWT

Max Relative PWT

6000 Hybrid 3000

Average Absolute PWT

Figure 20: Policy performance on MA PWT metric,

Vous aimerez peut-être aussi