Best-Effort Computing: Re-Thinking Parallel Software and Hardware

50.
Best-effort Computing: Re-thinking Parallel

Software and Hardware
Srimat T. Chakradhar and Anand Raghunathan

Systems Architecture Department, NEC Laboratories America

School of Electrical and Computer Engineering, Purdue University
chak@nec-labs.com,raghunathan@purdue.edu
ABSTRACT eral decades. However, as we approach the limits of scaling,

With the advent of mainstream parallel computing, applica- several limitations and challenges introduced by scaling have
tions can obtain better performance only by scaling to plat- changed the nature of the game signicantly. First, the inabil-
forms with larger numbers of cores. This is widely considered ity to continue classical scaling (due to rapidly increasing leak-
to be a very challenging problem due to the diculty of paral- age) led to the end of frequency scaling as the means to perfor-
lel programming and the bottlenecks to ecient parallel exe- mance improvements. Consequently, parallelism has emerged
cution. Inspired by how networking and storage systems have as the primary avenue to improved performance. However,
scaled to handle very large volumes of packet trac and per- achieving consistent performance improvements in this new
sistent data, we propose a new approach to the design of scal- regime is quite challenging due to the diculty of parallel pro-
able, parallel computing platforms. For decades, computing gramming, and the bottlenecks to ecient parallel execution.
platforms have gone to great lengths to ensure that every com- As devices scale (possibly into the post-CMOS era), further
putation specied by applications is faithfully executed. While challenges are expected, such as increasing device unreliability
this design philosophy has remained largely unchanged, appli- due to defects, soft errors, and extreme process variations.
cations and the basic characteristics of their workloads have In light of the above challenges, it is important to consider
changed considerably. A wide range of existing and emerging new approaches to design computing platforms. These new
computing workloads have an inherent forgiving nature. We approaches can help sustain the performance scaling that ap-
therefore argue that adopting a best-eort service model for plications are used to seeing, by (i) complementing the current
various software and hardware components of the computing scaling approach of just increasing the number of cores, (ii)
platform stack can lead to drastic improvements in scalabil- addressing the bottlenecks to parallel execution so that appli-
ity. Applications are cognizant of the best-eort model, and cations can take advantage of computing platforms with larger
separate their computations into those that may be executed numbers of cores, and (iii) allowing computing platforms to be
on a best-eort basis and those that require the traditional built from unreliable components (but without incurring high
execution guarantees. Best-eort computations may be ex- overheads for fault tolerance).
ploited to simply reduce the computing workload, shape it to In this paper, we argue for re-thinking the design of the
be more suitable for parallel execution, or execute it on un- entire computing platform stack - from applications to hard-
reliable hardware components. Guaranteed computations are ware - using an approach that we call best-eort computing.
realized either through an overlay software layer on top of the Put simply, the idea is to design most if not all components
best-eort substrate, or through the use of application-specic of the stack to provide services using a best-eort model, re-
strategies. We describe a system architecture for a best-eort alizing drastic improvements in eciency and scalability. Ap-
computing platform, provide examples of parallel software and plications (or programming models that they use) are struc-
hardware that embody the best-eort model, and show that tured appropriately to take advantage of a best-eort comput-
large improvements in performance and energy eciency are ing platform, by exposing computations that can be executed
possible through the adoption of this approach. on a best-eort basis. Computations that are not suitable
for best-eort execution can be handled by (i) separate hard-
Categories and Subject Descriptors ware and software that provide the traditional guaranteed
model of computation, (ii) a software overlay layer within the
C.0 [Computer Systems Organization]: General platform that ensures their correct and complete execution
on the best-eort substrate, or (iii) by implementing faster,
General Terms application-specic strategies to compensate for the imperfec-
Algorithms, Design, Performance, Reliability tions in the computing platform. We argue for a best-eort
computing model by drawing analogies from networking and
Keywords storage systems, and by examining how a best-eort approach
has enabled them to scale dramatically. We provide concrete
Best eort systems, parallel computing, multi core, perfor- examples of how to design parallel software and hardware us-
mance, scalability ing the best-eort computing model and present evidence of
improvements in performance and energy eciency.
1. INTRODUCTION The rest of this paper is organized as follows. Section 2 out-
Improvements in semiconductor devices have fueled im- lines the rationale for pursuing a best-eort model for com-
provements in the performance of computing systems for sev- puting systems. Section 3 discusses the best eort comput-
ing model and Section 4 identies characteristics of comput-
ing workloads that makes them good candidates for best-eort
computing. Section 5 provides various examples of designing
Permission to make digital or hard copies of all or part of this work for best-eort parallel software and hardware. Section 6 summa-
personal or classroom use is granted without fee provided that copies are rizes our proposal and discusses some possible ramications.
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy 2. RATIONALE
otherwise, to republish, to post on servers or to redistribute to lists, Scalability is a desirable property of a wide range of mod-
requires prior specific permission and/or a fee. ern information systems, including networking, storage, and
DAC'10, June 13-18, 2010, Anaheim, California, USA computing systems. It indicates the ability of the system to
Copyright 2010 ACM 978-1-4503-0002-5 /10/06...$10.00 handle growing amount of work in a graceful manner. It can
865
50.2
streaming, VOIP, Skype, stock quotation services and gaming

Applications applications are good examples of such a use case. Speed is
of utmost importance and it does not matter if some updates
are missed. It is indeed remarkable that a simple best-eort
packet delivery service like IP has supported everything from
Networking
Computing
Best-effort Best-effort le transfers, e-mail, and web trac to streaming media, voice
Storage
packet query
delivery processing, communications, social networking and collaborative comput-
(Internet ? Relaxed ing [1].
Protocol) consistency
Storage: Every popular website built on top of a tradi-
tional relational database experiences problems with scaling
the storage back end [2, 3, 4, 5, 6, 7, 8]. The most impor-
tant consideration for a data storage system is its ability to
rapidly scale to handle more users. Here, scaling refers to
Figure 1: Best-eort model in networking and storage the capability of the system to service more users while keep-
systems ing the cost per user constant. Due to the interactive nature
of users queries, the response time for any given query must
be independent of the number of users in the system. Data
also refer to the capability of a system to increase the total scale independence allows the user base to grow by orders of
throughput (as load increases) in direct proportion to the ad- magnitude without changing the application. For example,
dition of more resources (in the context of parallel computing consider the social networking web site Facebook, which has
systems, more processors). There are two broad methods of several billion dynamically generated page views per day [9].
adding more computing resources to a system. Scale-up adds Trac of this magnitude results in over 23,000 page views a
more resources to a single node in a system, typically involving second (and each page view could result in many queries to
addition of processors or memory to a single computer. Scale- the database). Facebook responded by developing their own
out adds more nodes to a system, such as adding a new com- complex, proprietary storage system [10]. Several functions
puter to a cluster or distributed system. In the context of high and services that are commonly implemented in traditional
performance computing, there are two other notions of scala- relational data stores hindered the scalability of the storage
bility. Strong scaling refers to how solution time varies with system. Notably, their system provides only eventual or re-
number of processors for a xed total problem size. Weak scal- laxed consistency, wherein a write to the storage system will
ing refers to how solution time varies when we x the problem not be seen by all users for an unspecied, variable period of
size per processor, and add more processors. Unfortunately, time. For Facebook and many other popular collaborative ap-
adding more resources to a system does not necessarily trans- plications like Flickr, Yelp etc., the relaxed consistency model
late to improved performance. Irrespective of the road taken oers superior scalablity and performance as compared to a
to achieve scalability, or the metrics used to measure scalabil- traditional database that provides full atomicity, consistency,
ity, a fundamental challenge in designing computing systems isolation, and durability (ACID) compliance.
is to translate increased resources into improved application We see that there are really two reasons to place functional-
performance. We pose the question: Is there a simple system ity in the network or storage sub-system rather than the end
design principle that has served us well over the past several hosts or applications. Either all applications need it, or a large
decades to build scalable systems? While the focus of this paper number of applications benet by an increase in performance
is on computing systems, we answer the question by examin- due to the clever implementation of a complex service or func-
ing how scaling challenges have been addressed in networking tion in the sub-system. Unfortunately, when new applications
and storage systems. like VoIP or Facebook use existing services and functions like
As shown in Figure 1, applications rely on functions pro- guaranteed delivery or consistency, their performance deteri-
vided by the networking, storage and compute sub-systems. orates with increased load. Also, workload characteristics of
Scalability of an application is intimately related to the scal- these new applications (VoIP, Facebook etc.) are inherently
ability of each of these sub-systems. We briey examine the dierent. Therefore, new packet delivery and data consistency
scalability of networking and storage sub-systems to identify mechanisms with radically dierent semantics and goals were
a common design principle for system scalability. necessary to improve scalability and performance of these ap-
Networking: The Internet Protocol (IP), on which the In- plications. In general, new applications will continuously force
ternet is built, was developed three decades ago with system us to re-evaluate semantics and goals of functions and services
scaling as a rst order priority. Towards that end, an impor- that are included in common sub-systems.
tant design decision was to make the end hosts responsible for Note that the adjective best-eort does sometimes get in-
reliability rather than the core network. Excessive workload terpreted as highly unreliable, but this is unfortunate. In
in the network is dealt with in a simple manner, by allowing fact, in networking systems, best-eort packet delivery is re-
switches and routers in the network to drop packets. The IP ferred to as a highly reliable delivery service [1, 11]. Therefore,
protocol has been successful in managing ever-increasing vol- equating best-eort with unreliability is neither appropriate
umes of end points and packet trac for over three decades nor factual. Reliability is not a discrete quantity with only
(exponential growth with packet trac doubling every two two values yes or no. Rather, it allows for a spectrum of possi-
years). While this could be attributed to a combination of bilities that range from completely reliable to outright unreli-
factors, one of the key design decisions to enable scalability able. As the networking and storage applications show, mod-
was simply reserving the right to drop packets. By sacric- est relaxations of traditional strict guarantees do not materi-
ing guarantees, it has become possible to build simpler, faster ally alter the core functionality of these applications. Rather,
and scalable networks. When the IP protocol was developed, such relaxations have enabled disproportionate performance
there were very few applications that could directly use the and scalability benets. Our proposal of best-eort comput-
best-eort packet delivery service, and most applications used ing platforms should also be viewed in a similar spirit.
TCP to guarantee packet delivery. However, over the years,
numerous new scenarios have emerged where applications pre-
fer to routinely use the higher-performance, best-eort packet 3. BEST-EFFORT COMPUTING MODEL
delivery protocols. For example, the UDP protocol is built on Since a large number of emerging applications already tol-
top of the IP protocol, and it does not guarantee reliability erate imperfections in the network or storage systems in re-
or correct sequencing of packets. Applications that use UDP turn for higher, scalable performance, it begs the question of
handle reliability requirements, if any, on their own. DNS is whether computing platforms have to be perfect. From a dif-
the perfect description of this use case. The costs of connection ferent perspective, can some of the functions currently being
set-up for reliable communication are too high, and DNS im- handled by the computing platform be shifted to the appli-
plements its own protocol for reliability or re-sends. Another cations in return for higher, more scalable computing perfor-
scenario is when delivering data that can be lost (without ad- mance? The best-eort computing paradigm is a proposal
versely aecting the application) because new data is being that oers applications an unprecedented, scalable computing
generated that will replace the lost data. Weather data, video platform, by making a fundamental change in the contract be-
866
50.2
application to easily experiment with a variety of strategies

Applications for execution of optional computations. It implements these
best-eort strategies, and manages the execution of guaranteed
Best-effort Programming Model computations. Like the TCP protocol in computer networking,
the best-eort computation layer also implements a mechanism
to ensure reliable execution of a mandatory computation by re-
Guaranteed
Optional
Computations
peated re-execution, if necessary. Many of the functions and
Computations services provided by traditional operating systems can also be
simplied by using the best-eort service philosophy. Finally
Guaranteed the best-eort model can be taken down to the underlying par-
Computation layer allel hardware, by trading o correct hardware operation for
increased performance or energy eciency.
Best-effort
Computation layer
4. FORGIVING NATURE OF COMPUTING
Best-effort
Operating System
WORKLOADS
The success of a best-eort service model greatly depends
Best-effort on the nature of applications that utilize it. In the context of
Parallel Hardware the Internet, the best-eort architecture has worked extremely
well for data trac, and even for real-time applications that
have some amount of adaptivity [12].
In this Section, we focus on the characteristics of applica-
Figure 2: Conceptual model of a best-eort computing tions that are well-suited to a best-eort computing platform.
system At an abstract level, we pose the question - which applications
can live with a computing platform that is not guaranteed to
tween applications and the computing platform. Like network- execute computations in a 100% correct manner? We believe
ing and storage systems, best-eort computing shifts some of that a large class of computing workloads possess characteris-
the traditional concerns (like guaranteed execution of tasks) tics, which we collectively refer to as their forgiving nature,
to the applications rather than the computing platform being that enable them to execute well on best-eort computing plat-
solely responsible for these functions. In return, the best-eort forms. Computing workloads with a forgiving nature are all
computing paradigm promises increased scalability and higher around us, and they include digital signal processing, mul-
performance for the applications. timedia processing (image, video, audio), network processing,
The basic tenet of our proposal is that the computing plat- wireless communications, web search and recognition and data
form provides computing as a best-eort service. For exam- mining. The forgiving nature of these applications may be at-
ple, (i) the computing platform may drop (i.e. not execute) tributed to a variety of factors, some of which are listed below.
some of the computations requested by the application (every
application can be viewed as a collection of small computa- Noisy input data: They process input data that is
tions), (ii) the computing platform may not honor some of derived from the real world, and inherently noisy. Since
the task data dependencies or synchronization requirements the applications and algorithms are designed to deal with
of these computations while executing them in parallel, or noisy data, they are also often able to tolerate erroneous
(iii) the computations may be executed on unreliable hard- computations.
ware that introduces occasional errors into the results. This Large, redundant data sets: They process large input
view is similar to the Internet Protocol model in computer data sets that have signicant redundancy. This enables
networking where the network is unable to guarantee delivery them to be inherently resilient to errors in computations
of all packets, or the relaxed-consistency model in large-scale without degrading the result.
data stores where the storage sub-system is unable to guar-
antee consistency of data. By sacricing guarantees, it has No perfect or golden result: They come with an im-
become possible to build simpler and faster networks. By sac- plicit understanding (usage model) that a perfect result
ricing consistency, it has become possible to build large data is not possible or may not be necessary . Often, there
stores with fast query response times. Similarly, by sacric- may also be no single golden result, i.e., several dierent
ing guaranteed computation, it is possible to build faster and outputs are equally acceptable.
more scalable computing systems. There are several scenar-
ios that may warrant best-eort computing service: defects in Limited perceptual ability of users: Even when
hardware, real-time constraints on response times, excessive there is a notion of a perfect result, many applications
computation load, or power constraints. produce outputs for human consumption. The limited
Figure 2 depicts a conceptual model of the overall system perceptual ability of humans to discern minor variations
architecture for a best-eort computing system. Unreliability in the output imply that a good enough output is often
of the underlying computing platform forces the application to sucient.
re-structure its computations into two categories: (i) optional Statistical or probabilistic computations: Applica-
computations that may be dropped by the computing platform tions that employ statistical or probabilistic computa-
or executed incorrectly, and (ii) mandatory computations that tions are often tolerant to imprecision or errors in some
must be executed correctly in order to maintain integrity of of the computations.
the application. This is again similar to re-structuring of net-
work applications today to utilize the unreliable UDP protocol Self-healing: They employ iterative computations
or the reliable TCP protocol, both of which are realized on the where the result is rened until a certain criterion is
unreliable IP protocol. New parallel programming models that satised; therefore, errors in the computations may be
provide high-level programming templates to easily and intu- corrected or healed by performing additional iterations
itively express various algorithms for execution on a parallel, or renement of the result.
unreliable computing platform will enable rapid development
of new applications. Using these programming models, ap- In order to quantitatively illustrate the forgiving nature of
plications can easily specify optional and mandatory compu- workloads, we focus on two representative applications from
tations. Optional computations can be leveraged in at least the domains of Recognition and Mining (RM), which repre-
three ways: (1) drop computations to reduce overall workload sent an emerging class of applications that are expected to be
and improve performance, (2) make the computations more prevalent on future computing platforms [13, 14]. Specically,
amenable to parallel execution by increasing parallelism, and we focus on K-means clustering and Generalized Learning Vec-
reducing communication and synchronization or memory ac- tor Quantization (GLVQ). K-means is a widely used clustering
cesses, or (3) execute the computations on unreliable but e- algorithm that is based on unsupervised learning [15]. It clus-
cient hardware. The best-eort computation layer allows the ters a given set of points that are represented as vectors in
867
50.2
K-means clustering quality classes [20]. Similar to K-means, inputs are encoded as vectors
with centroid perturbation
300 in a multi-dimensional space. During training, the algorithm
10% builds a model from the labeled training data set by creating a
Clustering Quality Impact (%)

250 5% set of reference vectors to represent each class of data. During
1%
200 classication, the distance from the input to all the reference
0.1%
0%
vectors is computed and used to assign the input to one of
150
(a) the classes. We focus on the computation-intensive training
100 phase of the GLVQ algorithm, which performs the following
operations for each training input vector.
50
1. Compute distances between the labeled training vector
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
and all reference vectors.
# iterations
2. Identify the closest reference vector RV1 in the correct
GLVQ classification quality
class, and the closest reference vector RV2 from all the
with model perturbation incorrect classes.
93
3. Suitably update the two chosen reference vectors so that
Classification Accuracy (%)
92
RV1 moves closer to the training vector, and RV2 moves
91 farther from it.
90 (b)
0% The above computations are performed for each training vec-
89
10% tor, and the process is iterated until the quality of the model
88 50% does not improve. In the case of GLVQ, we evaluated the
90%
87 forgiving nature of the algorithm by executing a software im-
1 2 3 4 5 6 7 8 9 10 11 plementation that was used to detect faces in images. During
# epochs (iterations) the execution of the algorithm, we injected errors in the re-
sults computed by each iteration, i.e., the reference vectors.
Figure 3(b) presents the results of our evaluation. Again, the
dierent curves represent dierent rates of error injection and
Figure 3: Illustration of the forgiving nature of K- each curve shows how the quality of the model improves as the
means clustering and GLVQ classication algorithm iterates to convergence. In the case of GLVQ, we
note that error rates of up to 50% have very little impact on
a multi-dimensional space. The algorithm begins by picking the nal quality of the model constructed.
a random set of cluster centroids, and performs the following While we have focused on just two representative applica-
operations to assign points to clusters in an iterative manner tions, we would like to reiterate that the forgiving nature is
until the clustering does not change any more. observed across a wide range of application domains, including
1. Compute the distance between every point and every digital signal processing, multimedia (image, video, and audio)
cluster centroid processing, network processing, wireless communications, and
recognition and data Mining. The causes and extent of the
2. Assign each point to the cluster with the closest centroid. forgiving nature may vary, but we believe that most if not all
of these workloads can benet from the proposed best-eort
computing model.
3. Re-compute the new centroid for each cluster as the mean
of all the points in the cluster.
5. BEST-EFFORT SOFTWARE AND HARD-
A common application of K-means is to segment images into
regions with similar color and texture characteristics (each WARE: ILLUSTRATIONS
The forgiving nature of computing workloads suggests that
pixel represents a point in the K-means clustering algorithm). the traditional interface of guaranteed execution of computa-
Image segmentation is a useful pre-processing step for image tions may be changed by the underlying computing platform.
content analysis or compression. In order to evaluate the for- In this section, we provide concrete examples of how parallel
giving nature of the K-means algorithm, we executed a soft- hardware and software can be re-designed based on the best-
ware implementation of K-means to perform image segmenta- eort service model. First, we address the problem of how to
tion across several image data sets, while injecting errors in partition the computations in an application into best-eort
the cluster centroid that is computed as a result of each iter- and guaranteed computations through parallel programming
ation 1 . Figure 3(a) shows how the quality of the clustering models. Second, we discuss how a given computing platform
computed by K-means varies with dierent rates of error injec- can scale to handle larger workloads by dropping (not execut-
tion. Each curve corresponds to a dierent error injection rate, ing) some of the best-eort computations. Next, we discuss
and a best-eort shows the improvement of clustering quality how scalable parallel execution can be obtained by taking ad-
as the algorithm iterates to convergence. The results suggest vantage of the forgiving nature of the workload to alleviate the
that executing K-means on a computing platform that intro- bottlenecks to parallel execution. We describe how the best-
duces fairly large error rates (up to 1%) will have virtually no eort model can be taken down into the hardware by incorpo-
impact on the clustering quality. However, it is important to rating knobs that modulate the eort expended by hardware
note that not all computations in the algorithm can be subject towards computing correct results. Finally, we discuss how the
to errors - for example, we note that the operations that deter- best-eort model may be used to build computing platforms
mine whether to continue iterating must be executed without from unreliable components, without incurring excessive over-
any errors. This corresponds to the view that most applica- heads for fault tolerance.
tions will consist of computations that may be executed on a
best-eort basis, and others that may not. 5.1 Best-effort parallel programming models
GLVQ is a supervised learning algorithm that is used for Applications that execute on a best-eort computing plat-
classifying input data into one of a pre-determined set of form must address the separation of computations into best-
1 eort and guaranteed components. While there are several
The errors are injected to represent the impact of executing possible approaches to achieving this separation, we believe
K-means on a best-eort computing platform. More detailed that approaches that minimize the additional burden on the
evaluations considering the exact nature of errors introduced programmer are desirable. Towards that goal, Meng et al. [16]
are presented in [16, 17, 18, 19]. We believe that this sim- proposed parallel programming templates such that the identi-
ple model is adequate to illustrate the forgiving nature of K- cation of best-eort computations is a natural by-product of
means, since the net eect of errors in the computations of each specifying the algorithm using the template. For example, Fig-
iteration of the K-means algorithm is to change the clusters to ure 4 presents a template for iterative-convergence algorithms
which the points are assigned. that naturally embodies the concept of best-eort computing.
868
50.2
iterate {
mask[0:M] = filter();
parallel_for(0 to N with mask batch P) {

..

}
} until converged( );
Figure 4: Iterative-convergence programming model

for best-eort computing [16]
Figure 6: Improving scalability of parallel execution

for GLVQ classication [17]
Figure 5: Scalability of the K-means workload (error

vs. execution time) for various best-eort computing
strategies [16]
Figure 7: Energy-eciency improvements through

The template naturally captures algorithms that iteratively scalable-eort hardware [19]
perform a parallel computation until a specied convergence
criterion is achieved. This iterative-convergence template was
applied to specify the K-means and GLVQ algorithms, and able them to scale to computing platforms with larger num-
other recognition and mining algorithms [16, 17]. Several bers of processing cores. In this case, we do not wish to reduce
strategies to identify best-eort computations were presented the amount of computation executed; rather, we would like to
in [16] based on the iterative-convergence template. For ex- shape the workload so that its performance scales better with
ample, computations within the inner parallel_for loop that larger numbers of cores. Towards this goal, Meng et al. [17]
do not have an impact on the result (e.g., their outputs do and Byna et al. [18] analyze workloads whose parallel scala-
not change) for a few iterations can be identied through the bility is limited by bottlenecks to parallel execution, such as
filter operator and marked as best-eort computations. Al- communication, synchronization, and run-time library over-
though the iterations of the parallel_for loop may carry de- heads, leading to sub-optimal performance on multi-core and
pendencies, batches of the loop iterations may be executed in many-core (GPU) platforms. They propose a technique called
parallel ignoring the dependencies. Finally, various conver- dependency relaxation, which selectively relaxes data depen-
gence criteria may be used to identify iterations of the outer dencies between tasks, and show how it can be applied to cre-
iterate loop that have a minimal impact on the output and ate increased parallelism, improve the granularity of parallel
mark them as best-eort computations. tasks, and reduce the o-chip memory and coherence trac.
Figure 6 presents representative results for the GLVQ appli-
5.2 Managing computing workloads cation running on a multi-core platform with 8 cores. In con-
The ability of a given computing platform to handle work- trast to the original implementation, whose parallel scalability
loads can be enhanced by selectively dropping (not executing) is severely limited by the granularity of the parallel tasks and
some of the best-eort computations. This is a strategy for run-time overheads, the dependency relaxed implementation
load management that is used in network routers, as well as demonstrates near-linear speedup for up to 8 cores.
in specic applications such as web servers. We propose an
extension of this concept to manage a broader class of com- 5.4 Taking best-effort into hardware
puting workloads. Meng et al. [16] demonstrated that, in the The best-eort model can be used to re-think hardware de-
context of Recognition and Mining workloads, signicant per- sign and implementation. Traditionally, hardware has been
formance improvements can be obtained by not executing some designed to implement a given specication exactly, based on
of the best-eort computations. Reducing the workload leads the concepts of Boolean or numerical equivalence. Chippa et
to speedups on both sequential and parallel computing plat- al. [19] propose a design approach called scalable eort hard-
forms. Figure 5 illustrates this scalability in the context of ware that takes advantage of the forgiving nature of applica-
the K-means algorithm executing on an 8-core parallel plat- tions to achieve improvements in energy eciency. The key
form. Using various best-eort computing strategies described idea is to incorporate knobs that regulate the computational
in [16], we observe that it is possible to scale the computing eor expended by the hardware towards performing compu-
workload by two orders of magnitude, while the error intro- tation correctly, and to regulate these knobs to achieve the
duced in the output due to dropping the best-eort computa- desired quality of the output at minimal energy consumption.
tions varies up to around 10% of the points being mis-classied. They illustrate how such knobs can be identied at dierent
This scalability can be exploited to process larger data sets on levels of design abstraction, namely at the algorithm, archi-
a given computing platform, or to handle surges in workload tecture, and circuit levels. These concepts are applied to the
without degrading application performance. design of a hardware processor for Support Vector Machines
(SVMs). At the algorithm level, the number of support vectors
5.3 Improving scalability of parallel execution and the number of dimensions of each vector can be modulated
The forgiving nature of applications can be exploited to en- to trade o the number of operations for accuracy of the result.
869
50.2
At the architecture level, the number of processing elements the networking domain, best-eort does equal highly unreli-
and the precision of each processing element are regulated in able. Only a very small degradation in reliability is actually
order to provide just enough computational accuracy. At the encountered in practice. Finding the right extent to which the
circuit level, the concept of voltage over-scaling (scaling the forgiving nature of each application should be exploited will
supply voltage while keeping the clock frequency xed, thereby require further investigation. Similar to the concept of dier-
introducing errors in the outputs of processing elements) is uti- entiated services in networking, it may be interesting to ex-
lized together with enabling circuit-level design techniques, in plore further classication of the best-eort computations into
order to obtain improved energy eciency at the cost of errors dierent categories that are treated dierently by the comput-
introduced into the operations of the algorithm. By combining ing platform. Finally, in the context of both parallel hardware
these knobs, the hardware implementation can be regulated to and software, techniques used for verifying the implementation
expend just enough eort to achieve the desired output ac- must account for the fact that numerical or Boolean equiva-
curacy. As illustrated in Figure 7, energy improvements of lence may no longer be maintained. Despite these challenges,
2X-3X are reported as compared to a well-optimized conven- we believe that the potential benets of the best-eort com-
tional hardware implementation. puting model (such as improvements in performance, energy
eciency, and scalability) make it very appealing as a direction
5.5 Dealing with unreliable hardware for future research in parallel hardware and software systems.
With continued scaling of feature sizes, transistors and in- Acknowledgment: We acknowledge Jiayuan Meng, Suren
terconnects are expected to become increasingly unreliable due Byna, Srihari Cadambi, Hyungmin Cho, Vinay Chippa, De-
to escalating defects, soft errors, and process variations. In the babrata Mohapatra, and Kaushik Roy, whose inputs have
face of high defect or error rates, classical fault-tolerant design greatly shaped our thoughts on best-eort computing.
techniques (based on spatial and/or temporal redundancy) will
become either very expensive or ineective. Recently, there
has been a proposal to design multi-core computing platforms 7. REFERENCES
by combining a small number of reliable processing cores with a
[1] S. Floyd and M. Allman. Comments on the Usefulness of Simple
large number of unreliable cores, and exploiting this asymmet- Best-Eort Trac. IETF RFC 5290,
ric reliability using a suitable software architecture [21]. Such http://tools.ietf.org/html/rfc5290, July 2008.
an architecture will have a larger number of cores, thereby po- [2] B. Fitzpatrick. Livejournal: Behind the scenes scaling storytime
tentially improving performance, compared to an architecture (Invited talk). In USENIX, 2007.
that only uses fully reliable cores (since reliable cores are signif- [3] G. Linden. Early Amazon: Splitting the website. Online:
icantly larger than unreliable cores). We have recently demon- http://glinden.blogspot.com/2006/02/early-amazon-splitting-web.
strated that the best-eort computing model can be combined [4] J. Newton. Scaling out like technorati.
with the architecture proposed in [21] to take full advantage of http://newton.typepad.com/content/2007/09/scaling-out-
lik.html.
the forgiving nature of computing workloads. The separation [5] T. OReilly. Web 2.0 and databases part 1: Second life.
of computations into best-eort and guaranteed computations http://radar.oreilly.com/archives/2006/04/web-20-and-databases-
can be used to drive the assignment of computations to the part-1-se.html.
reliable and unreliable cores. We have also demonstrated that [6] T. OReilly. Database war stories 5: craigslist.
simultaneously exploiting the forgiving nature of the workloads http://radar.oreilly.com/archives/2006/04/database-war-stories-5-
craigsl.html.
through software techniques (dropping computations, relaxing [7] T. OReilly. Database war stories 3: Flickr.
data dependencies) and execution on unreliable cores can lead http://radar.oreilly.com/archives/2006/04/database-war-stories-3-
to signicantly better performance than either of these tech- ickr.html.
niques alone [22]. [8] M. Armbrust et al. SCADS: Scale-independent storage for social
computing applications. In Proc. 4th Biennial Conference on
Innovative Data Systems Research (CIDR), 2009.
6. SUMMARY AND DISCUSSION [9] Adf Facebook infosession. Sponsored by Industrial Relations
Oce (IRO), EECS Department, UC Berkeley, September 2008.
The emergence of mainstream parallel computing mandates [10] J. Sobel. Scaling out.
that applications scale in performance primarily by taking ad- http://www.facebook.com/note.php?note id=23844338919.
vantage of increasing numbers of cores. Motivated by ap- [11] S. Armstrong et al. Multicast Transport Protocol. RFC 1301,
proaches used to design large scale networking and storage sys- http://www.rfc-archive.org/getrfc.php?rfc=1301, Feb. 1992.
tems, we propose best-eort computing as a model for future [12] L. Breslau and S. Shenker. Best-eort versus reservations: a
parallel computing platforms. Under the best-eort model, dif- simple comparative analysis. In Proc. SIGCOMM, pages 316,
ferent layers of the computing platform stack are designed to 1998.
provide a best-eort service, relaxing conventional guarantees [13] Pradeep Dubey. A Platform 2015 Workload Model: Recognition,
Mining and Synthesis Moves Computers to the Era of Tera. White
of complete or correct execution. We presented a conceptual Paper, Intel Corporation, 96, 2008.
model for a best-eort computing platform, described charac- [14] Y. K. Chen et al. Convergence of Recognition, Mining, and
teristics of applications that can leverage such a platform, and Synthesis Workloads and Its Implications. Proc. of IEEE,
96:790807, 2008.
provided concrete examples of parallel software and hardware [15] J. B. MacQueen. Some methods for classication and analysis of
design techniques that adopt the best-eort approach to signif- multivariate observation. In Proceedings of the Berkeley
icantly improve performance and energy eciency. We believe Symposium on Mathematical Statistics and Probability, 1967.
that the best-eort model can also be exploited to optimize [16] J. Meng, S. Chakradhar, and A. Raghunathan. Best-eort parallel
many other computing platform metrics like cost or system execution framework for recognition and mining applications. In
Proc. IEEE Int. Parallel and Distributed Processing Symp., May
management eort of the computing infrastructure. 2009.
We would also like to mention some of the challenges that [17] J. Meng, A. Raghunathan, A. Chakradhar, and S. Byna.
must be overcome in order to facilitate adoption of the best- Exploiting the forgiving nature of applications for scalable parallel
execution. In Proc. IEEE Int. Parallel and Distributed
eort model described in this paper. The best-eort model in- Processing Symp., Apr. 2010.
creases the burden on application developers since they must [18] S. Byna, J. Meng, A. Raghunathan, S. Chakradhar, and
partition their applications into computations that can work S. Cadambi. Best Eort Semantic Document Search on GPUs. In
with a best-eort service model and computations that can- Proc. Third Wkshp. on General-Purpose Computation on
Graphics Procesing Units, Mar. 2010.
not. The development of programming models, such as the [19] V. Chippa, D. Mohapatra, A. Raghunathan, K. Roy, and S. T.
iterative-convergence template described in this paper, for a Chakradhar. Scalable Eort Hardware: Exploiting Algorithmic
wide range of algorithms and computational patterns, will alle- Resilience for Energy Eciency. In Prof. ACM/IEEE Design
viate this burden. Alternatively, for some applications it may Automation Conf., June 2010.
[20] N. R. Pal, J. C. Bezdek, and E. C.-K. Tsao. Generalized
be better to use programming language extensions to mark clustering networks and Kohonens self-organizing scheme. IEEE
the regions in a program that correspond to best-eort com- Trans. on Neural Networks, 4(4):549557, Jul 1993.
putations. An important question that must be addressed is [21] L. Leem, H. Cho, J. Bau, Q. Jacobson, and S. Mitra. ERSA:
to what extent the underlying platform should actually take Error-Resilient System Architecture for Probabilistic
Applications. In Prof. ACM/IEEE Design Automation and Test
advantage of the forgiving nature of the best-eort computa- in Europe, Mar. 2010.
tions (i.e., how aggressively should it drop or incorrectly ex- [22] H. Cho, S. T. Chakradhar, and A. Raghunathan. Trading
ecute these computations). In this context, we note that in reliability for parallelism. Personal Communication, Sep. 2009.
870

Best-Effort Computing: Re-Thinking Parallel Software and Hardware

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Best-Effort Computing: Re-Thinking Parallel Software and Hardware

Transféré par

Droits d'auteur :

Formats disponibles

50.

Best-effort Computing: Re-thinking Parallel

ABSTRACT eral decades. However, as we approach the limits of scaling,

streaming, VOIP, Skype, stock quotation services and gaming

application to easily experiment with a variety of strategies

Clustering Quality Impact (%)

parallel_for(0 to N with mask batch P) {

Figure 4: Iterative-convergence programming model

Figure 6: Improving scalability of parallel execution

Figure 5: Scalability of the K-means workload (error

Figure 7: Energy-eciency improvements through

Vous aimerez peut-être aussi