Vous êtes sur la page 1sur 6

Ben

Trapani
Professor Enos
Advanced Writing in the Technical Professions
2/3/16
The High Cost of Variable Latency in Distributed Systems
As a result of the collapse of Moores Law, data and compute center engineers have
been moving towards increasing the number of compute nodes in the center as opposed to
upgrading the performance of each node. It is usually cheaper to purchase a higher volume of
lower performance chips than it is to purchase a lower volume of high performance chips that
offer the same total compute power. Therefore, pressure is on computer scientists to design
frameworks and systems that effectively utilize the performance gains of distributed systems.
As of 2016, many frameworks exist to distribute computation across a set of networked
computers, including OpenCL, NVidia CUDA, and OpenMPI. However, highly variable latency
has been an inherent problem in all modern distributed system frameworks and
implementations, and has been an acute cost to industries that depend on real-time distributed
processing of data.

Unpredictable latency is a result of variable per node performance combined with

variable network latency, leading to a large potential latency per execution. However, since the
probability of one node being latent is fairly low, the vast majority of executions occur within a
reasonable latency window, while a significant portion of executions occupy the entire
potential latency window. This phenomenon is called tail latency. Some sectors dependent on
cloud technology have experienced the pains of latency fluctuations in cloud compute services

the most acutely, including the financial sector, air traffic control systems real-time machine
learning algorithms. Imagine a hedge fund manager placing an order to buy $500 million worth
of a security. If the order is a market order and is not filled within 1/10th of a second, the price
of even the largest securities could fluctuate by 0.5%, increasing the chances that the hedge
fund looses money. If the system that the hedge fund is using executes 90% of orders placed in
1 microsecond and executes the remaining 10% in 1 second, the hedge fund risks a large sum of
money when placing an order using the system outlined. The system exhibits high tail latency
that is characteristic of most modern distributed systems.
As a result of the demand for distributed systems with deterministic latency, many
computer scientists have modeled different ways of distributing data and have proposed
different solutions which provide predictable latency in distributed systems. The proposed
solutions typically are dependent on the type of computation being performed. For example,
the field of real-time 3D rendering requires extensive data parallelism and deterministic low
latency.
Scientists accommodated the demand for deterministic latency in 3D rendering by
building specialized compute units known as GPUs, which parallelize operations on similar data
at the hardware level and guarantee almost-equal latency per operation while ensuring that the
final product includes every completed operation. This approach works well for graphics
processing because the per-data computation is usually a set of three to four 3x3 matrix
multiplications. So long as each piece of data is given an equal slice of the compute resources
and performing the computation for one piece of data does not effect the computation of other
pieces of data, the GPU can execute all operations in parallel on separate mini-cores within the

unit and can then build the resulting texture synthesized from the results of each data
transformation. When every transformation has completed, the most-recently-built frame
buffer replaces the existing frame buffer and the users display reflects the change. The fact
that most modern 3D video games are able to operate at a stable 60 frames per second is proof
that distributed, predictable low latency systems are possible if the system is tailored to specific
data and transformation types.

However, GPUs are only well-suited to perfectly data-parallel operations that depend

heavily on linear algebra. Take a securities trading system for example. If we attempted to use a
GPU to execute buy/sell order matching in parallel, we would run into data races, since the
price of certain securities such as ETFs, futures, options, etc. depend on multiple other
securities. Many computer scientists would point out that we could build a dependency tree to
model the dependencies between securities and could then leverage the GPU for computations
at the lower levels of the tree. However, this only yields a performance gain for a few iterations,
since GPUs are optimized for linear algebra and have very poor per core performance
compared to traditional CPUs. Although we can solve the predictable low latency distributed
computation problem by tailoring a system to a specific set of data and transformation types,
we have not been able to architect a general system with the same properties.

Suresh, Schmid and Feldmann discuss the issue of tail latency in distributed systems in

great depth in an article appearing in the May 2015 publication of Usenix titled C3: Cutting Tail
Latency in Cloud Data Stores via Adaptive Replica Selection. The authors explain the costs of
high tail latency in the fields of finance and real time machine learning and data processing.
They also identify some key technical challenges encountered when designing a general

purpose distributed compute cluster, such as fault tolerance and the necessity of data
replication. Making a distributed system highly available and ensuring data integrity introduce
significant tail latency if done naively. In the worst case, data is replicated across an arbitrary
percentage of the nodes in the system with no load balancing, which can lead to a single failure
or costly execution causing a latency spike across all incoming executions over a certain time
interval. Suresh, Schmid and Feldmann propose a data replication scheme called C3. They
outline the parameters of the scheme, provide pseudocode for a control application and
document the reduction in tail latency by examining the impact that their data replication
scheme has on the tail latency of a distributed Cassandra database running on an Amazon EC2
instance (Suresh 2015). They assume that their readers have an advanced understanding of the
field of distributed systems and computer science, the ability to read technical diagrams of
systems, and an advanced understanding of coding and probability. Knowledge of the basic idea
of data replication is also assumed. The article seeks to present a solution to both the high
availability and data replication challenges in cloud infrastructures without significantly
contributing to the already-significant tail latency present in distributed systems.

Although the article presents an effective solution for reducing tail latency in a

distributed Cassandra database on an Amazon AWS EC2 instance, the article does not propose
a more general solution to curbing tail latency in distributed compute clusters. The lack of
generality in the article and in others concerning the same subject suggest that computer
scientists have had difficulty identifying and relating the key parameters of distributed systems
in a general manner that optimizes these systems for deterministic latency. Unfortunately, it
appears as if the solution outlined in the paper is of little more use to general purpose low

latency distributed computing than a GPU is. It is useful given a specific set of parameters of the
system. When computer scientists discover how to automate the parameter adjustments of
distributed systems to optimize system latency based on properties of the data and the
transformations performed by each respective system, distributed computing will reach its full
potential in terms of deterministic latency. Industries such as the financial services sector, realtime data analytics and real-time navigation will advance rapidly as a result.















References:
Suresh, L., Schmid, S., & Feldmann, A. (2015). C3: Cutting Tail Latency in Cloud Data Stores via
Adaptive Replica Selection. Usenix.

Vous aimerez peut-être aussi