Vous êtes sur la page 1sur 28

Interview with Raff Krikorian

on Twitters Infrastructure
Raff Krikorian, Vice President of Platform Engineering at Twitter,
gives an insight on how Twitter prepares for unexpected traffc peaks
and how system architecture is designed to support failure. PAGE 4
Scalability
eMag Issue 11 - April 2014
FACILITATING THE SPREAD OF KNOWLEDGE AND INNOVATION IN ENTERPRISE SOFTWARE DEVELOPMENT
INTERVIEW: ADRIAN COCKCROFT ON HIGH AVAILABILITY,
BEST PRACTICES, AND LESSONS LEARNED IN THE CLOUD P. 9
TO EXECUTION PROFILE OR TO MEMORY PROFILE? THAT IS THE QUESTION. P. 12
VIRTUAL PANEL: USING JAVA IN LOW LATENCY ENVIRONMENTS P. 16
RELIABLE AUTO-SCALING USING FEEDBACK CONTROL P. 25
Contents
Interview with Raff Krikorian on Twitters Infrastructure Page 4
Raff Krikorian, Vice President of Platform Engineering at Twitter, gives an insight on how
Twitter prepares for unexpected traffc peaks and how system architecture is designed to
support failure.
Interview: Adrian Cockcroft on High Availability,
Best Practices, and Lessons Learned in the Cloud Page 9
Netfix is a widely referenced case study for how to effectively operate a cloud application
at scale. While their hyper-resilient approach may not be necessary at most organizations,
Netfix has advanced the conversation about what it means to build modern systems. In this
interview, InfoQ spoke with Adrian Cockcroft who is the Cloud Architect for the Netfix
platform.
To Execution profle or to Memory Profle?
That is the question. Page 12
There are times when memory profling will provide a clearer picture than execution
profling to fnd execution hot spots. In this article Kirk Pepperdine talks through some
indicators for determining when to use which kind of profler.
Virtual Panel: Using Java in Low Latency Environments Page 16
Java is increasingly being used for low latency work where previously C and C++ were the
de-facto choice. InfoQ brought together four experts in the feld to discuss what is driving the
trend, and some of the best practices when using Java in these situations.
Reliable Auto-Scaling using Feedback Control Page 25
Philipp K. Janert explains how to reliably auto-scale systems using a reactive approach based
on feedback control which provides a more accurate solution than deterministic or rule-
based ones.
Contents
Page 4
Scalability / eMag Issue 11 - April 2014
Interview with Raff Krikorian on
Twitters Infrastructure
Twitters Raff Krikorian gives insight on how the company prepares for unexpected
traffc peaks and how system architecture is designed to support failure.
InfoQ: Hi, Raff. Would you please introduce yourself
to the audience and the readers of InfoQ?
Raff: Sure. My name is Raff Krikorian. Im the
vice-president of platform engineering at Twitter.
Were the team that runs basically the backend
infrastructure for all of Twitter.
InfoQ: With the help of Castle in the Sky, Twitter
created a new peak tweets-per-second record. How
does Twitter deal with unpredictable peak traffc?
Raff: What youre referring to is the Castle in the
Sky event, which is what we call it internally. That
was a television show that aired in Tokyo. We set
our new record of around 34,000 tweets a second
coming into Twitter during that event. Normally,
Twitter experiences something on order of 5,000 to
10,000 tweets a second, so this is pretty far out of
our standard operating bounds. I think it says a few
things about us. I think it says how Twitter reacts to
the world at large, like things happen in the world and
they get refected on Twitter.
So the way that we end up preparing for something
like this is really years of work beforehand. This type
of event could happen at any time without real notice.
So we do load tests against the Twitter infrastructure.
Wed run those on the order of every month I dont
know what the exact schedule is these days and
then we do analyses of every single system at Twitter.
When we build architecture and systems at Twitter,
we look at the performance of all those systems
on a weekly basis to really understand what the
theoretical capacity of the systems looks like,
right now on a per-service basis, and then we try
to understand what the theoretical capacity looks
like overall. From that, we can decide whether we
have the right number of machines in production
at any given time or whether we need to buy more
computers, and we can have a rational conversation
on whether or not the system is operating effciently.
So if we have certain services, for example, that
can only take half the number of requests a second
as other services, we should look at those and
understand architecturally: are they performing
correctly or do we need to make a change?
So for us, the architecture to get to something like the
Castle in the Sky event is a slow evolutionary process.
We make a change, we see how that change reacts
and how that change behaves in the system, and we
make a decision on the slow-rolling basis of whether
or not this is acceptable to us. We make a tradeoff,
by Xuefeng Ding
Contents
Page 5
Scalability / eMag Issue 11 - April 2014
like do we buy more machinery or do we write new
software in order to withstand this?
While we never have experienced an event like Castle
in the Sky before, some of our load tests have pushed
us to those limits already so we were comfortable
when it happened in real life. Were like, Yes, it
actually worked.
InfoQ: Are there any emergency plans in Twitter? Do
you practice for unusual times, such as shut down
some servers or switches?
Raff: Yeah. We do two different things, basically, as
our emergency planning maybe three, depending
on how you look at it. Every system is carefully
documented for what would turn it on and what
would turn it off. We have what we call runbooks for
every single system so we understand what we would
do in an emergency. Weve already thought through
the different types of failures. We dont believe weve
thought through everything but we think weve
documented at least the most common ones and we
understand what we need to do.
Two, were always running tests against production,
so we understand what the system would look like
when we hit it really hard and we can practice. So we
hit it really hard and teams on call might get a page or
something, and we can try to decide whether or not
we do need to do something differently and how to
react to that.
And third, weve taken some inspiration from Netfix.
Netfix has what they call their Chaos Monkey, which
kills machines in production. We have something
similar to that within Twitter that helps make sure
that we didnt accidentally introduce a single point of
failure somewhere. We can randomly kill machines
within the data center and make sure that the service
doesnt see a blip while thats happening.
All this requires us to have excellent transparency
with respect to the success rate of all the different
systems. We have a massive board. Its a glass wall
with all these graphs on it that show us whats going
on within Twitter. And when these events happen,
we can see in an instant whether or not something is
changing, whether it would be traffc to Twitter or a
failure within a data center, so that we can react to it
as quickly as we can.
InfoQ: How do you isolate the broken module in the
system? When something goes wrong, whats your
reaction at the frst moment?
Raff: The way that Twitter is architected these days
is that a failure should stay relatively constrained to
the feature in which the failure occurred. Of course,
the deeper you get down the stack, the bigger the
problem becomes. So if our storage mechanisms
all of a sudden have a problem, a bunch of different
systems would show something going wrong. For
example, if someone made a mistake on the Web site,
it wont affect the API these days.
The way that we know that something is going wrong
again is by being able to see the different graphs
of the system. We have alerts set up over different
thresholds on a service-by-service basis. So, if the
success rate of the API fell below some number, a
bunch of pagers immediately go off; theres always
someone on call for every single service at Twitter
and they can react to that as quickly as they can.
Our operations team and our network command
center will also see this and might try some really
rudimentary things, the equivalent of should we
turn it off and on again and see what happens?
Meanwhile, the actual software developers on a
second track try to understand what is going wrong
with the system. So, operations is trying to make sure
the site comes back as quickly as it can while software
development is trying to understand what actually
went wrong and to determine whether we have a bug
that we need to take care of.
So this is how we end up reacting. But, like I said,
the architecture at Twitter keeps failure fairly well
constrained. If we think its going to propagate or
we think that, for example, the social graph is having
a problem, the social-graph team will then start
immediately notifying everyone else just in case they
should be on alert for something going wrong.
One of our strengths these days, I like to say jokingly,
is emergency management: what we do in a case of
disaster because it could happen at any time. My
contract with the world is that Twitter will be up so
you dont have to worry about it.
InfoQ: The new architecture helps a lot in stability
and performance. Could you give us a brief
introduction to it?
Contents
Page 6
Scalability / eMag Issue 11 - April 2014
Raff: Sure. When I joined Twitter a couple of years
ago, we ran the system on what we call the monolithic
codebase. Everything you had to do with the
software at Twitter was in one codebase that anyone
could deploy, anyone could touch, anyone could
modify. That sounds great. In theory, thats actually
excellent. It means that every developer in Twitter is
empowered to do the right thing.
In practice, however, theres a balancing act.
Developers then need to understand how everything
actually works in order to make change. And in
practical realities, the concern I would have is that
with the speed at which Twitter is writing new
code, people dont give deep thought into places
they havent seen before. I think this is standard in
the way developers write software. Its like, I dont
understand what I fully need to do to make this
change, but if I change just this one line it probably
gets the effect I want. Im not saying that this is a bad
behavior. Its a prudent and expedient behavior. But
this means that technical debt builds up when you do
that.
So what weve done is weve taken this monolithic
codebase and broken it up into hundreds of different
services that comprise Twitter. This way, we can
have actual real owners for every single piece of
business logic and every single piece of functionality
at Twitter. Theres actually a team responsible for
managing photos for Twitter. Theres another team
who manages the URLs for Twitter. There are now
subject experts throughout the company, and you
could consult them when you want to make a feature
change that would change something for example,
where URLs work.
Breaking up the codebase in all these different ways
and having subject-matter experts also allows things
that weve spoken about: isolation for failure and
isolation for feature development. If you want to
change the way tweets work, you only have to change
a certain number of systems. You dont have to
change everything in Twitter anymore, so we can have
good isolation both for failure and for development.
InfoQ: Whats the role of Decider in the system?
Raff: Decider is one of our runtime confguration
mechanisms at Twitter. What I mean by that is
we can turn off features and software in Twitter
without doing a deployment. Every single service at
Twitter is constantly looking to the Decider system
for the current runtime values of Twitter. How
that practically maps is, for example, the Discover
homepage has a Decider value that wraps it, and that
Decider value tells Discover whether its on or off
right now.
So I can deploy Discover into Twitter and have it
deployed in the state that Decider says it should be.
We dont get an inconsistent state. The Discover page,
or any feature at Twitter, runs across many machines.
You dont want to get in the inconsistent state where
some of the machines have the feature and some of
them dont. So we can deploy it in the off state using
Decider and then when it is on all the machines that
we want it to be on, we can turn it on across the data
center by fipping a Decider switch.
This also gives us the ability to do a percentage-
based control. I can say that now that its on all of
the machines, I only want 50% of users to get it. I can
actually make that decision as opposed to it being a
side effect of the way that things are being deployed
in Twitter. This allows us to have runtime control over
Twitter without having to push code. Pushing code is
a dangerous thing; the highest correlation to failure in
a system like ours, not just Twitter but any big system,
is software-development error. This way we can
deploy software in a relatively safe way because its
off. Turn it on really slowly, purposefully, make sure
its good, and then ramp it up as fast as I want.
InfoQ: How does Twitter push code online? Would
you please share the deployment process with us?
For example, how many different stages? Do you
choose daily pushing or weekly pushing or both?
Raff: Twitter deployment, because we have this
services architecture, is up to the control of every
individual team. So the onus is on the team to make
sure that when theyre deploying code, everyone
that may be affected by it should know that youre
doing it, and the network control center should also
know what youre doing so they have a global view of
the system. But its really up to every single team to
decide when and if they want to push.
On average, I would say teams have a bi or tri-
weekly deployment schedule. Some teams deploy
every single day; some teams only deploy once a
month. But the deployment process looks about the
same to everybody: you deploy into a developing
environment. This is so developers can hack on it
quickly, make changes, look at the product manager,
Contents
Page 7
Scalability / eMag Issue 11 - April 2014
look at the designer, and make sure it does the right
thing. Then we deploy into what we call the Canary
system within Twitter, which means that its getting
live production traffc but we dont rely on its results
just yet. So its just basically loading it to make sure it
handles it effciently, and we can look at the results
that would have returned and inspect them to make
sure that it did what we thought it would do given live
traffc.
Our testing scenarios may not have covered all the
different edge cases that the live traffc gets, so its
one way we learn what the real testing scenarios
should look like. After we go into Canary, we deploy
at dark, then we slowly start to ramp it up to really
understand what it looks like at the scale. That ramp-
up could take anywhere from a day to a week. Weve
had different products that weve ramped to 100%
in the course of a week or two. Weve added other
products that weve ramped up to 100% in the course
of minutes.
Again, its really up to the team. And each team is
responsible for their feature, is responsible for their
service. So its their call on how they want to do it, but
those stages of development Canary, dark reading,
ramp up by Decider is the pattern that everyone
follows.
InfoQ: There are huge amounts of data in Twitter.
You must have some special infrastructure (such as
Gizzard and Snowfake) and methods to store the
data, to even process them in real time.
Raff: Thats really two different questions, I
think. There is how do we ingest all this data thats
coming into Twitter because Twitter is a real-time
system with latency for a tweet to get delivered in
milliseconds to Twitter users. And then theres the
second question of what we do with all that data.
For the frst one, youre right; we have systems like
Snowfake, Gizzard, and things like that to handle
tweet ingestion. Tweets are only one piece of data
that comes into Twitter, obviously. We have things
like favorites. We have retweets. We have people
sending direct messages. People change their avatar
images, their background images, and things like that.
People click on URLs and load Web pages. These are
all events that are coming into Twitter.
So we begin to ingest all this and log them so we can
do analysis. Its a pretty hard thing. We actually have
different SLAs depending on what kind of data comes
in. Tweets, we measure in milliseconds. In order to get
around database locking, for example, we developed
Snowfake, which can generate unique IDs for us
incredibly quickly and do it decentralized so that we
dont have a single point of failure in generating IDs
for us.
We have Gizzard, which handles data fowing in and
shards it as quickly as possible so that we dont have
hot spots on different clusters in the system. It tries to
probabilistically spread the load so that the amount of
data coming in doesnt overload the databases. Again,
tweets go through very fast on the system.
Logs of, for example, people clicking on things or
viewing tweets, have their SLA measured in minutes
as opposed to milliseconds. Those go into completely
different pipeline. Most of it is based on Scribe these
days. So, those slowly trickle through, get aggregated,
get collected and get jumped into HDFS so we can
analyze them later.
For long-term retention, all of the data, whether it be
real-time or not, ends up in HDFS and thats where we
run massive MapReduce and Hadoop jobs to really
understand whats going on in the system.
So, we try to achieve a balance of what needs to be
taken care of right now, especially given the onslaught
of data we have, and where do we put things because
this unclogged data accumulates very fast. If Twitter
sees 400 million tweets a day and has been running
for a couple of years now you can imagine the size of
our corpus. HDFS handles all that for us, and we can
run these massive MapReduce jobs that way.
InfoQ: Twitter is an amazing place for engineers.
Whats the growing path of an engineer in Twitter?
How would one become a successful geek like you?
Raff: Well, I cant say Im a successful engineer since
I dont write software anymore. I started at Twitter as
an engineer, and Ive risen into this position of running
a lot of engineering these days.
Twitter has a couple of different philosophies and
mentalities around it, but we have a career path for
engineers that basically involves tackling harder and
Contents
Page 8
Scalability / eMag Issue 11 - April 2014
harder and harder problems. We would like to say
that it doesnt actually matter how well the feature
you built does. In some cases, it does. But we really
like the level of technical thought and technical merit
youve put into the project you work on.
So growth in Twitter is done very much through a
peer-based mechanism. To be promoted from one
level to the next level at Twitter requires consensus.
It requires a bunch of engineers at that higher level to
agree that, yes, youve done the work needed in order
to get to this level at Twitter.
To help with that, managers make sure projects go
to engineers that are looking for big challenges.
Engineers can move between teams. Theyre not
stuck on the tweet team, for example, or a timeline
team. If an engineer says, I want to work on the
mobile team because thats interesting. I think
theres career growth for me, my job as a person
that manages a lot of this is to make that possible.
You can do almost whatever you want within Twitter.
I tell engineers what my priorities are in running
engineering and what the companys priorities are in
user growth or money or features to build. And then
engineers should fow to the projects that they think
they can make the biggest impact on.
On top of that, I run a small university within Twitter
that we call Twitter University. Its a group of people
whose whole job is training. For example, if an
engineer wants to join the mobile team but is a back-
end Java developer, were say, Great. Weve created
a training class so you can learn Android engineering
or iOS engineering and you can take a one weeklong
class that will get you to the place that youve
committed to with that codebase, and then you can
join that team for real. This gives you a way to sort
of expand your horizons within Twitter and a way to
safely decide whether or not you want to go and try
something new.
We invest in our engineers because, honestly, theyre
the backbone of the company. The engineers build the
thing that we all thrive on within Twitter, and that the
world uses, so I give them as many opportunities as I
can in order to try different things and to geek out in
lots of different ways.
READ THIS INTERVIEW
ONLINE ON InfoQ
ABOUT THE INTERVIEWEE
Raff Krikorian is vice-president of
platform engineering at Twitter. His
teams manage the business logic, scalable
delivery, APIs, and authentication of
Twitter's application. His group helped
create the iOS 5 Twitter integration as
well as the The X Factor Twitter voting
mechanism.
Contents
Page 9
Scalability / eMag Issue 11 - April 2014
Interview: Adrian Cockcroft on
High Availability, Best Practices, and
Lessons Learned in the Cloud
Netfix is a widely referenced case study for how to effectively operate a cloud
application at scale. While their hyper-resilient approach may not be necessary at
most organizations and the jury is outon that assumption Netfix has advanced
the conversation about what it means to build modern systems. InfoQ spoke with
Adrian Cockcroft, who is the cloud architect for the Netfix platform.
InfoQ: What does high availability 101 look like
for a new Netfix engineer? How do they learn best
practices, and what are the main areas of focus?
Cockcroft: We run an internal boot camp every few
months for new engineers. The most recent version
is a mixture of presentations about how everything
works and some hands-on work making sure that
everyone knows how to build code that runs in
the cloud. We use a version of the Netfix OSS RSS
Reader as a demo application.
InfoQ: Are there traditional Web-development
techniques or patterns that you often ask engineers
to forget when working with cloud-scale
distributed systems?
Cockcroft: Sticky session-based programming
doesnt work well so we make everything request
scoped, and any cross-request information must be
stored in memcached using our EVcache mechanism
(which replicates the data across zones).
InfoQ: You and others at Netfix have spoken at
length about expecting failures in distributed
systems. How do you specifcally recommend that
architects build out circuit breakers and employ
other techniques for preventing cascading failures
in systems?
Cockcroft: The problem with dependencies between
services is that it rapidly gets complicated to keep
track of them, and its important to multi-thread calls
to different dependencies, which gets tricky when
managing nested calls and responses. Our solution to
this is based on the functional reactive pattern that
weve implemented using RxJava, with a backend
circuit-breaker pattern wrapped around each
dependency using Hystrix. To test that everything
works properly under stress, we use Latency Monkey
to inject failures and high latency into dependent
service calls. This makes sure we have the timeouts
and circuit breakers calibrated properly, and
uncovers any unsafe dependencies that are being
called directly, since those can still cause cascading
failures.
InfoQ: Netfix OSS projects cover a wide range of
services including application deployment, billing,
and more. Which of these projects do you consider
most indispensable to your team at Netfix, and
why?
by Richard Seroter
Contents
Page 10
Scalability / eMag Issue 11 - April 2014
Cockcroft: One of our most powerful mechanisms
and somewhat overlooked Netfix OSS projects is
the Zuul gateway service. This acts as an intelligent
routing layer, which we can use for many purposes:
handling authentication; geographic and content
aware routing; scatter/gather of underlying services
into a consistent external API; etc. Its dynamically
programmable, and can be reconfgured in seconds.
In order to route traffc to our Zuul gateways, we
need to be able to manage a large number of DNS
endpoints with ensemble operations. Weve built
the Denominator library to abstract away multiple
DNS vendor interfaces to provide the same high
levels of functionality. We have found many bugs and
architectural problems in the commonly used DNS-
vendor-specifc APIs, so as a side effect we have been
helping fx DNS management in general.
InfoQ: Frameworks often provide a useful
abstraction on top of complex technology. However,
are there cases where an abstraction shields
developers from truly understanding something
more complex, but useful?
Cockcroft: Garbage collection lets developers
forget about how much memory they are using and
consuming. While it helps them quickly write code,
the sheer volume of garbage and number of times
data is copied from one memory location to another
is not usually well understood. While there are some
tools to help (we open-sourced our JVM GCviz tool),
its a common blind spot. The tuning parameters for
setting up heaps and garbage-collection options are
confusing and are often set poorly.
InfoQ: Netfix is a big user of Cassandra, but is there
any aspect of the public-facing Netfix system that
uses a relational database? How do you think that
modern applications should decide between NoSQL
and relational databases?
Cockcroft: The old Netfix DVD-shipping service
still runs on the old code base on top of a few large
Oracle databases. The streaming service has all
its customer-request-facing services running on
Cassandra, but we do use MySQL for some of our
internal tools and non-customer-facing systems
such as the processes that we use to ingest metadata
about new content. If you want to scale and be
highly available, use NoSQL. If you are doing rapid
continuous delivery of functionality, you will
eventually want to denormalize your data model and
give each team its own data store so they can iterate
their data models independently. At that point, most
of the value of a unifed relational schema is gone
anyway.
InfoQ: Can you give us an example of something
at Netfix that didnt work because it was too
sophisticated and made you opt for a simpler
approach?
Cockcroft: There have been cases where teams
decided that they wanted to maintain strong
consistency, so they invented complex schemes that
they thought would also keep their services available,
but this tends to end up with a lot of downtime, and
eventually a much simpler and more highly available
model takes over. There is less consistency guarantee
with the replacement, and perhaps we had to build
a data-checking process to fx things up after the
event if anything goes wrong. A lot of Netfix outages
around two years ago were due to an attempt to keep
a datacenter system consistent with the cloud, and
cutting the tie to the datacenter so that Cassandra
in the cloud became the master copy made a big
difference.
InfoQ: How about something you built at Netfix
that failed because it was too simple?
Cockcroft: Some groups use Linux load-average as a
metric to tell if their instances are overloaded. They
then want to use this as an input to autoscaling. I
dont like this because load-average is time-decay
weighted so its slow to respond, and its non-linear
so it tends to make autoscaler rules over-react. As
a simple rule, total (user+system) CPU utilization
is a much better metric, but it can still react too
slowly. Were experimenting with more sophisticated
algorithms that have a lot more inputs, and hope
to have a Netfix Tech Blog post on this issue fairly
soon (keep watching http://techblog.netfix.com for
technology discussion and open source project
announcements).
InfoQ: How do you recommend that developers
(at Netfix and other places) set up appropriate
sandboxes to test their solutions at scale? Do you
use the same production-level deployment tools
to push to developer environments? Should each
developer get their own?
Cockcroft: Our build system delivers into a test
AWS account that contains a complete running set
of Netfix services. We automatically refresh test
Contents
Page 11
Scalability / eMag Issue 11 - April 2014
databases from production backups every weekend
(overwrite the old test data). We have multiple
stacks or tagged versions of specifc services that
are being worked on, and ways to direct traffc by
tags is built into Asgard, our deployment tool. Theres
a complete integration stack that is intended to be
close to production availability but refect the next
version of all the services. Each developer has their
own tagged stack of things they are working on,
that others will ignore by default, but they share
the common versions. We re-bake AMIs from a test
account to push to production account with a few
changes to environment variables. There is no tooling
support to build an AMI directly for the production
account without launching it in test frst.
InfoQ: Given the size and breadth of the Netfix
cloud deployment, how and when do you handle
tuning and selection of the ideal AWS instance size
for a given service? Do you run basic performance
profling on a service to see if its memory-bound,
I/O bound, or CPU bound, and then choose the right
type of instance? At what stage of the services
lifecycle do you make these assessments?
Cockcroft: Instance type is chosen primarily based
on memory need. Were gradually transitioning
where possible from the m2 family of instances to
the m3 family, which have a more modern CPU base
(Intel E5 Sandy Bridge) that runs Java code better.
We then run enough instances to get the CPU we
need. The only instances that are I/O intensive are
Cassandra, and we use the hi1.4xlarge for most of
them. Weve built a tool to measure how effciently
we use instances, and it points out the cases where a
team is running more instances than they need.
READ THIS INTERVIEW
ONLINE ON InfoQ
ABOUT THE INTERVIEWEE
Adrian Cockcroft has had a long career
working at the leading edge of technology.
Before joining Battery in 2013, Adrian
helped lead Netfixs migration to a
large scale, highly available public-cloud
architecture and the open sourcing of
the cloud-native NetfixOSS platform.
Prior to that at Netfix he managed a team
working on personalization algorithms
and service-oriented refactoring. He
graduated from The City University,
London with a Bsc in Applied Physics and
Electronics, and was named one of the
top leaders in Cloud Computing in 2011
and 2012 by SearchCloudComputing
magazine. He can usually be found on
Twitter @adrianco.
Contents
Page 12
Scalability / eMag Issue 11 - April 2014
To Execution-Profle or to Memory-
Profle? That Is the Question
I recently had a group of developers troubleshoot a
problem-riddled application from my performance
workshop. After dispensing with a couple easy wins,
the group was faced with a CPU that was running
very hot. The group reacted in exactly the same
way that I see most teams do when faced with a hot
CPU; they fred up an execution profler hoping that
it would help them sort things out. In this particular
case, the problem was related to how the application
was burning through memory. Now, while an
execution profler can fnd these problems, memory
proflers will paint a much clearer picture. My group
had somehow missed a key metric that was telling
them that they should have been using a memory
profler. Lets run through a similar exercise here so
that we can see when and why it is better to use a
memory profler.
Proflers work by either sampling top of stack
or instrumenting the code with probes, or a
combination of both. These techniques are good
at fnding computations that happen frequently
or take a long time. As my group experienced, the
information gathered by execution proflers often
correlates well with the source of the memory
ineffciency. However, it points to an execution
problem which can sometimes be confusing.
The code found in Listing 1 defnes the method
API fndByName(String,String). The problem here
isnt so much in the API itself but more in how the
method treats the String parameters. The code
concatenates the two strings to form a key that is
used to look up the data in a map. This misuse of
strings is a code smell in that it indicates that there
is a missing abstraction. As we will see, that missing
abstraction is not only at the root of the performance
problem, but adding it also improves the readability
of the code. In this case, the missing abstraction is a
CompositeKey<String,String>, a class that wraps the
two strings and implements both the equals(Object)
and hashCode() methods.
public class CustomerList {
private fnal Map customers = new
ConcurrentHashMap();
public Customer addCustomer(String
frstName, String lastName) {
Customer person = new
Customer(frstName, lastName);
customers.put(frstName + lastName,
person);
return person;
}
public Customer fndCustomer(String
frstName, String lastName) {
return (Customer) customers.
get(frstName + lastName);
}
Listing 1. Source for CustomerList
Another downside to the style of API used in this
example is that it will limit scalability because of
the amount of data the CPU is required to write to
by Kirk Pepperdine
Contents
Page 13
Scalability / eMag Issue 11 - April 2014
memory. In addition to the extra work to create the
data, the volume of data being written to memory by
the CPU creates a back pressure that will force the
CPU to slow down. Though this bench is artifcial in
how it presents the problem, its a problem that is
not so uncommon in applications using the popular
logging frameworks. That said, dont be fooled into
thinking only String concatenation can be at fault.
Memory pressure can be created by any application
that is churning through memory, regardless of the
underlying data structure.
The easiest way to determine if our application
is burning through memory is to examine the
garbage-collection (GC) logs. GC logs report on
heap occupancy before and after each collection.
Subtracting occupancy after the previous collection
from the occupancy before the current collection
yields the amount of memory allocated between
collections. If we do this for many records, we can
get a pretty clear picture of the applications memory
needs. Moreover, getting the needed GC log is both
cheap and, with the exception of a couple of edge
cases, will have no impact on the performance of
your application. I used the fags -Xloggc:gc.log
and -XX:+PrintGCDetails to create a GC log with a
suffcient level of detail. I then loaded the GC log fle
into Censum, jClaritys GC-log analysis tool.
Table 1. Summary of garbage-collection activity
Censum provides a whole host of statistics (see
Table 1), of which were interested in the Collection
Type Breakdown (at the bottom). The % Paused
column (the sixth column in Table 1) tells us that the
total time paused for GC was 0.86%. In general, wed
like GC pause time to be less than 5%, which it is.
The number suggests that the collectors are able to
reclaim memory without too much effort. Keep in
mind, however, that when it comes to performance, a
single measure rarely tells you the whole story. In this
case, we need to see the allocation rates and in Chart
1 we can see just that.
Chart 1. Allocation rates
In this chart, we can see that the allocation rates
initially start out at about 2.75 GB per second. The
laptop that I used to run this benchmark under
ideal conditions can sustain an allocation rate of
about 4 GB per second. Thus this value of 2.75
GB/s represents a signifcant portion of the total
memory bandwidth. In fact, the machine is not able
to sustain this rate as is evidenced by the drop over
time in allocation rates. While your production
servers may have a larger capacity to consume
memory, it is my experience that any machine trying
to maintain object-creation rates greater than 500
MB per second will spend a signifcant amount of
time allocating memory. It will also have a very
limited ability to scale. Since memory effciency is the
overriding bottleneck in our application, the biggest
wins will come from making it more memory effcient.
Execution Profling
It should go without saying that if were looking
to improve memory effciency, we should be using
a memory profler. However, when faced with a
hot CPU, our group decided that they should use
execution profling, so lets start with that and see
where it leads. I used the NetBeans profler running in
VisualVM in its default confguration to produce the
profle in Chart 2.
Chart 2. Execution profle
Contents
Page 14
Scalability / eMag Issue 11 - April 2014
Looking at the chart, we can see that outside of the
Worker.run() method, most of the time is spent in
CustomerList.fndCustomer(String,String). If the
source code were a bit more complex, you could
imagine it being diffcult to understand why the
code is a problem or what you should do to improve
performance. Lets contrast this view with the one
presented by memory profling.
Memory Profling
Ideally, I would like my memory profler to show me
how much memory is being consumed and how many
objects are being created. I would also like to know
the causal execution paths that is, the path through
the source code that is responsible for churning
through memory. I can get these statistics using the
NetBeans profler once again running in VisualVM.
However I will need to confgure the profler to collect
allocation stack traces. This confguration can be seen
in Figure 1.
Figure 1. Confguring NetBeans memory profler
Note that the profler will not collect for every
allocation but only for every 10th allocation. Sampling
in this manner should produce the same result as if
you were capturing data from every allocation but
with much less overhead. The resulting profle is
shown in Chart 3.
Chart 3. Memory profle
The chart identifes char[] as the most popular object.
Having this information, the next step is to take a
snapshot and then look at the allocation stack traces
for char[]. The snapshot can be seen in Chart 4.
Chart 4. char[] allocation stack traces
The chart shows three major sources of char[]
creation of which one is opened up so that you can see
the details. In all three cases, the root can be traced
back to the frstName + lastName operation.
It was at this point that the group tried to come up
with numerous alternatives. However, none of the
proposed solutions were as effcient as the code
produced by the compiler. It was clear that to have
the application run faster, we were going to have
to eliminate the concatenation. The solution that
eventually solved the problem was to introduce a Pair
class that took the frst and last name as arguments.
We called this class CompositeKey as it introduced
the missing abstraction. The improved code can be
seen in Listing 2.
public class CustomerList {
private fnal Map customers = new
ConcurrentHashMap();
public Customer addCustomer(String
frstName, String lastName) {
Customer person = new
Customer(frstName, lastName);
customers.put(new
CompositeKey(frstName, lastName),
person);
return person;
}
public Customer fndCustomer(String
frstName, String lastName) {
return (Customer) customers.get(new
CompositeKey(frstName, lastName));
}
}
Listing 2. Improved implementation using CompositeKey
abstraction
Contents
Page 15
Scalability / eMag Issue 11 - April 2014
CompositeKey implemented both hashCode() and
equals(), thus eliminating the need to concatenate
the strings together. While the frst benchmark
completed in ~63 seconds, the improved version
ran in ~21 seconds, a threefold improvement. The
garbage collector ran four times, making it impossible
to get an accurate picture, but the application
consumed in aggregate just under 3 GB of data as
apposed to the more than 141 GB consumed by the
frst implementation.
Two Ways to Fill a Water Tower
A colleague of mine once said that you can fll a
water tower one teaspoon at a time. This example
proves that you certainly can. However, its not the
only way to fll the tower; you could also run a large
hose to fll it very quickly. In those cases, its unlikely
that an execution profler would pick up on the
problem. However the garbage collector will see the
allocation and the recover and certainly the memory
profler will see the allocation in sheer byte count.
In one application where these large allocations
predominated, the development team had exhausted
the vast majority of the gains they were going to get
by using an execution profler, yet they still needed to
squeeze more out of the app. At that point, we turned
on the memory profler and it exposed one allocation
hotspot after another, and with that information
we were able to extract a number of signifcant
performance gains. What that team learned is that
not only was memory profling giving them the
right view, it was giving them the only view into the
problem. This is not to say that execution profling
isnt productive. What it is saying is that sometimes
its not able to tell you where your application is
spending all of its time and in those cases, getting a
different perspective on the problem can make all the
difference in the world.
ABOUT THE AUTHOR
Kirk Pepperdine has worked in high-
performance and distributed computing
for nearly 20 years. Since 1998, Kirk has
been working all aspects of performance
and tuning in each phase of a project
life cycle. In 2005, he helped author
the foremost Java-performance tuning
workshop that has been presented to
hundreds of developers worldwide.
Author, speaker, consultant, Kirk was
recognized in 2006 as a Java Champion
for his contributions to the Java
community. He was the frst non-Sun
employee to present a technical lab at
JavaONE, an achievement that opened
the opportunity for others in the industry
to do so. He was named a JavaONE
Rockstar in 2011 and 2012 for his talks
on garbage collection. You can reach him
by e-mail at kirk@kodewerk.com or on
Twitter @kcpeppe.
READ THIS ARTICLE
ONLINE ON InfoQ
Contents
Page 16
Scalability / eMag Issue 11 - April 2014
Java is increasingly being used for low-latency work where previously C and C++
were the de facto choices.
InfoQ brought together four experts in the feld to discuss what is driving the trend
and some of the best practices when using Java in these situations.
The Participants
Peter Lawrey is a Java consultant interested
in low-latency and high-throughput systems.
He has worked for a number of hedge funds,
trading frms, and investment banks.
Martin Thompson is a high-performance and
low-latency specialist, with over two decades
working with large-scale transactional and
big-data systems, in the automotive, gaming,
fnancial, mobile, and content-management
domains.
Todd L. Montgomery is vice-president of
architecture for Informatica Ultra Messaging
and the chief designer and implementer of
the 29West low-latency messaging products.
Andy Piper recently joined Push Technology
as chief technology offcer, from Oracle.
The Questions
Q1: What do we mean by low latency? Is it the
same thing as real time? How does it relate to high-
performance code in general?
Lawrey: A system with a measured latency
requirement that is too fast to see. This could be
anywhere from 100 ns to 100 ms.
Montgomery: Real time and low latency can be
quite different. The majority view on real time would
be determinism over pure speed with very closely
controlled, or even bounded, outliers. However, low
latency typically implies that pure speed is given
much higher priority and some outliers may be,
however slightly, more tolerable. This is certainly
the case when thinking about hard real time. One of
the key prerequisites for low latency is a keen eye
for effciency. From a system view, this effciency
must permeate the entire application stack, the
OS, and the network. This means that low-latency
systems have to have a high degree of mechanical
sympathy to all those components. In addition, many
of the techniques that have emerged in low-latency
systems over the last several years have come from
high-performance techniques in OSs, languages,
Virtual Panel: Using Java in
Low-Latency Environments
by Charles Humble
Contents
Page 17
Scalability / eMag Issue 11 - April 2014
VMs, protocols, other system-development areas,
and even hardware design.
Thompson: Performance is about two things:
throughput, i.e. units per second, and response time,
otherwise know as latency. It is important to defne
the units and not just say something should be fast.
Real time has a very specifc defnition and is often
misused. Real time is to do with systems that have
a real-time constraint from input event to response
time regardless of system load. In a hard real-time
system, if this constraint is not honored then a total
system failure can occur. Good examples are heart
pacemakers or missile control systems.
With trading systems, real time tends to have a
different meaning in that the system must have high
throughput and react as quickly as possible to an
event, which can be considered low latency. Missing
a trading opportunity is typically not a total system
failure so you cannot really call this real time.
A good trading system will have a high quality of
execution for which one aspect is to have a low-
latency response with little deviation in response
time.
Piper: Latency is simply the delay between decision
and action. In the context of high-performance
computing, low latency has typically meant that
transmission delays across a network are low or
that the overall delays from request to response
are low. What defnes low depends on the context
low latency over the Internet might be 200 ms
whereas low latency in a trading application might
be 2 s. Technically, low latency is not the same
as real time low latency typically is measured as
percentiles where the outliers (situations in which
latency has not been low) are extremely important
to know about. With real time, guarantees are made
about the behavior of the system so instead of
measuring percentile delays, you are enforcing a
maximum delay. You can see how a real-time system
is also likely to be a low-latency system, whereas the
converse is not necessarily true. Today, however, the
notion of enforcement is gradually being lost so that
many people now use the terms interchangeably.
If latency is the overall delay from request to
response then it is obvious that many things
contribute to this delay CPU, network, OS,
application, even the laws of physics! Thus low-
latency systems typically require high-performance
code so that software elements of latency can be
reduced.
Q2: Some of the often-cited advantages of using
Java in other situations include access to the rich
collection of libraries, frameworks, application
servers, and so on, and also the large number of
available programmers. Do these advantages apply
when working on low-latency code? If not, what
advantages does Java have over C++?
Lawrey: If your application spends 90% of the time
in 10% of your code, Java makes optimizing that 10%
harder, but writing and maintaining 90% of your code
easier, especially for teams of mixed ability.
Montgomery: In the capital markets, especially
algorithmic trading, there are a number of factors
that come into play. Often, the faster an algorithm
can be put into the market, the more advantage
it has. Many algorithms have a shelf life and
quicker time to market is key in taking advantage
of that. With the community around Java and the
options available, it can defnitely be a competitive
advantage, as opposed to C or C++ where the options
may not be as broad for the use case. Sometimes,
though, pure low latency can rule out other concerns.
I think the current difference in performance
between Java and C++ is so close that its not a
black-and-white decision based solely on speed.
Improvements in GC techniques, JIT optimizations,
and managed runtimes have made traditional Java
weaknesses with respect to performance into some
very compelling strengths that are not easy to ignore.
Thompson: Low-latency systems written in Java
tend to not use third-party or even standard libraries
for two major reasons. Firstly, many libraries have
not been written with performance in mind and often
do not have suffcient throughput or response time.
Secondly, they tend to use locks when concurrent,
and they generate a lot of garbage. Both of these
contribute to highly variable response times, due to
lock contention and garbage collection respectively.
Java has some of the best tooling support of any
language, which results in signifcant productivity
gains. Time to market is often a key requirement
when building trading systems, and Java can often
get you there sooner.
Contents
Page 18
Scalability / eMag Issue 11 - April 2014
Piper: In many ways the reverse is true: writing good
low-latency code in Java is relatively hard since
the developer is insulated from the guarantees of
the hardware by the JVM itself. The good news is
that this is changing. Not only are JVMs constantly
getting faster and more predictable but developers
are now able to take advantage of hardware
guarantees through a detailed understanding of
the way that Java works in particular, the Java
memory model - and how it maps to the underlying
hardware. (Indeed, Java was the frst popular
language to provide a comprehensive memory
model that programmers could rely on. C++ only
provided one later on.) A good example is the lock-
free, wait-free techniques that Martin Thompson
has been promoting and that our company, Push,
has adopted into its own development with great
success. Furthermore, as these techniques become
more mainstream, we are starting to see their
uptake in standard libraries (e.g. the Disruptor) so
that developers can adopt the techniques without
needing such a detailed understanding of the
underlying behavior.
Even without these techniques, the safety
advantages of Java (memory management, thread
management, etc.) can often outweigh the perceived
performance advantages of C++, and of course JVM
vendors have claimed for some time that modern
JVMs are often faster than custom C++ code because
of the holistic optimizations that they can apply
across an application.
Q3. How does the JVM support concurrent
programs?
Lawrey: Java has had built-in multi-threading
support from the start and high-level concurrency
support standard for almost 10 years.
Montgomery: The JVM is a great platform for
concurrent programs. The memory model allows a
consistent model for developers to utilize lock-free
techniques across hardware, which is a great plus for
getting the most out of the hardware by applications.
Lock-free and wait-free techniques are great for
creating effcient data structures, something we very
desperately need in the development community. In
addition, some of the standard library constructs for
concurrency are quite handy and can make for more
resilient applications. With C++11, certain specifcs
aside, Java is not the only one with access to a lot of
these constructs. And the C++11 memory model is a
great leap forward for developers.
Thompson: Java (1.5) was the frst major language
to have a specifed memory model. A language-
level memory model allows programmers to reason
about concurrent code at an abstraction above the
hardware. This is critically important, as hardware
and compilers will aggressively reorder our code to
gain performance, which has visibility issues across
threads. With Java, it is possible to write lock-free
algorithms that when done well can provide some
pretty amazing throughput at low and predictable
latencies. Java also has rich support for locks.
However, when locks are contended the operating
system must get involved as an arbitrator with huge
performance costs. The latency difference between
a contended and uncontended lock is typically three
orders of magnitude.
Piper: Support for concurrent programs in Java start
with the Java Language Specifcation itself the
JLS describes many Java primitives and constructs
that support concurrency. At a basic level, this is
the java.lang.Thread class for the creation and
management of threads and the synchronized
keyword for the mediation of access to shared
resources from different threads. On top of this,
Java provides a whole package of data structures
optimized for concurrent programs (java.util.
concurrent) from concurrent hash tables to task
schedulers to different lock types. One of the biggest
areas of support, however, is the Java memory model
(JMM) that was incorporated into the JLS as part
of JDK 5. This provides guarantees around what
developers can expect when dealing with multiple
threads and their interactions. These guarantees
have made it much easier to write high-performance,
thread-safe code. In the development of Diffusion,
we rely very heavily on the JMM in order to achieve
the best possible performance.
Q4. Ignoring garbage collection for a moment,
what other Java-specifc techniques (things that
wouldnt apply if you were using C++) are there for
writing low-latency code? Im thinking here about
things like warming up the JVM, getting all your
classes into permgen to avoid I/O, Java-specifc
techniques for avoiding cache misses, and so on.
Lawrey: Java allows you to write, test, and profle
your application with limited resources more
effectively. This gives you more time to ensure you
Contents
Page 19
Scalability / eMag Issue 11 - April 2014
cover the entire big picture. I have seen many C/
C++ projects spend a lot of time drilling down to the
low level and still ending up with longer latencies end
to end.
Montgomery: That is kind of tough. The only obvious
one would be warm up for JVMs to do appropriate
optimizations. However, some of the class and
method call optimizations that can be done via
class-hierarchy analysis at runtime are not possible
currently in C++. Most other techniques can also
be done in C++ or, in some cases, dont need to be
done. Low-latency techniques in any language often
involve what you dont do that can have the biggest
impact. In Java, there are a handful of things to avoid
that can have undesirable side effects for low-latency
applications. One is the use of specifc APIs, such as
the Refection API. Thankfully, there are often better
choices for how to achieve the same end result.
Thompson: You mention most of the issues in your
question. :-) Basically, Java must be warmed up to
get the runtime to a steady state. Once in this steady
state, Java can be as fast as native languages and in
some cases faster. One big Achilles heel for Java is
lack of memory-layout control. A cache miss is a lost
opportunity to have executed ~500 instructions on
a modern processor. To avoid cache misses, we need
control of memory layout and then we must access
it in a predictable fashion to avoid cache misses. To
get this level of control, and reduce GC pressure, we
often have to create data structures in direct byte
buffers or go off heap and use Unsafe. Both of these
allow for the precise layout of data structures. This
need could be removed if Java introduced support
for arrays of structures. This does not need to be a
language change and could be introduced by some
new intrinsics.
Piper: The question seems to be based on false
premises. At the end of the day, writing a low-latency
program is very similar to writing other programs
where performance is a concern; the input is code
provided by a developer (whether C++ or Java),
which executes on a hardware platform with some
level of indirection inbetween (e.g. through the
JVM or through libraries, compiler optimizers,
etc. in C++). The fact that the specifcs vary makes
little difference. This is essentially an exercise in
optimization and the rules of optimization are, as
always:
1. Dont.
2. Dont yet (for experts only).
And if that does not get you where you need to be:
1. See if you actually need to speed it up.
2. Profle the code to see where its actually spending
its time.
3. Focus on the few high-payoff areas and leave the
rest alone.
Now, of course, the tools you would use to achieve
this and the potential hotspots might be different
between Java and C++, but thats just because they
are different. Granted, you might need to understand
in a little more detail than your average Java
programmer would what is going on but the same
is true for C++, and, of course, by using Java there
are many things you dont need to understand so
well because they are adequately catered for by the
runtime. In terms of the types of things that might
need optimizing these are the usual suspects of
code paths, data structures, and locks. In Diffusion,
we have adopted a benchmark-driven approach
where we are constantly profling our application and
looking for optimization opportunities.
Q5. How has managing GC behavior affected the
way people code for low latency in Java?
Lawrey: There are different solutions for different
situations. My preferred solution is to produce so
little garbage that it no longer matters. You can cut
your GCs to less than once a day.
At this point, the real reason to reduce garbage is
to ensure you are not flling your CPU caches with
garbage. Reducing the garbage you are producing
can improve the performance of your code by two to
fve times.
Montgomery: Most of the low-latency systems
I have seen in Java have gone to great lengths to
minimize or even try to eliminate the generation
of garbage. As an example, avoiding the use of
Strings altogether is not uncommon. Informatica
Ultra Messaging (UM) itself has provided specifc
Java methods to cater to the needs of many users
with respect to object reuse and avoiding some
usage patterns. If I had to guess, the most common
implication has been the prevalent use of object
reuse. This pattern has also infuenced many other
Contents
Page 20
Scalability / eMag Issue 11 - April 2014
non-low-latency libraries such as Hadoop. Its a
common technique now within the community to
provide options or methods for users of an API or
framework to utilize them in a low or zero garbage
manner.
In addition to the effect on coding practices, there is
also an operational impact for low-latency systems.
Many systems will take some, shall we say, creative
control of GC. Its not uncommon to only allow GC
to occur at specifc times of the day. The implications
on application design and operational requirements
are a major factor in controlling outliers and gaining
more determinism.
Thompson: Object pools are employed or, as
mentioned in the previous response, most data
structures need to be managed in byte buffers or off
heap. This results in a C style of programming in Java.
If we had a truly concurrent garbage collector then
this could be avoided.
Piper: How long is a piece of java.lang.String?
Sorry, Im being facetious. The truth is that some
of the biggest changes to GC behaviour have come
about through JVM improvements rather than
through programmers individual coding decisions.
HotSpot, for instance, has come an incredibly long
way from the early days when you could measure
GC pauses in minutes. Many of these changes have
been driven by competition it used to be that
BEA JRockit behaved far better than HotSpot from
a latency perspective, creating much lower jitter.
These days, however, Oracle is merging the JRockit
and HotSpot codebases precisely because the gap
has narrowed so much. Similar improvements have
been seen in other, more modern, JVMs such as
Azuls Zing and in many cases, developer attempts
to improve GC behavior have actually had no net
beneft or made things worse.
However, thats not to say that there arent things
that developers can do to manage GC for instance,
by reducing object allocations through either pooling
or using off-heap storage to limit memory churn. Its
still worth bearing in mind, however, that these are
problems that JVM developers are also very focused
on, so it still may well be either not necessary to do
anything at all or easier to simply buy a commercial
JVM. The worst thing you can do is prematurely
optimize this area of your applications without
knowing whether it is actually a problem or not,
since these kinds of techniques increase application
complexity through the bypass of a very useful Java
feature (GC) and therefore can be hard to maintain.
Q6. When analyzing low-latency applications,
are there any common causes or patterns you see
behind spikes or outliers in performance?
Lawrey: Waiting for I/O of some type. CPU
instruction or data-cache disturbances. Context
switches.
Montgomery: In Java, GC pauses are beginning to be
well understood and, thankfully, we have better GCs
that are available. System effects are common for all
languages though. OS scheduling delay is one of the
many causes behind spikes. Sometimes it is the direct
delay and sometimes it is a knock-on effect caused by
the delay that is the real killer. Some OSs are better
than others when it comes to scheduling under heavy
load. Surprisingly, for many developers the impact
that poor application choices can make on scheduling
is something that often comes as a surprise and is
often hard to debug suffciently. Of a related note
is the delay inherent from I/O and contention that
I/O can cause on some systems. A good assumption
to make is that any I/O call may block and will block
at some point. Thinking through the implications
inherent in that is very often key. And remember,
network calls are I/O.
There are a number of network-specifc causes for
poor performance to cover as well. Let me list the key
items to consider.
Networks take time to traverse. In WAN
environments, the time it takes to propagate data
across the network is non-trivial.
Ethernet networks are not reliable; it is the
protocols on them that provide reliability.
Loss in networks causes delay due to
retransmission and recovery as well as second-
order effects such as TCP head-of-line blocking.
Loss in networks can occur on the receiver side
due to resource starvation in various ways when
UDP is in use.
Loss in networks can occur within switches
and routers due to congestion. Routers and
switches are natural contention points and when
contended for, loss is the tradeoff.
Contents
Page 21
Scalability / eMag Issue 11 - April 2014
Reliable network media, like InfniBand, trade off
loss for delay at the network level. The end result
of loss causing delay is the same, though.
To a large degree, low-latency applications that make
heavy use of networks often have to look at a whole
host of causes of delay and additional sources of
jitter within the network. Beside network delay, loss
is probably a high contender for the most common
cause of jitter in many low-latency applications.
Thompson: I see many causes of latency spikes.
Garbage collection is the one most people are
aware of but I also see a lot of lock contention,
TCP-related issues, and many Linux-kernel issues
related to poor confguration. Many applications
have poor algorithm design that does not amortize
the expensive operations like I/O and cache misses
under bursty conditions and thus suffers queuing
effects. Algorithm design is often the largest cause
of performance issues and latency spikes in the
applications Ive seen.
Time to safepoint (TTS) is a major consideration
when dealing with latency spikes. Many JVM
operations require all user threads to be stopped by
bringing them to a safepoint. Safepoint checks are
typically performed on method returns. The need
for safepoints can be anything from revoking biased
locks or some JNI interactions and de-optimizing
code, through to many GC phases. Often, the time
taken to bring all threads to a safepoint is more
signifcant than the work to be done. The work is
then followed by the signifcant costs in waking
all those threads to run again. Getting a thread to
safepoint quickly and predictably is often not a
considered or optimized part of many JVMs, e.g.
object cloning and array copying.
Piper: The most common cause of outliers are
GC pauses, however the most common cure for
GC pauses is GC tuning rather than actual code
changes. For instance, simply changing from the
parallel collector that is used by default in JDK 6
and JDK 7 to the concurrent mark-sweep collector
can make a huge difference to stop-the-world GC
pauses that typically cause latency spikes. Beyond
tuning, another thing to bear in mind is the overall
heap size being used. Very large heaps typically put
more pressure on the garbage collector and can
cause longer pause times often simply eliminating
memory leaks and reducing memory usage can make
a big different to the overall behavior of a low-
latency application.
Apart from GC, lock contention is another major
cause of latency spikes, but this can be rather
harder to identify and resolve due to its often non-
deterministic nature. Its worth remembering also
that any time the application is unable to proceed,
it will yield a latency spike. This could be caused by
many things, even things outside the JVMs control
e.g. access to kernel or OS resources. If these kinds
of constraint can be identifed then it is perfectly
possible to change an application to avoid the use of
these resources or to change the timing of when they
are used.
Q7. Java 7 introduced support for Sockets Direct
Protocol (SDP) over InfniBand fabric. Is this
something youve seen exploited in production
systems yet? If it isnt being used, what other
solutions are you seeing the wild?
Lawrey: I havent used it for Ethernet because it
creates quite a bit of garbage. In low-latency systems,
you want to minimize the number of network hops
and usually its the external connections that are
the only ones you cannot remove. These are almost
always Ethernet.
Montgomery: We have not seen this that much. It
has been mentioned, but we have not seen it being
seriously considered. Ultra Messaging is used as
the interface between SDP and the developer using
messaging. SDP fts much more into a (R)DMA access
pattern than a push-based usage pattern. Turning a
DMA pattern into a push pattern is possible, but SDP
is not that well-suited for it, unfortunately.
Thompson: Ive not seen this used in the wild. Most
people use a stack like OpenOnload and network
adapters from the likes of Solarfare or Mellanox.
At the extreme Ive seen RDMA over InfniBand
with custom lock-free algorithms accessing shared
memory directly from Java.
Piper: Oracles Exalogic and Coherence products
have used Java and SDP for some time so in that
sense weve seen usage of this feature in production
systems for some time also. In terms of developers
actually using the Java SDP support directory rather
than through some third-party product, no, not so
much but if it adds business beneft then we expect
this to change. We ourselves have made used of
Contents
Page 22
Scalability / eMag Issue 11 - April 2014
latency-optimized hardware (e.g. Solarfare 10GbE
adapters) where the benefts are accrued from
kernel-driver installation rather than specifc Java
tuning.
Q8. Perhaps a less Java-specifc question, but
why do we need to try and avoid contention? In
situations where you cant avoid it, what are the
best ways to manage it?
Lawrey: For ultra-low latency, this is an issue, but for
multi-microsecond latencies, I dont see it as an issue.
In situations where you cant avoid it, be aware of and
minimize the impact of any resource contention.
Montgomery: Contention is going to happen.
Managing it is crucial. One of the best ways to deal
with contention is architecturally. The single-writer
principle is an effective way to do that. In essence,
just dont have the contention, assume a single writer
and build around that base principle. Minimize the
work on that single write and you would be surprised
what can be done.
Asynchronous behavior is a great way to avoid
contention. It all revolves around the principle of
always be doing useful work.
This also normally turns into the single-writer
principle. I often like a lock-free queue in front of
a single writer on a contended resource and use a
thread to do all the writing. The thread does nothing
but pull off a queue and do the writing operation in a
loop. This works great for batching as well. A wait-
free approach on the enqueue side pays off big here
and that is where asynchronous behavior comes into
the play for me from the perspective of the caller.
Thompson: Once we have contention in an
algorithm, we have a fundamental scaling bottleneck.
Queues form at the point of contention and Littles
law kicks in. We can also model the sequential
constraint of the contention point with Amdahls
law. Most algorithms can be reworked to avoid
contention from multiple threads or execution
contexts, giving a parallel speed up, often via
pipelining. If we really must manage contention on
a given data resource then the atomic instructions
provided by processors tend to be a better
solution than locks because they operate in user
space without ever involving the kernel. The next
generation of Intel processors (Haswell) expands on
these instructions to provide hardware transactional
memory support for updating small amounts of data
atomically. Unfortunately, Java is likely to take a long
time to offer such support directly to programmers.
Piper: Lock contention can be one of the biggest
performance impediments for low-latency
applications. Locks in themselves dont have to
be expensive and in the uncontended case, Java
synchronized blocks perform extremely well.
However, with contended locks, performance can
fall off a cliff not just because a thread holding a
lock prevents another thread that wants the same
lock from doing work, but also because simply the
fact that more than one thread is accessing the
lock makes the lock more expensive for the JVM
to manage. Obviously, avoidance is key, so dont
synchronize stuff that doesnt need it remove locks
that are not protecting anything, reduce the scope
of locks that are protecting, reduce the time that
locks are held, dont mix the responsibilities of locks,
etc. Another common technique is to remove multi-
threaded access instead of giving multiple threads
access to a shared data structure, updates can be
queued as commands with the queue being tended
by a single thread. Lock contention then simply
comes down to adding items to the queue, which
itself can be managed through lock-free techniques.
Q9. Has the way you approach low-latency
development in Java changed in the past couple of
years?
Lawrey: Build a simple system that does what you
want. Profle it as end-to-end as possible. Optimize
and/or rewrite where you measure the bottlenecks
to be.
Montgomery: Entirely changed. Ultra Messaging
started in 2004. At the time, the thought of using
Java for low latency was just not a very obvious
choice. But a few certainly did consider it. And more
and more have ever since. Today I think the landscape
is totally changed. Java is not only viable, it may be
the predominant option for low-latency systems. Its
the awesome work done by Martin Thompson and
[Azul Systems] Gil Tene that has really propelled this
change in attitude within the community.
Thompson: The main change over the past few
years has been the continued refnement of lock-
free and cache-friendly algorithms. I often have fun
getting involved in language shootouts that just keep
proving that the algorithms are way more important
Contents
Page 23
Scalability / eMag Issue 11 - April 2014
than the language to performance. Clean code that
displays mechanical sympathy tends to give amazing
performance, regardless of language.
Piper: Java VMs and hardware are constantly
changing, so low-latency development is always
an arms race to stay in the sweet spot of target
infrastructure. JVMs have also gotten more robust
and dependable in their implementation of the Java
memory model and concurrent data structures
that rely on underlying hardware support, so that
techniques such as lock-free/wait-free have moved
into the mainstream. Hardware also is now on a
development track of increasing concurrency based
on increasing execution cores, so that techniques
that take advantage of these changes and minimize
disruption (e.g. by giving more weight to avoiding
lock contention) are becoming essential to
development activities.
In Diffusion, we have now got down to single-digit
microsecond latency all on stock Intel hardware
using stock JVMs.
Q10. Is Java suitable for other performance-
sensitive work? Would you use it in a high-
frequency trading system, for example, or is C++
still a better choice here?
Lawrey: For time to market, maintainability, and
support from teams of mixed ability, I believe Java
is the best. The space for C or C++ between where
you would use Java and FPGAs or GPUs is getting
narrower all the time.
Montgomery: Java is defnitely an option for most
high-performance work. For HFT, Java already has
most everything needed. There is more room for
work, though: more intrinsics is an obvious one. In
other domains, Java can work well, I think. Just like
low latency, I think it will take developers willing to
try to make it happen, though.
Thompson: With suffcient time, I can make a C/
C++/ASM program perform better than Java, but
there is not that much in it these days. Java is often
the much quicker delivery route. If Java had a good
concurrent garbage collector, control of memory
layout, unsigned types, and some more intrinsics for
access to SIMD and concurrent primitives then Id be
a very happy bunny.
Piper: I see C++ as an optimization choice. Java
is by far the preferred development environment
from a time-to-market, reliability, higher-quality
perspective, so I would always choose Java frst and
then switch to something else only if bottlenecks
are identifed that Java cannot address. Its the
optimization mantra all over again.
ABOUT THE PANELISTS
Peter Lawrey is a Java consultant
interested in low-latency and high-
throughput systems. He has works
for a number of hedge funds, trading
frms, and investment banks. Peter
is third for Java on StackOverfow,
his technical blog gets 120K page
views per month, and he is the
lead developer for the OpenHFT
project on GitHub. The OpenHFT
project includes Chronicle, which
supports up to 100 million persisted
messages per second. Peter offers
a free hourly sessions on different
low-latency topics twice a month to
the Performance Java Users Group.
Todd L. Montgomery is vice-president of
architecture for the Messaging Business Unit
of 29West, now part of Informatica. As the
chief architect of Informaticas Messaging
Business Unit, Todd is responsible for the design
and implementation of the Ultra Messaging
product family, which has over 170 production
deployments within the fnancial services sector.
In the past, Todd has held architecture positions
at TIBCO and Talarian, as well as research and
lecture positions at West Virginia University.
He has contributed to the IETF and performed
research for NASA in various software felds.
With a deep background in messaging systems,
reliable multicast, network security, congestion
control, and software assurance, Todd brings
a unique perspective tempered by 20 years of
practical development experience.
Contents
Scalability / eMag Issue 11 - April 2014
Martin Thompson is a high-
performance and low-latency
specialist, with experience gained
over two decades working with
large-scale transactional and
big-data domains, including
automotive, gaming, fnancial,
mobile, and content management.
He believes mechanical sympathy
- applying an understanding of
the hardware to the creation
of software - is fundamental
to delivering elegant, high-
performance, solutions. Martin
was the co-founder and CTO of
LMAX, until he left to specialize in
helping other people achieve great
performance with their software.
The Disruptor concurrent
programming framework is just one
example of what his mechanical
sympathy has created.
Andy Piper recently joined the
Push Technology team as chief
technology offcer. Previously
a technical director at Oracle
Corporation, Andy has over 18
years experience working at
the forefront of the technology
industry. In his role at Oracle,
Andy led development for Oracle
Complex Event Processing (OCEP)
and drove global product strategy
and innovation. Prior to Oracle,
Andy was an architect for the
WebLogic Server Core at BEA
Systems, a provider of middleware
infrastructure technologies.
READ THIS ARTICLE
ONLINE ON InfoQ
Contents
Page 25
Scalability / eMag Issue 11 - April 2014
Reliable Auto-Scaling
Using Feedback Control
by Philipp K. Janert
Introduction
When deploying a server application to production,
we need to decide on the number of active server
instances to use. This is a diffcult decision, because
we usually do not know how many instances will
be required to handle a given traffc load. As a
consequence, we are forced to use more, possibly
signifcantly more, instances than actually required in
order to be safe. Since servers cost money, this makes
things unnecessarily expensive.
In fact, things are worse than that. Traffc is rarely
constant throughout the day. If we deploy instances
with peak traffc in mind, we basically guarantee that
most of the provisioned servers will be underutilized
most of the time. In particular, in a cloud-based
deployment scenario where instances can come
and go at any moment, we should be able to realize
signifcant savings by having only as many instances
active as are required to handle the load at any
moment.
One approach to this problem is to use a fxed
schedule, in which we somehow fgure out the
required number of instances for each hour of
the day. The diffculty is that such a fxed schedule
can not handle random variations: if for some
reason traffc is 10% higher today than yesterday,
the schedule will not be capable of providing the
additional instances that are required to handle the
unexpected load. Similarly, if traffc peaks half an
hour early, a system based on a fxed schedule will
not be able to cope.
Instead of a fxed (time-based) schedule, we might
consider a rule-based solution: we have a rule that
specifes the number of server instances to use for
any given traffc intensity. This solution is more
fexible than the time-based schedule but it still
requires us to predict how many servers we need for
each traffc load. And what happens when the nature
of the traffc changes as may happen, for example,
if the fraction of long-running queries increases?
The rule-based solution will not be able to respond
properly.
Feedback control is a design paradigm that is fully
capable of handling all these challenges. Feedback
works by constantly monitoring some quality-of-
service metric (such as the response time), then
making appropriate adjustments (such as adding
or removing servers) if this metric deviates from its
desired value. Because feedback bases its control
actions on the actual behavior of the controlled
system, it is capable of handling even unforeseen
events, such as traffc that exceeds all expectations.
Moreover, and in contrast to the rule-based solution
sketched earlier, feedback control requires very little
a priori information about the controlled system.
The reason is that feedback is truly self-correcting:
because the quality-of-service metric is monitored
constantly, any deviation from the desired value is
spotted and corrected immediately, and this process
repeats as necessary. To put it simply: if the response
time deteriorates, a feedback controller will simply
activate additional instances, and if that does not
help, it will add more. Thats all.
Contents
Page 26
Scalability / eMag Issue 11 - April 2014
Feedback control has long been a standard method
in mechanical and electrical engineering, but it does
not seem to be used much as a design concept in
software architecture. As a paradigm that specifcally
applies in situations of incomplete information
and random variation, it is rather different than
the deterministic, algorithmic solutions typical of
computer science.
Although feedback control is conceptually simple,
deploying an actual controller to a production
environment requires knowledge and understanding
of some practical tricks in order to work. In this
article, we will introduce the concepts and point out
some of the diffculties.
Nature of a Feedback Loop
The basic structure of a feedback loop is shown in the
fgure. On the right, we see the controlled system.
Its output is the relevant quality-of-service metric.
The value of this metric is continuously supplied to
the controller, which compares it to its desired value,
which is supplied from the left. (The desired value
of the systems output metric is referred to as the
setpoint.) Based on the two inputs of the desired
and the actual value of the quality-of-service metric,
the controller computes an appropriate control
action for the controlled system. For instance, if the
actual value of the response time is worse than the
desired value, the control action might consist of
activating a number of additional server instances.
The fgure shows the generic structure
of all feedback loops. Its essential components are
the controller and the controlled system. Information
fows from the systems output via the return path to
the controller, where it is compared to the setpoint.
Given these two inputs, the controller decides on an
appropriate control action.
So, what does a controller actually do? How does it
determine what action to take?
To answer these questions, it helps to remember
that the primary purpose of feedback control is to
minimize the deviation of the actual system output
from the desired output. This deviation can be
expressed as tracking error:
error = actual - desired
The controller can do anything it deems suitable
to reduce this error. We have absolute freedom in
designing the algorithm but we will want to take
knowledge of the controlled system into account.
Lets consider again the data-center situation. We
know that increasing the number of servers reduces
the average response time. So, we can choose a
control strategy that will increase the number of
active servers by one whenever the actual response
time is worse than its desired value (and decrease
the server count in the opposite case). But we can
do better than that, because this algorithm does not
take the magnitude of the error into account, only its
sign. Surely, if the tracking error is large, we should
make a larger adjustment than when the tracking
error is small. In fact, it is common practice to let the
control action be proportional to the tracking error:
action = k error
The value of k is some number.
With this choice of control algorithm, large
deviations lead to large corrective actions, whereas
small deviations lead to correspondingly smaller
corrections. Both aspects are important. Large
actions are required in order to reduce large
deviations quickly but it is also important to let
control actions become small if the error is small
only if we do this does the control loop ever settle to
a steady state. Otherwise, the behavior will always
oscillate around the desired value, an effect we
usually wish to avoid.
Contents
Page 27
Scalability / eMag Issue 11 - April 2014
We said earlier that there is considerable
freedom in choosing a particular algorithm for the
implementation of the feedback controller but it is
usually a good idea to keep it simple. The magic of
feedback control lies in the loopback structure of
the information fow, not so much in a particularly
sophisticated controller. Feedback control incurs a
more complicated system architecture in order to
allow for a simpler controller.
One thing, however, is essential: the control action
must be applied in the correct direction. In order to
guarantee this, we need to have some understanding
of the behavior of the controlled system. Usually, this
is not a problem: we know that more servers means
better response times and so on. But it is a crucial
piece of information that we must have.
Implementation Issues
Thus far, our description of feedback control has
been largely conceptual. However, when attempting
to turn these high-level ideas into a concrete
realization, some implementation details need to be
settled. The most important of these concerns the
magnitude of the control action that results from a
tracking error of a given size. (If we use the formula
given earlier, this amounts to choosing a value for the
numerical constant, k.)
The process of choosing specifc values for the
numerical constants in the controller implementation
is known as controller tuning. Controller tuning
is the expression of an engineering tradeoff: if we
choose to make relatively small control actions,
then the controller will respond slowly and tracking
errors will persist for a long time. If, on the other
hand, we choose to make rather large control actions,
then the controller will respond much faster, but
at risk of over-correcting and incurring an error
in the opposite direction. If we let the controller
make even larger corrections, it is possible for the
control loop to become unstable. If this happens,
the controller tries to compensate each deviation
with an ever-increasing sequence of control actions,
swinging wildly from one extreme to the other
while increasing the magnitude of its actions all the
time. Instability of this form is highly detrimental to
smooth operations and therefore must be avoided.
The challenge of controller tuning therefore amounts
to fnding control actions that are as large as possible
without making the loop unstable.
A frst rule of thumb in choosing the size of control
actions is to work backwards: given a tracking error
of a certain size, how large would a correction need
to be to eliminate this error entirely? Remember that
we do not need to know this information precisely:
the self-correcting nature of feedback control
assures that there is considerable tolerance in
choosing values for the tuning parameters. But we do
need to get at least the order of magnitude right. In
other words, to improve the average query response
time by 0.1 seconds, do we need to add roughly one
server, 10 servers, or 100?
Some systems are slow to respond to control
actions. For instance, it may take several minutes
before a newly requested (virtual) server instance
is ready to receive requests. If this is the case, we
must take this lag or delay into account: while the
additional instances spin up, the tracking error will
persist, and we must prevent the controller from
requesting further and further instances. Otherwise,
we will eventually have way too many active servers
online! Systems that do not respond immediately
pose specifc challenges and require more care but
systematic methods exist to tune such systems.
(Basically, one frst needs to understand the duration
of the lag or delay before using specialized plug-in
formulas to obtain values for the tuning parameters.)
Special Considerations
We must keep in mind that feedback control is
a reactive control strategy: things must frst go
out of whack, at least a little, before any corrective
action can take place. If this is not acceptable,
feedback control might not be suitable. In practice,
this is usually not a problem: a well-tuned feedback
controller will detect and respond even to very small
deviations and generally keep a system much closer
to its desired behavior than a rule-based strategy or
a human operator would.
A more serious concern is that no reactive control
strategy is capable of handling disturbances that
occur much faster than it can apply its control
actions. For instance, if it takes several minutes to
bring additional server instances online, we will not
be able to respond to traffc spikes that build up
within a few seconds or less. (At the same time, we
will have no problem handling changes in traffc that
build up over several minutes or hours.) If we need
to handle spiky loads, we must either fnd a way to
speed up control actions (for instance, by having
Contents
Page 28
Scalability / eMag Issue 11 - April 2014
servers on hot standby) or employ mechanisms that
are not reactive (such as message buffers).
Another question that deserves some consideration
is the choice of the quality-of-service metric to
be used. Ultimately, the only thing the feedback
controller does is to keep this quantity at its desired
value, hence we should make sure that the metric we
choose is indeed a good proxy for the behavior that
we want to maintain. At the same time, this metric
must be available, immediately and at all times. (We
cannot build an effective control strategy on some
metric that is only available after a signifcant delay,
for instance.)
A fnal consideration is that this metric should not
be too noisy, because noise tends to confuse the
controller. If the relevant metric is naturally noisy,
then it usually needs to be smoothed before it can
be used as control signal. For instance, the average
response time over the last several requests provides
a better signal than just the response time of the
most recent request. Taking the average has the
effect of smoothing out random variations.
Summary
Although we have introduced feedback control here
in terms of data-center autoscaling, it has a much
wider area of applicability: wherever we need to
maintain some desired behavior, even in the face of
uncertainty and change, feedback control should
be considered an option. It can be more reliable
than deterministic approaches and simpler than
rule-based solutions, but it requires a novel way of
thinking and knowledge of some special techniques
to be effective.
Further Reading
This short article can only introduce the basic notions
of feedback control. More information is available
on my blog and in my book on the topic (Feedback
Control for Computer Systems. OReilly, 2013).
ABOUT THE AUTHOR
Philipp K. Janert provides consulting
services for data analysis and
mathematical modeling, drawing on
his previous careers as physicist and
software engineer. He is the author
of the best-selling Data Analysis with
Open Source Tools (OReilly), as well as
Gnuplot in Action: Understanding Data
with Graphs (Manning Publications).
In his latest book, Feedback Control for
Computer Systems, he demonstrates
how the same principles that govern
cruise control in your car also apply
to data-center management and
other enterprise systems. He has
written for the OReilly Network, IBM
developerWorks, and IEEE Software.
He holds a Ph.D. in theoretical physics
from the University of Washington.
Visit his companys Web site.
READ THIS ARTICLE
ONLINE ON InfoQ

Vous aimerez peut-être aussi