Académique Documents
Professionnel Documents
Culture Documents
April 2019
Notices
Customers are responsible for making their own independent assessment of
the information in this document. This document: (a) is for informational
purposes only, (b) represents AWS’s current product offerings and practices,
which are subject to change without notice, and (c) does not create any
commitments or assurances from AWS and its affiliates, suppliers or licensors.
AWS’s products or services are provided “as is” without warranties,
representations, or conditions of any kind, whether express or implied. AWS’s
responsibilities and liabilities to its customers are controlled by AWS
agreements, and this document is not part of, nor does it modify, any
agreement between AWS and its customers.
© 2019 Amazon Web Services, Inc. or its affiliates. All rights reserved.
Contents
Introduction ............................................................................................................ 1
Reliability ............................................................................................................. 1
Design Principles ................................................................................................ 2
Definition ............................................................................................................. 3
Foundation – Limit Management ........................................................................... 5
Foundation - Networking........................................................................................ 7
Application Design for High Availability ............................................................... 13
Understanding Availability Needs .................................................................... 19
Application Design for Availability .................................................................... 20
Operational Considerations for Availability ...................................................... 28
Example Implementations for Availability Goals ................................................. 35
Dependency Selection...................................................................................... 36
Single Region Scenarios .................................................................................. 36
Multi-Region Scenarios .................................................................................... 44
Conclusion............................................................................................................ 51
Contributors.......................................................................................................... 53
Document Revisions ............................................................................................ 53
Appendix A: Designed-For Availability for Select AWS Services ....................... 54
Abstract
The focus of this paper is the reliability pillar of the AWS Well-Architected
Framework. It provides guidance to help you apply best practices in the design,
delivery, and maintenance of Amazon Web Services (AWS) environments.
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Introduction
The AWS Well-Architected Framework helps you understand the pros and cons
of decisions you make while building systems on AWS. By using the
Framework you will learn architectural best practices for designing and
operating reliable, secure, efficient, and cost-effective systems in the cloud. It
provides a way to consistently measure your architectures against best
practices and identify areas for improvement. We believe that having well-
architected systems greatly increases the likelihood of business success.
• Operational Excellence
• Security
• Reliability
• Performance Efficiency
• Cost Optimization
This paper focuses on the reliability pillar and how to apply it to your solutions.
Achieving reliability can be challenging in traditional on-premises environments
due to single points of failure, lack of automation, and lack of elasticity. By
adopting the practices in this paper you will build architectures that have strong
foundations, consistent change management, and proven failure recovery
processes.
This paper is intended for those in technology roles, such as chief technology
officers (CTOs), architects, developers, and operations team members. After
reading this paper, you will understand AWS best practices and strategies to
use when designing cloud architectures for reliability. This paper includes high-
level implementation details and architectural patterns, as well as references to
additional resources.
Reliability
The reliability pillar encompasses the ability of a system to recover from
infrastructure or service disruptions, dynamically acquire computing resources
to meet demand, and mitigate disruptions such as misconfigurations or
Page 1
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Design Principles
In the cloud, there are a number of principles that can help you increase
reliability:
Page 2
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Definition
Service availability is commonly defined as the percentage of time that an
application is operating normally. That is, it’s the percentage of time that it’s
correctly performing the operations expected of it. This percentage is calculated
over periods of time, such as a month, year, or trailing 3 years. Applying the
strictest possible interpretation, availability is reduced any time the application
isn’t operating normally, including both scheduled and unscheduled
interruptions. We define availability using the following criteria:
• Common short-hand refers only to the “number of 9’s”; for example, “five
nines” translates to 99.999% available
Here is a table of common application availability design goals and the possible
length of interruptions that can occur within a year while still meeting the goal.
The table contains examples of the types of applications we commonly see at
each availability tier. In this document, we will refer to these values.
Max Disruption
Availability (per year) Application Categories
Page 3
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Page 4
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
For example, if the MTBF is 150 days and the MTTR is 1 hour, the availability
estimate is 99.97%.
For additional details: This document can help you calculate your availability.
Another way that costs escalate in systems that operate with higher availability
design goals is in the selection of dependencies. At these higher goals, the set
of software or services that can be chosen as dependencies will diminish based
on which of these services have had the deep investments we previously
described. As the availability design goal increases, it’s typical to find fewer
multi-purpose services (such as a relational database) and more purpose-built
services. This is because the latter are easier to evaluate, test, and automate,
and have a reduced potential for surprise interactions with included but unused
functionality.
Page 5
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
limits) or design constraints (hard limits). The final piece of limit management is
alerting and reporting, which enable you to know when you hit a limit or are
about to hit a limit, and then you can react accordingly.
The default limits for cloud resources created by AWS services are documented
for each service. These limits are tracked per account, so if you use multiple
accounts, you need to know what the limits are in each account. Other limits
may be based on your configuration. Examples of these limits are number of
instances in an Auto Scaling group, provisioned IOPS, RDS storage allocated,
EBS volume allocations, network IO, available IP addresses in a subnet or
VPC, etc.
Limits are enforced per AWS Region and per AWS account. If you are planning
to deploy into multiple regions or AWS accounts, then you should ensure that
you increase limits in the regions and accounts that you using. Additionally,
ensure you have sufficient buffer accounted for, such that an Availability Zone
event will not cause you to exceed your limits when requesting additional
resources while the unhealthy resources are being terminated.
AWS provides a list of some service limits via AWS Trusted Advisor, and others
are available from the AWS Management Console. The default service limits
that are provided are available in the Service Limits documentation; you can
contact AWS Support to provide your current limits for the services you are
using if you have not tracked your limit increase requests. For rate limits on
throttled APIs, the SDKs provide mechanisms (retry, exponential back off) to
handle throttled responses. You should evaluate your use cases to decide
which scheme works better for you. If you have a use case where the throttling
limits impact your application’s performance, then contact AWS Support to see
if there are mitigations or if the limit can be increased.
Ideally, limit tracking is automated. You can store what your current service
limits are in a persistent data store like Amazon DynamoDB. If you integrate
your Configuration Management Database (CMDB) or ticketing system with
AWS Support APIs, you can automate the tracking of limit increase requests
and current limits. If you integrate with a CMDB, then it is likely that you can
store the service limits within that system.
Page 6
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
• Amazon CloudWatch: You can set alarms to indicate when you are
getting close to limits in Network IO, Provisioned IOPS, EBS and
ephemeral volume capacity (through custom metrics), etc. You can also
set alarms for when you are approaching maximum capacity of auto
scaling groups.
Resources
Refer to the following resources to learn more about AWS best practices for
identifying limits and managing limits, and see AWS Answers for an example of
automated limit monitoring:
Video
• How do I manage my AWS service limits?
Documentation
• AWS Service Limits
Foundation - Networking
When architecting systems using IP-address-based networks you need to plan
network topology and addressing in anticipation of future growth and integration
with other systems and their networks. You might find that your infrastructure is
Page 7
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
limited if you don’t plan for growth, or you might have difficulties integrating
incompatible addressing structures.
Amazon Virtual Private Cloud (Amazon VPC) lets you provision a private,
isolated section of the AWS Cloud where you can launch AWS resources in a
virtual network.
When you plan your network topology, the first step is to define the IP address
space itself. Following RFC 1918 guidelines, Classless Inter-Domain Routing
(CIDR) blocks should be allocated for each VPC. Consider doing the following
things as part of this process:
• Allow IP address space for more than one VPC per Region.
• Within a VPC, allow space for multiple subnets that span multiple
Availability Zones.
The second step in planning your network topology is to ensure the resiliency of
connectivity:
AWS has many features that will influence your design. How many VPCs do
you plan to use? Will you use Amazon VPC peering between your VPCs? Will
you connect virtual private networks (VPNs) to any of these VPCs? Are you
going to use AWS Direct Connect or the internet?
Page 8
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
includes sufficient address space for the number of subnets you need to deploy,
the potential size of Elastic Load Balancing (ELB) load balancers, the number of
concurrent Lambda invocations within your VPC, and your servers (including
machine learning servers) and containers deployed within your subnets. In
general, you should plan on deploying large VPC CIDR blocks. Note that VPC
CIDR blocks can be changed after they are deployed, but if you allocate large
CIDR ranges for your VPC, it will be easier to manage in the long term. Subnet
CIDRs cannot be changed. Keep in mind that deploying the largest VPC
possible results in over 65,000 IP addresses. The base 10.x.x.x address space
means that you can use over 16,000,000 IP addresses. You should err on the
side of too large instead of too small for all these decisions.
An additional option for inter VPC connectivity is VPC Endpoint Services. This
enables you to use a Network Load Balancer as a private entry point from
another VPC.
You should consider the resiliency and bandwidth requirements that you need
when you select the vendor and instance size on which you need to run the
appliance. If you use a VPN appliance that is not resilient in its implementation,
then you should have a redundant connection through a second appliance. For
all these scenarios, you need to define acceptable time to recovery (TTR) and
test to ensure you can meet those requirements.
If you choose to connect your VPC to your data center via an AWS Direct
Connect connection and you need this connection to be highly available, have a
redundant connection fallback through a second AWS Direct Connect
connection from another location. If you have multiple data centers, then ensure
the connections terminate at different locations.
Page 9
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
If you choose to fail over to VPN over the internet using AWS Managed VPN, it
is important to understand that it supports up to 1.25 Gbps throughput per VPN
tunnel, but does not support Equal Cost Multi Path (ECMP) for egress in the
case of multiple AWS Managed VPN tunnels terminating on the same VGW.
Page 10
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
You should use existing standards for protecting your resources within this
private address space. A subnet or set of subnets (one per Availability Zone)
should be used as a barrier between the internet and your applications. In an
on-premises environment, you often have firewalls with features to deflect
common web attacks, and you often have load balancers or edge routers to
deflect DoS attacks, such as SYN floods. AWS provides many services that can
provide this functionality, such as AWS Shield and AWS Shield Advanced, an
integrated web application firewall (AWS WAF) deployed on Amazon
CloudFront and on ELB, ELB itself, and features of AWS virtual networking like
VPC security groups and network access control lists (ACLs). You can augment
these features with virtual appliances from AWS Partners and the AWS
Marketplace to meet your needs.
• AWS Global Accelerator: This is a network layer service that you can
use to create accelerators that direct traffic to optimal endpoints over the
AWS global network.
Page 11
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Videos
Documentation
• Transit Gateways
Page 12
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Note: If the topic of Reliability is new to you, or if you’re new to AWS, check out
the Automate Deployments to Eliminate Impact (Change Management) and
Recovery Oriented Computing (Failure Management) sections later in this
whitepaper. These sections will cover some concepts that will be helpful as you
read this section.
When designing a new application, it’s common to assume that it must be “five
nines” (99.999%) available without appreciating the true cost to build and
operate applications at that level of availability. Doing so requires that every
piece of networking and infrastructure from end customers to the service
application, data stores, and all other components must be built and operated to
achieve 99.999%. As just one example, most internet service providers aren’t
built to achieve five nines of availability. Therefore, multiple service providers
(with no common point of failure) are required for the application to be 99.999%
available to a specific end customer.
In addition, the application and all of its dependencies needs to be built and
tested to this level of availability, also avoiding single points of failure. This will
Page 13
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Applications that truly require 99.999% availability can be built on AWS, as the
following example illustrates.
The devices themselves are not available all of the time, nor is the network
connectivity of any single device, so you deploy a large number of them to
enable a customer to easily use a different device if one is down or lacks
connectivity. In “availability-speak”, they are redundant and fail independently.
The host processor is actually at least two computers and storage that are
deployed across independent AWS Regions, with synchronous replication
between the Regions. The host processors have redundant connections to the
merchant bank and banking networks, and the host processors have standby
copies in an independent location. When cash is required, the host processor
requests an Electronic Funds Transfer to take the money from the customer’s
account and put it in the host processor account. After it has the funds
transferred, it will send a signal to the ATM to dispense the money. It then
initiates an Automated Clearing House (ACH) transfer of those funds to the
merchant bank’s account over the redundant connectivity to the merchant bank.
This will reimburse the merchant bank for the funds dispensed. If a problem
happens to the connectivity between the ATM and the host processor or the
ATM machine itself after the funds have been transferred to the host processor,
Page 14
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
it can’t tell the ATM to dispense the cash. It will transfer the funds back to the
customer’s account. The ATM will timeout the transaction and then mark itself
out of service.
Multiple AWS Direct Connect connections will be established from the host
processors on AWS to the merchant bank. The connectivity from the ATM
machines will also be run through redundant Direct Connect providers to the
host processor. If the redundant connectivity to the merchant back is severed
from both host processors, they will durably store the request for the ACH
transfer until the connection is restored, while marking the ATMs that it operates
out of service.
Let’s explore an example where you might initially assume the application
needs to be 99.999% available, but in reality it can be successful despite a
lower availability design goal.
Page 15
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Page 16
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
hours that will not come back to conduct business at another time if you have
an interruption? Could you use a lower level of availability with a fallback
mechanism to handle failures?
Category Description
Page 17
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Upon deep analysis, the work involved in achieving and maintaining high
availability applications seems daunting. That often leads to a more refined
definition and prioritization of requirements:
• What are the most valuable transactions to your customers and to your
business?
• Can you predict when certain geographies have the greatest demand?
• What times of day, week, month, quarter, or year are your peak times?
The good news is that AWS provides numerous services that abstract the work
required to do many of these things, provided the right services are chosen to
meet your availability design goals. Defining the right design goal for availability
will start you down the path of how to implement your application and identify
what the additional costs are to achieve that availability. The remainder of this
whitepaper will help you think about selecting the right design goal, and then
choosing AWS services, building software, and constructing operational
practices that meet your goals.
Page 18
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Recommendation
Within AWS, we commonly divide services into the “data plane” and the “control
plane.” The data plane is responsible for delivering real time service while
control planes are used to configure the environment. For example, Amazon
EC2 instances, Amazon RDS databases, and Amazon DynamoDB table
read/write operations are all data plane operations. In contrast, launching new
EC2 instances or RDS databases, or adding or changing table meta-data in
DynamoDB are all considered control plane operations. While high levels of
availability are important for all of these capabilities, the data planes typically
have higher availability design goals than the control planes.
Page 19
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
With this information in hand, availability design goals are then tailored to the
sub-component, and work is done to meet the specific design goal of each sub-
component. Naturally, components that have higher availability design goals will
necessitate deeper investment in the engineering, testing, and operations
automation.
Availability design goals are then tailored to the different aspects, and the
appropriate work efforts are executed to engineer the system. AWS has
significant experience engineering applications with a range of availability
design goals, including services with 99.999% or greater availability. AWS
Solution Architects (SAs) can help you design appropriately for your availability
goals. Involving AWS early in your design process improves our ability to help
you meet your availability goals. Planning for availability is not only done before
your workload launches. It is done continuously to refine your design as you
gain operational experience, learn from real world events, and endure failures
of different types. You can then apply the appropriate work effort to improve
upon your implementation.
Page 20
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
AWS also employs the fault isolation construct of Availability Zones (AZs). AWS
Regions are composed of two or more Availability Zones that are designed to
be independent. Each Availability Zone is separated by a large physical
distance from other zones to avoid correlated failure scenarios due to
environmental hazards like fires, floods, and tornadoes. Each Availability Zone
has independent physical infrastructure: dedicated connections to utility power,
standalone backup power sources, independent mechanical services, and
independent network connectivity within and beyond the Availability Zone.
Despite being geographically separated, Availability Zones are located in the
same regional area. This enables synchronous data replication (for example,
between databases) without undue impact on application latency. This allows
customers to use Availability Zones in an active/active or active/standby
configuration. Availability Zones are independent, and therefore application
availability is increased when multiple AZs are used. Some AWS services
(including the EC2 instance data plane) are deployed as strictly zonal services
where they have shared fate with the Availability Zone as a whole. These
services are used to independently operate resources (instances, databases,
and other infrastructure) within the specific Availability Zone. AWS has long
offered multiple Availability Zones in our Regions.
Page 21
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
While AWS control planes typically provide the ability to manage resources
within the entire Region (multiple Availability Zones), certain control planes
(including Amazon EC2 and Amazon EBS) have the ability to filter results to a
single Availability Zone. When this is done, the request is processed only in the
specified Availability Zone, reducing exposure to disruption in other Availability
Zones.
Recommendation
When your application relies on the availability of control plane APIs during
a disruption of one Availability Zone, use API filters to request results for a
single Availability Zone with each API request (for example, with
DescribeInstances.)
The most pervasive fault isolation construct is that of the AWS Region. Regions
are designed to be autonomous, with dedicated copies of services deployed in
each Region. Regional AWS services internally use multiple Availability Zones
in an active/active configuration to achieve the availability design goals we
establish.
Redundant Components
One of the bedrock principles for service design in AWS is the avoidance of
single points of failure in underlying physical infrastructure. This motivates us to
build software and systems that use multiple Availability Zones and are resilient
Page 22
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Micro-Service Architecture
At AWS, we have built our systems using a concept called micro-services.
While micro-services have several attractive qualities, the most important
benefit for availability is that micro-services are smaller and simpler. They allow
you to differentiate the availability required of different services, and thereby
focus investments more specifically to the micro-services that have the greatest
availability needs. For example, to deliver product information pages on
Amazon.com (“detail pages”), hundreds of micro-services are invoked to build
discrete portions of the page. While there are a few services that must be
available to provide the price and the product details, the vast majority of
content on the page can simply be excluded if the service isn’t available. Even
such things as photos and reviews are actually not required to provide an
experience where a customer can buy a product.
• The team can deploy at any time as long as they meet their API and
other “contract” requirements
• The team can use any technology stack they want to as long as they
meet their API and other “contract” requirements.
Recommendation
Page 23
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Recovery-Oriented Computing
Complementing the AWS focus on building fault isolation zones and avoiding
single points of failure, we also work to minimize the disruption time when
failures do occur. Since impact duration is a primary input to calculating
availability, reducing recovery time has a direct impact on improving availability.
Recovery-Oriented Computing (ROC) is the term applied to systematic
approaches to improving recovery.
Page 24
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
event a “special case”, ROC would apply a similar strategy of retrying with
exponential back-off. Another example is the use of Auto Scaling with Amazon
Elastic Compute Cloud (Amazon EC2) to manage fleet capacity. An instance
may fail due to hardware failure, operating system bug, memory leak, or other
causes. Rather than building custom remediation for each, treat any as an
instance failure, terminate the instance, and allow Auto Scaling to replace the
instance.
A pattern to avoid is developing recovery paths that are rarely executed. For
example, you might have a secondary data store that is used for read-only
queries. When you write to a data store and the primary fails, you might want to
fail over to the secondary data store. If you don’t frequently test this failover,
you might find that your assumptions about the capabilities of the secondary are
incorrect. The capacity of the secondary data store, which might have been
sufficient when you last tested, may be no longer be able to tolerate the load
under this scenario. Our experience has shown that the only error recovery that
works is the path you test frequently. This is why having a small number of
recovery paths is best. You can establish recovery patterns and regularly test
them. If you have a complex or critical recovery path, you still need to regularly
execute that failure in production to convince yourself that the recovery path
works. In the example we just discussed, you should failover to the standby
regularly, regardless of need.
AWS approaches the design of our services with fault recovery in mind. We
design services to minimize the time to recover from failures and impact on
data. Our services primarily use data stores that acknowledge requests only
after they are durably stored across multiple replicas. These services and
resources include Amazon Aurora, Amazon Relational Database Service
(Amazon RDS) Multi-AZ DB instances, Amazon S3, Amazon DynamoDB,
Amazon Simple Queue Service (Amazon SQS), and Amazon Elastic File
System (Amazon EFS). They are constructed to use cell based isolation and
use the independence of Availability Zones. We use automation extensively in
our operational procedures. We also optimize our replace-and-restart
functionality to recover quickly from interruptions.
Page 25
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
that many systems built today are distributed systems. They rely on
communications networks to interconnect components. Particularly when
traversing longer distances or intermediary networks, these systems can have
high latency or loss. Individual services may see spikes of requests that
temporarily overwhelm their ability to respond. There are a number of best
practices that can be applied to allow these services to continue to operate
normally in the presence of these “normal” issues.
Retry with exponential fallback: This is the invoking side of the throttling
pattern we just discussed. AWS SDKs implement this by default, and can be
configured. The pattern is to pause and then retry at a later time. If it fails again,
pause longer and retry. This increase in pause time is often called “backing off.”
After a configured number of attempts or elapsed time, it will quit retrying and
return failure.
Fail fast: Simply return an error as soon as possible. This will allow releasing of
resources associated with requests and can often allow a service to recover if it
is running out of resources. It’s preferable to fail fast rather than allowing
requests to be queued. Queues can be created at multiple levels of a system,
and can seriously impede an ability to quickly recover. Be very mindful of places
where queues exist (they often hide in workflows or in work that’s recorded to a
database).
Page 26
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Constant work: Systems can fail when there are rapid changes in load. If you
know your service needs to process 100 units of work done per second at peak,
then you should design and tune your system for 100 units of work. If there is
work to be done, it takes one of the slots. But if not, you put “filler” work in the
slot just so you know you’re always doing 100 units per second. An example is
a video player that plays data in a buffer. If you have no work to perform
because no new data has come in, you may be able to process the same data
you last received again and effectively render the same video frame, performing
the same work.
Page 27
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Testing is an important part of the delivery pipeline. Aside from common unit
tests and functional tests that are performed at component levels, it is important
to perform sustained load testing. Load tests should discover the breaking point
of your workload, test performance, and perform fault injection testing. In
addition, your monitoring service must be able to add or remove monitoring of
capabilities that are added or deprecated. AWS finds it useful to perform
operational readiness reviews that evaluate the completeness of the testing,
ability to monitor, and importantly, the ability to audit the applications
performance to its SLAs and provide data in the event of an interruption or other
operational anomaly.
Page 28
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Recommendation
Although conventional wisdom suggests that you keep humans in the loop
for the most difficult operational procedures, we suggest that you automate
the most difficult procedures for that very reason.
• Canary deployment
• Blue-Green deployment
• Feature toggles
• Failure isolation zone deployments
Blue-Green deployments are similar to the canary deployment except that a full
fleet of the application is deployed in parallel. You alternate your deployments
across the two stacks (blue and green). Once again, you can send traffic to the
new version, and fail back to the old version if you see problems with the
deployment. You can also use fractions of your traffic to each version to dial up
the adoption of the new version. AWS Code Deploy can be configured with a
deployment configuration that will enable a blue-green deployment.
Feature toggles are configuration options on an application. You can deploy the
software with a feature turned off, so that customers don’t see the feature. You
Page 29
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
can then turn the feature on, as you’d do for a canary deployment, or you can
set the change pace to 100% to see the effect. If the deployment has problems,
you can simply turn the feature back off without rolling back.
One of the most important rules AWS has established for its own deployments
is to avoid touching multiple Availability Zones within a Region at the same
time. This is critical to ensuring that Availability Zones are independent for
purposes of our availability calculations. We recommend that you use similar
considerations in your deployments.
Testing
The testing effort should be commensurate with your availability goals. Your
application’s resiliency to transient failures of dependencies should be tested for
durations that may last from less than a second to hours. Testing to ensure that
you can meet your availability goals is the only way you can have confidence
that you will meet those goals. Our experience is that canary testing that can
run constantly and simulate customer behavior is among the most important
testing processes. You should unit test, load test, performance test, and
simulate your failure modes while under these tests. Don’t forget to test for
external dependency unavailability, and deployment failures. Achieving very
high availability requires implementing fault tolerant software patterns, and
testing that they are effective.
Netflix has provided some example open source software that can be a basis
for this type of testing. You can use their software or develop your own for
simulating failure modes. For simulating conditions that might produce
brownouts, you can use extensions to common proxies to introduce latency,
dropped messages, etc., or you can create your own.
Page 30
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
1. Generation
Page 31
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
2. Aggregation
3. Real-time processing and alarming
4. Storage
5. Analytics
Generation
First, determine which services and/or applications require monitoring, define
important metrics and how to extract them from log entries if necessary, and
finally create thresholds and corresponding alarm events. AWS makes an
abundance of monitoring and log information available for consumption, which
can be used to define change-in-demand processes. The following is just a
partial list of services and features that generate log and metric data.
• VPC Flow Logs can be enabled on any or all elastic network interfaces
(ENIs) within a VPC.
• Amazon ECS and AWS Lambda stream log data to CloudWatch Logs.
Page 32
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Aggregation
Amazon CloudWatch and Amazon S3 serve as the primary aggregation and
storage layers. For some services, like Auto Scaling and ELB, default metrics
are provided “out the box” for CPU load or average request latency across a
cluster or instance. For streaming services, like VPC Flow Logs or AWS
CloudTrail, event data is forwarded to CloudWatch Logs and you need to define
and apply filters to extract metrics from the event data. This gives you time
series data, and you can define an array of CloudWatch alarms to trigger alerts.
There are a number of tools provided by partners and third parties that allow for
aggregation, processing, storage, and analytics. Some of these tools are New
Page 33
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
• Amazon S3: Acts as the storage layer, and allows for lifecycle policies
and data management.
• Amazon EMR: Use this service to gain further insight into log and metric
data.
• Amazon Athena: Use this service to gain further insight into data that is
in support formats.
Page 34
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Recommendation
Auditing
Auditing your monitoring will ensure that you know when an application is
meeting its availability goals. Root Cause Analyses require the ability to
discover what happened when failures occur. AWS provides services that allow
you to track the state of your services during an incident:
• Amazon CloudWatch Logs: You can store your logs in this service and
inspect their contents.
• AWS Config: You can see what AWS infrastructure were used at points
in time.
• AWS CloudTrail: You can see which AWS APIs were invoked at what
time and by what principal.
Page 35
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
• Use monitoring
• Deploy changes
• Back up data
• Implement resiliency
• Test resiliency
Dependency Selection
We have chosen to use Amazon EC2 for our applications. We will show how
using Amazon RDS and multiple Availability Zones improves the availability of
our applications. We will use Amazon Route 53 for DNS. When we use multiple
Availability Zones, we will use Elastic Load Balancing. Amazon S3 is used for
backups and static content. As we design for higher reliability, we can only
adopt services with higher availability themselves. See the Appendix for the
design goals for the respective services.
These applications can be deployed with one Region and one Availability Zone.
We will deploy the software, including the database, to a single instance. We
will use a vendor or purpose built backup solution to send encrypted backup
data to Amazon S3 using a runbook. We will test that the backups work by
restoring and ensuring the ability to use them on a regular basis using a
runbook. We will configure versioning on our Amazon S3 objects and remove
permissions for deletion of the backups. We will use an Amazon S3 bucket
lifecycle policy to archive or permanently delete according to our requirements.
Page 36
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Let’s see what the implications on availability of recovery time are. We take 30
minutes to understand and decide to execute recovery, deploy the whole stack
in AWS CloudFormation in 10 minutes, assume that we deploy to a new
Availability Zone, and assume the database can be restored in 30 minutes. This
implies that it takes about 70 minutes to recover from a failure. Assuming one
failure per quarter, our estimated impact time for the year is 280 minutes, or
four hours and 40 minutes.
This means the upper limit on availability is 99.9%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we require the application to be
offline for updates (estimating 24 hours per year: four hours per change, six
times per year), plus actual events. So referring to the table on application
availability earlier in the whitepaper we see that our availability design goal is
99%.
Page 37
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Topic Implementation
3 9s (99.9%) Scenario
The next availability goal is for applications for which it is important to be highly
available, but they can tolerate short periods of unavailability. This type of
application is typically used for internal operational systems that have an effect
on employees when they are down. This type of application can also be used
for customer-facing systems that are not high revenue for the business and can
tolerate a longer recovery time or recovery point. Such applications include an
administrative system for account or information management.
We can improve availability for applications by using two Availability Zones for
our deployment and by separating the applications to separate tiers. We will
use services that work across multiple Availability Zones, such as Elastic Load
Balancing, Auto Scaling and Amazon RDS Multi-AZ with encrypted storage via
AWS Key Management Service. This will ensure tolerance to failures on the
resource level and on the Availability Zone level. Backup and restore can be
done using Amazon RDS. It will be executed regularly using a runbook to
ensure that we can meet recovery requirements.
Page 38
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Runbooks exist for total system recovery and common reporting. We will have
playbooks for common database problems, security-related incidents, failed
deployments, and establishing root cause of problems. After the root cause has
been identified, the correction for the error will be identified by a combination of
the operations and development teams. The correction will be deployed when
the fix is developed.
Let’s see what the implications on availability of recovery time are. We assume
that at least some failures will require a manual decision to execute recovery.
However with greater automation in this scenario we assume only two events
per year will require this decision. We take 30 minutes to decide to execute
recovery, and assume recovery is completed within 30 minutes. This implies 60
minutes to recover from failure. Assuming two incidents per year, our estimated
impact time for the year is 120 minutes.
Page 39
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
This means the upper limit on availability is 99.95%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we require the application to be
briefly offline for updates, but these updates are automated. We estimate 150
minutes per year for this: 15 minutes per change, 10 times per year. This adds
up to 270 minutes per year when the service is not available, so our availability
design goal is 99.9%.
Topic Implementation
Page 40
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
4 9s (99.99%) Scenario
This availability goal for applications requires the application to be highly
available and tolerant to component failures. The application must to be able to
absorb failures without needing to procure additional resources. This availability
goal is for mission critical applications that are main or significant revenue
drivers for a corporation, such as an e-commerce site, a business to business
web service, or a high traffic content/media site.
Page 41
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
causing the trend. Playbooks will exist for undiscovered failure modes and
security incidents. Playbooks will also exist for establishing the root cause of
failures. We will practice our failure recovery procedures constantly through
game days, using runbooks to ensure that we can perform the tasks and not
deviate from the procedures. The team that builds the website also operates the
website. That team will identify the correction of error of any unexpected failure
and prioritize the fix to be deployed after it is implemented. We will also engage
with AWS Support for Infrastructure Event Management offering.
Let’s see what the implications on availability of recovery time are. We assume
that at least some failures will require a manual decision to execute recovery,
however with greater automation in this scenario we assume only two events
per year will require this decision and the recovery actions will be rapid. We
take 10 minutes to decide to execute recovery, and assume recovery is
completed within five minutes. This implies 15 minutes to recover from failure.
Assuming two per year, our estimated impact time for the year is 30 minutes.
This means the upper limit on availability is 99.99%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we assume the application is
online continuously through updates. Based on this, our availability design
goal is 99.99%.
Page 42
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Topic Implementation
Page 43
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Multi-Region Scenarios
Implementing our application in multiple AWS Regions will increase the cost of
operation, partly because we isolate regions to maintain their independence. It
should be a very thoughtful decision to pursue this path. That said, regions
provide a very strong isolation boundary and we take great pains to avoid
correlated failures across regions. Using multiple regions will give you greater
control over your recovery time in the event of a hard dependency failure on a
regional AWS service. In this section, we’ll discuss various implementation
patterns and their typical availability.
Page 44
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Let’s see what the implications on availability of recovery time are. We assume
that at least some failures will require a manual decision to execute recovery,
however with good automation in this scenario we assume only 2 events per
year will require this decision. We take 20 minutes to decide to execute
recovery, and assume recovery is completed within 10 minutes. This implies 30
minutes to recover from failure. Assuming 2 per year, our estimated impact time
for the year is 60 minutes.
This means the upper limit on availability is 99.95%. The actual availability will
also depend on the real rate of failure, duration of failure and how quickly each
factor actually recovers. For this architecture we assume the application is
online continuously through updates. Based on this, our availability design
goal is 99.95%.
Page 45
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Topic Implementation
Page 46
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Page 47
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
The deployment pipeline will have a full test suite, including performance, load,
and failure injection testing. We will deploy updates using canary or blue/green
deployments to one isolation zone at a time, in one Region before starting at
the other. During the deployment, the old versions will still be kept running
instances to facilitate a faster rollback. These are fully automated, including a
rollback if KPIs indicate a problem. Monitoring will include success metrics as
well as alerting when problems occur.
Let’s see what the implications on availability of recovery time are. We assume
that heavy investments are made to automate all recovery, and that recovery
Page 48
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Page 49
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Topic Implementation
Page 50
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Conclusion
Whether you are new to the topics of availability and reliability or a seasoned
veteran seeking insights to maximize your mission critical service’s availability,
we hope this section has triggered your thinking, offered a new idea, or
introduced a new line of questioning. We hope this leads to a deeper
understanding of the right level of availability based on the needs of your
business. We encourage you to take advantage of the design, operational, and
recovery-oriented recommendations offered here as well as the knowledge and
experience of our AWS Solution Architects. We’d love to hear from you –
especially about your success stories achieving high levels of availability on
AWS. Contact your account team or use Contact Us via our website.
Resources
Refer to the following resources to learn more about AWS best practices in this
area.
Page 51
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Documentation:
• DynamoDB: Global Tables
• AMI Copies
Page 52
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
• Canary Release
• Feature Toggles
• Microservice Trade-Offs
Contributors
The following individuals and organizations contributed to this document:
Document Revisions
Date Description
April 2019 Appendix A updated
Page 53
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Date Description
June 2018 Added Design Principles and Limit Management sections.
Updated links, removed ambiguity of upstream/downstream
terminology, and added explicit references to the remaining
Reliability Pillar topics in the availability scenarios.
NOTE: The numbers in the table below do not refer to durability (long term
retention of data); they are availability numbers (access to data or functions.)
Page 54
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Availability
Service Component Design Goal
Page 55
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Availability
Service Component Design Goal
Page 56
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Availability
Service Component Design Goal
Page 57
Amazon Web Services Reliability Pillar AWS Well-Architected Framework
Availability
Service Component Design Goal
Page 58