Vous êtes sur la page 1sur 85

RISK MANAGEMENT FOR

IT INFRASTRUCTURE

Executive Handbook, Vol. 1


Series Editors Julian Kudritzki and Matt Stansberry
1
RISK MANAGEMENT FOR
IT INFRASTRUCTURE
Executive Handbook
Volume 1

2
RISK MANAGEMENT FOR
IT INFRASTRUCTURE
Executive Handbook
Volume 1

Series Editors
Julian Kudritzki and Matt Stansberry, Uptime Institute

Executive Publisher
Martin McCarthy, CEO, 451 Group

Designed by David Wilson

Seattle and New York


Uptime Institute & 451 Group
20 W. 37th Street
6th Floor
New York, NY 10018

3
iii
ISBN 978-0-9982850-0-9
Printed in the United States of America
Printed on 100% post-consumer waste paper
© 2016 by Uptime Institute, LLC. All Rights Reserved.

Uptime Institute is an independent division of The 451 Group. Reproduction and distribution of this publication, in
whole or in part, in any form without prior written permission is forbidden. The information contained herein has
been obtained from sources believed to be reliable and opinions expressed in this publication are solely those
of the authors and do not represent the position of Uptime Institute or its Affiliates. Uptime Institute disclaims
all warranties as to the accuracy, completeness or adequacy of such information. Although selections of this
publication may discuss legal issues related to the information technology business, Uptime Institute does not
provide legal advice or services and this publication should not be construed or used as such. Uptime Institute shall
have no liability for errors, omissions, or inadequacies in the information contained herein or for interpretations
thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results.
The opinions expressed herein are subject to change without notice.

iv
4
CONTENTS
INTRODUCTION / vii
How to overcome the operations risk to IT infrastructure

2016 DATA CENTER INDUSTRY SURVEY / 2


Responses from 1,000 IT and data center end users provide an
overview of an industry and profession in transition

DATA CENTER WALKTHROUGH CHECKLIST / 14


Identifying lurking vulnerabilities in even the best-designed data centers

WHY EFFECTIVE GOVERNANCE NEEDS INDUSTRY CERTIFICATIONS / 18


The benchmarks are a function of your business needs, but the award
on the wall substantiates Risk Management

AVOIDING DATA CENTER CAPITAL PROJECT FAILURES / 24


Identify and mitigate costly mistakes

BALANCING LIFE SAFETY, INFRASTRUCTURE INVESTMENT,


AND DOWNTIME / 32
Due to the uninterruptible nature of IT infrastructure, many
organizations allow high-risk maintenance activities

5v
COMPLEX SYSTEMS FAILURE THEORY / 40
Conventional wisdom blames “human error” for the majority of IT
outages, but those failures are incorrectly attributed to front-line
operator errors, rather than management oversights

APPLY EFFICIENT IT PRINCIPLES TO ADDRESS SUSTAINABILITY RISKS /56


As Corporate Sustainability programs become increasingly important
to C-level execs and investors, IT organizations need to adopt more
meaningful KPIs to remain relevant

IT RESILIENCE DURING A NATURAL DISASTER / 64


The most common cause of disruption to IT services during a natural
disaster is preventable

A HOLISTIC APPROACH TO VENDOR SELECTION FOR CLOUD AND


COLOCATION / 68
As companies rely more on colocation, cloud, and other off-premise
computing models, enterprise IT needs to improve how it selects and
manages vendors

vi
6
INTRODUCTION
How to overcome the operations risk to IT infrastructure

IT infrastructure decisions are fraught with risk from almost every angle due
to the extensive investment and high stakes of technology deployments.

Organizations are exposed to financial and reputational risk from IT service


outages. There are market risks associated with lagging behind industry peers
and losing agility and competitiveness. There are operational risks associated
with life safety and heavy industrial equipment. Increasingly, companies are
facing sustainability and regulatory risks due to the intensity of IT energy use.

All of these concerns require Risk and IT executives to ensure attention to


detail at the trade and technician level, while maintaining a holistic view of
the overall business goals and challenges. A lack of insight in either area can
result in costly organizational blind spots.

In this volume, we have included excerpts of Uptime Institute’s extensive


thought leadership in this area. This publication provides senior executives
the insight of our leadership team.

Uptime Institute is an unbiased advisory organization focused on improving


the performance, efficiency, and reliability of business critical infrastructure
through innovation, collaboration, and independent certifications.

Our organization’s tagline is The Global Data Center Authority–and we


have assessed and certified over 1,000 IT infrastructure capital projects
and operations programs around the globe. Our Network, a user group of
professionals, has recorded over two decades worth of data and insights into
what and why IT failures happen. Our subject matter experts have held
leadership positions in IT infrastructure organizations in some of the world’s
largest companies.

The keys to risk management for IT infrastructure are identifying the risk factors
your organization faces, assessing your organization’s exposure, and ensuring
that your processes and procedures are in line with industry recommendations.

To that end, we have included a chapter titled “Theory of Complex Systems


Failures” in this volume. We believe it is important for operations risk stakeholders
to consider that the reasons for failures are human and addressable.

vii
7
We have also assembled multiple practical pieces of thought leadership to
bring stakeholder attention to life safety, operational upsets, and loss of
revenue risks inherent in IT operations. Often, these three are interlaced.

It is historical fact that IT has only reluctantly welcomed outside scrutiny and
tends to limit stakeholder input when making strategic decisions. Thus, we
have included Industry Survey data to quantify the evolving practices of IT
stakeholders, as well as highlight latent issues.

We have also placed IT in the context of natural disaster risk planning, using
extensive research into the IT solutions and operations programs that outlasted
Superstorm Sandy. Additionally, we call attention to the need for IT to adapt to
and adopt the principles of Corporate Sustainability. For many organizations,
IT is the leading consumer of resources by street address or headcount. As
pressure mounts to prioritize and communicate IT’s resource consumption, a
lack of response represents an image and fiduciary risk to enterprises.

We hope that you find this Executive Handbook compelling. This is only
an excerpt of our substantial body of thought leadership. We welcome your
questions and comments via e-mail below.

As Chief Operating Officer of Uptime Institute, Julian Kudritzki directly leads the
strategic corporate and content initiatives with the new Efficient IT and Corporate
Governance Advisory Services, standardizing global data center portfolios, and
reducing resource consumption in IT infrastructure.

After joining Uptime Institute in 2004, he served as one of the architects of the Tier
Certification program, which has redefined standardization and accountability within
the design-build-operations of capital IT infrastructure projects. Since the inception
of this program, he managed the rapid expansion of Uptime Institute commercial
offerings outside the U.S. and formed local teams in Brasil, Latin America, Europe,
Middle East, and throughout Asia Pacific.

jkudritzki@uptimeinstitute.com

Matt Stansberry is Senior Director of Content and Publications for the Uptime Institute
and also serves as Program Director for the Uptime Institute Symposium, an annual
event that brings together 1,500 stakeholders in enterprise IT, data center facilities, and
corporate real estate to deal with the critical issues surrounding enterprise computing.
He was formerly Editorial Director for Tech Target’s Data Center and Virtualization media
group, and was managing editor of Today’s Facility Manager magazine. He has reported
on the convergence of IT and facilities for more than a decade.

mstansberry@uptimeinstitute.com

viii
8
RISK MANAGEMENT FOR
IT INFRASTRUCTURE

Executive Handbook
Volume 1

1
UPTIME INSTITUTE 2016 DATA CENTER
INDUSTRY SURVEY RESULTS
Enterprise IT budgets are shrinking with execs projected to
outsource heavily to the cloud in the coming 5 years

2
Uptime Institute 2016 Data Center Survey Results

EXECUTIVE SYNOPSIS
Many enterprise IT departments are shrinking, due to budget pressures,
IT hardware advances, and the outsourcing of workloads to cloud and
colocation providers. At present, the majority of IT groups maintains a mix
of assets across enterprise-owned data centers, colocation partners, and cloud
platforms, which is consistent with several years of survey data that suggested
the shift to cloud computing would be gradual for conservative enterprise
IT organizations. However, this year’s data indicate those assumptions may
be incorrect. A major shift in IT’s role in the enterprise is imminent—or has
already happened—unbeknownst to the enterprise IT professional.

This survey explores the rapidly growing relationship between enterprise IT


and colocation providers and also how enterprise IT can work effectively
with business functions outside of its discipline, specifically Corporate
Sustainability, in order to drive efficiencies and demonstrate responsible
stewardship of resources.

Uptime Institute concludes that IT will need to move away from its role
as a slow-moving centralized service provider, as IT assets become more
distributed across locations and platforms, and instead provide corporate
governance across the various business lines—evaluating security, costs, and
performance of IT for end users.

DEMOGRAPHICS
The sixth annual Uptime Institute Data Center Industry Survey was conducted
via email in February 2016 and includes responses from over 1,000 data center
operators and IT practitioners (see Figure 1).

2016 Survey Respondents


Job Function Location Top Verticals

Colocation or Multi-tenant
Data Centers 26%
Financial 18%
Telecommunications 14%
Government 10%
U.S. and Canada 40% Manufacturing 6%
33% Executive Europe 22% Utilities/Energy 6%
34% IT Management APAC 13%
33% Facilities Management Africa and Middle East 12%
Latin America 10%
Russia and CIS 3%

Figure 1: Uptime Institute’s survey respondents include 1,000 data center owners, operators, and IT practitioners from
various industries and locations around the world.

3
Uptime Institute

The survey respondents are end users—those responsible for managing


infrastructure at the world’s largest IT organizations. The participants
represent a wide range of industries, with about a 50-50 split between
enterprise IT leaders and service providers—those with operational or
executive responsibilities in colocation or cloud computing companies.

The roles of the participants range from IT and Facilities Management to


Executive, with senior-level participants at the VP-level and above. Multiple
geographic regions are represented, providing a global perspective.

Regional Difference Are Minor in a Global Economy


Having conducted this survey for six years, Uptime Institute has noticed less variance in responses
between regions. Put another way, an IT director at a bank in São Paulo, Brasil, responds to questions
in much the same way as a London-based IT exec in the financial industry. The biggest variances
appear to relate to company size and job function and between verticals.

Select regional economies are growing faster and might adopt certain technologies more quickly
than others, but these differences have no impact on the purpose of this survey: to examine and
evaluate the decision making of enterprise IT and data center leaders.

Defining Enterprise, and Addressing Enterprise Issues


For this survey, Uptime Institute has divided respondents into two categories: enterprise IT
and service providers. The enterprise IT category includes government, financial industry,
manufacturers, retailers, and any other vertical that deploys IT to serve an internal business
function. Service providers include cloud and colocation vendors—any organization that provides
IT or infrastructure for customers.

Uptime Institute withheld questions from service providers in several areas of this survey. For
example, the survey did not ask service providers about their server hardware footprints or cloud
computing adoption plans. The survey is largely focused on how enterprise IT and infrastructure
is deployed and managed, both in enterprise-owned data centers and through off-premise
computing models.

Lastly, throughout the survey, Uptime Institute uses the generic term colocation. In this survey,
colocation applies broadly to any service provider supplying a data center facility, from dedicated
facilities to multi-tenant spaces.

BUDGETS: STABLE OR SHRINKING ENTERPRISE IT?


Is the glass half-full or trending toward empty?

For the last five years, around half of enterprise IT departments have faced
flat or shrinking overall budgets (combined technology infrastructure of IT
and data center facilities). This percentage has held steady in each of Uptime
Institute’s surveys and speaks to a close scrutiny of enterprise IT spending.
Some enterprise IT organizations are receiving modest budget increases, but
fewer than 10% are seeing any significant growth.

4
Uptime Institute 2016 Data Center Survey Results

This trend is corroborated by 451 Research’s “Voice of the Enterprise Data


Centers Q4 2015” report, which found 48% of budgets are flat or shrinking
and less than 6% of respondents will see an annual spending increase over 25%.

Over half of enterprise respondents reported a flat or shrinking server hardware


footprint (see Figure 2).

50% of enterprise budgets


flat or shrinking

55% of enterprise server


footprints flat or shrinking

Figure 2. Enterprise IT organizations are looking for ways to decrease spending on telecommunications,
staffing, facilities infrastructure, and server hardware.

HP Proliant and Dell PowerEdge Servers were listed as the two most critical
server platforms for enterprise IT users in 451 Research’s “Voice of the
Enterprise, Servers and Converged Infrastructure Q4 2015” report. Over half
of the respondents planned to cut spending on HP equipment in 2016, and
nearly half plan to cut spending on Dell hardware. Respondents report plans
for increased spending toward Cisco’s converged hardware platforms in 2016.
But, for companies that supply the x86 hardware that makes up bulk of data
center capacity, dramatic cuts may be coming. Nearly 30% of respondents
planned to cut spending on HP server hardware by over 50% in 2016.

The impact of flat enterprise IT budgets and shrinking server hardware


footprints is now trickling down to the colocation providers (see Figure 3).

For the last 5 years, colocation providers have experienced massive growth, trying
to keep up with demand. Yet the forces shrinking enterprise IT deployments are
now impacting the capital project cycle, even for colocation providers.

Despite experiencing a slowdown in new capital projects, colocation or multi-tenant


data center providers are playing a major role in many enterprise IT team’s asset
mix. According to the survey, a significant portion of an enterprise’s IT workload is
deployed in colocation provider sites. The following section will address the drivers
and trends for managing IT assets in these third-party service provider sites.

5
Uptime Institute

Why Are Enterprise Server Footprints Shrinking?


A conversation with 451 Research Director Peter Christy

The number of server hardware units seems to be flat or declining across the enterprise. Uptime Institute
first noted the trend in conversations with the people running advanced enterprise IT departments who
participated in Uptime Institute’s Server Roundup program two years ago, and now the survey data
confirm that the rest of the industry is starting to see something similar. Does this shrinking server
footprint fit your view from the market side?

Christy: It does. All the traditional enterprise server suppliers are seeing flat or even declining
businesses. New servers have more capacity than the older ones they replace, and server virtualization
makes it much easier to refresh the server infrastructure and take advantage of more powerful and
cost-effective hardware. All of that fits what you are seeing.

How are these server hardware trends impacting deployment?

Christy: Moore’s Law can’t last forever, but it is still making progress. Intel just introduced its most
recent Xeon E5 v4 series, which features a 25% performance bump over last year. There are more cores
per processor. The last version had 18; now they’re up to 22. The cores are smaller and offer more
throughput.

Also, server virtualization allows higher server utilization; fewer servers are needed to do the same
work. Server virtualization also makes it easier to refresh servers with newer, more cost- and space-
effective replacements. Server virtualization has happened rapidly by the historical standards of data
center change because it could be done by the server team and because it yielded a quick ROI.

Also, server virtualization has made it easier to move workloads to the cloud, and enterprises are
starting to outsource elements of IT to the public cloud so that fewer servers are needed in their
private enterprise data centers.

Our surveys have shown a conservative adoption rate of cloud computing among enterprise IT
groups. But, we also see some indications that the industry is poised to make a major shift. What do
you think happens next?

Christy: The shift to the public cloud is fascinating. It’s being driven now by the need for enterprises
to be more agile—to respond to issues and opportunities more quickly. IT plays an important role in
business agility. If IT isn’t agile, it’s hard for a business to be agile. Public cloud services, in particular the
market leader Amazon Web Services (AWS), have played a key role in demonstrating the potential value
of agile IT and demonstrating what is possible, at least on a platform like AWS.

Most enterprises would prefer a private alternative to the public cloud, but so far that’s been hard to
accomplish. Although IT surveys clearly show this reluctance and would lead you to believe the evolution
to the public cloud will be slow, other data suggest otherwise.

For the last year, Amazon has broken out AWS as a separate business, and we see that revenues are
approaching a US$10-billion run rate, which is growing at more than 50% year over year with 25%
profitability. These are all breathtaking numbers, especially in an IT industry that is at best slowly
growing.

The shift is also seen in what Intel reports about server CPU sales. Three years ago, Intel said that the
cloud segment was growing more rapidly than the enterprise segment but was still much smaller. More
recently, Intel said that 2016 would be the crossover year in which cloud sales would exceed enterprise
sales.

Finally, all the traditional enterprise IT suppliers (HP, Dell, and IBM) are struggling just to keep the enterprise
business flat and looking for ways they can sell to cloud providers. Although the surveys may show that
enterprises don’t want rapid evolution to the cloud, other data suggest the change may happen quickly.

Peter Christy is the Research Director of 451 Research’s Networking Practice. For more than 30 years,
Peter has worked with segment leaders in a spectrum of IT and networking technologies. He managed
software and system technology for companies including HP, Sun, IBM, Digital Equipment Corp, and Apple.

6
Uptime Institute 2016 Data Center Survey Results

Enterprise cuts affecting the colocation industry


Colocation budget increases 2014 – 2016

86% 74% 64%


2014 2015 2016

Colocation Builds

45% 29% 24%

2014 2015 2016

18% 15% 15%

Enterprise Builds

Figure 3. Despite experiencing a slowdown in new capital projects, colocation or multi-tenant data center
providers are playing a major role in many enterprise IT team’s asset mix.

THIRD-PARTY SERVICE PROVIDER ADOPTION AND MANAGEMENT


For the last 5 years, the majority of survey respondents have reported that
some percentage of their IT portfolio resides outside of their enterprise-
owned data centers, either in the cloud or in a colocation facility (see Figure
4). In 2016, over 75% of respondents claimed to use some form of off-
premise computing. Arguably that number is closer to 100%, as many of the
respondents may be unaware of initiatives outside of their purview and end
users might even purposely circumvent traditional IT channels and barriers.

In the years (2012-2016) that the survey has included this question, these
numbers have remained fairly static. Despite massive growth in cloud

Where are your current IT assets located?


Estimate percentages:

Enterprise-owned Data Center 71%


Colocation or Multi-tenant Data Center Provider 20%
Cloud Computing 9%

Figure 4. Uptime Institute says that these survey numbers may understate the move to off-premises computing by
enterprises.

7
Uptime Institute

computing revenues, cloud adoption appears to be conservative. And yet,


asking this question in a different way to a different segment of the audience
yields results that suggest the industry is poised for a major realignment.

About half of senior executives say they expect the majority of their IT
workloads to reside off-premise in cloud or colocation sites in the future.
Around 70% of those respondents expect that shift to happen by 2020, and
23% expect that shift to happen by next year.

Survey: Top Drivers for Multi-Tenant Data Center Adoption


In your own words, what are the drivers for colocation adoption?

• Reduce churn of noncritical workloads into critical space

• Mergers/Acquisitions activity

• Disaster recovery site on a separate power grid

• Executive directive to divest owned data center infrastructure

• Global expansion

• Avoid large capital expenses of new site build

• Not core business

• Lack of confidence in staff/resources

Uptime Institute saw it coming. “The 2013 Data Center Data Center Industry
Survey” found that C-level execs were not paying attention to data center
infrastructure cost or performance metrics. At that time, Uptime Institute
advocated that data center and IT professionals become more effective at
articulating their value to the business or risk being outsourced.

At Uptime Institute Symposium that year, an operations director at a very


large U.S.-based company said that he was making simple, no-cost changes to
the data center that would save his company hundreds of thousands of dollars
annually and extend the life of legacy data center assets—offsetting a looming
eight-figure capital investment.

To the question “What does your CIO think of your projects?” he responded,
“I’ll let you know if I ever meet him.”

Fundamentally this survey takes the pulse of enterprise IT and data center
professionals—stakeholders who are not motivated to go to the public cloud,
who will attempt to diminish the speed with which it will happen, and who
will emphasize the potential problems with cloud computing. While CIOs
may be loath to relinquish their empires, other executives in the business
lines will demand more scalable, responsive IT on demand.

8
Uptime Institute 2016 Data Center Survey Results

To be clear, there will not be an exodus of enterprise data center workloads


to the cloud. Sunk investments, human nature, and organizational resistance
will sustain many traditional enterprise IT roles into the foreseeable future.
And yet, as business lines demand agility and transparency, enterprise IT will
need to emulate service providers, as they will increasingly be competing
with them. Additionally, IT and data center teams can reorient to provide
corporate governance, advising and assisting business lines with service
provider procurement, and managing vendor relationships.

Enterprise IT on Colocation Providers


What is the length of the typical contract commitment you make to a colocation or multi-
tenant data center provider?
• Under 2 years 12%
• 2-4 years 39%
• 4-5 years 25%

• Over 5 years 24%

How many separate colocation or multi-tenant data center providers is your organization
currently using?
• One 30%
• Two to Three 35%
• Three to Five 13%
• Over Five 22%

Does your organization consider third-party certifications such as the Uptime Institute’s Tier
Certification and/or M&O Stamp of Approval as part of the vetting process for considering
potential colocation candidates?
• Yes 65%
• No 35%

Despite major adoption of cloud and colocation, the outsourcing model is


not a panacea. According to 2016 Survey Data:

• 40% of enterprise respondents are paying more for colocation


contracts than they had initially planned or expected

• Nearly one-third of respondents had experienced an outage at a


colocation vendor site

• Over 60% of respondents said the penalty clause in their Service


Level Agreement (SLA) would not adequately offset the cost of that
outage to the business

9
Uptime Institute

To be fair, customer satisfaction levels are high. Almost half the respondents
reported being satisfied or very satisfied with their primary provider, while
7% said they were dissatisfied or very dissatisfied. In 2015, enterprise IT
organizations reported experiencing slightly more outages in their enterprise-
owned sites over a 2-year period than at their colocation sites.

That said, enterprise IT organizations paying a premium for a third party to


deliver data center capacity should hold service providers to higher standards
than their own organization. There is significant room for improvement in
vetting, negotiating, and managing those relationships.

THE CORPORATE SUSTAINABILITY DEPARTMENT


Uptime Institute has tracked data center trends and sentiments in this survey
for 6 years, but the last 3 years of survey responses illustrate a major shift in
IT infrastructure efficiency prioritization. The survey results from 2014–16
demonstrates how Corporate Sustainability will have a major impact on IT
departments going forward.

In 2014, many enterprise IT organizations were sitting on recently built


(last 5 years) data center facilities that were underutilized due to forecasting
errors. Senior IT and data center staff relied heavily on skewed PUE metrics
as an indicator of success—touting the efficiency of the cooling systems in a
partially loaded facility (that in hindsight should never have been built at that
scale) as a metric senior management should care about.

Nearly half the respondents were not auditing their sites for comatose server
hardware at all, as addressing this issue would only make overall utilization
look worse. Yet 80% of respondents reported achieving U.S. Green Building
Council’s LEED designation or other green building award, which provided
very little environmental or financial return in the context of the data center.

It is not surprising that respondents reported C-suite executives were not


interested in data center efficiency or performance.

In 2015, organizations with major IT infrastructure investments began to


try to address root problems but struggled to get organizational buy-in. The
primary culprits of organizational inefficiency had been ignored for years:

• Poor demand and capacity planning within and across functions


• Significant failings in asset management and utilization

• Lack of financial accountability

10
Uptime Institute 2016 Data Center Survey Results

These problems stemmed from a disconnect between IT infrastructure


costs and the business lines. Accurate forecasting and asset utilization are
not prioritized if no one is held accountable for those functions. In many
cases, the facility or corporate real estate teams owned an underutilized data
center investment that was treated as an undifferentiated cost center, totally
unattributed to the IT department or the lines of business. Many IT leaders
realized chargeback would address the chronic problems with IT efficiency.

Chargeback is a method of charging internal consumers (e.g., departments, functional units) for
the IT services they use. Instead of bundling all IT costs under the IT department, a chargeback
program allocates the various costs of delivering IT (e.g., services, hardware, software,
maintenance) to the business units that consume them (See “IT Chargeback Drives Efficiency”
The Uptime Institute Journal, vol. 6, page 22 and journal.uptimeinstitute.com).

In 2015, less than a third of survey respondents said that their organizations
had deployed a chargeback accounting method. In May of that year, Uptime
Institute gathered a group of senior stakeholders for the Executive Assembly
for Efficient IT. The group comprised leaders from large financial, healthcare,
retail, and web-scale IT organizations; the purpose of the meeting was to share
experiences, success stories, and challenges to improving IT efficiency.

Nearly every organization in the room had struggled to implement


chargeback, but almost all of them were in process and realized they needed
to reach out to counterparts in other disciplines within the business to make
the most of these efforts.

In 2016, infrastructure leaders said that they faced strong internal resistance
to addressing chronic inefficiency. In order to implement accountability
and efficiency measures, the projects needed senior-level support. These
efforts would not succeed as bottom-up initiatives. With that knowledge,
infrastructure teams reached out to their counterparts in other disciplines
(see Figure 5).

Increasingly, Corporate Sustainability drives decisions at large companies, as


this function can affect the investor community, stock price, and capitalization.
Many companies meet these challenges by creating sustainability offices that
have both C-level visibility and broad staff participation across all business
units and facilities, including IT.

11
Uptime Institute

Which major business functions or departments are


consistently absent from major IT infrastructure decisions?

1. Finance 2. Risk 3. Sustainability

Figure 5. IT reached out to partner with finance, risk, and even Corporate Sustainability to gain executive
visibility and traction to address chronic problems.

As Uptime Institute Chief Operating Officer Julian Kudritzki wrote in Network World
in April 2016:
Two years ago, a place at the table for sustainability would have been provocative, and perhaps
evoked derision. In 2015, less than a tenth of enterprise IT stakeholders had confidence in
Corporate Sustainability to affect IT efficiency and costs. One short year later, 2016 is a vastly
different matter, and the data suggests that the time of Corporate Sustainability in IT is here now:
70% of enterprise IT organizations actively participate in Corporate Sustainability efforts. The
influence of an outside party breaks down the ‘thwart by silo’ effect that has been the cause of so
much well meaning, and often fruitless, energies to reshape IT.

The relationship between Corporate Sustainability and enterprise IT is really just


getting started. There are good signs for the potential of this relationship, but
also signals that entrenched behaviors and metrics will be difficult to overcome.

The relationship so far, according to the survey stakeholders, has been


overwhelmingly positive:

• 73% report that sustainability executives understand the data you


provide and use it properly

• 44% report having a beneficial relationship with Corporate Sustainability

• Less than 10% claim that Corporate Sustainability efforts pose a risk
to IT performance or availability or create needless work

And yet, the reporting functionality has a long way to go. If IT infrastructure
leaders are motivated to improve the accountability and efficiency of their
organizations, Corporate Sustainability is a great partner for gaining C-level
buy-in and funding for projects.

But in many cases, IT infrastructure teams are still relying on the least
meaningful metrics to drive efficiency.

12
Uptime Institute 2016 Data Center Survey Results

The majority of IT departments position total data center power consumption,


LEED certifications, and total data center power usage as primary indications
of efficient stewardship of environmental and corporate resources.

Infrastructure leaders should not co-opt the Corporate Sustainability


department to continue to perpetuate the fallacy that efficient computer
room air conditioning is indicative of an efficient IT organization.

Rather, companies need to use this visibility and executive influence to address
the chronic efficiency problems in a holistic manner, not only for the sake of
their businesses but also for the very existence of enterprise IT, as increasingly
these organizations will be forced to compete with the cloud (See “A Holistic
Approach to Reducing Cost and Resource Consumption” The Uptime Institute
Journal, vol. 4, page 18 and at journal.uptimeinstitute.com).

CONCLUSIONS
Enterprise IT budgets and server footprints are in decline, and that trend will
continue. Outsourcing is rampant in the face of opaque costs and chronically
poor capacity planning.

In the face of budget constraints and competition from service providers,


leading enterprise IT organizations are trying to drive efficiency and
transparency to compete with the cloud.

In the face of these challenges, infrastructure executives have reached


out to business stakeholders in other parts of the organization to become
more responsive. Most IT organizations are partnering with Corporate
Sustainability, but the efforts are still primarily focusing on least impactful
aspects of efficiency.

As IT assets become more distributed across locations and platforms, IT needs


to move away from its role as a slow-moving centralized service provider, and
instead provide corporate governance across the various lines of business–
evaluating security, costs, efficiency, and performance of IT for end users.

By Matt Stansberry, Senior Director of Content and Publications, Uptime Institute

13
DATA CENTER WALKTHROUGH CHECKLIST
Identifying lurking vulnerabilities in even the
best-designed data centers

14
Data Center Walkthrough Checklist

Even the best-designed data centers have vulnerabilities. Companies with


complex IT systems design safeguards against failure with multiple layers of
protection and backup. Thus, when IT infrastructure fails, it is not due to a
lack of backup systems but rather a failure of management.

Uptime Institute has delivered operations assessments across hundreds of data


center facilities and has identified indicators of management shortfalls.

But how do you evaluate management behaviors?

You do not need to be an expert in data center operations to determine


whether underlying risk factors are being left untended by management.
Use this checklist to ensure processes and documentation are in place—the
organization’s responsiveness, familiarity, and adherence to documented
procedures are key to evaluating performance.

The majority of IT outages occur for practical and predictable reasons that aren’t
sexy and aren’t attended to. Management structures were not in place or were not
followed; a lack of processes or enforcement of processes defeated the investment.

Use this checklist to identify areas of improvement and further inquiry for
your staff or service provider.

WALKTHROUGH CHECKLIST

Are there any combustible materials (cardboard, paper, etc.) on the raised floor,
battery room, or electrical rooms? All incoming equipment should be stripped of
packaging outside of critical space.

Are unrelated items—office furniture, shelving units, tools—stored in critical space?


This is a fire, safety, and contamination issue.

Review fire extinguishers for out-of-date tags.

Ask to see the housekeeping policy and procedure documentation.

If the facility operates a raised floor, review condition of underfloor plenum. This area
should be cleaned regularly—ask to see the schedule.

How many employees have access to the critical space? Does your organization even
have an access policy for staff?

Ask to see the vendor check-in and training requirements; non-vetted individuals
should not be allowed in critical areas.

15
Uptime Institute

Are panels, switchboards, and valves labeled to indicate “normal” operating positions?

Ensure arc flash labeling is installed on all panels and PDUs. (See “Balancing Life
Safety, Infrastructure Investment, and Downtime,” page 32)

Data center cooling practices for over a decade have called for airflow isolation—cool
air delivered to the front of a rack of IT equipment and hot air exhausted out the back.
In a raised floor environment, rows of equipment are typically arranged in what is
called Hot Aisle-Cold Aisle configuration—perforated tiles deliver cool air to the cold
aisle or server intakes.

Are any grated or perforated panels in the Hot Aisle?



Are there unsealed cutouts in the raised floor?

Are there uncovered gaps in the racks between IT hardware?

All these are indicators of poor bypass airflow management. This results in cooling
inefficiency, wasted money, and poor adherence to management best practices.

Ask to see records and schedules for maintenance activities on batteries, engine
generators, and mechanical systems.

Ask to see staffing documentation—overtime rates greater than 10% can lead to
an increase in human error that causes outages. Are roles and responsibilities
documented? Are qualifications listed?

Ask to see list of preventive maintenance activities. Are the activities fully scripted?
What is the quality control process?

Who keeps critical documentation on equipment, including warranty info, maintenance


records, and performance data?

Ask to see training records, annual budget, and time allocation.

What is the process for keeping the reference library (staffing, equipment,
maintenance, procedures, and scripts) up-to-date?

16
Data Center Walk Through Checklist

Many data center programs, even successful rigorous operations, are subject to
vulnerabilities and would benefit from continuous improvement. Some of the items
on this checklist will raise red flags and identify areas to focus attention.

But there are some other symptoms to look for that indicate a crisis in management
rigor:

Are data center staff voice mail boxes full, emails not responded to,
email inbox size limit exceeded, or meetings missed or routinely cancelled?

Does your data center team report having no time for training? Shortage
of qualified staff? Personnel performing work outside their competency?
High personnel turnover?

Has Maintenance exceeded its budget? How about energy cost estimates?

Does the back of the server or cable trays look like a spaghetti pot
blew up? Is the cabling all correctly labeled? Is there a unique labeling
system for equipment? If it looks like a mess, it is a mess.

Successful IT infrastructure teams are preoccupied with failure and display attention
to detail and a commitment to process over personality. Use this checklist to discover
vulnerabilities in your IT operations and start a conversation with your staff and
service providers.

Management and operations have the biggest impact on your IT infrastructure


performance, and provide the biggest opportunity for change and improvement.

By Uptime Institute senior technical staff

17
WHY EFFECTIVE GOVERNANCE NEEDS
INDUSTRY CERTIFICATIONS
The benchmarks are a function of your business needs, but the
award on the wall substantiates Risk Management

18
Why Effective Governance Needs Industry Certification

Discussions around standards and certifications tend to take two forms: technical
debates on the nuances of criteria and discussions around proof—whether self-
attested versus audited by a third party. Both are valid discussions and Uptime
Institute has written a wide ranging body of content addressing both (See The
Uptime Institute Journal, industry publications, and conference archives).

But what is lost in these discussions is the governance imperative of a rigorous


certification that considers the special attributes of the outcome being validated.

When evaluating any certification executives should focus on its governance


benefit.

• Will the process of certification fundamentally improve my project outcome?

• Will it have benefit to my organization beyond the testing/evaluation period?

• Is its process perfunctory? An adaptation of accounting and stock-


keeping practices?

• Will its accomplishment speak to a ‘through and through’ insight?

• Will my Board understand it?

The certification business is unforgiving in practice. You are only as good


as the last thing you certified. Uptime Institute has certified around 1,000
projects in dozens of countries. Our business is to release standards to the
industry royalty free, but we exclusively audit and certify. Reserving the audit
right is key to consistency, as Uptime Institute experts adjudicate the criteria
without conflict of interest or other temptation. And, by maintaining a central
auditing resource, we are best able to ensure consistency and accountability.

Our certifications were developed to secure the governance-level requirements


of a major capital investment. For example, Tier Certification is a three-
sequence process of Design Documents, Constructed Facility, and Operations.
This sequence enforces discipline and transparency because each Certification
is a prerequisite for the next. If an organization has only the first award, it
immediately begs the question of their capabilities to build what has been
designed. On the other hand, each award allows the confident transfer from
design to implementation to service.

The need for this rigor is just as potent for the 1,000th Certification as it was
for the first. The reason is complex systems failure theory. An application of
complex systems failures theory to data center and IT infrastructure may be
found in this book (see page 40).

19
Uptime Institute

In summary, as the complexity of a project increases and as the number of


disciplines involved increases, the opportunity for failure exponentially increases.

Human foible is the thief of success. And as long as complex capital projects
require multiple human beings in various trades, the risk will not go away.
And, the stature of each engineering, construction, testing, or operations
vendor is not a guarantee and should not be mistaken for a Governance
decision or Risk Management.

Despite specialization of brands and an industry that is arguably in its late second
or early third generation of professional, the need for Governance through
Certification is not abating. We have a unique vantage point as a certification body.
And, as we have chosen to certify at the system level, we stand at the intersection,
alignment, and battleground of the many trades and disciplines.

When we developed our capital project Assessment & Certification regimen,


we based it on technical criteria that would verify each and every technical
and engineering detail. But, we also instilled criteria that were consistent with
the fundamental tenets of a high availability IT investment, regardless of the
size, scale, of sophistication of the project. The fundamental tenets include life
safety, independent verification of design intent (such as loss of utility power),
and proving the site’s investment in functionality.

Over 70% of projects fail at least one of our fundamental tenets. In a majority
of assessments, we discovered at least one unidentified incident looming over
steady-state operations. In every one of these failed assessments, the owner
had offset risk with appropriate capital investment and the team had signed
off on the project as ready for service.

The vast majority of capital projects, regardless of location or market maturity,


experience one or more of the following;

• Disarray of project milestones


o Design decisions being made well into construction or even
commissioning

• Absence of Owners’ Representation


o Project schedule is not published and updated
o Design or other key documentation has revision control issues
o Project requirements are broadly written or bypassed

• Non-binding KPIs
o Testing and commissioning regimen is not established
o Certifications are not communicated and coordinated

20
Why Effective Governance Needs Industry Certification

EXIT REPORTS FROM RECENT PROJECTS


An exit report from a 2016 capital project in the Americas reads:

• Ordinary Design
• Acceptable Construction
• Minor Testing & Commissioning
• Failed

The review continues in exacting detail, but, at an executive level, the back-up
power systems failed during a simulated electrical utility outage. This was an
anticipated design condition—and arguably the most rudimentary function—
of the new data center. The underlying reason was a ‘feature’ that had been
engineered into the back-up power systems, but the owner did not receive
training, did not have appropriate knowledge, and had not been informed of
its existence, thereby defeating the purpose of the data center.

An exit report from a 2016 capital project in Europe reads:

• Repurposing of an Existing Building in Historical Area


• Well Thought Out Solutions to Limitations of Materials & Configuration
• Minor Testing & Commissioning
• Failed

The review continues in exacting detail, but, at an executive level, the data
center posed a threat to life safety. Service work on the power systems
necessitated placing a screwdriver on a live 400-volt connection. Additional
failures were attributed to incorrect fuse ratings and errors in the building
monitoring and automation system. Any of these three issues would have
resulted in a service interruption of the new data center.

It is important to note that our assessment was the very last step in the capital
project, immediately preceding the new data center being placed into service.
All of the capital project stakeholders had signed off on the data center before
our assessment began.

The reasons for the high rate of shortfalls is consistent with the social science
of disaster and the normalization of deviance. Schedule pressures, cost
pressures, and no one focused on the comprehensive or the ‘whole’ lead to
these outcomes even when the requisite skills and experience are in place.

21
Uptime Institute

These items were discovered because Tier Certification was the barrier to the
asset being places in service. Without a Certification in place, the investor
had no recourse. By requiring Certification, the rigor of the project itself was
put to the test. And the award on the wall demonstrated compliance at the
profound level of the intent of the investment and its core business value.

During the genesis of Tier Certification, it was decided that any project
claiming to be of high availability, regardless of Tier, had to follow a high
standard of care. The project must demonstrate exactitude and coordination
that was responsive to its complexity and the enormity of the investment.
This process enforced a clarity and sequencing of the design, construction,
and testing. The team that developed Tier Certification understood that it
was an inherent challenge to uphold the sequence of project milestones, but
that within each project phase, tens of trades were involved that could defeat
the outcome single.

Industry certifications with a holistic focus transcend from technical acumen


to the governance level. On numerous occasions, we have been engaged at
the Board level in either a protectionist or triage role. These engagements
showed the capability of industry certifications to span the abstruse nature of
technology with the Board’s mandate of fiduciary responsibility and both Risk
Mitigation and Management.

By Julian Kudritzki, Chief Operating Officer, Uptime Institute

22
23
AVOIDING DATA CENTER
CAPITAL PROJECT FAILURES
Identify and mitigate costly mistakes

24
Avoiding Data Center Capital Project Failures

For enterprise IT organizations embarking on a data center capital project—the


stakes are undeniably high. Building a new data center is a massive investment,
but also enables or hampers an organization’s IT strategy and capability—
affecting an organization’s business performance for years to come. As more
organizations rely on colocation data center providers, ensuring the design
and construction of these projects meets your requirements is critical as well.

With multiple vendors, subcontractors, and typically more than 50 different


disciplines involved in any data center project—structural, electrical, HVAC,
plumbing, fuel pumps, networking, and more—it would be remarkable if there
were no errors introduced or corners cut during the construction process.

Lapses in construction oversight, planning, and budget can mean that an


expensive new data center facility will fail to meet the owner’s requirements,
with the end result offering poor performance, limited flexibility, insufficient
compute resources, or excess stranded capacity.

Addressing problems as they reveal themselves may delay construction and


delay the start of operations and usually requires significant spending. In
some cases, the problems continue to hamper operations for the life of the
data center and may eventually require the facility to be replaced prematurely.
Even if the facility should continue operations for its expected life, it may
cost more than expected to operate, suffer more downtime incidents, and
complicate efforts to introduce new products and services.

Any data center capital project is subject to complex challenges. Inclement


weather, delayed equipment delivery, overwhelmed local resources, slow-moving
permitting and approval bureaucracies, lack of availability of public utilities
(power, water, gas), merger or acquisition, or other shift in corporate strategy can
delay construction or increase costs. However, capital project teams must prepare
for and resolve problems that result from unexpected conditions.

In Uptime Institute’s experience delivering more than 1,000 Certifications


across the globe, problems in construction are most often attributed to:

• Poor integration of complex systems

• Lack of thorough commissioning or compressed commissioning schedules

• Unapproved or unexamined design changes

• Substitution of materials or products

25
Uptime Institute

In a number of cases, data center failures, delays, or cost overruns occur


during the construction phase because of misaligned construction incentives
or poor contractor performance. In reality, the seeds of both these issues are
sown in the earliest phases of the capital project, when design objectives,
budgets, contracts, and schedules are developed; RFPs and RFIs issued; and
the construction team assembled.

These issues arise during construction, commissioning, or even after operations


have commenced and may impact cost, schedule, or IT operations. These
construction problems often occur because of poor change management
processes, inexperienced project teams, misaligned objectives of project
participants, or lack of third-party verification.

CONTRACTS AND OWNER’S REPRESENTATIVES


At the project outset, all parties should recognize that owner objectives differ
greatly from builder objectives. The owner wants a data center that best meets
cost, schedule, and overall business needs, including data center availability.
The builder wants to meet project budget and schedule requirements while
preserving project margin. Data center uptime (availability) and operations
considerations are usually outside the builder’s scope and expertise.

Thus, it is imperative that the project owner—or owner’s representatives—


devise contract language, processes, and controls that limit the contractors’
ability to change or undermine design decisions while making use of the
contractors’ experience in materials and labor costs, equipment availability,
and local codes and practices, which can save money and help construction
follow the planned timeline without compromising availability and reliability.

Data center owners should appoint an experienced owner’s representative


to properly vet contractors. This representative should review contractor
qualifications, experience, staffing, leadership, and communications. Less
experienced and cheaper contractors can often lead to quality control problems
and design compromises.

Data center owners should also gather a design, construction, and project
management team with extensive data center experience. If necessary, outside
experts may be needed to focus on the owner’s project requirements. Keep in
mind that an IT group may not understand schedule risk or the complexity
of a project. Experienced teams are more likely to pushback on unrealistic
schedules or suggestions that would compromise the project.

The owner or owner’s representative must work through all the project
requirements and establish an agreed upon sequence of operations and an

26
Avoiding Data Center Capital Project Failures

appropriate and incentivized construction schedule that includes sufficient


time for rigorous and complete commissioning. In addition, the owner’s
representative should regularly review the project schedule and apprise team
members of the project status to ensure that the time allotted for testing and
commissioning is not reduced.

Project managers or contractors looking to keep on schedule may perform


tasks out of sequence. Tasks performed out of sequence often have to be
reworked to allow access to space allocated to another system or to correct
misplaced electrical service, conduits, ducts, etc., which only exacerbates
scheduling problems.

Construction delays should not be allowed to compromise commissioning.


Incorporating penalties for delays into the construction contract is one
solution that should be considered.

CHANGE CONTROL IS CRITICAL TO PROJECT SUCCESS


Once a design has been finalized, change control processes are essential to
managing and reducing risk during the construction phase. For various
reasons, many builders, and even some owners, may be unfamiliar with the
criticality of change control as it relates to data center projects. No project
will be completely error free; however, good processes and documentation will
reduce the number and severity of errors and sometimes make the errors that
do occur easier to fix.

Value Engineering is a common construction practice to reduce the expected


cost of building a completed design. The process has its benefits, but it tends
to focus just on the first costs of the build. Often conducted by a building
contractor, the practice has a poor reputation among designers because it
often leads to changes that compromise the design intent. Yet other designers
believe that in qualified hands, value engineering can yield savings for the
project owner, without affecting reliability, availability, or operations.

Value Engineering needs to be integrated in the change control process. If it


is performed without input from Operations and appropriate design review,
any initial savings realized from these changes may be far less than charges
for remedial work needed to restore features necessary to achieve Concurrent
Maintainability or Fault Tolerance and increased operating costs over the life
of the data center.

As a result, each and every change must be scrutinized for its effect on the
design. Retaining the original design engineer or a project engineer with
experience in data centers may reduce the number of inappropriate changes

27
Uptime Institute

generated during the process. Even so, data center owners should be aware that
Uptime Institute personnel have observed that improperly conducted Value
Engineering has led to equipment substitutions or systems consolidations
that compromised owner expectations of Fault Tolerance or Concurrent
Maintainability. Contractors may substitute lower-priced equipment that has
different capacity, control methodology, tolerances, or specifications without
realizing the effect on reliability.

THE SORRY STATE OF COMMISSIONING


Uptime Institute’s extensive global field experience reveals the vast majority of
even the world’s most elite data center projects do not operate as designed/installed
on day one. Uptime Institute consultants find system faults in nearly every Tier
Certification site visit. Data center owners often comment that Tier Certification
demonstrations are more rigorous than their commissioning program.

Commissioning activities represent a unique opportunity for data center


owners. The ability to rigorously test the capabilities of the critical infrastructure
that support the data center without any risk to mission critical IT loads is an
opportunity that should be capitalized on to the maximum possible extent.

Uptime Institute observes that this critical opportunity is being wasted far
too often in data center facilities, with not nearly enough emphasis on the
rigor and depth of the commissioning program required for a mission critical
facility until critical IT hardware is already connected.

A well-planned and executed commissioning program will help validate the


capital investment in the facility to date. It will also put the operations team
in a far better position to manage and operate the critical infrastructure for
the rest of the data center’s useful life, and ultimately ensure that the facility
realizes its full potential.

Construction teams that are insufficiently experienced in the rigors of data


center commissioning often underestimate the time required or regard the
commissioning period as a kind of buffer that can be accessed when work
runs late. For both these reasons, it is important that the owner or owner’s
representative take care to schedule adequate time for commissioning and ensure
that contractors meet or exceed construction deadlines. A recommendation
would be to engage the commissioning agent and general contractor early in
the process as a partner in the development of the project schedule.

In addition, data center capital projects include requirements that might be


unfamiliar to teams lacking experience in mission critical environments; these
requirements often have budgetary impacts.

28
Avoiding Data Center Capital Project Failures

For example, owners and owner’s representatives must scrutinize construction


bids to ensure that they include funding and time for:
• Factory witness tests of critical equipment

• Extended Level 4 and Level 5 commissioning with vendor support

• Load banks to simulate full IT load within the critical environment

• Diesel fuel to test and verify engine-generator systems

Because experienced teams understand the importance of data center specific


commissioning, the commissioning agent will be able to work more effectively
early in the process, setting the stage for the transition to operations.

In addition, Operations should be part of the design and construction team


from the start of the project through commissioning and handover. Including
Operations in change management gives it the opportunity to share and learn
key information about how that data center will run, including set points,
equipment rotation, change management, training, and spare inventory, that
will be essential in every day operations and dealing with incidents.

THIRD-PARTY CERTIFICATION
Third-party verifications can assure the owner that the project delivered meets
the owner’s project requirements. Uptime Institute has witnessed third-party
verification improve contractor performance. The verifications motivate the
contractors to work more diligently, perhaps because verification increases the
likelihood that shortcuts or corner cutting will be found and repaired at the
contractor’s expense.

Certifications and verifications are only effective when conducted by an


unbiased, vendor-neutral third party. Many certifications in the market fail
to meet this threshold. Some certifications and verification processes are little
more than a vendor stamp of approval on pieces of equipment. Others take a
checklist approach, without examining causes of test failures.

Serious mistakes can take place at almost any time during the construction
process, including during the bidding process. In one such instance, an
owner’s procurement department tried to maximize a vendor discount for a
piece of equipment but failed to order components to connect it.

In another example, a contractor won a bid based on the cost of transporting


completely assembled generators on skids for more than 800 miles. When the
vendor threatened to void warranty support for this creative use of product,

29
Uptime Institute

the contractor was forced to absorb the substantial costs of transporting


equipment in a more conventional way. In such instances, owners might
be wise to watch closely whether the contractor tries to recoup his costs by
changing the design or making other equipment substitution.

In another case, the Uptime Institute team found that the builder implemented
a design as it saw fit, without considering maintenance access or labeling of
this critical infrastructure. The builder had instead rerouted the bus ducts into
a shared compartment and neglected to label any of the conductors.

Many more examples are found in almost every exit report from Uptime
Institute Tier Certification engagements.

CONCLUSIONS
Data center capital projects are subject to complex challenges, with multiple
stakeholders and contractors coming together across multiple disciplines. To
ensure that the infrastructure investment meets an organization’s business
requirements, project leaders need to ensure that they have selected the right
partners, empowered a competent owner’s representative, and left adequate
time for rigorous commissioning and third-party certification.

By Kevin Heslin, Chief Editor, Uptime Institute, with Keith Klesner, Senior Vice President, North America, and
input from additional senior and technical staff

30
31
BALANCING LIFE SAFETY, INFRASTRUCTURE
INVESTMENT, AND DOWNTIME
Due to the uninterruptible nature of IT infrastructure, many
organizations allow high-risk maintenance activities

32
Balancing Life Safety, Infrastructure Investment, and Downtime

Due to the uninterruptible nature of data center operations, a large percentage


of organizations allows maintenance activities on energized electrical equipment.
These conditions put personnel at risk of an arc flash accident.

Electrical accidents such as arc flash occur all too often in facility environments
that have high-energy use requirements, a multitude of high-voltage electrical
systems and components, and frequent maintenance and equipment
installation activities.

The U.S. Occupational Safety and Health Administration (OSHA) defines arc
flash as “a phenomenon where a flashover of electric current leaves its intended
path and travels through the air from one conductor to another, or to ground.
The results are often violent and when a human is in close proximity to the
arc flash, serious injury and even death can occur.”

When these accidents occur they can derail operations and cause serious harm
to workers and equipment. Costs to businesses can include lost work time,
downtime, OSHA investigation, fines, medical costs, litigation, lost business,
equipment damage, and most tragically, loss of life. According to the Workplace
Safety Awareness Council (WPSAC), the average cost of hospitalization for
electrical accidents is US$750,000, with many exceeding US$1,000,000.

There are regulations in the U.S. and globally that set safety requirements, but there
is wide-ranging industry confusion over how to comply with those regulations
and understandable uneasiness with requirements that put personnel at risk.

According to Uptime Institute’s annual data center industry survey, about one-third
of organizations allow maintenance activities on energized electrical equipment at
voltage levels that could cause health or human-safety consequences (see Figure 1).

Does your organization allow maintenance activities on energized


electrical equipment?

YES
31%

NO
69%

Figure 1. Source: Uptime Institute Annual Data Center Industry Survey 2015

33
Uptime Institute

OSHA and the National Fire Protection Association (NFPA) Standard 70E address
electrical safety in the workplace and provide guidance and regulations on safety
programs, warning labels, personal protective equipment, boundary requirements,
and hazard analysis. And yet, there is widespread confusion over how the codes
should be applied in the data center industry, as evidenced by the responses from
North American data center operators and executives (see Figure 2).

This confusion over how regulations and codes should be applied is clearly a
major issue facing this industry.

Even highly informed experts can disagree on how these regulations should
be applied. The confusion creates opportunities for accidents and operational
exposures to risk that can cause significant injuries and even death.

Do the relevant codes, laws, or regulations in your region prohibit


maintenance activities or energized equipment?

Yes 29%
Yes but unenforced 14%
No 35%
Don’t know 22%

Figure 2. North American respondents demonstrate inconsistent understanding and enforcement.

The most effective way to eliminate the risk of electrical shock or arc flash hazard is
to de-energize the equipment. Uptime Institute’s Tier III and Tier IV criteria both
require design and installation of systems that enable equipment to be fully de-
energized to allow planned activities such as repair, maintenance, replacement, or
upgrade without exposing personnel to the risks of working on energized equipment.

INDUSTRY STANDARDS AND REGULATIONS


To prevent these kinds of accidents and injuries, it is imperative that data center
operators understand and follow appropriate safety standards for working with
electrical equipment. Both the NFPA and OSHA have established standards
and regulations that help protect workers against electrical hazards and prevent
electrical accidents in the workplace.

OSHA 29 CFR Part 1910, Subpart S and OSHA 29 CFR Part 1926, Subpart
K include requirements for electrical installation, equipment, safety-related
work practices, and maintenance for general industry and construction
workplaces, including data centers.

34
Balancing Life Safety, Infrastructure Investment, and Downtime

Are you uncomfortable with maintenance activities on


energized electrical equipment?

YES 59%
NO 41%

Figure 3. Even for organizations with appropriate training, protective equipment, and analysis many operators
are uncomfortable with “hot work.”

NFPA 70E is a set of detailed standards (issued at the request of OSHA and
updated periodically) that address electrical safety in the workplace. It covers
safe work practices associated with electrical tasks and for performing other
non-electrical tasks that may expose an employee to electrical hazards. OSHA
revised its electrical standard to reference NFPA 70E-2000 and continues to
recognize NFPA 70E today.

OSHA requires that facilities:

• Provide and be able to demonstrate a safety program with defined


responsibilities.
• Calculate the degree of arc flash hazard.
• Use correct personal protective equipment (PPE) for workers.
• Train workers on the hazards of arc flash.
• Use appropriate tools for safe working.

• Provide warning labels on equipment.

NFPA 70E further defines “electrically safe work conditions” to mean that
equipment is not and cannot be energized. To ensure these conditions, personnel
must identify all power sources, interrupt the load and disconnect power, visually
verify that a disconnect has opened the circuit, lock out and tag the circuit, test
for absence of voltage, and ground all power conductors, if necessary.

JUSTIFICATION FOR “HOT WORK”


NFPA 70E and OSHA require employers to prove that working in a de-
energized state creates more or worse hazards than the risk presented by
working on live components or is not practical because of equipment design or
operational limitations, for example, when working on circuits that are part of

35
Uptime Institute

a continuous process that cannot be completely shut down. Other exceptions


include situations in which isolating and deactivating system components
would create a hazard for people not associated with the work, for example,
when working on life-support systems, emergency alarm systems, ventilation
equipment for hazardous locations, or extinguishing illumination for an area.

In addition, OSHA makes provision for situations in which it would be


“infeasible” to shut down equipment. For example, some maintenance and testing
operations can only be done on live electric circuits or equipment. The decision to
work hot should only be made after careful analysis of the determination of what
constitutes infeasibility. In recent years, some well-publicized OSHA actions and
statements have centered on the matter of how to interpret this term.

ELECTRICAL SAFETY MEASURES IN PRACTICE


Only qualified persons should work on electrical conductors or circuits that have
not been put into an electrically safe work condition. A qualified person is one
who has received training in and possesses skills and knowledge in the construction
and operation of electric equipment and installation and the hazards involved with
this type of work. Knowledge or training should encompass the skill to distinguish
exposed live parts from other parts of electric equipment, determine the nominal
voltage of exposed live parts, and calculate the necessary clearance distances and the
corresponding voltages to which a worker will be exposed.

An arc flash hazard analysis for any work must be conducted to determine the
appropriate arc flash boundary, the incident energy at the working distance,
and the necessary protective equipment for the task.

NFPA 70E outlines strict standards for the type of PPE required for any
employees working in areas where electrical hazards are present based on the
task, the parts of the body that need protection, and the suitable arc rating
to match the potential flash exposure. PPE includes items such as a flash suit,
switching coat, mask, hood, gloves, and leather protectors. Flame-resistant
clothing underneath the PPE gear is also required.

After an arc flash hazard analysis has been performed, the correct PPE can
be selected according to the equipment’s arc thermal performance exposure
value (ATPV) and the break open threshold energy rating (EBT). Together,
these components determine the calculated hazard level that any piece of
equipment is capable of protecting a worker from (measured in calories per
square centimeter). For example, a hard hat with an attached face shield
provides adequate protection for Hazard/Risk Category 2, whereas an arc flash
protection hood is needed for a worker exposed to Hazard/Risk Category 4.

36
Balancing Life Safety, Infrastructure Investment, and Downtime

PPE is the last line of defense in an arc flash incident; it is not intended to
prevent all injuries, but to mitigate the impact of a flash, should one occur. In
many cases, the use of PPE has saved lives or prevented serious injury.

CONCLUSIONS
It can be argued that some of today’s data center operations approach the status
of being “essential” for much of the underlying infrastructure that runs our
24x 7 digitized society. Data centers support the functioning of global financial
systems, power grids and utilities, air traffic control operations, communication
networks, and the information processing that support vital activities ranging
from daily commerce to national security.

Each facility must assess its operations and system capabilities to enable adherence
to safe electrical work practices as much as possible without jeopardizing critical
mission functions. In many cases, it may become a jurisdictional decision as to
the answer for a specific data center business requirement.

Balancing the need for appropriate electrical safety measures and compliance
with the need to maintain and sustain uninterrupted production capacity in
an energy-intensive environment is a challenge.

But it is a challenge the data center industry is perhaps better prepared to meet than
many other industry segments. It is apparent that those in the data center industry
who subscribe to high-availability concepts such as the Tier Standards: Topology
and Operational Sustainability have adopted a rigorous approach to cleaning,
maintenance, installation, training, and other tasks that forestall arc flash.

If you had resources, would you invest in infrastructure to prevent


the need for maintenance on energized platforms?

YES 92%

Figure 4. The vast majority of survey respondents would strongly prefer investment in concurrently
maintainable topology that would eliminate the need for “hot work.”

Organizations that subscribe to Tier standards and maintain stringent


operational practices are better prepared to take on the challenges of compliance
with OSHA and NFPA 70E requirements, in particular the requirements for
safely performing work on energized systems, when such work is allowed per
the safety standards.

37
Uptime Institute

No measure will ever completely remove the risk of working on live, energized
equipment. In instances where working on live systems is necessary and allowed
by NFPA 70E rules, the application of Uptime Institute Tier III and Tier IV
criteria can help minimize the risks. Tier III and IV both require the design and
installation of systems that enable equipment to be fully de-energized to allow
planned activities such as repair, maintenance, replacement, or upgrade without
exposing personnel to the risks of working on energized electrical equipment.

ARC FLASH POP QUIZ

• Does this data center [specify] perform site work and maintenance
on energized electrical equipment?

o If no, and you are in a Tier III or IV Certified data center


that—by design and Uptime Institute award—your
organization has no reason to risk exposure to hot work.

o If no, and you are not in a Tier III or IV Certified data center,
then you are exposed to the risk of equipment failure due to
indefinitely deferred site work and maintenance.

o If yes, these questions may help you to understand your risk


exposure of life safety, unplanned downtime, disrupted
business process, code violation and penalty, and/or adverse
revenue impact:

— What is the established corporate policy for


performing work on energized electrical equipment?

— Who is informed of, and signed off on, this policy?

· Data Center Operations


· Maintenance & Site Work Contractors
· IT Systems
· Risk/Compliance
· Life Safety/Health
· Regulatory/Oversight (3rd-Party or Internal)

— When was the last time that work was performed


on energized electrical equipment?

38
Balancing Life Safety, Infrastructure Investment, and Downtime

— Who was alerted before the work was performed


on energized electrical equipment?

· Data Center Operations


· IT Systems
· Risk/Compliance
· Life Safety/Health
· Regulatory/Oversight (3rd-Party or Internal)

— Who and how many performed the work (contractor


or employees)?

— What were the safety precautions?

— How long was the hot work period scheduled for?

— How long did it actually take?

— What was the process of QA/QC before the hot


work period concluded and normal operations
restored?

— What are the scheduled and upcoming hot work


periods?

— As noted in this article, regulatory and code


affecting hot work is changing. Who is responsible
for checking on the latest impacts to corporate
policy and site work activities?

· How often does that check up occur?

By Matt Stansberry, Senior DIrector of Content and Publications, Uptime Institute, and Uptime Institute senior
technical staff

39
EXAMINING AND LEARNING FROM
COMPLEX SYSTEMS FAILURES
Conventional wisdom blames “human error” for the majority
of outages, but those failures are incorrectly attributed to
front-line operator errors, rather than management oversights

40
Examining and Learning from Complex Systems Failures

Data centers, oil rigs, ships, power plants, and airplanes may seem like vastly
different entities, but all are large and complex systems that can be subject
to failure—sometimes catastrophic failure. Natural events like earthquakes
or storms may initiate a complex system failure. But often blame is assigned
to “human error”—front-line operator mistakes combined with a lack of
appropriate procedures and resources or compromised structures that result
from poor management decisions.

Human error is an insufficient and misleading term. The front-line operator’s


presence at the site of the incident ascribes responsibility to the operator for
failure to rescue the situation. But this masks the underlying causes of an
incident. It is more helpful to consider the site of the incident as a spectacle
of mismanagement.

Responsibility for an incident, in most cases, can be attributed to a senior


management decision (e.g., design compromises, budget cuts, staff reductions,
vendor selection, and resourcing) seemingly disconnected in time and space
from the site of the incident.

What decisions led to a situation where front-line operators were unprepared


or untrained to respond to an incident and mishandled it?

To safeguard against failures, standards and practices have evolved in many


industries that encompass strict criteria and requirements for the design and
operation of systems, often including inspection regimens and certifications.
Compiled, codified, and enforced by agencies and entities, these programs
and requirements help protect the service user from the bodily injuries or
financial effects of failures and spur industries to maintain preparedness and
best practices.

Twenty years of Uptime Institute research into the causes of data center
incidents places predominant accountability for failures at the management
level and finds only single-digit percentages of spontaneous equipment failure.

This fundamental and permanent truth compelled the Uptime Institute to step
further into standards and certifications that were unique to the data center
and IT industry. Uptime Institute undertook a collaborative approach with a
variety of stakeholders to develop outcome-based criteria that would be lasting
and developed by and for the industry. Uptime Institute’s Certifications were
conceived to evaluate, in an unbiased fashion, front-line operations within the
context of management structure and organizational behaviors.

41
Uptime Institute

EXAMINING FAILURES
The sinking of the Titanic. The Deepwater Horizon oil spill. DC-10 air crashes
in the 1970s. The failure of New Orleans’ levee system. The Three Mile Island
nuclear release. The northeast (U.S.) blackout of 2003. Battery fires in Boeing
787s. The space shuttle Challenger disaster. Fukushima Daiichi nuclear disaster.
The grounding of the Kulluk arctic drilling rig. These are a few of the most
infamous, and in some cases tragic, engineering system failures in history. While
the examples come from vastly different industries and each story unfolded in
its own unique way, they all have something in common with each other—
and with data centers. All exemplify highly complex systems operating in
technologically sophisticated industries.

John Maclean, author of numerous books analyzing deadly wildfires, including Fire on
the Mountain (Morrow 1999), suggests rebranding of high reliability organizations,
which is a fundamental concept of firefighting crews, the military, and the commercial
airline industry. He argued for ‘high-risk organizations.’ A high-reliability organization
may only fail, like a goalkeeper, as performance is so highly anticipated. A high-risk
organization is tasked with averting or minimizing impact and may gauge success in a
non-binary fashion. It is a recurring theme in Mr. Maclean’s forensic analyses of deadly
fires that front-line operators, including the perished, carry the blame for the outcome
and management shortfalls are far less exposed.

The hallmarks of so-called complex systems are “a large number of interacting


components, emergent properties difficult to anticipate from the knowledge
of single components, adaptability to absorb random disruptions, and highly
vulnerable to widespread failure under adverse conditions (Dueñas-Osorio and
Vemuru 2009).” Additionally, the components of complex systems typically
interact in non-linear fashion, operating in large interconnected networks.

Large systems and the industries that use them have many safeguards against
failure and multiple layers of protection and backup. Thus, when they fail it
is due to much more than a single element or mistake.

It is a truism that complex systems tend to fail in complex ways. Looking


at just a few examples from various industries, again and again we see that
it was not a single factor but the compound effect of multiple factors that
disrupted these sophisticated systems. Often referred to as “cascading failures,”
complex system breakdowns usually begin when one component or element
of the system fails, requiring nearby “nodes” (or other components in the
system network) to take up the workload or service obligation of the failed
component. If this increased load is too great, it can cause other nodes to
overload and fail as well, creating a waterfall effect as every component failure

42
Examining and Learning from Complex Systems Failures

increases the load on the other, already stressed components. The following
transferable concept is drawn from the power industry:

Power transmission systems are heterogeneous networks of large numbers of


components that interact in diverse ways. When component operating limits are
exceeded, protection acts to disconnect the component and the component ‘fails’ in the
sense of not being available... Components can also fail in the sense of misoperation
or damage due to aging, fire, weather, poor maintenance, or incorrect design or
operating settings... The effects of the component failure can be local or can involve
components far away, so that the loading of many other components throughout the
network is increased... the flows all over the network change (Dobson, et al. 2009).

A component of the network can be mechanical, structural or human agent,


as front-line operators respond to an emerging crisis. Just as engineering
components can fail when overloaded, so can human effectiveness and
decision-making capacity diminish under duress. A defining characteristic of
a high-risk organization is that it provides structure and guidance despite
extenuating circumstances—duress is its standard operating condition.

The sinking of the Titanic is perhaps the most well-known complex system
failure in history. This disaster was caused by the compound effect of structural
issues, management decisions, and operating mistakes that led to the tragic
loss of 1,495 lives. Just a few of the critical contributing factors include design
compromises (e.g., reducing the height of the watertight bulkheads that allowed
water to flow over the tops and limiting the number of lifeboats for aesthetic
considerations), poor discretionary decisions (e.g., sailing at excessive speed
on a moonless night despite reports of icebergs ahead), operator error (e.g.,
the lookout in the crow’s nest had no binoculars—a cabinet key had been left
behind in Southampton), and misjudgment in the crisis response (e.g., the pilot
tried to reverse thrust when the iceberg was spotted, instead of continuing at full
speed and using the momentum of the ship to turn course and reduce impact).
And, of course, there was the hubris of believing the ship was unsinkable.

Figure 1a. (Left) NTSB photo of the burned auxiliary power unit battery from a JAL Boeing 787 that caught fire
on January 7, 2013 at Boston’s Logan International Airport. Photo credit: By National Transportation Safety
Board (NTSB) [Public domain], via Wikimedia Commons. Figure 1b. (Right) A side-by-side comparison of an
original Boeing Dreamliner (787) battery compared and a damaged Japan Air Lines battery. Photo credit: By
National Transportation Safety Board (NTSB) [Public domain], via Wikimedia Commons.

43
Uptime Institute

Looking at a more recent example, the issue of battery fires in Japan Airlines
(JAL) Boeing 787s, which came to light in 2013 (see Figure 1), was ultimately
blamed on a combination of design, engineering, and process management
shortfalls (Gallagher 2014). Following its investigation, the U.S. National
Transportation Safety Board reported (NTSB 2014):

• Manufacturer errors in design and quality control. The manufacturer


failed to adequately account for the thermal runaway phenomenon:
an initial overheating of the batteries triggered a chemical reaction that
generated more heat, thus causing the batteries to explode or catch
fire. Battery “manufacturing defects and lack of oversight in the cell
manufacturing process” resulted in the development of lithium mineral
deposits in the batteries. Called lithium dendrites, these deposits can
cause a short circuit that reacts chemically with the battery cell, creating
heat. Lithium dendrites occurred in wrinkles that were found in some
of the battery electrolyte material, a manufacturing quality control issue.

• Shortfall in certification processes. The NTSB found shortcomings in


U.S. Federal Aviation Administration (FAA) guidance and certification
processes. Some important factors were overlooked that should have
been considered during safety assessment of the batteries.

• Lack of contractor oversight and proper change orders. A cadre of


contractors and subcontractors were involved in the manufacture of
the 787’s electrical systems and battery components. Certain entities
made changes to the specifications and instructions without proper
approval or oversight. When the FAA performed an audit, it found
that Boeing’s prime contractor wasn’t following battery component
assembly and installation instructions and was mislabeling parts. A lack
of “adherence to written procedures and communications” was cited.

How many of these circumstances parallel those that can happen during the
construction and operation of a data center? It is all too common to find deviations
from as-designed systems during the construction process, inconsistent quality control
oversight, and the use of multiple subcontractors. Insourced and outsourced resources
may disregard or hurry past written procedures, documentation, and communication
protocols (see “Avoiding Data Center Capital Project Failures,” page 24).

THE NATURE OF COMPLEX SYSTEM FAILURES


Large industrial and engineered systems are risky by their very nature. The
greater the number of components and the higher the energy and heat levels,
velocity, and size and weight of these components the greater the skill and
teamwork required to plan, manage, and operate the systems safely. Between

44
Examining and Learning from Complex Systems Failures

mechanical components and human actions, there are thousands of possible


points where an error can occur and potentially trigger a chain of failures.

In his seminal article on the topic of complex system failure, “How Complex
Systems Fail”—first published in 1998 and still widely referenced today—
Dr. Richard I. Cook identifies and discusses 18 core elements of failure in
complex systems:

1. Complex systems are intrinsically hazardous systems.


2. Complex systems are heavily and successfully defended against failure.
3. Catastrophe requires multiple failures—single point failures are not enough.
4. Complex systems contain changing mixtures of failures latent within them.
5. Complex systems run in degraded mode.
6. Catastrophe is always just around the corner.
7. Post-accident attribution to a ‘root cause’ is fundamentally wrong.
8. Hindsight biases post-accident assessments of human performance.
9. Human operators have dual roles: as producers and as defenders against failure.
10. All practitioner actions are gambles.
11. Actions at the sharp end resolve all ambiguity.
12. Human practitioners are the adaptable element of complex systems.
13. Human expertise in complex systems is constantly changing.
14. Change introduces new forms of failure.
15. Views of ‘cause’ limit the effectiveness of defenses against future events.
16. Safety is a characteristic of systems and not of their components.
17. People continuously create safety.
18. Failure-free operations require experience with failure (Cook 1998).

Let’s examine some of these principles in the context of a data center. Certainly high-
voltage electrical systems, large-scale mechanical and infrastructure components,
high-pressure water piping, power generators, and other elements create hazards
[Element 1] for both humans and mechanical systems/structures. Data center
systems are defended from failure by a broad range of measures [Element 2], both
technical (e.g., redundancy, alarms, and safety features of equipment) and human
(e.g., knowledge, training, and procedures). Because of these multiple layers
of protection, a catastrophic failure would require the breakdown of multiple
systems or multiple individual points of failure [Element 3].

RUNNING NEAR CRITICAL FAILURE


Complex systems science suggests that most large-scale complex systems, even
well-run ones, by their very nature are operating in “degraded mode” [Element
5], i.e., close to the critical failure point. This is due to the progression over
time of various factors including steadily increasing load demand, engineering
forces, and economic factors.

45
Uptime Institute

The enormous investments in data center and other highly available


infrastructure systems perversely incents conditions of elevated risk and higher
likelihood of failure. Maximizing capacity, increasing density, and hastening
production from installed infrastructure improves the return on investment
(ROI) on these major capital investments. Deferred maintenance, whether
due to lack of budget or hands-off periods due to heightened production,
further pushes equipment towards performance limits—the breaking point.

The increasing density of data center infrastructure exemplifies the dynamics that
continually and inexorably push a system towards critical failure. Server density is
driven by a mixture of engineering forces (advancements in server design and efficiency)
and economic pressures (demand for more processing capacity without increasing
facility footprint). Increased density then necessitates corresponding increases in the
number of critical heating and cooling elements. Now the system is running at higher
risk, with more components (each of which is subject to individual fault/failure), more
power flowing through the facility, and more heat generated, etc.

This development trajectory demonstrates just a few of the powerful “self-


organizing” forces in any complex system. According to Dobson, et al (2009),
“these forces drive the system to a dynamic equilibrium that keeps [it] near a
certain pattern of operating margins relative to the load. Note that engineering
improvements and load growth are driven by strong, underlying economic
and societal forces that are not easily modified.”

Because of this dynamic mix of forces, the potential for a catastrophic outcome
is inherent in the very nature of complex systems [Element 6]. For large-scale
mission critical and business critical systems, the profound implication is that
designers, system planners, and operators must acknowledge the potential for
failure and build in safeguards.

WHY IS IT SO EASY TO BLAME HUMAN ERROR?


Human error is often cited as the root cause of many engineering system
failures, yet it does not often cause a major disaster on its own. Based on
analysis of 20 years of data center incidents, Uptime Institute holds that human
error must signify management failure to drive change and improvement.
Leadership decisions and priorities that result in a lack of adequate staffing
and training, an organizational culture that becomes dominated by a fire drill
mentality, or budget cutting that reduces preventive/proactive maintenance
could result in cascading failures that truly flow from the top down.

Although front-line operator error may sometimes appear to cause an incident,


a single mistake (just like a single data center component failure) is not often
sufficient to bring down a large and robust complex system unless conditions

46
Examining and Learning from Complex Systems Failures

are such that the system is already teetering on the edge of critical failure and
has multiple underlying risk factors.

For example, media reports after the


1983 Exxon Valdez oil spill zeroed in
on the fact that the captain, Joseph
Hazelwood, was not at the bridge at
the time of the accident and accused
him of drinking heavily that night.
Figure 2. Shortly after leaving the Port of Valdez,
However, more measured assessments the Exxon Valdez ran aground on Bligh Reef. The picture
of the accident by the NTSB and others was taken three days after the vessel grounded, just
found that Exxon had consistently before a storm arrived. Photo credit: Office of Response
and Restoration, National Ocean Service, National
failed to supervise the captain or Oceanic and Atmospheric Administration [Public
provide sufficient crew for necessary domain], via Wikimedia Commons.
rest breaks (see Figure 2).

Perhaps even more critical was the lack of essential navigation systems: the
tanker’s radar was not operational at time of the accident. Reports indicate that
Exxon’s management had allowed the RAYCAS radar system to stay broken
for an entire year before the vessel ran aground because it was expensive to
operate. There was also inadequate disaster preparedness and an insufficient
quantity of oil spill containment equipment in the region, despite the
experiences of previous small oil spills. Four years before the accident, a letter
written by Captain James Woodle, who at that time was the Exxon oil group’s
Valdez port commander, warned upper management, “Due to a reduction in
manning, age of equipment, limited training and lack of personnel, serious
doubt exists that [we] would be able to contain and clean-up effectively a
medium or large size oil spill” (Palast 1999).

As Dr. Cook points out, post-accident attribution to a root cause is


fundamentally wrong [Element 7]. Complete failure requires multiple faults,
thus attribution of blame to a single isolated element is myopic and, arguably,
scapegoating. Exxon blamed Captain Hazelwood for the accident, and his
share of the blame obscures the underlying mismanagement that led to the
failure. Inadequate enforcement by the U.S. Coast Guard and other regulatory
agencies further contributed to the disaster.

Similarly, the grounding of the oil rig Kulluk was the direct result of a cascade of
discrete failures, errors, and mishaps, but the disaster was first set in motion by
Royal Dutch Shell’s executive decision to move the rig off of the Alaskan coastline
to avoid tax liability, despite high risks (Lavelle 2014). As a result, the rig and its
tow vessels undertook a challenging 1,700-nautical-mile journey across the icy
and storm-tossed waters of the Gulf of Alaska in December 2012 (Funk 2014).

47
Uptime Institute

There had already been a chain of engineering and inspection compromises


and shortfalls surrounding the Kulluk, including the installation of used and
uncertified tow shackles, a rushed refurbishment of the tow vessel Discovery,
and electrical system issues with the other tow vessel, the Aivik, which had
not been reported to the Coast Guard as required. (Discovery experienced
an exhaust system explosion and other mechanical issues in the following
months. Ultimately the tow company—a contractor—was charged with a
felony for multiple violations.)

This journey would be the Kulluk’s


last, and it included a series of
additional mistakes and mishaps.
Gale-force winds put continual stress
on the tow line and winches. The tow
ship was captained on this trip by
Figure 3. Waves crash over the drilling unit Kulluk an inexperienced replacement, who
where it sits aground on the southeast side of
Sitkalidak Island, AK, on January 1, 2013. Photo Credit:
seemingly mistook tow line tensile
By Petty Officer 3rd Class Jonathan Klingenberg, alarms (set to go off when tension
United States Coast Guard ([1]) [Public domain], via exceeded 300 tons) for another
Wikimedia Common.
alarm that was known to be falsely
annunciating. At one point the
Aivik, in attempting to circle back and attach a new tow line, was swamped
by a wave, sending water into the fuel pumps (a problem that had previously
been identified but not addressed), which caused the engines to begin to fail
over the next several hours (see Figure 3).

Despite harrowing conditions, Coast Guard helicopters were eventually able


to rescue the 18 crew members aboard the Kulluk. Valiant last-ditch tow
attempts were made by the (repaired) Aivik and Coast Guard tugboat Alert,
before the effort had to be abandoned and the oil rig was pushed aground by
winds and currents.

Poor management decision making, lack of adherence to proper procedures


and safety requirements, taking shortcuts in the repair of critical mechanical
equipment, insufficient contractor oversight, lack of personnel training/
experience—all of these elements of complex system failure are readily seen as
contributing factors in the Kulluk disaster.

EXAMINING DATA CENTER SYSTEM FAILURES


Two recent incidents demonstrate how the dynamics of complex systems
failures can quickly play out in the data center environment.

48
Examining and Learning from Complex Systems Failures

Example A
The data center in this example had been designed appropriately with
fuel pumps and engine-generator controls powered from multiple circuit
panels. As built, however, a single panel powered both, whether due to
implementation oversight or cost reduction measures. At issue is not the
installer, but rather the quality of communications from the implementation
team and the operations team.

In the course of operations, technicians had to shut off utility power during
the performance of routine maintenance to an electrical switchgear. This
meant the building was running on engine-generator sets. However, when
the engine-generator sets started to surge due to a clogged fuel line, the UPS
automatically switched the facility to battery power. The day tanks for the
engine-generator sets were starting to run dry. If quick-thinking operators had
not discovered the fuel pump issue in time, there would have been an outage
to the entire facility: a cascade of events leading down a rapid pathway from
simple routine maintenance activity to complete system failure.

Example B
In this example, an enterprise data center shared space with corporate offices
in the same building, with a single chilled water plant used to cool both sides
of the building. The office air handling units also brought in outside air to
reduce cooling costs.

One night, the site experienced particularly cold temperatures and the
control system did not switch from outside air to chilled water for office
building cooling, which affected data center cooling as well. The freeze stat
(a temperature sensing device that monitors a heat exchanger to prevent its
coils from freezing) failed to trip; thus the temperature continued to drop and
the cooling coil froze and burst, leaking chilled water onto the floor of the
data center. There was a limited leak detection system in place and connected,
but it had not been fully tested yet. Chilled water continued to leak until
pressure dropped and then the chilled water machines started to spin offline
in response. Once the chilled water machines went offline neither the office
building nor data center had active cooling.

At this point, despite the extreme outside cold, temperatures in the data hall rose
through the night. As a result of the elevated indoor temperature conditions,
the facility experienced myriad device-level (e.g., servers, disc drives, and fans)
failures over the following several weeks. Though a critical shut down was not
the issue, damage to components and systems—and the cost of cleanup and
replacement parts and labor—were significant. One single initiating factor—a
cold night—combined with other elements in a cascade of failures.

49
Uptime Institute

In both of these cases, severe disaster was averted, but relying on front-line
operators to save the situation is neither robust not reliable.

PREVENTING FAILURES IN THE DATA CENTER


Facility infrastructure is only one component of failure prevention; how
a facility is run and operated on a day-to-day basis is equally critical. As
Dr. Cook noted, humans have a dual role in complex systems as both the
potential producers (causes) of failure as well as, simultaneously, some of the
best defenders against failure [Element 9].

The fingerprints of human error can be seen on the two data center examples.
In Example A, the electrical panel was not set up as originally designed, and
the leak detection system, which could have alerted operators to the problem,
had not been fully activated in Example B.

Dr. Cook also points out that human operators are the most adaptable
component of complex systems [Element 12], as they “actively adapt the
system to maximize production and minimize accidents.” For example,
operators may “restructure the system to reduce exposure of vulnerable
parts,” reorganize critical resources to focus on areas of high demand, provide
“pathways for retreat or recovery,” and “establish means for early detection of
changed system performance in order to allow graceful cutbacks in production
or other means of increasing resiliency.” Given the highly dynamic nature of
complex system environments, this human-driven adaptability is key.

STANDARDIZATION CAN ADDRESS MANAGEMENT SHORTFALLS


In most of the notable failures in recent decades, there was a breakdown or
circumvention of established standards and certifications. It was not a lack of
standards, but a lack of compliance or sloppiness that contributed the most
to the disastrous outcomes. For example, in the case of the Boeing batteries,
the causes were bad design, poor quality inspections, and lack of contractor
oversight. In the case of the Exxon Valdez, inoperable navigation systems and
inadequate crew manpower and oversight—along with insufficient disaster
preparedness—were critical factors. If leadership, operators, and oversight
agencies had adhered to their own policies and requirements and had not cut
corners for economics or expediency, these disasters might have been avoided.

Ongoing operating and management practices and adherence to recognized


standards and requirements, therefore, must be the focus of long-term risk
mitigation. In fact, Dr. Cook states that “failure-free operations are the result
of activities of people who work to keep the system within the boundaries
of tolerable performance.... human practitioner adaptations to changing
conditions actually create safety from moment to moment” [Element 17].

50
Examining and Learning from Complex Systems Failures

This emphasis on human activities as decisive in preventing failures dovetails


with Uptime Institute’s advocacy of operational excellence as set forth in the
Tier Standard: Operational Sustainability. This was the data center industry’s
first standardization, developed by and for data centers, to address the
management shortfalls that could unwind the most advanced, complex, and
intelligent of solutions. Uptime Institute was compelled by its findings that
the vast majority of data center incidents could be attributed to operations,
despite advancements in technology, monitoring, and automation.

“It is human nature for elected officials and the general public to tend to disregard the
fact that the ubiquitous complex systems that we rely on in our daily lives operate at a
certain level of risk—a likelihood that they may fail, and that their failures need to be
addressed proactively.” (ASME 2011)

The Operational Sustainability criteria pinpoint the elements that impact long-
term data center performance, encompassing site management and operating
behaviors, and documentation and mitigation of site-specific risks. The detailed
criteria include personnel qualifications and training and policies and procedures
that support operating teams in effectively preventing failures and responding
appropriately when small failures occur to avoid having them cascade into large
critical failures. As Dr. Cook states, “Failure free operations require experience with
failure” [Element 18]. We have the opportunity to learn from the experience of other
industries, and, more importantly, from the data center industry’s own experience,
as collected and analyzed in Uptime Institute’s Abnormal Incident Reports database.
Uptime Institute has captured and catalogued the lessons learned from more than
5,000 errors and incidents over the last 20 years and used that research knowledge
base to help develop an authoritative set of benchmarks. It has ratified these with
leading industry experts and gained the consensus of global stakeholders from each
sector of the industry. Uptime Institute’s Tier Certifications and Management &
Operations (M&O) Stamp of Approval provide the most definitive guidelines for
and verification of effective risk mitigation and operations management.

Dr. Cook explains, “More robust system performance is likely to arise in


systems where operators can discern the ‘edge of the envelope.’ It also depends
on calibrating how their actions move system performance towards or away
from the edge of the envelope. [Element 18]” Uptime Institute’s deep subject
matter expertise, long experience, and evidence-based standards can help data
center operators identify and stay on the right side of that edge. Organizations
like CenturyLink are recognizing the value of applying a consistent set of
standards to ensure operational excellence and minimize the risk of failure
in the complex systems represented by their data center portfolio (See the
sidebar CenturyLink and the M&O Stamp of Approval).

51
Uptime Institute

CENTURYLINK AND THE M&O STAMP OF APPROVAL


The IT industry has growing awareness of the importance of management-people-process
issues. That’s why Uptime Institute’s Management & Operations (M&0) Stamp of Approval
focuses on assessing and evaluating both operations activities and management as
equally critical to ensuring data center reliability and performance. The M&O Stamp can
be applied to a single data center facility, or administered across an entire portfolio to
ensure consistency.

Recognizing the necessity of making a commitment to excellence at all levels of an


organization, CenturyLink is the first service provider to embrace the M&O assesment
for all of its data centers. It has contracted Uptime Institute to assess 57 data center
facilities across a global portfolio. This decision shows the company is willing to hold
itself to a uniform set of high standards and operate with transparency. The company
has committed to achieve M&O Stamp of Approval standards and certification across the
board, protecting its vital networks and assets from failure and downtime and providing
its customers with assurance.

CONCLUSION
Complex systems fail in complex ways, a reality exacerbated by the business
need to operate complex systems on the very edge of failure. The highly
dynamic environments of building and operating an airplane, ship, or oil
rig share many traits with running a high availability data center. The risk
tolerance for a data center is similarly very low, and data centers are susceptible
to the heroics and missteps of many disciplines. The coalescing element is
management, which makes sure that frontline operators are equipped with the
hands, tools, parts, and processes they need, and, the unbiased oversight and
certifications to identify risks and drive continuous improvement against the
continuous exposure to complex failure.

By Julian Kudritzki, COO, Uptime Institute with Uptime Institute editorial and research staff

52
REFERENCES

ASME (American Society of Mechanical Engineers). 2011. Initiative to Address Complex


Systems Failure: Prevention and Mitigation of Consequences. Report prepared by Nexight
Group for ASME (June). Silver Spring MD: Nexight Group. http://nexightgroup.com/wp-
content/uploads/2013/02/initiative-to-address-complex-systems-failure.pdf

Bassett, Vicki. (1998). “Causes and effects of the rapid sinking of the Titanic,” working
paper. Department of Mechanical Engineering, the University of Wisconsin. http://writing.
engr.vt.edu/uer/bassett.html#authorinfo.

BBC News. 2015. “Safety worries lead US airline to ban battery shipments.” March 3, 2015.
http://www.bbc.com/news/ technology-31709198

Brown, Christopher and Matthew Mescal. 2014. View From the Field. Webinar presented
by Uptime Institute, May 29, 2014. https://uptimeinstitute.com/research-publications/
asset/webinar-recording-view-from-the-field

Cook, Richard I. 1998. “How Complex Systems Fail (Being a Short Treatise on the Nature
of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and
the Resulting New Understanding of Patient Safety).” Chicago, IL: Cognitive Technologies
Laboratory, University of Chicago. Copyright 1998, 1999, 2000 by R.I. Cook, MD, for CtL.
Revision D (00.04.21), http://web.mit.edu/2.75/resources/random/How%20Complex%20
Systems%20Fail.pdf

Dobson, Ian, Benjamin A. Carreras, Vickie E. Lynch and David E. Newman. 2009. “Complex
systems analysis of a series of blackouts: Cascading failure, critical points, and self-
organization.” Chaos: An Interdisciplinary Journal of Nonlinear Science 17: 026103
(published by the American Institute of Physics).

Dueñas-Osorio, Leonard and Srivishnu Mohan Vemuru. 2009. Abstract for “Cascading
failures in complex infrastructure systems.” Structural Safety 31 (2): 157-167.

Funk, McKenzie. 2014. “The Wreck of the Kulluk.” New York Times Magazine December 30,
2014. http://www.nytimes. com/2015/01/04/magazine/the-wreck-of-the-kulluk.html?_r=0

Gallagher, Sean. 2014. “NTSB blames bad battery design—and bad management—in Boeing 787
fires.” Ars Technica, December 2, 2014. http://arstechnica.com/information-technology/2014/12/
ntsb-blames-bad-battery-design-and-bad-management-in-boeing-787-fires/

Glass, Robert, Walt Beyeler, Kevin Stamber, Laura Glass, Randall LaViolette, Stephen
Contrad, Nancy Brodsky, Theresa Brown, Andy Scholand, and Mark Ehlen. 2005. Simulation
and Analysis of Cascading Failure in Critical Infrastructure. Presentation (annotated
version) Los Alamos National Laboratory, National Infrastructure Simulation and Analysis
Center (Department of Homeland Security), and Sandia National Laboratories, July 2005..
New Mexico: Sandia National Laboratories. http://www.sandia.gov/CasosEngineering/
docs/Glass_annotatedpresentation.pdf

53
Kirby, R. Lee. 2012. “Reliability Centered Maintenance: A New Approach.” Mission Critical,
June 12, 2012. http://www.missioncriticalmagazine.com/articles/84992-reliability-
centered-maintenance--a-new-approach

Klesner, Keith. 2015. “Avoiding Data Center Construction Problems.” The Uptime Institute
Journal. 5: Spring 2014: 6-12. https://journal.uptimeinstitute.com/avoiding-data-center-
construction-problems/

Lipsitz, Lewis A. 2012. “Understanding Health Care as a Complex System: The Foundation
for Unintended Consequences.” Journal of the American Medical Association 308 (3):
243–244. http://jama.jamanetwork.com/article.aspx?articleid=1217248

Lavelle, Marianne. 2014. “Coast Guard blames Shell risk taking in the wreck of the
Kulluk.” National Geographic, April 4, 2014. http://news.nationalgeographic.com/news/
energy/2014/04/140404-coast-guard-blames-shell-in-kulluk-rig-accident/

“Exxon Valdez Oil Spill.” New York Times. “Exxon Valdez Oil Spill,” On NYTimes.com,
last updated August 3, 2010. http:// topics.nytimes.com/top/reference/timestopics/
subjects/e/exxon_valdez_oil_spill_1989/index.html

NTSB (National Transportation Safety Board). 2014. “Auxiliary Power Unit Battery Fire
Japan Airlines Boeing 787-8, JA829J.” Aircraft Incident Report released 11/21/14.
Washington, DC: National Transportation Safety Board. http://www.ntsb. gov/
Pages/..%5Cinvestigations%5CAccidentReports%5CPages%5CAIR1401.aspx

Palast, Greg. 1999. “Ten Years After But Who Was to Blame?” for Observer/Guardian UK,
March 20, 1999. http://www. gregpalast.com/ten-years-after-but-who-was-to-blame/

Pederson, Brian. 2014. “Complex systems and critical missions—today’s data center.”
Lehigh Valley Business, November 14, 2014. http://www.lvb.com/article/20141114/
CANUDIGIT/141119895/complex-systems-and-critical-missions--todays-data-center

Plsek, Paul. 2003. Complexity and the Adoption of Innovation in Healthcare. Presentation,
Accelerating Quality Improvement in Health Care Strategies to Speed the Diffusion of
Evidence-Based Innovations, conference in Washington, DC, January 27-28, 2003.
Roswell, GA: Paul E Plsek & Associates, Inc. http://www.nihcm.org/pdf/Plsek.pdf

Reason, J. 2000. “Human Errors Models and Management.” British Medical Journal 320
(7237): 768–770. http://www.ncbi. nlm.nih.gov/pmc/articles/PMC1117770/

Reuters. 2014. “Design flaws led to lithium-ion battery fires in Boeing 787: U.S. NTSB.”
December 2, 2014. http://www. reuters.com/article/2014/12/02/us-boeing-787-battery-
idUSKCN0JF35G20141202

54
Wikipedia, s.v. “Cascading Failure,” last modified April 12, 2015. https://en.wikipedia.org/
wiki/Cascading_failure

Wikipedia, s.v. “The Sinking of The Titanic,” last modified July 21, 2015. https://en.wikipedia.
org/wiki/Sinking_of_the_RMS_ Titanic

Wikipedia, s.v. “SOLAS Convention,” last modified June 21, 2015. https://en.wikipedia.org/
wiki/SOLAS_Convention Captions

55
APPLY EFFICIENT IT PRINCIPLES TO ADDRESS
SUSTAINABILITY RISKS
As Corporate Sustainability programs become increasingly
important to C-level execs and investors, IT organizations need
to adopt more meaningful KPIs to remain relevant

56
Apply Efficient IT Principles to Address Sustainability Risks

Increasingly, Corporate Sustainability drives decisions at large companies, as


this function can affect a company’s standing with the investor community, its
stock performance and capitalization.

For many companies, IT infrastructure is the organization’s largest expense,


carbon emitter, and environmental liability due to the resource intensiveness of
technology operations.

Many of the corporate risk considerations associated with global sustainability


concerns have an impact on IT infrastructure decisions, including but not limited to:

• Climate change and other direct environmental impacts


• Government regulation
• Resource scarcity
• Ethical investment and sourcing
• Waste management

Large companies are under pressure from activists, investors, regulators, and
customers to adopt sustainable business practices and communicate those
methods and metrics to stakeholders. Noncompliance with sustainable practices
can result in reputational damage, litigation, and penalties.

Starting in 2009, NGOs like Greenpeace pilloried Facebook and other web scale
companies for sourcing carbon-intensive utility providers. Many companies have
significantly shifted utility sourcing in the interim and worked to ensure they are
demonstrating efficiency and environmental stewardship, but these companies
continue to be closely monitored. More traditional enterprise organizations are
also under scrutiny—especially companies with large IT footprints.

Figure 1: Studies find companies that focus on sustainability issues achieve real and quantifiable financial impacts.

57
Uptime Institute

Figure 2: Stockholders, financial institutions, and customers increasingly scrutinize a company’s


sustainability record before deciding to invest money.

While many companies approach sustainability to avoid negative outcomes, the


positive benefits of a robust sustainability program may be even more profound (see
Figures 1 and 2). The seventh global executive study on corporate sustainability
from MIT Sloan Management Review and The Boston Consulting Group (BCG)
found that 75% of investors cite improved revenue performance and operational
efficiency from sustainability as strong reasons to invest. More than 60% believe
that solid sustainability performance reduces a company’s risks. Nearly the same
number also strongly believes that it lowers a company’s cost of capital. At the same
time, nearly half of investors say that they won’t invest in a company with a record
of poor sustainability performance. Some 60% of investment firm board members
say they are willing to divest from companies with a poor sustainability footprint.

Other benefits include cost savings, increased employee attraction and retention,
and increased customer loyalty.

Business Drivers for Efficient IT

INNOVATION

RISK
MANAGEMENT

COST
SAVINGS

Figure 3: Embracing Efficient IT principles provides an immediate financial return, but also delivers value to
risk management and enables business agility.

58
Apply Efficient IT Principles to Address Sustainability Risks

Yet only 60% of managers in publicly traded companies believe that good
sustainability performance is materially important to investors’ investment
decisions. For the rank-and-file employees, across IT and other lines of
business, sustainability is rarely a priority. IT has an especially poor track record
of tolerating inefficiency and waste, in favor of expedience and performance.
In order to implement an effective sustainability program, companies have
had to mandate participation at a corporate level.

Figure 4: The vast majority of large companies have implemented a formal sustainability program. IT
departments are participating in these programs at a nascent level.

Most organizations meet these challenges by creating sustainability offices that


have both C-level visibility and broad staff participation across all business
units and facilities, including IT (see Figure 4).

Two years ago, a place at the IT table for sustainability would have been provocative,
and perhaps evoked derision. In 2015, less than a tenth of enterprise IT stakeholders
had confidence in corporate sustainability to affect IT efficiency and costs.

IT often stood apart, isolated from the rest of the company because of the
perceived complexity of its needs, the robustness of its procedures, and low
prioritization of cost and resource savings.

One short year later, 2016 is a vastly different matter and the data suggest that
the time of corporate sustainability in IT is here now: 70% of enterprise IT
organizations actively participate in corporate sustainability efforts. The influence
of an outside party breaks down the ‘thwart by silo’ effect that has been the cause
of so much well meaning, and often fruitless, energies to reshape IT.

The relationship between Corporate Sustainability and Enterprise IT is really


just getting started. There are good signs for the potential of this relationship, but
also signals that entrenched behaviors and metrics will be difficult to overcome.

59
Uptime Institute

The relationship so far, according to the survey stakeholders, has been


overwhelmingly positive (see Figure 5):

Figure 5: The relationship between IT and Corporate Sustainability has been positive so far, with limited
downside and increased visibility of achievement to senior management.

And yet, looking at the reporting functionality, we still have a long way to go.
If IT infrastructure leaders are motivated to improve the accountability and
efficiency of their organizations, Corporate Sustainability is a great partner for
gaining C-level buy-in and funding for projects.

But in many cases, IT Infrastructure teams are still relying on the least
meaningful metrics to drive efficiency. Of the 70% of IT organizations who
participate in corporate sustainability, the majority focuses on metrics with
the least impact to the cost and carbon picture.

The majority of IT departments submit metrics for sustainability departments


that focus wholly on the data center—such as total data center power
consumption and Power Usage Effectiveness or PUE (a ratio that can be used
to describe the efficiency of a data center’s cooling systems).

These KPIs address less than 15% of the opportunity to improve IT efficiency.
The data center facility is only the site of resource consumption and can only,
to a limited extent, be made more efficient in and of itself.

The actual decision making that impacts resource consumption generally does
not occur at that street address—sometimes not even in that same city, state,
or country. Additionally the invoices and/or allocations for those resources
consumed are not all sent or charged to the right location or department.

Thus, Efficient IT relies on organizational navigation rather than spot


improvements in a specific site.

60
Apply Efficient IT Principles to Address Sustainability Risks

The problem of comatose servers is a prime example of rampant inefficiency in


IT that can be addressed only through the org chart and not the street address.
Industry reports suggest server CPU utilization in many enterprises is in single-
digit percentages. Furthermore, Uptime Institute research has demonstrated that
over 20% of IT equipment serves no business function whatsoever. Hardware
long abandoned by application owners and users but still racked and running
hide in plain sight within even the most sophisticated IT organizations.

By decommissioning an abandoned server or eliminating the need to deploy


a server through asset optimization, kilowatts of power and cooling are saved.
This leads to savings in expense and carbon that are appreciated at the data
center. If the organization is trying to avoid another data center build or
purchase, then the kilowatts that are recovered are all the more precious.
This is because the recovered savings are now being thought of as capital cost
avoidance (tens of millions) rather than cost savings (tens of thousands).

Yet the decision to deploy or decommission a server is rarely made at the data
center street address, but rather in the line of business or IT function outside
of the purview of the data center. Understanding how to address the issue
of comatose hardware requires navigating the org chart to find owners and
executive sponsors to address underlying efficiency problems.

Server optimization and consolidation are only a portion of efficient IT


opportunities but serve to illustrate how the big gains are reached by starting
with the org chart, not the street address.

CONCLUSIONS
As Corporate Sustainability increasingly exercises influence on IT decision making,
the question becomes “How will sustainability affect technology adoption trends?”

Will cloud adoption accelerate as result?

Enterprise IT’s reputation for low-utilization assets, comatose equipment,


and inability to document resource consumption seem retrograde when
compared to cloud providers, who are very ready to have the discussion—and
provide facts and figures—about the costs, operational efficiencies, and waste
reduction inherent to their service offerings.

The public cloud provides a good story for Corporate Sustainability in its
“reveal” of resources consumed.

61
Uptime Institute

Extant enterprise IT needs to move beyond emotionally driven efforts to stem


or claw back cloud adoption. Instead, IT teams need to collaboratively work
with their Corporate Sustainability counterparts to develop business-level
KPIs demonstrating “best use of assets.”

Outsourcing is no guarantee of efficiency or fiduciary responsibility—there


is no better example than enterprises overprovisioning space or power in a
colocation environment. Rather than focusing on the compute venue, IT
needs to address the management behaviors that lead to waste.

To assess inherent best use of resources, Corporate Sustainability will have to


gain insight and influence into the ways and means of IT provisioning and
procurement. This will have to encompass both extant IT infrastructure, which
is arguably obscured in terms of cost and utilization, and public cloud, which
is believed to be the opposite. The latter will be easier and thus preferred, and
that natural bias could be carried forward to more cloud adoption.

Cost and utilization insights into IT are not impossible, and they don’t require
investment in a new suite of tools. Plus, IT infrastructure management may
want to show their benefits to organizations in ways that are resonant, or even
competitive, with public cloud.

Figure 6 lists questions that executives should ask extant IT leadership.

Figure 6. IT sustainability programs can be more effective if designed to address five key questions.

Corporate Sustainability’s mission and outreach will be the impetus for new forms
of control even in old systems. But we can’t be surprised if the inflexibility of existing
IT, when weighed against the perceived ease of public cloud, persuades Corporate
Sustainability to hasten the speed of the transition to off-premise solutions.

By Matt Stansberry, Senior Director of Content and Publications, Uptime Institute, and Julian Kudritzki, COO,
Uptime Institute

62
63
IT RESILIENCE DURING NATURAL DISASTERS
The most common cause of disruption to IT services during a
natural disaster is preventable

64
IT Resilience During Natural Disasters

In late October 2012, Superstorm Sandy tore through the Caribbean and up
the east coast of the U.S., killing over a hundred, leaving millions without
power, and causing billions of dollars in damage. In the aftermath of the
storm, Uptime Institute surveyed data center operators to gather information
on how Sandy affected data center and IT operations.

The survey focused primarily on the northeast corridor of the U.S. as the
greater New York City area took the brunt of the storm and suffered the most
devastating losses.

Uptime Institute examined how facilities fared during the storm and after
actions taken by the data center owners and operators to ensure availability
and safeguard of critical equipment. Of all respondents spread through North
America, approximately one-third said they were affected by the storm in
some way. The results show that natural disasters can bring unexpected risks
but also reveal that planning and preemptive measures can be applied in
anticipation of a catastrophic event.

When sites lost their IT computing services, it was due largely to either critical
infrastructure components being located below grade or by depending on external
resources for re-supply of engine-generator fuel—both preventable outcomes.

Full preparedness for a natural disaster is not a simple proposition. However,


being prepared with a robust infrastructure system, sufficient on-site fuel,
available staff, and knowledge of past events will go a long way toward
ensuring operational readiness.

IMPACTS
Almost all the respondents in the path of the storm went on engine generators
during the course of the storm, with a couple following industry best practice
by going on engine generators before losing utility power. About three-quarters
of the respondents who turned to engine generators successfully rode out the
storm and the days after. The remainder turned to disaster recovery sites or
underwent an orderly shutdown. For all who turned to engine-generator power,
maintaining sufficient on-site fuel storage was a key to remaining operational.

Utility power outages were widespread due to high winds and flooding.
Notably, two sites that had two separate commercial power feeds were not
immune. One site lost one utility feed completely, and large voltage swings
rendered the other unstable. Thus, the additional infrastructure investment
was unusable in these circumstances.

65
Uptime Institute

Respondents reported engine-generator runtimes ranging from one hour to eight


days. Facilities with engine generators, fuel storage, or fuel pumps in underground
basements experienced problems. Flooding affected fuel storage, fuel pumps, or
fuel piping distribution in 25% of the respondents. Several operators remained on
engine-generator power after utility power was restored due to risks of instability
in the grid. Additionally, for some respondents, timely delivery of fuel to fill tanks
was not available as fuel shortages caused fuel vendors to prioritize hospitals and
other life-safety facilities. In short, fuel delivery SLAs were unenforceable.

WHAT WORKED? PREPARATION


Almost all the respondents reported that they topped off fuel or arranged
additional fuel, and one-third made sleeping and food provisions for operators
or vendors expected to be on extended shift. About one-quarter of respondents
reported checking that business continuity and maintenance actions were up-
to-date. Others ensured that roof and other drainage structures were clear and
working. And, a handful obtained sandbags.

All respondents reviewed operational procedures with their teams to ensure


a thorough understanding of standard and emergency procedures. Several
reported that they brought in key vendors to work with their crews on site
during the event, which they said proved helpful.

Some firms also had remote operations emergency response teams to relieve
the local staff, but one respondent reported that blocked roadways and flight
cancellations delayed their arrival for a lengthy period of time.

Multiple respondents said that conducting an in-depth review of emergency


procedures in preparation for the storm resulted in the staff being better aware of
how to respond to events. Preparations enabled all the operators to continue IT
operations during the storm. For example, unexpected water leaks materialized,
but precautions such as sandbags and tarps successfully safeguarded the IT gear.

Overwhelmingly, respondents said emergency preparations were valuable and


enabled personnel to anticipate and prevent problems. Reviewing rehearsed
procedures, load transfer testing to switch the electrical load from utility power
to engine generators, and extra attention to fuel storage all provided benefits.

WHAT FAILED? FUEL SUPPLY AND STORAGE


This storm showed that even a backup plan for fuel delivery is no guarantee
that fuel will be available to replenish stock. In some cases fuel supplier power
outages, fuel shortages, or closed roadways prevented deliveries. In other
cases, however, companies that thought they were on a priority list learned
hospitals, fire stations, and police had even higher priority.

66
IT Resilience During Natural Disasters

In addition, fuel suppliers had their own issues remaining operational. Due to
the widespread outages, some refineries shut down and some suppliers could
not dispense fuel. There were also problems with phone systems, making it
difficult for suppliers to run their businesses and communicate with customers.

Also, respondents reported lessons about the location of fuel tanks. Fuel
storage tanks or fuel pumps should not be located below-grade. As indicated,
many areas below grade flooded due to either storm water or pipe damage.
Respondents experiencing this problem plan to move pumps or other engine-
generator support equipment to higher locations.

No one expected that water infiltration would have such an impact. Wind
speeds were so extreme that building envelopes were not watertight, with
water entering buildings through roofing and entryways.

A few respondents saw a need to move critical facilities away from an area
susceptible to a hurricane or flood. While some respondents plan to increase
the resiliency of their site infrastructure, they are also evaluating extending
the use of existing facilities in other locations to pick up the computing needs
during the emergency response period.

CONCLUSIONS
In order to maintain functionality through a region-wide disaster, it is important
for executives and infrastructure staff to identify the risks and mitigate them.

Solutions include the following:

• Thoroughly review site location and elevation of critical components,


including fuel storage and fuel pumps for flooding potential.

• Perform regular testing and maintenance of the infrastructure


systems, in particular switching power from utility to engine generator.

• Ensure sufficient duration of engine-generator fuel stored on site.

• Maintain up-to-date disaster recovery, business continuity, and


IT load shedding plans. Brief stakeholders on these plans regularly
to ensure confidence and common understanding.

In a major disaster, unforeseen issues can arise. The goal is to reduce potential
impacts as much as possible.

By Uptime Institute Senior Technical Staff

67
A HOLISTIC APPROACH TO VENDOR SELECTION
FOR CLOUD AND COLOCATION
As companies rely more on colocation, cloud, and other
off-premise computing models, enterprise IT needs to improve
how it selects and manages vendors

68
A Holistic Approach to Vendor Selection for Cloud and Colocation

According to Uptime Institute’s sixth annual Data Center Industry Survey,


the majority of IT departments maintain a mix of assets in their own data
centers, colocation partners, and cloud platforms.

The percentage of IT processed at in-house sites has remained steady around


70%, but data points to a major shift to colocation and cloud for new
workloads in the coming years.

Half of senior IT execs expect the majority of their IT workloads to reside


off-premise in the future. Of those, 70% expect that shift to happen by 2020.

It is hard to predict what percentage will go to public cloud, but a significant portion
of those workloads will be shifting to colocation providers—companies that provide
data center facilities and varying levels of operations management and support.

Many colocation suppliers have been growing rapidly in recent years. Survey
respondents listed the following as top drivers for colocation adoption:

• Reduce churn of noncritical workloads into critical space


• Mergers/Acquisitions activity
• Disaster recovery site on a different power grid
• Executive directive to divest owned data center infrastructure
• Global expansion
• Avoid large capital expenses of new site build
• Not core business

• Lack of confidence in staff/resources

Yet, many decisions to deploy IT assets in colocation or cloud computing


environments do not holistically view the financial, risk, performance, or
other impacts of that decision.

Survey results show executives are not confident in their ability to evaluate
deployment alternatives due to the following challenges:

• Incomplete data when evaluating internal assets, such as data center


capital costs that aren’t included in TCO calculations for IT projects,
or lack of insight into personnel costs associated with providing
internal IT services.

69
Uptime Institute

• Lack of insight into cloud computing security, pricing models, and


reliability data.

• Lack of credible cloud computing case studies.

• Inconsistency in reporting structures across geographies and


divisions and between internal resources and colocation providers.

• Difficulty articulating business value for criteria not tied to a specific


cost metric, like redundancy or service quality.

• Difficulty connecting IT metrics to business performance metrics.

• Challenge of capacity planning for IT requirements forecast beyond


six months due to evolving architecture/application strategy and
shifting vendor roadmaps.

• Difficulty collecting information across the various stakeholders, from


application development, corporate real estate.

INTRODUCING FORCSS
Uptime Institute developed a system called FORCSS to enable enterprise IT to
identify, weigh, and communicate the advantages and risks of IT application
deployment options using a consistent and relevant criteria based on business
drivers and influences.

FORCSS helps the enterprise to overcome this challenge by focusing on


the critical selection factors, thereby reducing or eliminating unfounded
assumptions and organizational “blind spots.” FORCSS establishes a consistent
and repeatable set of evaluation criteria and a structure to communicate the
informed decision to stakeholders.

A coherent IT deployment strategy is often difficult because the staff responsible for
IT assets and IT services across multiple geographies and multiple operating units are
themselves spread over multiple geographies and operating units. The result can be a
range of operating goals, modes, and needs that are virtually impossible to incorporate
into a single, unified deployment strategy. And when a single strategy is developed
from the “top down,” the staff responsible for implementing that strategy often
struggles to adapt that strategy to their operational requirements and environments.

FORCSS was developed to provide organizations with the flexibility to respond


to varying organizational needs while maintaining a consistent overall strategic
approach to IT deployments. FORCSS represents a process a) to apply consistent

70
A Holistic Approach to Vendor Selection for Cloud and Colocation

Net Revenue Impact


Comparative Cost of Ownership
Cash and Funding Commitment

Time to Value
Scalable Capacity
Business Leverage and Synergy

SIX FORCSS Cost of Downtime vs. Availability


Acceptable Security Assessment

FACTORS, Supplier Flexibility


Government Mandates

18 INPUTS Corporate Policies


Compliance & Certifications to Industry Standards
Carbon and Water Impact
Efficient IT Certification
Sustainability Metrics & Documentation
Application Availability
Application Performance
End-User Satisfaction

Figure 1. FORCSS helps organizations evaluate IT alternatives by examining 18 relevant inputs.

selection criteria to specific deployment options, and b) to translate the outcome


of the key criteria into a concise structure that can be presented to “non-IT”
executive management.

The FORCSS system is composed of six necessary and sufficient selection


factors supported by three underlying inputs per factor. These six factors, or
criteria, provide a holistic evaluation system and drive a succinct decision
exercise that avoids analytical paralysis (see Figure 1).

And, by scaling the importance of the criteria within the system, FORCSS
allows each organization to align the decision process to organizational needs
and business drivers.

FORCSS DEFINITIONS

Financial: The fiscal consequences associated with deployment alternatives.

• What is the net revenue impact or value of the IT deployment to


the business?

• Comparative Cost of Ownership: The identified differential cost of


deploying the alternative—a detailed accounting analysis is not
necessary to procure services, rather the ability to effectively identify
and communicate the financial consequences of each alternative.

• Cash and Funding Commitment: Representation of liquidity—cash


necessary at appropriate intervals for the projected duration of the
business service.

71
Uptime Institute

Opportunity: A deployment alternative’s ability to fulfill compute capacity


demand over time.

• Time to Value: The time period from decision to IT service


availability. Timeline must include department deployment schedules
of IT, facilities, network, and service providers.

• Scalable Capacity: Available capacity for expansion of a given


deployment alternative.

• Business Leverage and Synergy: Significant ancillary benefits of a


deployment alternative outside of the specific application or business
service. For example: Improve economies of scale and pricing for
other applications. Or, geographic location of a particular site
provides business benefits beyond the scope of a single application.

Risk: A deployment alternative’s potential for negative business impacts.

• Cost of Downtime vs. Availability: Estimated cost of an IT service


outage vs. forecasted availability of deployment alternative.

• Acceptable Security Assessment: Internal security staff evaluation of


deployment alternative’s physical and data security.

• Supplier Flexibility: Potential “lock-ins” from a technical or


contractual standpoint.

Compliance: Verification, internal and/or third-party, of a deployment


alternative’s compliance with regulatory, industry, or other relevant criteria.

• Government: Legally mandated reporting obligations associated


with the application or business service. For example: HIPAA,
Sarbanes-Oxley, PCI-DSS.

• Corporate Policies: Internal reporting requirements associated with


the application or business service. For example: Data protection
and privacy, ethical procurement, Corporate Social Responsibility.

• Compliance & Certifications to Industry Standards: Current or


recurring validations achieved by the site or service provider,
beyond internal and governmental regulations. For example: SAS
70, SSAE 16, Uptime Institute Tier Certification or M&O Stamp
of Approval, ISO.

72
A Holistic Approach to Vendor Selection for Cloud and Colocation

Sustainability: Environmental consequences of a deployment alternative.

• Carbon and Water Impact: Carbon and water impacts for given a
site or service.

• Efficient IT Certifications: Current or recurring validations achieved


by the site or service provider of sustainable IT operations practices.
For example: Uptime Institute Efficient IT Stamp of Approval.

• Sustainability Metrics: Transparency and accountability demonstrated


by meaningful IT efficiency metrics including a documented energy
management plan and IT asset utilization program.

Service Quality: A deployment alternative’s capability to meet end-user


performance requirements.

• Application Availability: Computing environment uptime at the


application or operating system level.

• Application Performance: Evaluation of an application functional


response, acceptable speeds at the end-user level.

• End-User Satisfaction: Stakeholder response that an application or


deployment alternative addresses end-user functional needs. For
example: End-user preference for graphical user interfaces or
operating/management systems tied to a specific deployment alternative.

Many organizations already perform due diligence that would include most of
this process. But this system was validated by thought leaders in the enterprise IT
industry to ensure usefulness by those who inform senior-level decision makers.

Uptime Institute acknowledges that there are overlaps and dependencies


across all six factors. But, in order to provide a succinct, sufficient process
to inform C-level decision makers, categories must be finite and separate to
avoid analysis paralysis. The purpose of FORCSS is to identify the business
requirements of the IT service and pragmatically evaluate capabilities of
potential deployment options as defined. The Uptime Institute FORCSS
system provides a system of common criteria agreed upon by an elite group of
data center owners and operators from around the world.

73
Uptime Institute

THE FORCSS INDEX


The FORCSS Index is the executive summary presentation tool of a FORCSS
process. Once the organization has gathered the inputs in the FORCSS process,
the Index provides a graphical means to compare the deployment alternatives
and the relative impacts of each alternative on each factor (see Figure 2).

In conversations with senior management, the Index will also facilitate


discussions of the weighting of each FORCSS Factor in the ultimate decision.

The indicators may be placed in relative positions (high, medium, low) to


reflect the advantages or the exposures within any given factor.

The FORCSS Index effectively compares multiple alternatives at the


application or physical layer(s). Organizations executing a FORCSS analysis
can populate multiple indexes to compare a range of deployment alternatives.

Certain data inputs may be weighted more heavily than others (positively or negatively)
in determining the indicator position for a factor. These special considerations are
defined as Key Determinants and are specifically labeled in the FORCSS Index output.

Several data inputs may be used to determine the indicator position for one factor,
and one data input may affect the placement of indicators in several factors.

The FORCSS Index is designed as a means of relative comparison of any number of


alternatives. Although it would seem that the logical extension of this approach would
be to assign numerical scores to each data input for each factor, during FORCSS
development numerical scoring was found to add unnecessary complexity which
can obscure the key determinants. Scoring can also mislead as the score assigned to
one factor can numerically erase the score assigned to another (prohibitive) factor,
thereby defeating one of the major benefits of the FORCSS process.

Alternative: 1. Refurbishment Alternative: 2. Build Alternative: 3. Colocation

Figure 2. The graphical output of a FORCSS evaluation suggests the strengths and weaknesses of three
alternative IT plans.

74
A Holistic Approach to Vendor Selection for Cloud and Colocation

For a full explanation of how to apply FORCSS to a colocation decision, see “How
FORCSS Works: Case Study 1” in The Uptime Institute Journal (Vol. 3, page 92 or
journal.uptimeinstitute.com).

INSIGHT FROM A FORCSS LEADER:


A large financial organization in North America was an early adopter of
FORCSS and has used the system for deployment decisions in his organization.
The excerpt below describes his experience using FORCSS in his own words:

“Our organization has historically tried to self-provision first. But, the doors have
opened up. The IT departments aren’t holding onto hardware anymore, and the
shorter timelines are having a huge impact on how we respond. You have to pick
projects that you can do better, and you have to be ready to let go of things you
can’t do fast enough. Most builds will take longer than buying the service. An IT
organization isn’t linear anymore. There are multiple stakeholders who have different
influences and different impacts on decisions. If you’re not thinking 3 to 5 years out,
an enterprise organization won’t be to be able to respond to the business demand.

“Quantifying the opportunity is a difficult, but important, aspect of the


FORCSS process. One of the biggest considerations in this decision was the
business synergy [providing business continuity infrastructure for an adjacent
back-office function], documented in the Opportunity section. You have to
be well connected across your organization or you will miss Opportunity.

“Vetting a multi-tenant data center provider required due diligence. We


attended site tours. We had them provide single-lines of how the infrastructure
would look. We got as close to apples-to-apples as we could get, down to cabinet
layout of the room. We had three detailed meetings where my engineering and
operations teams sat down with the colocation fulfillment team.

“One of the biggest risks of an outsourcer is not about the immediate contract,
but about how you deal with change going forward. How do you handle change,
like a new business opportunity, that isn’t in the contracts? How do you deal
with non-linear growth? We never got to the point of pulling the trigger, but
had a frank discussion with our board about risk associated with outsourcing
and they were comfortable with the alternative.

“The Board ultimately funded the Build option. I believe that our FORCSS
process was successful with decision makers due to the thoroughness of our
preparation. For today’s enterprise, speed is key: speed of decision making;
speed of deployment. In this environment, the decision-making methodology
must be credible and consistent and timely. We adopted FORCSS because it
was thorough, independent, and industry accepted.”

75
Uptime Institute

RFPs and SLAs for Colocation


Despite increasing adoption of cloud and colocation, the outsourcing model is not a
panacea. According to 2016 Survey Data:

• 40% of enterprise respondents are paying more for colocation contracts than
they had initially planned or expected.

• Nearly one-third of respondents had experienced an outage at a colocation


vendor site

• Over 60% of respondents said the penalty clause in their Service Level
Agreement (SLA) would not adequately offset the cost of that outage to the
business.

Enterprise IT organizations pay a premium for a third party to deliver data center capacity
and should hold service providers to higher standards than their own organization. There is
significant room for improvement in vetting, negotiating, and managing those relationships.

Companies need to become much savvier about defining their requirements. The following
recommendations can help enterprise IT organizations to improve vendor selection and
management:

• The Site: Availability is at the forefront of all colocation discussions. Ask for
industry certifications and documentation. The vast majority of major suppliers
claim to build to Uptime Institute’s Tier III standard. Which ones can provide
verification that the site was built to that specification? If your IT workloads
are critical to your business, can you afford to take their word for it? Look
for trusted third party validation: Uptime Institute Tier Certification Constructed
Facility or Management and Operations Stamp of Approval can shorten the due
diligence cycle.

• The Operations: The biggest risk of an IT outage is due to operations failures. If you
want to see how a data center really runs, ask to review the last 5 years of
incident reports.Demand to inspect maintenance records. Ask to see commissioning
reports. Negotiate increased control and transparency with your provider to
ensure operational excellence.

• The Business: A data center that has been operated by the same team with the
same vendors and clients for several years will likely be very stable. But if that
provider is bought by another company, or conducting a consolidation project,
changing operations programs, or installing new equipment, you will have a
higher risk of failure.

76
A Holistic Approach to Vendor Selection for Cloud and Colocation

Ask about the current occupancy rate. If you are the first tenant in a shared space, every other
person coming in is an opportunity for your equipment to be de-energized or for a technician
to make a mistake. Ask about turnover of staff and average tenure with the company. Turnover
can be a red flag. Is the equipment infrastructure at the end of its lifespan? The company is
likely planning on upgrades that you should be aware of before signing.

Many problems with colocation providers can be avoided by setting more effective terms up front
and by writing better RFPs (request for proposal) and SLAs (service level agreement).

To that end, Uptime Institute conducted a panel with its user group, the Uptime Institute
Network, including participants from large enterprise organizations and colocation vendors,
to come up with the following recommendations for writing better RFPs and SLAs:

Don’t use the RFP for due diligence: 20-page RFPs are expensive and time consuming
to write and evaluate. Don’t waste time with a huge document. Focus on your business
requirements and must haves in a short 2-3 page RFP.

Customers are not a substitute for due diligence: Large companies renting space in the
facility is not an indication of whether the site will meet your business requirements. You
have no way to find out if that workload is business critical.

Avoid overprovisioning: Common mistakes include relying on IT equipment faceplate data


to calculate power draw requirements and underestimating the impact a hardware refresh
could have with increasingly efficient equipment.

Worst-case scenario: In the instance of multiple outages, do not focus on increasing


financial penalties. SLA penalties will not cover your cost but will in fact make your
contract worth less and less to your provider and would likely drive down service levels
you receive. Rather, structure your SLA in the case of multiple outages toward an exit.
Negotiate your move out costs and free rent while you find a new space.

As companies are increasingly relying on colocation and other off-premise computing


models, enterprise IT and data center staff will need to develop the planning skills,
expertise and coordination to play an importance governance role in their organizations
going forward.

By Matt Stansberry, Senior Director of Content and Publications, Uptime Institute, and Julian Kudritzki, COO,
Uptime Institute

77

Vous aimerez peut-être aussi