Networking Intelligence

NETWORKING INTELLIGENCE
Reality Check On Five-Nines

from the May 2002 issue of Business Communications Review, pp. 22-27
by Gary Audin, president of Delphi, Inc., an independent consulting and training firm. He has extensive
experience in the planning, design, implementation and operation of all kinds of networks, and he is the
instructor for a number of BCR seminars
While everyone in telecom has been using the term "five-nines" for decades, I'm willing to bet
that few actually know what 99.999 percent really means, and even fewer know if it's really
necessary. Of course, we all know that five-nines relates to reliability, and expectations are
high- dialtone is expected at all times. Way back when AT&T's #1ESS switch was being
developed, the development team's goal was to have less than one day of outage in 40 years.
Obviously no one kept one in service long enough to find out if that target was actually met,
but the 1Es cranked along pretty predictably.
Availability vs. Reliability
But when we talk about "five-nines," we're talking more about availability than reliability,
although the latter is integral to the former. Availability is a function of two basic factors: Mean
Time Between Failures (MTBF) and Mean Time To Repair (MTTR). Both are usually measured in
hours. Availability is described by the following equation:
It turns out, however, that MTTR is really the mean time to

restore rather than to repair, and it includes the following five activities (note, as discussed
below, the first four items can be significantly reduced or eliminated by redundant components
and automatic reconfiguration):
Failure Detection.
Failure Notification.
Vendor/User Response.
Repair/Replacement.
Recovery/Restart/Reboot.
MTBF provides a measure of a system's reliability. But over the course of a system's life, the
metric doesn't necessarily tell you everything you need to know; your system could have 99-
percent availability and still suffer a disaster-one huge outage-or hundreds of short outages.
But while the metrics are indifferent to the impact of an outage(s), they still provide a useful
function: they give us a frame of reference (Table 1).
TABLE 1: Translating the Metrics
Availability Downtime Per Year
(3651/4 x 24)
99.9999% 32 seconds
99.999% 5 minutes, 15 seconds
99.99% 52 minutes, 36 seconds
99.95% 4 Hours, 23 minutes
99.9% 8 Hours, 46 minutes
99.5% 1 day, 19 hours, 48 minutes
99% 3 days, 15 hours, 40 minutes
Myth vs. Reality

For years, the legacy PBX vendors not only created systems that delivered five-nines, they also
performed on-site surveys to ensure that the rest of the telephone infrastructure did not reduce
reliability. In general, customers got what they paid for: The PBX almost always delivered five-
nines, but the performance of the phones, cabling and power was another matter. That does
not mean that these components are unreliable, just that their performance is not included in
the calculation of reliability.
In short, the metrics for five-nines include performance for some, but not all system elements
and components. It includes the following:
1. Hardware components.
2. Power supplies.
3. Switch matrix.
4. Any other hardware component that can cause a total failure.
It does not include:
1. Shut-down of the operating system.
2. Loss of electrical power.
3. Network loss.
4. Time out for application software upgrades and fixes (can be 1-3 hours per month).
5. Preventive maintenance (hours per month).
6. The fact that some call servers must be shut down when line cards, trunk cards or gateways are
installed.
7. Complete server shutdown to install operating system changes or new releases.
When those nonincluded factors are tabulated, the downtime can grow to one to three days a
year, in addition to the time for the failures that are included in the definition. Assuming 48
hours (2,880 minutes) downtime a year, then:
Moreover, "reliability" is a moving target. For

example, the reliability of any new device can only be predicted, not measured. While new
products may go through extensive lab testing, it takes about two years of field experience
before reliability numbers can be considered dependable. The first system delivered is the least
reliable. Reliability improves after field experience, and after fixes are installed and faulty
components are replaced.
The reliability numbers that the vendors publish are more accurate if the hardware is not
changed or modified, except for repairs. Each time a customer modifies something, or
whenever vendor personnel get into the system, there is a chance of a mistake. In the old IBM
mainframe days, the NETGEN (updating of software for network reconfiguration) was scheduled
and limited in scope to reduce these mistakes.
Reliability also decreases with the age of the components. That hasn't been a problem for data
network devices, particularly routers, because they're rarely in service long enough for age to
become a factor. Traditional PBXs, on the other hand, often remain in service for 10+ years
(Figure 1).
For purposes of this discussion, a legacy PBX has
five critical elements, which are assembled in a series of interconnected cabinets:
Central Processor.
Switch Matrix.
Line Cards.
Trunk Cards.
Power Supply(s).
A PBX failure occurs when the central processor, or switch matrix or all the power supplies fail.
The failure of some of the line or trunk cards or a backup component does not constitute a PBX
failure.
By contrast, IP-PBXs (client/server) have some different elements (Figure 2). The server is the
central processor, the LAN and LAN switch are the switch matrix, while gateways and routers
behave like line and trunk cards. Power supplies are power supplies; available electric backup
power systems operate equally, independent of the type of PBX they support.
The five-nines goal can be
achieved with hardware and power, but software and networks are more problematic. Any PBX-
legacy, IP-enabled or pure "converged"-uses an underlying network to connect the components
and cabinets. But whenever a call server is remotely located, network reliability becomes an
issue. Having gateways and IP phones access two call servers in separate locations improves
reliability, especially during major power failures or a disaster at one of the server sites.
But, this improved reliability is reduced by the insertion of a network with less than 99.999-
percent availability. The end result: The benefits of server distribution may be negated,
sometimes completely, by the poorer network reliability.
The vendors' availability predictions for their hardware are probably accurate, but you need to
make sure that you understand their assumptions. For example, are their calculations based on
having all components operate in parallel (primary and backup) or sequentially? The difference
is important.
When components, systems or circuits operate in parallel with no switchover time, availability
increases significantly, because the probability that both parallel elements will fail at the same
time is extremely low. The availability of combined parallel components is greater than either
component by itself, and a parallel design of hardware components can deliver 99.999 percent.
But this configuration is expensive, not easy to do and it does not resolve power loss, software
failures or network failures.
By contrast, in a sequence of components, systems or circuits, the overall availability is poorer
than the availability of the worst component; availability degrades as more components are
added to the chain. In short, the chain is weaker than its weakest link. You can calculate a
systems' availability using the process shown in Figure 3.
It's The Software, Stupid

If the underlying/internal
network is one huge vulnerability, software is another. PBX software can be divided into two
parts: operating system and applications. There are other software modules-utilities, diagnostic
tools, etc.-but these are not critical to the operating reliability of a PBX.
Predicting software reliability is a guess; there is no standard for prediction, nor are there
formulas like those used for predicting hardware reliability. One key issue in software availability
is the reboot restoration time multiplied by the number of software outage recurrences. Cisco
uses an average reboot time of six minutes, which is based on routers and their field experience
with their IOS software, which is fine assuming an outage only occurs once each year.
There are a variety of choices for IP-PBX operating systems, and while Windows (NT or 2000)
was the early favorite, it is being supplanted by Unix, VxWorks, Linux and proprietary systems.
Mitel, for example, started with Windows, but last September announced that it would move to
Unix; others will follow. The reason for the change is simple: Windows is not industrial strength,
nor was it designed for the hardware interfaces that occur in the telephony world.
Unix is a favorite because of its maturity, stability and wide usage, while VxWorks is a Unix
variation commonly found in manufacturing and assembly-line environments. Linux, an open
environment, has gained favor with recent announcements by Avaya.
During a debate at the VoiceCon2002 conference, Avaya's Steve Markman described how Linux
would enable much more rapid turnarounds for fixes-several days vs. months-than would
Windows. But it also needs to be pointed out that while many in the industry expect that Linux
will do the job, several companies that were created to support Linux have had to reduce staff
or shut down.
The discussion of the operating system reliability has overshadowed even more important
software-the application software, the place where all PBX features and functions reside. IP-
enabled PBXs have a strong case here-they port existing mature, field-proven, stable software
to the new environment. In contrast, the pure converged IP-based PBXs may contain a mix of
existing software-operating system and applications-plus new add-ons that deliver the IP
capabilities, and new server-system software.
But no matter what system is chosen, the "good old days," when major software releases came
out every year or two, are disappearing. Today, some vendors seem to have a "release-a-
month" philosophy, as they seek to catch up with the feature lists on traditional PBXs. The new
releases certainly have value, but there are risks: Did the vendor do sufficient testing? What
mistakes/errors will occur by whoever actually installs the update? This rapid updating
eventually will slow down, but in the meantime it produces a less reliable environment.
Moreover, as noted above, there is no standard for software reliability prediction. Counting lines
of code and looking at quality-assurance methods can be used to predict the probability of
some kind of failure, and if a program is left alone, it will eventually become more reliable
through field experience and software fixes.
But customers can't take anything for granted: Challenge the reliability numbers for software
provided by vendors, and find out the extent to which they're based on optimistic guesses or
blind faith. And don't let yourself get distracted from a fundamental reality: Given the frequency
of software releases and attendant problems, it's likely to be another few years before we know
what to expect.
Power Trips
No PBX can be more reliable than the power supplied. And since no power supply operates at
five-nines, some alternate power is required; many traditional PBXs have 5 to 20 minute UPS so
that there can be a graceful shutdown of service.
How much downtime occurs with various UPS/generator configurations is shown in Table 2.
TABLE 2 UPS And Downtime (Minutes Down Per Year)
Raw AC (No UPS) 5-minute UPS 60-minute UPS UPS w/ generator
Instant Restart 113 minutes 100 minutes 10 minutes 1 minute
Auto Reboot (6 minutes) 203 minutes 112 minutes 10 minutes 1 minute
U.S. Power Reliability:
Average number of outages per year for IT departments: 15
90% of the outages are less than 5 minutes.
99% of the outages are less than 60 minutes.
Overall IT power availability: 99.98%

Source: American Power Conversion Technote # 26
The bottom line: The longer the backup power lasts, the less downtime per year; it takes more
than 1 hour of backup power to meet 99.999-percent availability. Note also that the downtime
in Table 2 is the result of power loss plus the software-reboot time (per above, an average of 6
minutes). Therefore:
Raw AC with 6 minutes reboot = 99.96% availability
But note that result assumes that all loss of availability is due to power failure-that there is no
loss of availability due to hardware, software or network failure. Moreover, to meet the five-
nines level, there can be only one power failure of seconds duration combined with only one
software reboot per year.
Nortel recommends an 8-hour battery UPS, but the amount of downtime you can tolerate will
depend on the business, organization and location. For example, the Department of Defense
has a goal for 8 hours of service via battery backup. They have power generators that can
switch-in to replace the battery UPS in seconds, and they may test those generators daily to
make sure they're ready.
Hospitals also need long-term support for the entire facility and all of its normal power users
(lights, etc.). Scott Silliman, director of communications, St. Johns Hospital in Springfield, IL,
installed battery backup on the hospital's servers and routers to avoid the several-minute reboot
time for the server and router software. The PBX has generator backup with an 8-second
startup time for the PBX network and the rest of the hospital, and the generators are tested
every two weeks.
Can VOIP PBXs Meet The Challenge?
There are three approaches to providing IP-PBXs:
IP-enabled PBXs are legacy PBXs, equipped with IP adapters for line and trunk cards. These
offer the reliability and field experience that comes with a mature product. If your existing PBX
delivered five-nines, the addition of an IP line card and trunk line should not reduce the
availability. Its hardware and software are known quantities. The power availability depends on
the UPS and generator investment, not the PBX design. There is no network connecting the
pieces together to reduce availability.
Converged PBXs have both circuit- and packet-switching processors with analog/digital and IP
line cards. The reliability of a converged PBX depends on its design. If there are two processors
and two switch matrices, one circuit and one packet switch, the availability of the system will
rival that of a redundant configuration. If one node fails, the other can still operate. If the
circuit-switch portion is built upon proven technology, it will probably deliver five-nines, as will
the packet-switch hardware.
If, however, both processors and switch matrices are new, then the hardware availability metric
will be a prediction, and software availability will also only be an estimate. The power
availability is the same as a legacy PBX, and no network is involved unless some of the line or
trunk cards are remotely located in gateways.
The underlying network, at best, will deliver 99.9 percent, the figure Sprint's website quotes for
the carrier's frame-relay service level agreement. This means remote devices will deliver less
than 99.9 percent availability
Client/server IP-PBXs are all-packet-switched systems; they come with IP phones but can
also support legacy interfaces. Since these are new systems, metrics for availability can only be
estimated. The hardware can probably deliver five-nines, if the configuration is redundant
(parallel primary/backup devices). As for software, it's hard to know without more field
experience. If there are frequent software releases, reliability, at least in the short term, will not
be great. The power reliability, on the other hand, is the same as all other forms of PBXs.
The client/server version supports remote gateways (IP line cards), and needs a network in
between the gateway and server. This reduces the reliability and therefore the availability of the
PBX features and functions. Dial tone may be provided locally (i.e., in the gateway), but this is
of little or no value if the server is inaccessible. Distributed control (servers) can be an
advantage- with two or more servers to back up each other, one site can fail and the remote
site can take over. But as noted above, the underlying network may not be able to deliver more
than 99.9 percent.
So, can IP-PBXs deliver five-nines? There's no single answer, but given the breakdown of
system components discussed above, the availability of IP-PBXs looks to be something like this:
Hardware: 99.999%
Software: 99.5% (this is really a guess)
Network: 99.9%
Power: 99.98 %
Multiplying them together produces a total availability of 99.38 percent (.9938). Is that enough?
That's for you to decide. But, whatever you do, you want to ensure that your system provides
the highest-level availability possible, given your investment. Here's a checklist to follow:
Have your vendor demonstrate how your hardware configuration meets your availability
requirements. Do not let the vendor give you some general model for the hardware.
Get MTBF and MTTR figures. You want a short MTTR (in minutes, not hours). How is the MTTR
accomplished? Redundant components, fast hardware swap by your automatic switchover?
Discuss your electrical power needs with the local power company, UPS supplier and generator
supplier to determine what is or can be realistically delivered. Exercise the backup power at least
every two weeks to insure proper operation.
Have the PBX vendor demonstrate the method used for determining software reliability. Separate
the demonstration into two parts: operating system and application software. Are the reliability
figures a calculated prediction, field experience or a guess? Find out how the vendor tests new
software for reliability. Was the software testing functional only or was it stressed, for example,
loaded with traffic?
Focus on new software releases. Do you really need to install it? Can installed software be easily
removed? Can software modules be suspended (isolated) so that the PBX can continue operating
when there is a release problem?
Check the service level agreement in your network contract, and verify that the stated availability
is being delivered. What is the network restoration time? Is the local access line (loop) as reliable
as needed? What is the backup procedure and MTTR for the local loop? These become important
with distributed call processing servers.
A good tutorial on this topic is "The Change Costs of System Availability," from Enabling
Technologies Group, Inc. It also discusses the difference between high-availability (HA) and
continuous availability (CA) systems, and the attendant costs and risks.
Are Five-Nines Really Necessary?
While there's no question that high availability is essential in a voice networking system, when
you come right down to it, five-nines may not be a realistic or even necessary goal. An office
that is in operation 12 hours per day, 5 days a week and 52 weeks per year would require its
PBX to be in use 187,200 minutes a year out of a possible 525,960 minutes. That equates to
36.6 percent of the full year.
If any changes, fixes or failures occur outside this time period, none of the users would ever
know about them or be affected, provided the problem is fixed before the next business day
resumes. The fix might be a repair, a reboot or an automatic reconfiguration, and as we all
know, most changes and fixes are made during off-hours for that very reason-to keep everyone
from being affected.
So, the metric of five-nines, in and of itself, seems like an unnecessary goal. My personal
opinion: It's nice to have, but you may not need it.
Reliability Prediction
It takes two years for a new product to generate accurate, performance-based MTBF and MTTR
measurements. Therefore, methods have been created to develop reliability prediction models,
and the two most popular techniques are MIL-HDBK 217 and the Telcordia prediction models.
There are also the mechanical models NSWC-94/L07, CNET 93 and HRD5
MIL-HDBK 217: The original standard for reliability, it was designed by the military but is also
used by commercial organizations. The latest version-Revision F Notice 2-was released in
February 1995. It provides mathematical models for reliability prediction for a huge range of
electronic devices-from phones to space vehicles to satellites, and defines two analysis
techniques: Parts count and parts stress.
Parts count analysis is often used in early product design, when detailed information is not
available or when only a rough estimate is required. Parts stress analysis provides a more
accurate estimate, by taking into account more detailed information about the components that
make up the product.
Telcordia Issue 1: The Telcordia reliability prediction model, developed by Bell Labs, is the
successor to Bellcore Issue 6 and was released in May 2000.
It uses modified equations from MIL HDBK 217 to better reflect what telephone equipment
experiences in the field. Parts count and parts stress analysis are supported, but they are called
Calculation Methods. There are 10 Telcordia Calculation Methods, each of which is designed to
take into consideration different information.
In comparing the two, a large number of factors need to be considered, but since the Telcordia
prediction model was designed for the commercial telecommunications industry, it is the better
method to use for a PBX or IP-based phone system.

Networking Intelligence

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Networking Intelligence

Transféré par

Droits d'auteur :

Formats disponibles

NETWORKING INTELLIGENCE

Reality Check On Five-Nines

It turns out, however, that MTTR is really the mean time to

Availability Downtime Per Year

Myth vs. Reality

Moreover, "reliability" is a moving target. For

It's The Software, Stupid

90% of the outages are less than 5 minutes.

99% of the outages are less than 60 minutes.

Overall IT power availability: 99.98%

Vous aimerez peut-être aussi