Académique Documents
Professionnel Documents
Culture Documents
Introduction to Realtimepublishers
by Don Jones, Series Editor
For several years, now, Realtime has produced dozens and dozens of high-quality books that just
happen to be delivered in electronic format—at no cost to you, the reader. We’ve made this
unique publishing model work through the generous support and cooperation of our sponsors,
who agree to bear each book’s production expenses for the benefit of our readers.
Although we’ve always offered our publications to you for free, don’t think for a moment that
quality is anything less than our top priority. My job is to make sure that our books are as good
as—and in most cases better than—any printed book that would cost you $40 or more. Our
electronic publishing model offers several advantages over printed books: You receive chapters
literally as fast as our authors produce them (hence the “realtime” aspect of our model), and we
can update chapters to reflect the latest changes in technology.
I want to point out that our books are by no means paid advertisements or white papers. We’re an
independent publishing company, and an important aspect of my job is to make sure that our
authors are free to voice their expertise and opinions without reservation or restriction. We
maintain complete editorial control of our publications, and I’m proud that we’ve produced so
many quality books over the past years.
I want to extend an invitation to visit us at http://nexus.realtimepublishers.com, especially if
you’ve received this publication from a friend or colleague. We have a wide variety of additional
books on a range of topics, and you’re sure to find something that’s of interest to you—and it
won’t cost you a thing. We hope you’ll continue to come to Realtime for your educational needs
far into the future.
Until then, enjoy.
Don Jones
i
Table of Contents
Introduction to Realtimepublishers.................................................................................................. i
Chapter 1: The State of Systems Management ................................................................................1
Overview..........................................................................................................................................1
Goals of Systems Management........................................................................................................2
Business Alignment .............................................................................................................3
Coherent Business Strategy .....................................................................................3
Multiple Business Objectives ..................................................................................4
Dynamic Requirements............................................................................................4
Technical Integrity ...............................................................................................................5
Malfunctioning Applications ...................................................................................6
Malicious Software ..................................................................................................7
System Configuration Vulnerabilities......................................................................8
Improperly Managed Access Controls...................................................................11
System Availability............................................................................................................12
Compliance ........................................................................................................................13
Spectrum of Systems Management Practices ................................................................................15
Ad Hoc Systems Management...........................................................................................15
Ad Hoc Systems Management in “Practice” .........................................................15
Effects of Ad Hoc Systems Management ..............................................................16
Controlled Systems Management ......................................................................................17
Continuous Improvement...................................................................................................19
Rationalizing Systems Management: SOM ...................................................................................21
Elements of SOM...............................................................................................................21
Unified Management Framework ..........................................................................22
Modular Services ...................................................................................................22
Open Architecture..................................................................................................22
Benefits of Service-Oriented Systems Management .........................................................23
Summary ........................................................................................................................................24
Chapter 2: Core Processes in Systems Management .....................................................................25
Aligning Business Objective and IT Operations ...........................................................................25
Ad Hoc Growth of IT Infrastructure..................................................................................26
Managing IT to the Big Picture .........................................................................................27
Planning and Risk Management in IT ...........................................................................................27
ii
Table of Contents
Basics of IT Planning.........................................................................................................27
Planning Technical Architecture............................................................................28
Organizational Structure ........................................................................................29
Budget and Staff Management...............................................................................30
Communications ....................................................................................................30
Risk Management in IT .....................................................................................................31
Prioritizing Business Objectives ............................................................................31
Assessing Risks and Impacts .................................................................................31
Mitigating Risks.....................................................................................................32
Business Continuity .......................................................................................................................33
Maintaining Security and Ensuring Compliance ...........................................................................33
Regulations and Compliance .............................................................................................34
Privacy and Confidentiality ...................................................................................34
Information Integrity..............................................................................................35
Information Security ..........................................................................................................36
Threat Assessment .................................................................................................36
Vulnerability Management ....................................................................................37
Change Control ......................................................................................................38
Auditing for Security and Systems Management ..............................................................40
System Events........................................................................................................40
Application-Level Auditing ...................................................................................41
User Auditing.........................................................................................................41
Incident Response ..............................................................................................................41
Capacity Planning and Asset Management....................................................................................42
Capacity Planning ..............................................................................................................42
Asset Management.............................................................................................................42
Acquiring Assets....................................................................................................43
Deploying and Configuring Assets........................................................................43
Maintaining and Retiring Assets............................................................................44
Service Delivery.............................................................................................................................44
Service Level Management................................................................................................45
Financial Management of IT Services ...............................................................................45
Capacity and Availability Management.............................................................................45
iii
Table of Contents
iv
Table of Contents
v
Table of Contents
vi
Table of Contents
vii
Table of Contents
Workload Management....................................................................................................134
Application Sizing and Modeling ....................................................................................134
Availability and Continuity Management....................................................................................135
Availability and SLAs......................................................................................................135
Continuity Management...................................................................................................135
Summary ......................................................................................................................................136
Chapter 7: Implementing Systems Management Services, Part 3: Managing Applications and
Assets ...........................................................................................................................................137
Application Life Cycles ...............................................................................................................137
Business Justification.......................................................................................................139
Requirements Phase .........................................................................................................140
Functional Requirements .....................................................................................140
Security Requirements .........................................................................................141
Integration Requirements.....................................................................................142
Non-Functional Requirements .............................................................................143
Analysis and Design ........................................................................................................145
Solution Frameworks ...........................................................................................146
Buy vs. Build .......................................................................................................149
Detailed Design....................................................................................................150
Development ....................................................................................................................151
Source Code Management ...................................................................................151
System Builds ......................................................................................................151
Regression Testing...............................................................................................152
Software Testing ..............................................................................................................153
Software Deployment ......................................................................................................154
Software Maintenance .....................................................................................................155
Role of Application Development Life Cycle in Systems Management .........................155
Managing Application Dependencies ..........................................................................................156
Data Dependencies...........................................................................................................156
Time Dependencies..........................................................................................................157
Software Dependencies....................................................................................................157
Hardware Dependencies ..................................................................................................157
Application Asset Management...................................................................................................158
Acquiring Assets..............................................................................................................158
viii
Table of Contents
ix
Table of Contents
x
Table of Contents
xi
Table of Contents
xii
Table of Contents
xiii
Copyright Statement
Copyright Statement
© 2007 Realtimepublishers.com, Inc. All rights reserved. This site contains materials that
have been created, developed, or commissioned by, and published with the permission
of, Realtimepublishers.com, Inc. (the “Materials”) and this site and any such Materials are
protected by international copyright and trademark laws.
THE MATERIALS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE,
TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice
and do not represent a commitment on the part of Realtimepublishers.com, Inc or its web
site sponsors. In no event shall Realtimepublishers.com, Inc. or its web site sponsors be
held liable for technical or editorial errors or omissions contained in the Materials,
including without limitation, for any direct, indirect, incidental, special, exemplary or
consequential damages whatsoever resulting from the use of any information contained
in the Materials.
The Materials (including but not limited to the text, images, audio, and/or video) may not
be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any
way, in whole or in part, except that one copy may be downloaded for your personal, non-
commercial use on a single computer. In connection with such use, you may not modify
or obscure any copyright or other proprietary notice.
The Materials may contain trademarks, services marks and logos that are the property of
third parties. You are not permitted to use these trademarks, services marks or logos
without prior written consent of such third parties.
Realtimepublishers.com and the Realtimepublishers logo are registered in the US Patent
& Trademark Office. All other product or service names are the property of their
respective owners.
If you have any questions about these terms, or if you would like information about
licensing materials from Realtimepublishers.com, please contact us via e-mail at
info@realtimepublishers.com.
xiv
Chapter 1
[Editor's Note: This eBook was downloaded from Realtime Nexus—The Digital Library. All
leading technology guides from Realtimepublishers can be found at
http://nexus.realtimepublishers.com.]
Overview
The book consists of twelve chapters that begin with a background on systems management
practices, then describes SOM in terms of several well-known frameworks for systems
management and related areas, and finally moves on to a detailed discussion of how to
implement SOM. Specifically, the chapters will address:
• Chapter 1 discusses the goals of systems management, typical implementation styles, and
the need for a rationalized process, such as SOM.
• Chapter 2 describes essential parts of systems management, including aligning with
business objectives, managing assets, delivering services, and maintaining compliance.
• Chapter 3 discusses SOM in terms of well-known frameworks such as ITIL, COBIT, and
ISO-17799.
• Chapter 4 describes the infrastructure required to implement a rational, efficient systems
management environment, including a configuration management database.
• Chapter 5 examines the elements of service support, such as incident, configuration, and
change management.
• Chapter 6 explores how to address financial issues, capacity planning, and availability
management issues in SOM.
• Chapter 7 discusses application life cycles, software asset management, and managing
hardware elements of IT infrastructure.
• Chapter 8 looks at systems management as a tool for supporting control objectives and
management guidelines that govern IT operations.
• Chapter 9 examines the role of systems management in threat assessments, vulnerability
management, incident response, and other aspects of security management.
• Chapter 10 describes the practice of risk management and shows how identifying risks,
prioritizing assets, and mitigating risks serve both risk management and systems
management objectives.
• Chapter 11 examines the business case for SOM with particular attention paid to the cost
of not adequately managing systems.
• Chapter 12 describes how to assess the current state of an organization’s systems
management practice and how to plan the transition of a SOM model as well as provides
guidance on how to implement a mature systems management practice.
1
Chapter 1
The responsibilities of systems administrators and IT managers are growing in complexity. The
need to support a growing number of systems with increasing dependencies between those
systems, meet growing quality of service (QoS) expectations, and be prepared for constant
security threats are just a few of the challenges faced by systems managers. Fortunately, as the
demands have expanded so too have the tools and practices for meeting those demands. The
purpose of this guide is to help managers and administrators apply these practices and tools to
their specific systems management challenges.
This chapter presents a high-level overview of the nature of systems management with a
discussion of three aspects of the discipline:
• The goals of systems management
• The spectrum of systems management practices
• Rationalizing systems management with SOM
Let’s begin with a fundamental issue for all IT operations—aligning with business objectives.
2
Chapter 1
Business Alignment
The objective of business alignment is to ensure that the information processing needs of lines of
business (LOB) are met by IT applications and infrastructures. This sounds so logical that it
seems like common sense—and it is. Unfortunately, the constraints of real-world organizations
present significant challenges to realizing business alignment.
The challenge with business alignment is not in understanding the need for it or even convincing
business or IT personnel about its importance; the challenge is with the execution. Three
common problems encountered with business strategy execution are:
• Formulating a coherent business strategy
• Addressing multiple objectives
• Meeting dynamic requirements
In many cases, IT managers have to address one or more of these in the course of their systems
management work.
3
Chapter 1
Dynamic Requirements
Business must constantly respond to changes in the marketplace. Some of these changes are
relatively slow, such as the move from strictly internal combustion powered cars to hybrid cars,
some are more moderately paced, like the shift to buying downloadable music from buying it on
discs, and still others are rapid changes, such as price spikes in the cost of petroleum products.
How effectively an organization can respond to these changes is dictated, in part, by the
organization’s ability to change its IT systems. Consider some of the roles IT applications play in
adapting to market changes:
• As partnerships are formed with other business, financial systems must be changed to
accommodate new compensation models
• Following mergers, IT infrastructures must be integrated to accommodate combined
business operations
• Downward price pressures drive process re-engineering and the adoption of greater
automation
• Outsourcing of operations may require changes to network infrastructure, security
policies and practices, and hardware configurations
In addition to these common business dynamics, there are the expected but unpredictable events,
such as natural disasters, that can disrupt operations and shift priorities.
4
Chapter 1
There will always be a time delay between changing a business objective and making the
necessary changes to implement that in IT infrastructure and procedures. The length of the delay,
though, can have a qualitative impact on the ability to execute the new business operations. As
Figure 1.1 shows, IT responses may take so long that not long after they are implemented, new
changes are required. In the worst case, the modifications are not finished before the next round
of changes is defined.
Aligning IT operations with long-term business strategies and shorter-term objectives is a
process; it is not a static state that is ever reached. The goal of IT should be to minimize the time
it takes to align with business objectives, and well-developed systems management practices are
fundamental to reaching that goal.
Figure 1.1: Time delays between the change of business strategy and the ability of IT to implement slow the
ability of organizations to adapt to changing market conditions.
IT and business alignment is based on a number of assumptions, including the technical integrity
of the IT infrastructure.
Technical Integrity
Technical integrity is the quality of information systems that ensures data is accurate, reliable,
and not subject to malicious or accidental changes. Like business alignment, technical integrity is
one of the characteristics that are so logical that we take it for granted. Systems administrators do
not take it for granted, though. Consider some of the challenges to maintaining technical
integrity:
• Malfunctioning applications
• Malicious software
• System configuration vulnerabilities
• Improperly managed access controls
Each of these categories of threats can create substantial disruption to IT operations.
5
Chapter 1
Malfunctioning Applications
Let’s face it, software has bugs. Complex software is difficult to build and programmers know
this all too well; an old quip among programmers is that “if builders built buildings the way
programmers built programs, one woodpecker would destroy civilization.” Although this
statement is an obvious exaggeration, the sentiment reflects the frustration even software
developers have with the practice of programming.
Programmers and software engineers do not just decry the state of programming; much work has
been done to improve software development practices. See for example, software development
maturity models developed by the Software Engineering Institute at http://www.sei.cmu.edu/, spiral
development methodology at
http://ieeexplore.ieee.org/iel1/2/6/00000059.pdf?tp=&arnumber=59&isnumber=6&htry=2, and Agile
development methodology at http://zsiie.icis.pcz.pl/ksiazki/Agile%20Software%20Development.pdf.
Using software development best practices can increase the quality of software, but even with
these methodologies, the impact of tracking and eliminating all bugs would make any reasonably
complex program either too expensive or available too late to be of practical use. As a result,
users, systems managers, and developers have all learned to manage less-than-perfect
applications.
One practice used to manage software deficiencies is patch management. Software developers
release corrections to applications, known as patches, which are applied by systems
administrators or through an automated process to correct known errors in software. Patching is
not a trivial process and several factors should be considered when patching:
• How to implement user acceptability testing (UAT)
• How to roll back a patch if problems occur
• Whether any of the problems corrected by the patch will have an impact on operations
• How to distribute the patch to all systems that require it
• How to track the version and patch level of all applications
In addition to these basic considerations, applications may have issues. For example, a relational
database may be used to support two applications. One application does not function because of
a bug that can be patched but the patch breaks another function required by the second
application. Should the database administrator patch the system anyway? Keep the current
version and work around the bug? Install another instance of the database, patch one instance,
leave the other instance un-patched, then run the two applications on their respective instances of
the database? To find the right answer, systems administrators and database administrators have
to weigh the costs and benefits. Just as business objectives can create competing demands when
addressing IT and business alignment, patch management can leave systems administrators to
choose between equally undesirable options.
6
Chapter 1
Malicious Software
Malicious software, commonly known as malware, includes software that ranges from annoying
to destructive. Some of the best known forms are:
• Viruses—Programs that replicate with the use of other programs or by user interaction
and carry code that performs malicious actions. Viruses consist of at least replication
code and the payload but may also include encryption of code-morphing routines to
improve chances of avoiding detection.
• Worms—Similar to viruses in payload and obfuscation techniques, worms are self-
replicating.
• Spyware—Software that installs on devices without user consent and collects private
information, such as usernames, passwords, and Web sites visited.
• Trojan horses—Programs that purport to do one thing but perform malicious activities.
For example, a Trojan horse might advertise itself as a program for synchronizing
computer clocks to atomic clocks, but it may deploy a remote control program that listens
on a specific chat room or Internet Relay Chat (IRC) channel for commands from the
malware developer.
• Keyloggers—Programs that intercept operating system (OS) messages sending keystroke
data to applications. The more sophisticated versions of these programs filter most
activity to focus on usernames, passwords, and other identifying information.
• Frame grabbers—Software that copies the contents of video buffers, which store data
about the contents displayed on the computer’s screen.
The list of malicious software categories is intimidating. To make matters worse, malware
writers seem to be in a constant cycle of responding to anti-malware developer’s
countermeasures who then respond to the malware writer’s new tricks, and on and on. On the
positive side, systems administrators are able to keep malware at bay with effective anti-malware
devices.
The use of anti-malware software, such as desktop antivirus software, and network appliances,
such as content filters, can provide adequate protection for most needs. The need for these
countermeasures introduces additional responsibilities for systems administrators. In addition to
the desktop productivity applications, application servers, databases, routers, Web servers, and
all the other business-specific and supporting tools that must be managed in an IT environment,
systems administrators must now manage a large class of mission-critical security applications
and devices. Malicious software often takes advantage of poorly secured configurations.
7
Chapter 1
8
Chapter 1
Figure 1.2: The Microsoft Update service can determine necessary critical patches and install them on
Windows OSs.
For more thorough vulnerability scanning, the Microsoft Baseline Security Analyzer (MBSA)
can detect configuration vulnerabilities as well as missing patches, and MBSA allows systems
administrators to scan multiple systems in a single session.
9
Chapter 1
Figure 1.3: The Microsoft Baseline Security Analyzer scans systems for vulnerabilities as well as missing
patches.
In other cases, systems administrators must keep abreast of critical patches by subscribing to
mailing lists or checking vendors’ support Web sites to finds patches. Even when patches are
released, they should not be installed in production without first testing on a quality control
platform. Although installing a critical security patch for Windows can create unanticipated
problems, in most cases, the risks are far outweighed by the benefits. Nonetheless, systems
administrators should have a contingency plan in place for restoring the original configuration if
unanticipated problems occur.
Two useful security-oriented mailing lists are NTBugtraq (http://www.ntbugtraq.com/) and Secunia
(http://secunia.com/).
10
Chapter 1
It should be noted that even with security analyzing tools, one type of malware—rootkits—are
particularly difficult to control once they have compromised a system. Rootkits hide their presence
and activities by changing registry settings or other system parameters, hiding files, erasing log
entries, and other techniques. Once installed, it is difficult to guarantee that rootkits have been
removed without performing a full reinstall of system software.
Even with updated patches, securely configured servers and desktops, and malware
countermeasures, systems managers must contend with yet another threat: inappropriate access
controls.
11
Chapter 1
System Availability
Disasters happen. Some are natural, such as hurricanes, and some are technical, such as the SQL
Slammer worm that effectively shut down large segments of the Internet in 2003. In both cases,
businesses and organizations lose some degree of access to their systems. The practice of
business continuity planning has evolved to address these kinds of disasters as well as other less
dramatic events that can nonetheless have an impact on systems availability.
Systems administrators play a key role in continuity planning because of their knowledge of
systems organizations, dependencies between systems, and the process that operate on those
systems. As IT infrastructure becomes more complex, business continuity planning become more
difficult. Information about the state of IT systems is needed to adequately prioritize services (for
example, in the event of a service disruption, restore payroll systems and then customer service
systems, leaving other production system for later) and ensure that all necessary systems and
processes are accounted for. A centralized and up-to-date database with information on the IT
infrastructure is required for cost-effective business continuity planning and execution. Figure
1.4 shows an example of how IT assets can be centrally managed in relation to other assets and
organizational structure.
Figure 1.4: Centralized information about IT assets is a key enabler of effective business continuity planning.
12
Chapter 1
For more information about the structure and function of configuration management databases, see
Chapter 4.
Compliance
The term compliance is getting a lot of press these days, perhaps to the point where we’ve
stopped paying attention to it, but doing so would be a mistake. The importance of maintaining
the privacy and accuracy of information is becoming more broadly recognized. Some of the best
known regulations make that clear:
• The Health Insurance Portability and Accountability Act (HIPAA) defines categories of
“protected healthcare information” and strict rules governing how that information is
gathered, stored, used, and shared. The act also defines stiff penalties for violating these
rules.
• The Sarbanes-Oxley Act, enacted in the wake of Enron, WorldCom, and similar
corporate scandals, raises the bar on ensuring accuracy in corporate reporting. CEOs and
CFOs now have to sign off on the accuracy of the information or face penalties.
• The California law, State Bill 1386, was passed in response to fears of the growing threat
of identity theft. Under this law, if identifying information of a California resident is
stolen or released in an unauthorized manner, the resident must be notified of the
disclosure.
A host of other IT-related regulations have been enacted by governments around the world. In
addition, non-governmental or quasi-governmental bodies have adopted standards and
frameworks related to financial reporting and security best practices. Table 1.1 lists some
relevant but less well-known regulations and frameworks that apply to particular countries or
industries.
13
Chapter 1
These and similar regulations are placing new demands on IT mangers and systems
administrators to not only comply with these regulations but also demonstrate that they are in
compliance. As with business continuity planning, compliance requires a centralized
management view of all information assets to meet these demands efficiently.
The goals of systems management range from maintaining technical integrity and system
availability to achieving compliance with regulations and aligning with the strategic plans of the
business. These are demanding goals and reaching them is not guaranteed, especially if systems
management practices are not sufficient for the task.
14
Chapter 1
15
Chapter 1
• A department decides that the reporting from the financial system is insufficient for their
needs and installs a database and reporting tool on a high-end desktop computer running
in their office. One of the staff in the department just read a book on data marts and
decides to implement one. The department uses an extraction, transformation, and load
tool that came with the database to pull data from the financial system every night. This
task puts additional load on the financial system at the same time it runs close-of-day
batch jobs and delays the generation of morning financial reports. The reports generated
from the data mart use different calculations, so performance measures from the financial
system do not agree with those from the data mart.
Although these examples are fictitious, the consequences described will probably sound familiar
to many IT professionals. A lack of central planning, uncoordinated decision making, and the
willingness to make changes to IT infrastructure to meet an immediate need without concern for
the ripple effects on the rest of the organization are the hallmarks of ad hoc systems management
practices. The consequences are predictable.
Lack of Compliance
Auditors would quickly point out that a lack of well-defined policies and procedures leave the
company potentially in violation of regulations governing information integrity (for example, the
Sarbanes-Oxley Act) and privacy (such as HIPAA).
16
Chapter 1
Poor Security
Security suffers because of poor management practices. Malicious software, information theft,
and other threats are a constant problem for systems administrators and IT managers. At the very
least, basic information security management requires:
• Comprehensive inventory of hardware and software in use on a network
• Configuration details on all servers, desktop, mobile devices, and network hardware
• Detailed information on users and access controls protecting assets
• The ability to audit and monitor system and network activity and to identify anomalous
events
• The ability to deploy patches and critical updates rapidly to all vulnerable devices
Clearly, the lack of centralized management information and ineffectual or poorly implemented
procedures that characterize ad hoc management undermine even the most basic security
requirements.
Lack of management controls, poor use of resources, lack of compliance, and the potential of
security threats should be motivation enough to move beyond ad hoc management to a well-
defined and centrally controlled management model.
For more information about ISACA’s COBIT, see http://www.isaca.org/cobit/, and ITIL at
http://www.itil.co.uk/. These topics are also covered in more detail in Chapter 3.
17
Chapter 1
18
Chapter 1
Figure 1.5: Controlled systems management depends on well-defined policies and procedures that address
each of the key services provided in an IT environment.
Continuous Improvement
Much has been written in the popular business press about quality and improvement. These
topics do not garner the press they once did, but the principals of quality improvement and
innovation are still relevant, even in systems management. Perhaps the best-known and most
well-established approach to realizing continuous improvement is Six Sigma, a data-centric
quality approach; another practice, Management by Fact (MBF), also emphasizes the importance
of managing by using measurements of performance.
Once an IT group has implemented controlled systems management practices, the group will
have information about assets, business use of those assets, and the changes those assets
undergo. In essence, the organization will have procedures for effectively managing assets as
well as data about the performance of those assets and related operation. With this, IT managers
and systems administrators can find ways to improve on operations.
19
Chapter 1
For example, by measuring the time from identifying the need for a new application to deploying
the solution, as well as key milestones in between, the organization can better understand the
average time to deploy, common bottlenecks in the process, and shared characteristics of failed
efforts. These and other key performance indicators (KPIs) form the foundation for measuring
improvement.
Another example relates to security. If a security breach occurs, a well-managed environment
will have audit trails and logs to help diagnose the breach as well as recovery procedures for
getting operations back online and data restored to its correct state. Configuration management
information, along with audit trails and logs, can help identify both specific and general
vulnerabilities in the current environment, which can be addressed to prevent future breaches of
the same sort.
Exceptionally well-run IT operations are not the result of one or two geniuses formulating a
perfect solution; they are instead the product of disciplined policies and procedures tightly linked
to business objectives. Like the business objectives themselves, the policies and procedures are
not static but are subject to innovation. As the well-respected management researcher and writer
Peter Drucker noted,
The purposeful innovation resulting from analysis, system and hard work is all that can
be discussed and presented as the practice of innovation. But this is all that need be
presented since it surely covers at least 90 percent of all effective innovations. And the
extraordinary performer in innovation, as in every other area, will be effective only if
grounded I the discipline and the master of it (Source: Peter Drucker, “Principals of
Innovation” in The Essential Drucker (Harper Business, 2001).
The discipline of systems management can be mastered and the practice of systems management
can be adapted and improved to meet the specific needs of different organizations.
The spectrum of systems management ranges from the reactive, uncontrolled ad hoc approach,
through a controlled, procedure-guided method to an adaptive model built on well-defined
controls that use performance measures to improve operations. Although there are many ways to
organize systems management operations, one of the most promising for the complex
heterogeneous IT environments of today is the SOM model.
20
Chapter 1
Elements of SOM
The domains of systems management can be viewed as services provided to users, applications,
and the organization as a whole. These services are managed within an umbrella framework that
is both modular and open. Some of the most important are:
• Service level management
• Financial management for IT services
• Capacity management
• Change management
• Availability management
• IT service continuity management
• Application management
• Software and hardware asset management
At first glance, these domains seem unrelated—such as financial management and change
management—but they are all required for effective systems management and therefore must be
included in any framework that purports to support the full breadth of demands in systems
management. The details of these domains are beyond the scope of this chapter; instead, this
chapter will examine the defining characteristics of a service-oriented architecture:
• Unified management framework
• Modular services
• Open architecture
The details of how these services are managed are addressed in Chapters 4 through 8.
21
Chapter 1
Modular Services
It is important to treat domains within systems management as distinct areas with their own set
of requirements. For example, change management requires information about the state of
software and hardware configurations throughout the enterprise. Before a port is closed on a
firewall, a network administrator needs to know whether an application is using that port. This
type of information is not required to manage the financial aspects of systems management and
should be isolated from financial functions. At the same time, however, some of the
configuration information has a definite impact on financial matters. For example, knowing the
number and versions of OSs running within the organization is essential to managing licensing
costs.
In addition to isolating information complexity, a modularized approach to systems management
enables the framework to incorporate new services and management models as needed. A
midsized company might not need capacity planning services initially, but as the company grows
and the complexity of the IT infrastructure increases, manual methods for capacity planning may
no longer be efficient or sufficient. For this reason, it is critical that a SOM framework be open.
Open Architecture
An open architecture is one that uses common, well-known protocols that are not proprietary to
any one vendor or organization. In the world of systems management, an open architecture lends
itself to incorporating multiple modules from a single vendor as well as leveraging services
available from third parties. For example, a router vendor may provide data on router
performance through the Simple Network Management Protocol (SNMP), which is collected in
the centralized configuration management database, then integrated with other data collected
from other network devices used in management reports generated by the systems management
reporting module. By combining the benefits of a unified management framework, modular
services, and an open architecture, organizations can realize the benefits of service-oriented
systems management.
22
Chapter 1
Figure 1.6: Service oriented systems management enables organizations to meet both operational and
strategic objectives.
23
Chapter 1
Summary
Any organization with IT systems practices some form of systems management. How well they
do so varies. The goal of systems management ultimately is to meet the strategic objectives of
the organization, which include aligning with business operations, preserving the integrity of
systems and information, and adapting to the changing needs of users. Systems management is a
broad discipline with many domains; some of the domains are similar, some are less so.
Underlying the entire practice, though, is a common set of information, processes, and
procedures that are best managed as a unified whole. At the same time, the complexity of
systems management requires a modularized approach to enable cost-effective and manageable
solutions.
SOM builds on the best practices of systems management, information security, governance, and
related areas. The remaining chapters of this guide will describe in detail the elements of these
best practices, the tools needed to implement the best practices, and the organizational direction
and policies needed to realize the benefits of SOM.
24
Chapter 2
Throughout this guide the words “business” and “organization” are both used to describe enterprises
that implement systems management practices. Even when the word business is used, the
discussion can equally apply to government departments, agencies, and non-profit organizations.
25
Chapter 2
26
Chapter 2
Basics of IT Planning
Once an IT department understands the objectives of the enterprise and has aligned the strategic
plan of IT with those of lines of business, the planning phase can begin. The planning process
entails a number of areas, including:
• Technical architecture
• Organizational structure
• Budget and staff management
• Communications
Of these, the technical architecture is the one most often addressed in IT planning.
27
Chapter 2
Perhaps one of the most complex data reference models is the U.S. Federal Enterprise Architecture
Data Reference Model designed for cross-agency information sharing and analysis. For more details,
see http://xml.coverpages.org/ni2005-12-28-a.html.
The other part of planning technical architecture focuses on the systems that manipulate
enterprise data.
28
Chapter 2
The OASIS organization coordinates a large number of XML standards in a wide range of areas, from
e-government and financial services to printing and plumbing. For more information see
http://www.oasis-open.org/home/index.php.
Organizational Structure
Planning around organizational structure is about answering questions related to who is
responsible for parts of IT infrastructure and services. Common assignments include:
• Help desk support
• Network management
• Server and storage management
• Training
• Security and compliance
• Application and database administration
• Auditing
One goal of organizational structure planning is to ensure that all critical functions are identified
and clearly assigned to a business unit. This does not necessarily mean there is a department or
group within IT dedicated solely to a single task, but that all tasks are covered. For example,
auditing may be assigned to the same group as security and compliance, while Help desk support
and training are managed by the same staff.
Another goal of organizational planning is operational efficiency. For example, it is far more cost
effective if a single group evaluates anti-malware systems and selects applications best suited to
the organization than if every department purchases their own antivirus software. Similarly,
allowing disparate lines of business to install different database systems will increase
development and support costs as well as introduce application dependencies that can drive up
cost long after the initial purchase.
Clearly demarcating lines of authority and responsibility is essential for efficient and effective IT
resource management. With an overall organizational structure in place, the next step is to
address budgeting and staffing.
29
Chapter 2
Communications
Communications across lines of business and operational units is sometimes difficult. Each part
of the organization has its own priorities and they are not always in sync. What is important to
one department is a marginal issue to another. At the same time, vertical communications up and
down the organizational structure is an important aspect of keeping IT operations aligned with
business objectives. By formally planning and implementing a communication plan, IT systems
managers can keep executives informed of the status of operations and projects and keep lines of
businesses appraised of service changes, development backlogs, and dependencies on systems
that can impact their performance.
Communications across the organization must include more than technical details, project plans,
and delivery schedules. Understanding and planning for risks is major factor in IT planning.
30
Chapter 2
Risk Management in IT
Risk management is the process of identifying and assessing potential loss to an organization.
This process includes three main steps:
• Prioritizing business objectives
• Assessing risks and impact
• Mitigating risks
Together, these provide the means to identify risks as well as options for dealing with them.
31
Chapter 2
Operator errors are less likely to cause the loss of physical assets but more likely to result in the
loss of information. For example, an operator might accidentally overwrite a backup tape that
contains necessary data, or a data entry clerk might accidentally delete records from a transaction
processing system. In the first case, the information may be permanently lost or recoverable from
other backup tapes. In the case of a data entry error, the lost data might be recovered from
database redo logs if caught before changes are committed or from backups in other cases. The
recovery methods range from quick and inexpensive to slow and costly procedures.
Hardware and software failures as well as security breaches can range from annoyances to
significant disruptions. When assessing the impact of these types of failures, one should address
both the direct consequences—for example, an order entry system is down—as well as
dependencies, such as the data warehouse cannot be updated and management reports cannot be
generated because of the delay in getting operational data. Clearly, the range of impacts is broad;
mitigation strategies should be selected based on that range.
Mitigating Risks
Risk mitigation is a balancing act. Formally speaking, risk mitigation strategies should not cost
more than the value of the lost resource multiplied by the probability that loss will occur.
Unfortunately, quantifiable measures are only available for a small set of risks. For example,
hardware manufactures can cite mean time between failure statistics about a device, but there are
not good statistics on the mean time between significant bugs in an ERP system, or the
likelihood of Denial of Service (DoS) attack, or the chances an operator will accidentally corrupt
a backup script that then fails to execute the backups properly. Often, risk mitigation strategies
are based on best guesses and past experience.
Risk mitigation strategies, therefore, tend to fall into general approaches that address a number of
different risks. Typical examples include:
• Multiple, overlapping backups of critical data
• Failover servers in the event of a hardware failure
• Off-site storage of backups and alternative servers in case of physical damage
• Preventive measures, such as firewalls, intrusion prevention systems (IPSs), and content-
filtering applications to prevent breaches and the introduction of malware
• Application user interfaces (UIs) designed to prevent accidental destruction of data
• Database integrity constraints to prevent accidental loss of information—for example,
deleting a customer record when the customer has open orders in the database
Understanding the types of risks that confront IT operations is a fundamental part of the planning
process. It is also closely related to another core IT process: business continuity planning.
32
Chapter 2
Business Continuity
The goal of business continuity planning and management is to minimize the chance of a
business disruption. The risks outlined can lead to an outage of business service. Although the
risk management planning process tries to minimize the chance of these risks actually disrupting
operations, business continuity addresses what to do when those risks are realized.
Business continuity planning creates policies and procedures that dictate what to do in the event
of a business disruption. These plans leverage the resources put in place as part of the risk
mitigation strategy. For example, offsite backups can be restored to a backup server at a remote
site in the event the primary site is destroyed by fire. To be effective, these plans must be:
• Detailed, application managers and network administrators should not have to think of
undocumented but necessary steps to restore operation (for example, updating a DNS
record to point to a backup instead of a primary server)
• Tested to ensure the procedures accomplish the proscribed goals
• Rehearsed so that staff are not executing these procedures for the first time during a
disruptive event
Business continuity is not an isolated set of tasks that are done at one time, documented, and put
on the shelf until the next audit. They are tightly linked to the risk management aspects of IT
planning as well as to the security operations of an IT organization.
33
Chapter 2
Compliance
Security
Integrity
Availability
Confidentiality
Figure 2.1: Information security and compliance share the common goals of information integrity and
confidentiality.
34
Chapter 2
It is worth noting that the United States has adopted a decentralized approach to privacy,
protecting, for example, healthcare information at the federal level while leaving general privacy
regulations to states. Unlike the U.S., many other countries, including the European Union
members, Australia, and Canada have adopted comprehensive privacy legislation at the national
and transnational levels.
Information Integrity
Maintaining the integrity of business and government information is essential to maintaining the
trust of markets, constituents, and others outside those organizations. This reality became
abundantly clear with the fiscal reporting scandals that occurred at Enron, WorldCom, Tyco, and
other large businesses just a few short years ago.
In response to the growing awareness of the importance of maintaining the integrity of publicly
reported information, governments passed a number of regulations to minimize the chance of any
further corporate accounting debacles. The most well-known legislation is probably the
Sarbanes-Oxley Act (SOX), which defines responsibilities for maintaining and reporting
accurate information on publicly traded companies in the United States.
In addition to SOX, some less well-known integrity measures include:
• Computer Fraud and Abuse Act
• Electronic Signatures in Global and National Commerce Act
• Gramm-Leach-Bliley Act
Like privacy protections, the movement to preserve accurate business reporting is a transnational
undertaking. For example, the Bank for International Settlements established the Basel II
standards to ensure that banks accurately report risks associated with their investments.
Information integrity regulations have also targeted other industries. The U.S. Food and Drug
Administration (FDA), for example, has established policies governing the recording, reporting,
and storing of information related to the production of pharmaceutical products in the 21 CFR
Part 11 regulations.
For more information about compliance from an IT perspective, see the IT Compliance Institute at
http://www.itcinstitute.com/.
35
Chapter 2
Information Security
Of all the areas that comprise systems management, information security is the largest and most
difficult. It is the most difficult because there are adversaries who are trying to compromise
security measures. It is the largest because there are so many areas that have to be addressed;
virtually every aspect of IT is touched by security issues or play a role in security maintenance.
The areas of information security most closely associated with systems management include:
• Threat assessment
• Vulnerability management
• Managing countermeasures
• Auditing
• Incident response
• Change control
• Information security management
These domains within information management require individual planning and management yet
depend on each other to be effective.
Threat Assessment
Threats to IT seem ubiquitous since the widespread adoption of the Internet. A threat is a person,
program, or process that can compromise the confidentiality, integrity, or availability of
information or systems.
Threats should not be confused with vulnerabilities, which are weaknesses, deficiencies, or errors in
applications, OSs, network devices, or procedures that can be exploited by a threat. Vulnerabilities
are addressed in the next section.
Threat assessment is the practice of determining who and what can damage an IT system. Of
course, a human is ultimately responsible for all threats, but direct actions carried out by a hacker
trying to break into a system require different responses than a malware writer who unleashes a
virus to delete randomly selected files from victim’s hard drives. For this reason, it is useful to
think in terms of categories of threats, such as:
• Information theft, a threat to confidentiality
• Information tampering, a threat to integrity
• DoS attacks, a threat to availability
• Viruses, worms, and other malware, potential threats to confidentiality, integrity, and
availability
• Spam, a threat to availability
• Phishing attacks, a threat to confidentiality
• Spyware and other potentially unwanted programs (PUPs), a threat to confidentiality and
availability
36
Chapter 2
With an understanding of the broad category of threats, the next step is to understand how these
threats are executed. For example, information theft can occur when a hacker compromises a
database server and steals credit card information; it can also occur when a disgruntled employee
uses legitimate access rights to collect data for unauthorized purposes. In the case of malware,
virus can be downloaded along with email through an organization’s email server; it can also
occur when a laptop user browses a compromised Web site from a poorly secured network at
home.
Threat assessment is the practice of discovering potential threats and understanding the motives
for those threats. In general, you cannot prevent threats—they exist outside of your control. You
can, however, minimize the chances that a threat can successfully compromise your
infrastructure. This is the role of vulnerability management.
Vulnerability Management
Vulnerability management is the practice of identifying and compensating for weaknesses in
systems, applications, and procedures that can be exploited by threats to breach a system. Like
threats, there are a variety of types of vulnerabilities, including:
• Misconfigured network software that allows hackers to use those programs to gain access
to protected resources
• Errors in OS software that allows malware writers to gain elevated privileges and execute
destructive programs on a compromised host
• Poorly designed programs that do not check for proper parameters and result in a
commonly exploited condition known as a buffer overflow
• Organizational policies and procedures that do not account for the potential for attacks or
thefts from internal personnel—for example, not rotating duties of employees in critical
functions
There are several ways to combat vulnerabilities. First, keep OSs, applications, and network
software up to date with security patches. Some systems, such as Microsoft Windows, make it
relatively easy for single users or small organizations by offering tools such as Windows Update.
As the size of an enterprise increases, more sophisticated tools are required that include
centralized management and rollback capabilities. Of course, not all applications have tools for
automatically downloading patches from a vendor site. For example, updating a database
typically requires a manual download of a patch, which is then applied by a database
administrator.
Applying patches to production systems can introduce as well as resolve problems. See the section
on change management for more details and caveats.
Second, employ code reviews and software analysis tools to check for common vulnerabilities in
custom developed software. This is more within the realm of software engineering than systems
management, but systems administrators should be confident that reasonable and prudent
measures have been taken to ensure the quality and safety of any application before they deploy
it on their networks.
37
Chapter 2
Third, implement organizational policies and procedures that minimize the chance of a breach or
theft by an internal staff member. Unfortunately, these crimes are more common and serious than
you might expect. For example, a Florida man who was the controlling owner of a Internet
advertising company was recently convicted and sentenced to 8 years in federal prison for
stealing information about more than 1 billion records containing personal information, such as
names, physical addresses, and email addresses from Axciom Corporation, a personal
information repository and distributor (details at http://www.cybercrime.gov/levineSent.htm).
For more examples of internal-based breaches, see the U.S. Department of Justice Cybercrime site
at http://www.cybercrime.gov/cccases.html.
Finally, understand that vendors are not always the first to detect a vulnerability in their
software. Researchers, developers, systems managers, and others may discover and report
vulnerabilities to the public through one of the large, public repositories of system
vulnerabilities.
Tracking Vulnerabilities
A number of public databases and related tools are available to systems managers in addition to
information provided by vendors. These include:
● The National Vulnerability Database (http://nvd.nist.gov/) is a government-sponsored database of
all publicly known vulnerabilities. It contains tens of thousands of vulnerabilities as well as a
number of cybersecurity alerts cross referenced from the U.S. Computer Emergency Response
Team (CERT) from http://www.us-cert.gov/cas/techalerts/.
● The Open Source Vulnerability Database (OSVDB—http://www.osvdb.org/) project also maintains
a database of known vulnerabilities. The OSVDB includes support for exporting entries to XML
files for importing into other databases.
● The Common Vulnerability and Exposure dictionary (http://cve.mitre.org/) is a standard naming
convention for identifying vulnerabilities. It is not a separate database of vulnerabilities but a tool
for sharing information across vulnerability databases and making it easier for systems
administrators, developers, and other users to query those databases.
Once a vulnerability is found, it should be addressed by either patching the vulnerable code or
deploying a workaround. This part of vulnerability management overlaps with some of the tasks
associated with change control.
Change Control
Change control in IT is like maintaining a plane while it is in flight. Too often, systems
administrators do not have the luxury of shutting down systems and keeping them offline to
update software and hardware, test it thoroughly, and bring back users in a controlled manner.
Instead, software patches, upgrades, and software installations have to be done with minimal
disruption to operational systems.
38
Chapter 2
Configuration management databases are essential to efficient change management. See Chapter 4
for more information about this topic.
Although change management tools help to plan for infrastructure-level changes, auditing helps
understand what is happening within those systems now.
39
Chapter 2
System Events
System events occur within OSs. Three of the most important types of events are access control
events, configuration change events, and performance measurements. Access control events
include:
• Successful and failed logins
• User lockouts due to multiple failed attempts
• Failed file access due to access control violations
When these events occur, the identity of the user as well as the time and device (for example, IP
address) should be tracked.
Configuration change events occur when, in the case of Windows, a registry setting is changed
or, in UNIX OSs, when configuration files are changed. The identity of the user making the
change, the old and new values of the change, and the time and the device from which the
change is made are some of the characteristics that may be tracked.
Performance measurements indicate levels of system activity. There are a wide variety of
performance measurements that may be collected, including:
• Disk I/O rates
• Page fault rates
• Percent of CPU time in different modes
• Number of files open
• Number of network connections established and connected
• Network segments received per second
These measures are specific to OS and network performance; individual applications may be
monitored as well.
40
Chapter 2
Application-Level Auditing
The type and volume of audit information tracked by applications varies widely. Some
applications will log details of startup and shutdown processes, error events, database accesses,
files opened, and other details of normal operations.
In addition, some applications support a detailed, debugging level of auditing that provide much
more detail than is normally recorded in audit logs. Debugging detail is designed to log
information about the execution path of a program, indicating which modules are executed,
conditions of key variables at the time of execution, and other details that help programmers and
support personnel identify problems. This level of detail is not normally needed for application
monitoring, only for problem resolution.
User Auditing
In some especially secure environments, it is important to have a record of user activity. This
record can include login attempts, use of various resources (including files and applications), and
programs executed. It may also record details of commands issued. For example, if someone
attempts to copy a file from a secure server to another server using ftp, the file name, the target
ftp site, and the date, time, and user identity should be recorded.
Auditing information is useful for systems management as well as for security purposes. It can
be especially useful for incident response.
Incident Response
The purpose of incident response is to limit the damage caused by a security breach. Ideally,
organizations will have incident response plans in place that dictate how IT staff and
management should respond to a security incident. Depending on the type of incident (for
example, a virus infection, a database break-in, or a DoS attack), the incident response plan
should describe the steps to mitigate the risks of damage. These steps can include:
• Removing a compromised server from the network
• Blocking traffic at a firewall
• Monitoring user activity if an unauthorized action is underway
• Notifying management
• Securing audit logs for forensic analysis
Like so many other security and systems management activities, incident response is most
effective when a comprehensive set of information is available about servers, applications, and
other devices within the IT infrastructure. An accurate and up-to-date centralized configuration
database is as important to enterprise security management as it is to operational systems
management.
41
Chapter 2
Capacity Planning
Capacity planning is one of the better examples of a systems management domain that leverages
the information and practices of other domains. To accurately gauge how much storage space,
how many CPUs, or how much bandwidth will be required to support operations at some point in
the future requires information about:
• Current loads on servers and the network, which is gathered during performance
monitoring
• Growth in application loads, which in part, is determined when aligning IT operations
with business strategy
• Dependencies between existing systems and proposed additions to infrastructure, which
uses data from change management practices
• Trends in security issues, such as the rate of growth in spam and malware targeted to the
enterprise network
Capacity planning requires a combination of looking backward for data and looking forward to
anticipated changes. It also requires a firm understanding of existing resource and their levels of
use. This is one of the elements of asset management.
Asset Management
Assets are hardware and software components that provide for particular services within the IT
infrastructure. Servers, desktops, routers, firewalls, databases, ERP applications, LDAP directory
servers, and a range of other devices and applications fall into this category. The scope of asset
management, at a minimum, includes:
• Acquiring assets
• Deploying assets
• Configuring assets
• Maintaining assets
• Retiring assets
The specific details of each of these will vary with the type of asset but some general principals
hold for all.
42
Chapter 2
Acquiring Assets
The acquisition of assets is closely tied to capacity planning. During capacity planning, when a
finding is made that additional resources are required, the acquisition process is initiated.
Requirements are defined, designs are formulated, configurations are determined, and the
necessary assets are purchased. Also during this phase, dependencies are analyzed to determine
how the introduction of the new asset will impact other parts of the infrastructure.
This process is especially important with assets that serve multiple business services. For
example, firewalls provide a core network service and could potentially affect every other
service and device on the network. A single-user desktop application, however, would have
limited impact on others in the organization and could be introduced with less thorough
planning.
43
Chapter 2
If testing was thorough, the configuration of the new asset should function in the production
environment, but there is always the potential for overlooking a configuration parameter or
missing a dependency, and configuration changes may be needed after an asset is deployed in
production. These steps begin to boarder on maintenance.
Service Delivery
Service delivery is the process of ensuring that functions and resource needed by the
organization are provided in a reliable and cost-effective manner. As is common in systems
management, there is some overlap with other core processes. The main components of service
delivery are:
• Service level management
• Financial management for IT services
• Capacity management
• Availability management
• IT service continuity management
44
Chapter 2
45
Chapter 2
Summary
The core operations of systems management are designed to support the strategic objectives of
an organization. That is the starting point for the core services of systems management. With a
clear and well-defined alignment of business objectives, IT professionals can plan for the
capacity needs of the organization, weigh potential risks and mitigate appropriately, and insure
the continuity and operational integrity of IT operations. Systems management professional have
always had significant responsibility in the area of systems security, and those responsibilities
have expanded to support organizational efforts to remain in compliance with a host of
government regulations. Other areas of systems management attend to the needs for capacity
planning, asset management, and service delivery. As this chapter has demonstrated, the range of
systems management is broad and extends beyond the boundaries of the traditional IT
department into the business units which they serve.
46
Chapter 3
47
Chapter 3
48
Chapter 3
For more information about the three IT management methods, see Chapter 1.
You can take much from what others have learned if you keep in mind several principals about
the use of best practices as they apply to SOM:
• IT services have much in common
• IT services are interdependent
• IT services can and should be measured
• IT services are repeatable processes
• IT services are broadly applicable
These principals speak to management of IT services within as well as across organizations.
They are also embodied in the four frameworks described in the following sections.
49
Chapter 3
50
Chapter 3
KPIs
KPIs are events or attributes that are measurable and correspond to the level of service delivered.
There are several types of KPIs with varying characteristics:
• Technical
• Financial
• Organizational
Best practices for a particular area might include more of some of these than others, but the most
comprehensive best practices address all the main types.
Technical KPIs
Some KPIs are easily identified, especially technical ones, such as megabytes of data transmitted
over a network segment in a given period of time, the latency on a network, the storage utilized
on a disk array, and the percent of available CPU time utilized for application processing. By
their very nature, technical KPIs are easily quantified. They are also easily gathered, relatively
speaking. Applications, OSs, and dedicated appliances can generate large amounts of data about
performance and capacity.
The ease with which data on technical measures is generated is both an advantage and a
disadvantage; information overload is a constant problem when managing with technical
elements of IT services. Thus, the goal of measuring IT services is not to measure all services or
every dimension of an operation but to focus on a small number of key measures that are
indicative of the overall performance of the service.
As Figure 3.1 shows, even simple operations, such as measuring CPU and disk activity, can
generate too much data to allow for quick assessments of the state of an operation. KPIs for
server performance might include:
• Percent of non-idle CPU time
• Disk reads and writes per second
• Total bytes received and sent per second from a network interface
• Number of page faults per second
This set of measurements provides one measure per major functional area of a server (CPU, disk,
network, and memory) and can be monitored nearly continuously or polled at longer intervals
with the data aggregated to provide a performance measure for a specific period of time.
51
Chapter 3
Figure 3.1: Information overload is a common problem when measuring technical performance.
As important as technical measures are, they do not provide a complete picture of the state of IT
operations. Financial measures are another critical component of IT operations management.
Financial KPIs
Financial KPIs allow managers to assess the value of specific IT operations and services relative
to their costs. Unlike technical measures, financial measures do not tend to lend themselves to
the massive amount of data found with machine-generated measures.
Financial measures tend to focus on the cost of labor and equipment, the return on investment
(ROI) of proposed purchases, and financial management issues, such as budgeting and cash flow.
These tasks are well understood and documented elsewhere; the focus here is on topics that are
too often overlooked or under-addressed in textbook discussions of IT management.
For information about other aspects of IT financial management, see resources such as
ComputerWorld’s IT Management Knowledge Center at
http://www.computerworld.com/managementtopics/management, and CIO Magazine’s CIO Resource
Center at http://www.cio.com/leadership/itvalue/.
52
Chapter 3
When formulating financial measures, be sure to understand the scope of the measure. For
example, the “cost” of a server may be stated as $20,000, when in fact that is the cost to purchase
the server from the vendor. The full cost of introducing that server into the organization would
have to include at least the vendor invoice amount, plus:
• Labor costs to install and configure the server and its OS
• Staff time dedicated to change management operations, including plan review for the
server
• Information security staff time spent locking down the server and auditing it as needed
• Compliance management staff time spent understanding implications of the use of the
server—for example, will confidential financial information be stored on the server?
• Network services support time spent updating routers, firewall, intrusion prevention
systems (IPSs), and other services that must be aware of the presence of new devices
• Server support staff time required to add the server to the backup and disaster recovery
process
• Application support time required to install and configure packaged or custom
applications running on the server
• Additional software licenses incurred because of the new server
Accurate financial measures are often difficult to formulate and, in reality, we often settle for
estimates. In addition to understanding the breadth of costs related to IT, it is important to avoid
unintentionally equivocating about the meaning of terms.
Related to identifying the scope of terms appropriately, you also must use terms precisely. Too
often within an organization, a single term will take on multiple meanings, depending on the
context. For example, to the sales department, the cost of goods sold may include the price paid
for a good, shipping costs, and storage and inventory management costs; the finance department
may include all those factors as well as the sales commission paid to the salesperson that made
the sale. It is not the case that one group is wrong and another is right. The problem lies in
multiple uses of the same term. Using multiple terms, such as pre-sales cost of goods sold and
post-sales costs of good sold can help avoid this confusion. As difficult as financial measures are
to formulate precisely, they are not as challenging as organizational KPIs.
53
Chapter 3
Organizational KPIs
Organizational KPIs are soft measures; they do not have obvious quantifiable aspects, as
technical and financial measures do. Technical measures are relatively easy to grasp. The
problem with them tends to be too much information. In the case of financial measures, you must
define terms precisely and with appropriate scope to accurately reflect the costs and benefits of
investments. Just defining organizational KPIs is difficult. Some of the areas that are included in
organizational KPIs are:
• Training level of staff
• Ability to incorporate emerging technologies into existing infrastructure
• Ability to execute new organizational models, such as partnering and outsourcing
• Ability of IT to meet needs and expectations of business units
• Level of overall compliance with government regulations
Although difficult to quantify, organizational measures reflect the ability of an organization to
execute strategies and perform operations.
Another aspect of these different types of KPIs is that they are not independent of each other.
The ability to effectively provide key technical services depends upon the ability to fund the staff
and equipment needed; having a well-trained staff that understands change management
procedures and executes them appropriately is an organizational KPI that has direct impact on
technical operations.
Figure 3.2: The three types of factors that are measured by KPIs interact and influence each other.
Measurement is a key process in SOM, particularly in the frameworks and best practices that
support it. Another characteristic of these best practices is the ability to leverage repeatable
processes.
54
Chapter 3
For more information about organizational reengineering, see Michael Porter’s Competitive
Advantage: Creating and Sustaining Superior Performance (New York: The Free Press, 1985), Peter
Drucker’s “The Coming of the New Organization” (Harvard Business Review, Jan-Feb. 1988), and M.
Hammer and S.A. Stanton’s The Reengineering Revolution: A Handbook (New York, Harper
Business, 1995).
Process reengineering has had its counterpart in IT with the widespread adoption of standard
process management policies and procedures. The goal is typically to improve consistency and
quality of services while controlling costs. Many of the frameworks described in this chapter
emphasize specific processes, including service level management, change management, disaster
recovery, capacity planning, security management, and a host of other essential IT services.
The focus on processes within IT has been driven by several advantages provided by their
adoption:
• Ability to deliver consistent and predictable performance—For example, with simple
tasks such as adding users access rights to an application to more complex processes,
such as incident response
• Ability to measure performance and compare results—With consistent, repeatable
processes, KPIs can be identified and measured
• Ability to improve procedures—Again, with consistent procedures, organizations can
measure performance, analyze performance data, and identify weak areas in those
processes
• Ability to justify budgetary needs—With hard numbers on system capacity and trends in
growth of users and applications, IT managers can more effectively defend their requests
for appropriate funding
Processes are common to virtually all IT operations, so it is not surprising to find them
prominently in best practice frameworks, especially those so closely associated with SOM
practices. This fact highlights another aspect of these frameworks—that is, they leverage broadly
applicable models across industries.
55
Chapter 3
56
Chapter 3
The following section will examine the particulars of each of these best practice frameworks.
57
Chapter 3
ITIL is an open standard, so it can be freely adopted by organizations. The content of the ITIL
references is copyright protected, however. To purchase ITIL framework books, see
https://securewsch01.websitecomplete.com/itilsurvival/shop/showDept.asp?dept=17. Community
support is available at http://www.15000.net/.
58
Chapter 3
The final element of service delivery addressed in ITIL is financial management with an
emphasis on understanding the total cost of ownership (TCO) of IT resources. As described
earlier in the section on financial KPIs, comprehensive measures, which take into account all
costs, is fundamental to financial management.
Although service delivery tends to address longer-term planning challenges in IT, the service
support discipline of ITIL concentrates on shorter-term needs and issues.
This shift from a narrow, problem-centric approach to a more comprehensive view only works when
service support staff has comprehensive information. A central aspect of SOM is the use of a
centralized repository of information in the form of the configuration management database (CMDB).
Without a CMDB or similar database, service support reverts to a less-effective silo-based problem
management practice.
Chapter 4 will provide details about CMDBs and their role in SOM.
A centralized approach to information sharing supports other areas of service support, including
problem management, configuration management, change management, and release
management.
59
Chapter 3
Chapter 1 includes a more detailed discussion of business alignment with a discussion of coherent
business strategies, managing multiple objectives, and dynamic requirements.
Security Management
ITIL has adopted ISO 17799 as a basis for security management. That framework is discussed in
more detail in a bit.
Infrastructure Management
ITIL’s section on infrastructure management addresses four elements: design and planning,
deployment, operations, and technical support. The design and planning part of infrastructure
management spans business requirements to technical and architectural issues surrounding the
development of IT infrastructure. Tasks include developing business cases for plans, conducting
feasibility studies, and designing architectures. The deployment operations include project
management and release management procedures to improve the likelihood of a successful
rollout of new hardware and applications. Operations management addresses the day-to-day
activities that keep an IT infrastructure operational. These include system monitoring, log
review, job scheduling, backup and restore operations, and utilization monitoring. Technical
support encompasses a number of services, including documentation, specialist support for
problem resolution, and support for technical planning.
Release Management
Once software components have been acquired or developed, and tested in a quality assurance
environment, they are ready for production release. Release management is the practice of
moving software components into operation; this entails several steps, including:
• Adding software to a definitive software library
• Analyzing dependencies in the production environment and ensuring that the new
software is configured to function properly
• Scheduling resources to install and configure software
• Coordinating with training, Help desk, and other support personnel
Release management is a bridge process that moves software from project to operational status.
60
Chapter 3
COBIT
Governance has grown in importance along with increasing demands for compliance with
government regulations. For publicly traded companies and government agencies in particular,
ad hoc management procedures are no longer sufficient. Well-defined policies and practices that
support specific objectives defined in regulations are demanded of IT professionals.
COBIT was developed by the Information Systems and Audit Control Association (ISACA) as a
framework for controlling IT operations. Although there is less emphasis on execution than ITIL
offers, much of COBIT can help improve operations. COBIT is well designed to support
governance and complements ITIL’s focus on operational processes.
COBIT is a process-centric framework with four broad subdivisions:
• Planning and organizing
• Acquiring and implementing
• Delivering and supporting
• Monitoring and evaluating
Like the disciplines in ITIL, these processes are common to IT operations regardless of size or
industry. Within the COBIT framework, these processes are managed through a series of
controls. Each control includes an objective that is to be achieved, a method for achieving it, and
metrics for measuring the success of the control objective.
As the name implies, controls are in place to ensure objectives are met and processes can be
improved. These controls help to define the operational tasks that must be performed to maintain
compliance with both internal and external process requirements. Although COBIT is not
designed for a particular regulation, the breadth and focus of the framework makes it well suited
for meeting the demands of many regulations.
For details on COBIT, see the ISACA Web site’s COBIT offerings at
http://www.isaca.org/Template.cfm?Section=COBIT6&Template=/TaggedPage/TaggedPageDisplay.c
fm&TPLID=55&ContentID=7981.
61
Chapter 3
62
Chapter 3
63
Chapter 3
For more information about ISO 17799, including training material, articles, and compliant policies,
see the ISO 17799 Information Security Portal at http://www.computersecuritynow.com/. Two user-
supported sites provide additional information—the ISO 17799 Guide at http://iso-
17799.safemode.org/ and the ISO 17799 Community Portal at http://www.17799.com/. The full
standard can be purchased and downloaded from http://17799.standardsdirect.org/.
64
Chapter 3
The full Risk Management Guide for Information Technology Systems is freely available at
http://csrc.nist.gov/publications/nistpubs/800-30/sp800-30.pdf.
While recognizing the need to align business and technical objectives of risk management, the
guide defines three processes in risk management:
• Risk assessment
• Risk mitigation
• Evaluation and assessment
The first step, risk assessment, is comprised of seven steps:
• System characterization, which defines the scope of the risk management effort and
identifies the assets and organizational units (OUs) involved in the effort.
• During threat assessment, threats, or potential agents of disruption, are identified along
with their sources.
• Vulnerability assessment discovers weaknesses in existing infrastructure that leaves the
system predisposed to disruption by threats.
• Control analysis, the fourth step, examines the controls, or countermeasures, in place or
planned for deployment that mitigate the potential for disruption by threats.
• Likelihood determination tries to pin down the probability of disruption given a set of
threats and vulnerabilities. This process takes into account motivation and capabilities of
the potential perpetrators, the nature of system vulnerabilities, and the effectiveness of
existing controls.
• Impact analysis determines the cost of disruptions caused by a threat being exercised
against an organization.
• Risk determination takes into account the impact of a threat and the likelihood to
determine the risk to the organization from that threat.
With the outcome of the risk determination phase, an organization can move to the next stage,
risk mitigation.
65
Chapter 3
Risk Mitigation
During the risk mitigation phase, information learned in the risk assessment phase is used to
determine appropriate measures for reducing risk for the least cost and with the least disruptive
impact on the organization. The risk mitigation phase has several components:
• Understanding risk mitigation options
• Developing and implementing risk mitigation strategy
• Conducting cost benefit analysis and dealing with residual risk
There are several risk mitigation options outlined in the NIST guide:
• Risk assumption, which essentially accepts the risks or provides for some controls to
reduce the risk
• Risk avoidance, which requires steps to remove the cause of the risk
• Risk limitation, which lessens the impact of a risk by use of preventive controls
• Risk planning, which introduces prioritized controls
• Research, which entails investigating the risk in an effort to discover new controls
• Risk transfer, which entails purchasing insurance to transfer the risk to a third-party
The guide provides several rules of thumb for risk mitigation strategies. First, if a risk does exist,
try to reduce the likelihood the vulnerability will be exercised by applying layered protection and
other architectural devices and administrative controls. Second, increase the cost to the potential
perpetrator so that the cost exceeds the value of the information stolen. Finally, when the cost is
great, purchase insurance to mitigate risk.
The risk mitigation strategy is implemented through a series of technical, management, and
operational controls. Technical controls contain some element of hardware, software, or
architectural countermeasure to mitigate risks. Management controls focus on policies,
procedures, and guidelines that work in conjunction with other types of controls to mitigate risks.
Operational controls focus on the governance of security measures and the identification of
weaknesses in the existing security posture of an organization.
Cost benefit analysis studies help to identity the set of controls in place, their cost, and their
impact on reducing risk. The purpose of conducting a cost benefit analysis is to find the best
combination of controls that mitigate the greatest risks for the least cost. However, even with
properly implemented controls and solid governance processes, risks may still remain. These are
known as residual risks.
66
Chapter 3
67
Chapter 3
Summary
There is no need to reinvent the wheel of IT management. Best practice frameworks, ranging
from broad frameworks covering all major areas of IT management to more targeted guidelines,
have been developed and are readily available for adoption by IT practitioners. These guidelines
provide details about what should be done. The next chapter begins to analyze how to implement
these practices using the tools of SOM.
68
Chapter 4
69
Chapter 4
Figure 4.1: System management depends upon tools for implementing policies and procedures that are
based on an IT strategy that is aligned, along with other division strategies, with the overall organizational
strategy.
The policies, procedures, and related management tools are links from strategy to
implementation. Today, common application development models are distributed and service
oriented. The procedures and tools for managing those applications should be as well.
70
Chapter 4
The remainder of this chapter will begin a detailed examination of what is required to move to a
service-oriented management model and will address the following areas:
• Asset tracking
• Structure of configuration management databases (CMDB)
• CMDBs and asset life cycles
• CMDBs and service-oriented architecture
Together, these describe the fundamental components and processes that will support other
aspects of service-oriented systems management.
Asset Tracking
IT management has dual aspects by nature: it is both a process-oriented system closely aligned
with business objectives and it is an asset-centric system of highly interdependent devices that
are often in a state of change. Although the process-oriented aspects of IT are critical to the
successful use of IT, this chapter will focus on asset tracking. Asset tracking is the discovery and
tracking process of asset management; asset management also includes the management of the
financial and contractual details of the assets in the IT environment. Asset tracking can be
divided into several sub-topics (with some overlap with other areas of systems management,
which will be addressed in future chapters):
• Inventory management
• Patch management
• System security
• Risk management
• Licensing
• Service delivery
Together, the management areas constitute the fundamental areas of asset management.
71
Chapter 4
Inventory Management
Inventory management sounds like a pretty basic operation. After all, how difficult can it be to
count a bunch of PCs, servers, and peripherals? If only it were that easy. Inventory management
is as complex as the devices encompassed in the process and as diverse as the allocation of
assets. The term inventory management might conjure up images of a warehouse stocked with
boxes, clearly labeled and bar-coded, organized in an optimal way to make the most efficient use
of storage. In the world of manufacturing or distribution, this might be the norm, but IT
inventory is more varied.
Consider a desktop workstation. What needs to be tracked to ensure the device is properly
inventoried? This is essentially the same question as what are the physical and electronic units
that must be tracked? A workstation is not a single monolithic device; it consists of multiple
devices that could be tracked:
• Monitor
• Keyboard and mouse
• Case, motherboard, and CPU
• Memory
• Hard drives, CD drives, and DVD drives
• Peripherals such as speakers, Web cams, and similar devices
And that is just the hardware. A workstation will also have software, including:
• Operating system (OS)
• Productivity software, such as word processors, spreadsheet, and presentation
applications
• Utility software, such as file transfer programs
• Security software, such as antivirus and personal firewalls
• Custom applications
All of these components can be tracked as part of a workstation or may be tracked individually
depending on your needs. It is difficult to imagine, though, a realistic case in which all these
items could be successfully managed by grouping them into a single inventoried entity.
72
Chapter 4
Figure 4.2: Over time, the cumulative affects of small changes in configurations can result in compounded
changes and great variation.
73
Chapter 4
Patch Management
Enterprise software—such as application servers, databases, and OSs—is regularly updated.
Patches (relatively small amounts of application code and configuration data) are released by
developers and vendors to correct flaws and vulnerabilities in deployed software. Unlike
upgrades, patches do not generally contain significant new functionality. To effectively manage
patches, there are several steps systems administrators should conduct as part of the patch
management process. These steps include:
• Assessing the relevance of patches
• Testing patches
• Scheduling patch installations
• Implementing change control procedures for patches
• Deploying patches
Figure 4.3 shows a high-level flowchart describing the decision points and sequence of steps in
the patch management process.
74
Chapter 4
Figure 4.3: This patch management decision flowchart includes several decision points related to patch
relevance and the organization’s tolerance for partial functionality.
75
Chapter 4
Many patches are released to address vulnerabilities in applications and OSs. Without these
patches, vulnerabilities can be exploited resulting in disclosure of data, tampering with data, and
a compromise of system control. When security patches apply to an application that is not used,
there is no need to install the patch. Systems administrators must be careful to distinguish
policies that dictate certain services are not used (for example, “Windows desktops will not use
Trivial File Transfer Protocol”) and what is actually done on devices (for example, TFTP was
mistakenly not removed from a desktop, and a user, unaware of the policy, uses it to transfer
files).
The fact that discrepancies can exist between policies and implementations is an argument for the
need for centralized management of assets. Although organizations can have well-defined policies,
without auditing and enforcement, organizations can experience a drift between policies and
implementations. Automated collection of asset information and a centralized repository of integrated
information are essential for cost-effective systems management.
Testing Patches
Patches should be tested before deployment even when applying security patches. Never forget
that patches are software components and, like all other software, may contain errors. The errors
may be within the patch themselves—that is, bugs in the program—or the errors may be at the
system level at which the patch replaces code on which other applications were dependent.
Dependencies are not always obvious. For example, an OS patch might replace a library routine
with a more robust version but in the process change the behavior of some of the undocumented
routines within that library. Side effects of the undocumented routines, for instance, may not
have detected some errors in applications that use the routine. The patched version may detect
those errors and now an application that has worked for months is suddenly broken. What should
be done? Systems and applications managers have basically three options:
• Install the patch and tolerate the loss of functionality in the dependent application
• Install the OS patch and, assuming it exists, a patch for the broken application
• Do not install the patch and tolerate the security vulnerability that prompted the release of
the patch
The first option is appropriate when the security vulnerability is so serious it outweighs the cost
of disrupting the other application and there is no time to investigate patches for the application.
An example of such a case would be when a fast-spreading piece of malware threatens to disrupt
services across the organization.
The second option highlights the chain reaction that patching can sometimes trigger. In today’s
complex array of distributed applications and service-oriented programming models, applications
cannot be considered monolithic, self-contained units. Applications are clusters of modules and
services that depend on other programs, so testing is essential to maintain the integrity of
distributed applications. Scheduling patch installations is also affected by the nature of
distributed applications.
76
Chapter 4
Deploying Patches
Once the patches have been identified, tested, and scheduled, the last step is deployment. At this
point, the bulk of the work shifts from analyzing the patch and other software to actually getting
the patch where it is needed. The key goals of this stage are to ensure:
• All devices that require the patch receive it
• Patches install correctly
• Software is cataloged in the organization’s definitive software library
A CMDB can provide a list of devices that should receive the patch based on characteristics of
those devices. For example, the listing could be based on the OS, service pack level, browser
version installed on the device, combination of applications running on the device, and so on.
77
Chapter 4
Distributing Patches
In relatively small environments, manual distribution of patches may be feasible, but in any but
the smallest of organizations, an automated software distribution system is generally preferable.
Automated systems have a number of advantages:
• An automated tool will apply the same logic to distributing the patch. This reduces the
chance of inconsistent implementations.
• Automated tools can respond to unexpected events or conditions in a predefined manner.
For example, if a patch cannot be installed because a device is powered down, the
distribution server could reschedule the patch delivery for the next day. If a patch cannot
be installed because the disk is full, an error alert can be sent to the systems
administrator.
• Automated tools can log the installation of the patch. Such logs can be an important part
of the organization’s compliance regimen.
• Automated tools can deploy patches much more rapidly than manual distribution. Not
only does this result in reduced costs but can also improve security. Deploying patches in
hours instead of days could conceivably prevent the exploitation of a known
vulnerability.
After the SQL Slammer worm spread through large segments of the Internet in less than 15 minutes,
do not underestimate the speed at which malware or distributed attacks can spread. See Paul
Boutin’s article “Slammed: An Inside View of the Worm that Crashed the Internet in 15 Minutes” at
http://www.wired.com/wired/archive/11.07/slammer.html.
Getting a patch to a device is the first step of deployment. Ensuring it installed correctly is the
next.
Verifying Patches
During the installation process, status checks should be done to ensure patches are installed
correctly. There are many things that can go wrong during a patch installation:
• The process installing the patch does not have sufficient privileges
• The patch process depends on a network service that is not available
• The configuration of the target device does not match the data in the configuration
database
• The device runs out of disk space
• The device is rebooted during the installation process
If a technician were installing the patch, these problems could be addressed immediately or
avoided all together; however, given the need for automated patch distribution, the distribution
process will need sufficient logic to check for these and other failure conditions.
78
Chapter 4
Cataloging Patches
The end of the patching process occurs when the patch is checked into the definitive software
library. The patch is now an asset of the organization; systems and operations depend on it. The
definitive software library can provide version control, reporting services, and most importantly,
a secure copy of the patch should it be needed again. The patch life cycle sometimes intersects
with an area of asset tracking: system security.
System Security
System security is driven by three goals:
• Protecting the confidentiality of information
• Ensuring the integrity of systems and operations
• Maintaining the availability of applications and services
Asset management plays a significant role in security efforts to realize these three goals.
Not all losses of confidential information are due to poor system configuration; sometimes it is a lack
of policy enforcement. One of the worst cases on record was the theft of names, Social Security
numbers, and dates of birth of 26.5 million veterans and some spouses from the U.S. Veterans
Administration (VA). An employee had taken a notebook, which contained the records, home, against
VA policies. The employee’s house was burglarized and the notebook stolen. See “Department of
Veterans Affairs Statement Announcing the Loss of Veteran’s Personal Information May 22, 2006” at
http://www.va.gov/opa/data/docs/initann.doc#May22Statement.
Information can be lost in a number of ways, ranging from social engineering techniques that get
people to reveal details about systems and accounts, to probing for known vulnerabilities in OSs
and applications.
79
Chapter 4
Maintaining Availability
The best system is of no use when it is unavailable. Managing assets and their configurations is
especially important for maintaining availability. An improperly configured router or firewall
might not respond properly to a Denial of Service (DoS) attack. An intrusion prevention system
with out-of-date attack signatures may miss a new type of attack. As with preserving
confidentiality and protecting the integrity of systems, maintaining system availability is
dependent, in part, upon asset management practices.
80
Chapter 4
Figure 4.4: Both external and internal information is needed to properly assess the threat of system
vulnerabilities.
Risk Management
Risk management techniques are used to determine the appropriate response to risks:
• The risk of having a large number of desktops infected with a virus
• The risk of a DoS attack shutting down a customer support site
• The risk of natural disaster shutting down operations
• The risk of an attacker stealing trade secrets and proprietary designs
81
Chapter 4
Risk management practices take into account the cost if a risk is realized; for example, if a virus
does infect a large number of desktops and disrupts operations; the cost of countermeasures, such
as antivirus software, and the likelihood a risk will be realized.
Configuration management and asset management information is essential raw data for risk
assessments. Evaluators need to know how many devices may be vulnerable to a particular risk.
What is the value of those assets? What other systems could act as backup or temporary
replacements in the event of a local disaster at one site? Questions such as these can be answered
but only with up-to-date and comprehensive information about the state of IT assets.
For more information about risk management, see the Software Engineering Institute’s Risk
Management information on the topic at http://www.sei.cmu.edu/risk/index.html and the National
Institute of Standards and Technologies’ Risk Management Guide for IT Systems at
http://csrc.nist.gov/publications/nistpubs/800-30/sp800-30.pdf.
Licensing
Software licenses are a special type of asset—one that can easily be overlooked. Licenses are not
concrete assets, so you do not see them as you walk through the office or into the data center.
Sometimes, the physical manifestation of a license is little more that a contract filed away
somewhere. The way the paperwork is often managed belies the complexities of license
management.
To begin with, there is no single, standardized form of licenses. Software vendors have
developed a number of licensing models in response to market demands and their own quest to
maximize revenues. Example licensing models include:
• Licensing by the number of concurrent users
• Licensing by the number of named users
• Licensing by site
• Enterprise licenses
• Leased licenses
• Licensing by CPU
• Pay-per-use
• Feature-based licenses
• Evaluation licenses
The variety of licensing models combined with the sometimes high rates of change within IT
environments, it is easy to imagine how quickly license management can become a management
burden. The license management life cycle is another dimension of complexity to license
management. As Figure 4.5 shows, once software is procured, it can pass through multiple states
before it is finally retired.
82
Chapter 4
Figure 4.5: The multiple paths through the software license life cycle compound the complexity of licensing
models to make license management especially challenging.
Like other aspects of asset tracking, to successfully and cost effectively manage software
licenses, it is essential to have a detailed database of information about licenses, their
deployment, and how they are being used, if at all. Understanding how licenses are being used
allows IT and procurement to know how many licenses need to be procured and whether unused
licenses can be harvested and reallocated to get the most of the organization’s investments.
Organization should have neither too few licenses, and be out of compliance with their software
vendors, nor should they have too many, and incur unnecessary expenses.
As licenses often move with other assets—for example, a Web server is moved from one
department to another—it makes sense to manage the hardware and software together. This is yet
another example of a common systems management problem that can be effectively addressed
by a comprehensive asset and configuration management database. Another area of asset
management that benefits from configuration management information is service delivery.
83
Chapter 4
Service Delivery
Service delivery entails a broad range of tasks related to managing IT operations, including
service level management, capacity management, contingency planning, and financial
management. Knowing the state of assets and their deployment is part of each of these
operations. For example, to maintain service levels at expected rates of growth, systems
managers will need information about the assets deployed for a particular service and their
current utilization as well as the utilization of similar assets that may be redeployed in service to
another function. Asset management also benefits from accurate and up-to-date reporting about
the allocation of assets, especially when IT departments use a charge-back model to bill
departments and lines of business for IT services.
Asset management is a broad topic. A common requirement of the tasks within asset
management is the need for comprehensive information about the state of hardware, software,
and license assets. A CMDB serves that purpose. By no means is a CMDB a panacea for IT
management problems, but it is a fundamental tool for effective systems management. As
applications become more distributed and service oriented, it is only appropriate that system
management practices align themselves in a service-oriented model. The CMDB is central to that
model.
Structure of CMDBs
The purpose of a CMDB is to store and integrate information about IT assets—known as
configuration items (CIs)—and their configuration status as well as process-related information.
The CMDB model consists of four logical layers: the service layer, such as Human Resource
services; the system layer, which is the system, such as SAP, that implements a service; the sub-
system, or components of the system; and finally the physical layer, which consists of IT assets
that support the service (see Figure 4.6).
In addition to supporting the technical aspects of asset management, the CMDB model supports
financial management. Most CIs are shared resources, making it impossible to easily allocate
costs to business units, cost centers, or departments. If an IT organizations is able to model their
services, they can then allocate costs to departments based on the services, not the asset.
84
Chapter 4
Figure 4.6: The CMDB is a multi-layered model that links configuration items to define what is included in a
service.
The database consists of two parts: a definitive software library and a configuration and status
data repository.
85
Chapter 4
Figure 4.7: The DSL is a logical construct with multiple physical instantiations.
Consider an analogy: Any company large enough to have multiple departments has a single
finance department that keeps official finance records. Individual departments do not keep their
own set of books (at least not an official set of books). Financial information of public
companies must adhere to strict standards relating to what kind of information is tracked, how it
is reported, and how it is audited. For financial reporting, there must be one set of books.
Similarly for software, there must be a single set of software that constitutes the body of
applications functioning in an organization. Without it, there would be no way to confidently
reconstruct the state of applications across an organization or confidently release code to
production.
The definitive software library is relatively static. Only when code is ready for production
release is the definitive software library updated. The configuration and status data repository,
the second part of the CMDB is much more dynamic.
86
Chapter 4
87
Chapter 4
88
Chapter 4
Assets should be individually tracked through their entire life cycles. Keeping the lineage of
devices can be useful in a number of ways, including financial recordkeeping, compliance
reporting (for example, was a device with export controlled security software ever used outside
the United States?), and version control, and usefulness of the CI in service delivery.
The stages of the asset life cycle are similar to the software license life cycle depicted in Figure
4.5 and include:
• Procurement
• Deployment
• Transfer to another organizational unit (OU)
• Transfer outside the organization
• Decommissioned
In addition to the coarse-grained changes, finer-grained changes, such as software updates or
hardware upgrades, are tracked by other processes within asset management services. Asset
management is just one of the services that are both central to effective systems management and
supported by CMDBs.
Summary
This chapter opened with the question: What methods and resources are required to execute and
manage transitions in an efficient and effective manner? The answer to that question is
multifaceted. The methods that are required include the various processes of asset management:
• Inventory management
• Patch management
• System security
• Risk management
• Licensing
• Service delivery
89
Chapter 4
These methods are coupled with resources, such as the definitive software library and the
configuration and status repository of the configuration management library, to support systems
management. As first described in Chapter 1, systems management consist of several disciplines:
• Service level management
• Financial management for IT services
• Capacity management
• Change management
• Availability management
• IT service continuity management
• Application management
• Software and hardware asset management
Each of these services depends in some way on the data collected, integrated, and aggregated in
the CMDB. Although each of these disciplines presents different aspects of IT management, they
are all subject to common constraints, especially the rapid and persistent state of change that
characterizes many IT infrastructures.
Moving to a service-oriented management model depends on the effective use of polices that are
aligned to overall business strategy and mechanisms for ensuring those policies are implemented
and enforced. One of the first tasks that must be addressed in a service-oriented management
model is the ability to track assets, both their physical characteristics and their life cycles. Many
of the sub-disciplines of asset management depend upon a common base of information, which
should be stored and integrated in a CMDB.
CMDBs are not tied to a particular aspect of systems management. Rather, they are designed to
house information on the breadth of IT operations to allow for an integrated and service-oriented
view of operations. Organizations can support the various disciplines of systems management
with individual applications that essentially become silos of information.
90
Chapter 4
91
Chapter 5
92
Chapter 5
Figure 5.1: Service support processes are interdependent and can support or trigger other each other.
93
Chapter 5
It is clear from these simple examples that information relevant to one process can be vital to the
proper implementation of other processes. It is also clear that errors in one set of procedures can
have ripple effects that cause other processes to be activated. For both of these reasons,
automated configuration management services can improve service support.
Automated configuration management is a mechanism that supports all parts of service support, not
just traditional configuration management. This mechanism should not be confused with early
configuration management tools that provided limited information and provide little support for related
service support operations.
94
Chapter 5
The frequency of data collection will determine how often the agent sends information to the
central repository. There is a tradeoff with this setting. Devices that frequently update the central
repository are less likely to have outdated data, but the data collection process places additional
demands that can adversely impact the performance of other applications.
When depending on agents, it is important for the central repository to accept data only from
authenticated agents. Distributed applications such as these are vulnerable to spoofing—that is,
an attacker or an attacker’s program pretending to be the real agent. An attacker, for example,
might want to cover his or her tracks by sending false information about failed login attempts or
the amount of disk space in use. By using cryptographic techniques, such as digitally signing all
transmissions, the repository can significantly reduce the chance of attacks. (See the sidebar,
Digital Signatures, for details about this security measure).
95
Chapter 5
A process flow engine within a configuration management system could meet these requirements
if it supports:
• Ordered deployment of modules
• Tests for success of each step
• Conditional processing—for example, if the browser does not contain a particular patch,
it is installed; otherwise, it is not
• Detail logs of each step
Logged information about the deployment process should be available by querying the CMDB.
Information Retrieval
Information retrieval sounds trivial—you simply want to display data that is stored in a database.
What is not trivial is precisely specifying what data it is that you want displayed. At one end of
the information retrieval spectrum, there are query languages used by database developers and
the occasional power user. Even for relatively simple queries, this is not a reasonable tool for
most users. Consider the following query: a systems manager wants to list all resource
associations, the associated resource type, the name of the resource, and a brief description,
sorted by resource type. The corresponding database query would look something like (the
details depend on the database structure, but the example holds for a typical normalized
relational database):
SELECT
ra.resource_assoc_name,
rt.resource_assoc_type_name,
rt.resource_type_name,
r.resource_name
r.resource_descr
FROM
resources r,
resource_type rt,
resource_associations ra
WHERE
r.resource_id = ra.resource_id
AND
r.resource_type_id = rt.resource_type_id
ORDER BY
rt.resource_type_name
Query languages are not practical tools for working with CMDBs—they require an
understanding of the underlying data model and knowledge of the database query language,
typically a variation on ANSI standard SQL. However, query languages are quite flexible and
with the right query, one can find anything that is in the database.
Static reports lie at the other end of the information retrieval spectrum. They require no
knowledge of the implementation details of the database, but they are limited in their usefulness.
Static reports provide information about a limited amount of data and typically represent
designers and developers’ best guess at what information a systems manager will need.
96
Chapter 5
Between the two extremes lies parameterized reports. They provide some of the flexibility of
query languages along with some of the ease of use of static reports. Properly configured, these
reports can help guide users to the information they need (see Figure 5.2 for an example).
Figure 5.2: Information retrieval from complex data structures should use a combination of search and
guided querying.
Automated configuration management tools provide several mechanisms important for efficient
service support, including a centralized data repository, automated data collection, support for
process flow, and flexible reporting. The following sections describe how automated
configuration management can support the particular requirements of several service support
areas.
97
Chapter 5
Incident Management
Incidents are events outside of the normal operations that disrupt those operational processes. An
incident can be a relatively minor event, such as running out of disk space on a desktop machine,
or a major disruption, such as a breach of database security and the loss of private and
confidential customer information. Incident management is a set policies and processes for
responding to incidents, the goals of which are to:
• Restore normal operations as quickly as possible
• Track information about incidents for further analysis
• Support problem management by analyzing patterns of incidents
Incident management begins with defining what constitutes an incident, categorizing those
incidents, and measuring there occurrences.
Characteristics of Incidents
Something as generalized as “any event outside of normal operations” covers quite a large space
of possible events. By focusing on just those that are so disruptive that they cause a call to the
Help desk or other IT support services, you can limit the discussion to a manageable domain.
Within this domain of incidents, you can categorize incidents by several characteristics:
• Cause of problem
• Severity
• Asset or assets causing the incident
• Role of personnel experiencing disruption
• Resolution method
The cause of problems covers a wide range of topics.
Severity
Incidents should be categorized by severity; at the very least a three-point scale of minor,
moderately severe, and severe should be used. For each level of severity, IT organizations should
define acceptable resolution times, escalation procedures, and reporting procedures. For
example, minor incidents, such as password resets, should not consume too much time or
resources from the Help desk. A security breach, however, should immediately escalate, trigger
reporting to management and executives, and require rapid resolution.
Assets
The asset or assets causing an incident are important dimensions for tracking incident trends. If a
particular version of desktop application is causing an inordinate number of support calls, IT
managers should be able to detect this during problem management procedures. (There is more
information about problem management later in this chapter.)
98
Chapter 5
Personnel
Just as assets involved in incidents should be tracked, so should the users encountering the
disruptions. If a large number of personnel from a single department are generating a large
number of Help desk calls, there might be a problem with training or an application specific to
that department.
Resolution Method
The method for resolving an incident should also be tracked. This data can help determine
guidelines for selecting the appropriate response to an incident. For example, data about
resolution methods reveal that most OS problems that require more than 2 hours to solve
eventually require reinstallation. Given that, a support desk policy is instituted requiring that OS
errors that cannot be resolved within 2 hours will be addressed by formatting the OS drive and
restoring it from an image backup. These characteristics are especially useful when measuring
incident rates and analyzing trends by these characteristics.
Incident Types
Defining the cause of a problem can be more difficult than it seems at first because there are
sometimes multiple pre-conditions that must be in place for an incident to occur. Consider a few
examples. Password resets are one of the most common incidents reported to Help desks. The
causes of this type of incident include users allowing passwords to expire and forgetting
passwords—especially when users are expected to remember passwords to multiple systems
while not re-using passwords. All of these causes can factor in a single password reset incident.
In another example, an employee is saving a document to a network drive when the save
operation fails. An error message is displayed stating the network drive cannot be found.
Because the employee had been saving the document regularly, something must have occurred
since the last save operation. After the user has contacted the service desk, the service desk
technician tests several possible causes and determines that the problem is a failed network
interface device. In this case, determining the exact cause of the failure is not relevant unless the
problem occurs repeatedly; hardware has well-known mean times between failures (MTBFs) and
further root cause analysis is not likely to help reduce these types of incidents.
99
Chapter 5
The final example is more complex. A security breach results in a large number of customer
account and credit card numbers being exposed to attackers. The causes could include:
• Improperly configured firewalls that allow traffic on a port that should have been closed
• An un-patched database listener (a program that accepts requests to connect to the
database) that is vulnerable to known attacks
• Access controls within the database that do not adequately limit read access to sensitive
data
• Vulnerability in a database management system that allows for escalation of privileges
• Lax OS privileges that allow execute privileges on database administration tools
• Poorly designed applications that use over privileged database accounts
A database breach is a case in which a series of vulnerabilities must be in place for a successful
attack to occur. Had one of the vulnerabilities been compensated for with adequate
countermeasures, the attack would not have occurred as it did. For example, had the access
controls on database tables and views been sufficiently restrictive, the attacker could not query
the sensitive data even though he or she had made it through network, OS, and database
authentication security measures.
Information security is often compromised by a weak link; however, one effective countermeasure
can stop an attack that has exploited a number of other weaknesses. A best practice in security,
known as defense in depth, deploys multiple countermeasures to protect assets even if, in theory,
only one is needed. Security practitioners know that no single measure is perfect and multiple
countermeasures are needed to reduce of information theft and other threats.
The general categories of incident causes that cut across these examples include:
• Improper documentation
• Insufficient user training
• Configuration errors
• Previously unknown bug
• Known but un-patched vulnerabilities
• Unexpected changes in operating loads
• User error
Determining the cause of incidents is essential to understand both how to resolve the problems
and how many resources to commit to reduce the likelihood of those problems in the future.
100
Chapter 5
Resolving Incidents
Of all the topics in service support, the most time could be spent on resolving incidents; in fact, it
could be the topic of a very long book. The problem with resolving incidents is that there are so
many types and each can require a customized response. In some ways, resolving incidents is
like cooking—there is a different recipe for every dish, and there is a different response to every
incident. At the same time, general principals can be found that apply to a broad range of
challenges, whether culinary or technical.
The general principals for resolving incidents include:
• The time, effort, and resources committed to incident resolution must be commensurate
with the impact of the incident.
• Responses should be formalized with well-defined procedures that are more frameworks
than strict, precise sets of steps. Formulating such procedures would be too time-
consuming to be practical.
• All incidents and the response should be documented. In some cases, this can be as trivial
as incrementing a count of simple incidents, such as password resets, or as complex as a
detailed report describing a security breach.
• As with other service support operations, coordinate incident resolution information with
other asset information.
Consider examples from the extremes of resolving incidents: Password resets are one of the
simplest types of incidents to resolve. Many organizations now use self-service methods to
address them. One could attempt to drive down the number of password resets, but after a certain
point, the economics do not justify the effort to do so because the marginal cost of resetting a
password with a self-service system is small. As the next section on trend analysis will show,
password vulnerabilities could become a factor in broader security management issues in which
the costs of poor password management grow much higher.
Security incidents are some of the most costly. According to the FBI/Computer Security Institute
(CSI) Computer Crime and Security Survey, 639 respondents reported a total loss of almost $43
million due to virus attacks and more than $31 million due to unauthorized access. Individual
incidents can be extremely costly. For example, 40 million credit card accounts were
compromised at CardSystems Solutions, a credit card processor, causing it to lose major credit
card customers.
For more information about the CardSystems Solutions breach, see Clint Boulton’s “MasterCard: 40M
Credit Card Accounts Exposed” at http://www.crime-research.org/news/28.06.2005/1321/. The
FBI/Computer Security Institute 2005 Computer Crime and Security Survey is available at
http://i.cmpnet.com/gocsi/db_area/pdfs/fbi/FBI2005.pdf.
Resolving incidents requires detailed information, whether one is dealing with password resets or
security breaches. A centralized repository of configuration information is especially helpful
when the incident is caused, in part, to hardware, software, or system configurations.
101
Chapter 5
Problem Management
Problem management is focused on reducing incidents and their impact on an organization’s
operations. Problem management and incident management, although tightly coupled, differ in
several ways:
• A problem is the underlying cause for multiple disruptions; an incident is one of those
disruptions.
• Problem management addresses the underlying cause of multiple incidents; incident
management entails responding to an instance of disrupted operations caused by a
problem.
• Problem management attempts to detect and address root causes of problems; incident
management attempts to restore normal operating functions, possibly without fully
correcting the underlying cause.
Problem management depends on data from multiple incidents, so a CMDB and incident
repository can support the investigation and analysis of root causes. For example, if an end user
application repeatedly crashes on some but not all client devices, the CMDB can be used to
determine what the affected systems have in common that are not found in the unaffected
devices.
Figure 5.3: CMDBs can help to rapidly identify common characteristics of devices affected by an incident,
thus supporting root cause analysis and problem management.
Once the cause of a problem is identified and a solution developed, the problem and solution
should be documented for future reference. Even if the identical problem is not likely to occur
again—for example, all servers are patched for a known vulnerability—the solution description
may help to solve other somewhat similar problems.
102
Chapter 5
Trend Analysis
Another part of problem management, and closely related to incident management, is trend
analysis. The function of trend analysis is to determine the frequency of particular types of
problems and determine which, if any, incident types are increasing. Trend analysis can lead to
introducing new methods or devices. For example:
• The increasing number of password resets, coupled with the cost of staffing Help desks,
can create a cost justification for a self-service password reset.
• Rapid growth in email storage requirements may justify the use of a network appliance to
filter spam.
• Discovery of an increasing number of conflicts between newly deployed applications and
legacy applications can lead to changes in software testing methodology.
Trend analysis in itself does not solve problems but identifies categories of problems that are
growing in severity or frequency. A general problem that can have ripple affects throughout an
IT infrastructure is errors in configuration management.
Configuration Management
Configuration management is the process of controlling changes to device configurations in an
IT environment. There are five basic operations in configuration management:
• Planning
• Identification
• Control
• Status accounting
• Verification and audit
Together, these operations provide the means to control the establishment and maintenance of
device configurations.
Planning
Planning within configuration management is similar to other IT operations; that is, the focus is
on setting an overall strategy, defining the policies and procedures necessary to implement that
strategy, and identifying configuration items that should be tracked within the CMDB. The
configuration management strategy defines the scope and objectives of the configuration
management process. For example, the scope of a typical plan includes all managed devices
within an organization; the objectives include maintaining the availability and integrity of
devices, ensuring efficient use of resources, and minimizing maintenance and training costs.
103
Chapter 5
Managed devices are those that are under the control of an organization and function within the IT
infrastructure; unmanaged devices function within the IT infrastructure but are uncontrolled by the IT
department. Examples of unmanaged devices include servers used by business partners and
desktops used by customers to access online services.
The planning process also defines roles and responsibilities. A single device may be maintained
by several roles. A server, for example, may be the responsibility of a systems manager who is
responsible for the OS and access controls, a network administrator who is responsible for
configuring network hardware and protocols, and an application administrator who maintains
services provided by the server. The CMDB is used across service support operations, but its
function and maintenance fall under the scope of configuration management planning.
It should be noted that configuration management planning is not a one-time event. These plans are
typically subject to change as business and organizational requirements change. A comprehensive
review of configuration management plans is recommended twice year.
Although the planning process focuses on the overall configuration management process, the
identification process addresses the details of the operation.
Identification
Any entity tracked by configuration management is known as a configuration item (CI). Several
characteristics of configuration items are recorded:
• Name and description
• Owner of item
• Relationships to other items
• Versions
• Unique identifiers
It is important to identify CIs to the level of independent change. For example, if laptops are
treated as a single unit and hard drives are not moved among laptops, there is no need to track the
hard drives independently of the laptop. However, optical drives used for backups and moved
among servers should be managed as distinct devices.
104
Chapter 5
Control
The control process ensures that all configuration items are properly identified, their information
is recorded in the CMDB, and any changes are done in accordance with change management
procedures. (Change control is discussed in detail later.)
Status Accounting
Status accounting is the process of recording state changes to a configuration item. The most
common states are:
• On order
• Received, pending testing
• Under test
• Installed to production
• Under repair
• Disposed
All state changes should be recorded so that the CMDB always has an accurate representation of
the IT infrastructure. This information is also useful for problem management, especially for
detecting devices with high incidents of repair or long repair periods.
105
Chapter 5
Change Management
Change management is the process of controlling modifications to configuration items so as to
minimize incidents that disrupt normal operations. The reason change management is so
important is that one change can have ripple effects through multiple other assets (see Figure
5.4).
Figure 5.4: Changes in one configuration item can have ripple effects through other items.
106
Chapter 5
Clearly, what appears at first to be a software change can quickly propagate ripple effects to
other software components, hardware devices, and network settings.
Large numbers of emergency change requests is an indication of failures in other processes, such as
planning, testing, patch management, and security management.
Change Controls
Formal change control procedures are one way to ensure that the effects of a change are
understood before the change is implemented. Formal methods are often based on a standardized
change request mechanism and a change review board.
107
Chapter 5
Release Management
Release management is a demanding operation. The goal of release management is to preserve
the integrity and availability of production systems while deploying new software and hardware.
Several processes are included under the umbrella of release management:
• Planning software and hardware releases
• Testing releases
• Developing software distribution procedures
• Coordinating communications and training about releases
Release management is the bridge that moves assets from development into production.
Figure 5.5: Release management is the bridge between two high-level IT life cycles: development and
production.
108
Chapter 5
Planning Releases
Planning releases is often the most time-consuming area of release management because there
are so many factors that must be taken into consideration. For example, when deploying a new
sales support system, the release managers must address:
• How to distribute client software to all current users
• How to migrate data from the current applications database to the new database with
minimal disruption to database access
• How to verify the correct migration of data
• How to uninstall and decommission the applications replaced by the new system
• Verifying all change control approval are secured
Each of these issues breaks down into a series of more granular tasks. Consider distributing
client software. Release managers must account for variations in OSs and patch levels of client
devices, the need for administrative rights to update the registry if software is installed, and the
possibility of conflicts or missing software on clients.
One of the often-discussed advantages of Web-based applications is that client software does not
need to be permanently installed. This is true for the most part, but some software is still required to
support Web applications, including browsers, browser helper objects (BHOs), plug-ins, and, in some
cases, a JRE. The supporting software is subject to some of the same constraints and limitations as
client/server software—they sometimes require administrative privileges to install, they must be
patched as needed to maintain security, and they are subject to their own upgrade life cycles. Web-
based applications may ease some of the burdens associated with release management, but they do
not eliminate them.
109
Chapter 5
Software Testing
It goes without saying that software should be thoroughly tested before it is released. In the ideal
world, software developers work in the development environment and deploy their code to a
testing and quality assurance environment that is identical to the production environment. It is in
the test environment that integrated module testing and client acceptance testing is performed.
This is not always possible. Large production environments may not be duplicated in test
environments because of cost and other practical limitations. It is especially important in these
cases that release managers work closely with software developers.
With responsibility for deploying software, release managers can provide valuable
implementation details about the production environment that developers should test. For
example, release managers will have information about the types of client devices and the types
of network connectivity supported as well as other applications that may coexist with the system
under development. Release managers may need to address data migration issues as well.
Integration Testing
Integration testing is the process of testing the flow of processing across different applications
that support an operation. For example, an order processing system may send data to business
partners’ order fulfillment system, which then sends data to a billing system and an inventory
management system. Certainly these would have been tested prior to deployment, but real-world
conditions can vary and uncommon events can cause problems. For example, spikes in network
traffic can increase the number of server response timeouts forcing an unacceptable number of
transactions to rollback. In this case, it is not that the systems have a bug that is disrupting
operations, but that the expected QoS levels are not maintained. Testing and verifying software
functions, data migration, and integrated services can be easily overlooked as “someone else’s
job,” but release managers have to share some of this responsibility.
110
Chapter 5
Software Distributions
Software distribution entails several planning steps. At the aggregate level, release managers
must determine whether a phased release is warranted, and if so, which users will be included in
each phase. Phases can be based on several characteristics, including:
• Organizational unit
• Geographic distribution
• Role within the organization
• Target device
When deploying new software or major upgrades, a pilot group often receives the software first.
This deployment method limits the risks associated with the release. (Even with extensive testing
and evaluation, unexpected complications can occur—especially with end users’ response to a
new application).
When distributing software, several factors must be considered:
• Will all clients receive the same version of the application? Slightly different versions
may be required for Windows 2000 (Win2K) clients and Windows XP clients.
• Will all clients receive the same set of modules? If a new business intelligence
application is to be deployed, power users may need full functionality of an ad hoc query
tool and analysis application, while managers and executives may require only summary
reports and basic drill-down capability.
• How will installation recover from errors or failure? Downloads can fail and need to be
restarted. There may be a power failure during the installation process. Disk drives can
run out of space. In some cases, the process can restart without administrator intervention
(for example, when the power is restored) but not in other cases (such as when disk space
must be freed).
• How will the installation be verified? Depending on the regulations and policies
governing IT operations, differing levels or verification may be required. At the very
least, the CMDB must be updated with basic information about the changes.
Software distribution is the heart of release management, but the ancillary process of
communication and training are also important.
111
Chapter 5
Summary
Service delivery depends on a mosaic of interdependent processes, including incident
management, problem management, configuration management, change management, and
release management. These processes constitute core operations within the SOM model.
The focus here, as with other SOM elements, is to define management tasks in terms of generic
operations that apply to a wide range of assets and can be adapted to new technologies as they
emerge. The center of management is not desktops, servers, and network hardware but the
operations that deploy, maintain, and secure them.
This chapter has introduced the first part of systems management services. Chapter 6 will
continue the discussion by examining management issues in service delivery, including service
level management, financial management of IT resources, capacity management, and availability
management.
112
Chapter 6
Service-Level Management
How much service-level management is necessary? If you had to distill service-level
management to its most essential form and present it as a question, that would be it. IT managers
must understand how much storage, computing resources, network bandwidth, training, and time
from developers, quality control specialists, and a host of other IT services are needed by users.
In addition, IT managers must know when these resources are needed. For example, will the data
warehouse extraction, transformation, and load process run during normal business hours or at
night? If it is during the day, the application hosting the data will need to accommodate its
normal workload plus export potentially large amounts of data. The network must also
accommodate the additional load. If the data warehouse is loaded in the middle of the night, the
demands on both the application and the network would be less. Something as simple as when a
process runs can have a major impact on the performance of that process.
113
Chapter 6
Throughout this guide and in best practices and control frameworks, such as those documented
in ITIL and COBIT, there is a major emphasis on formalizing processes and procedures. This
idea applies to service-level management as well. The mechanism most commonly used in
service-level management is the service level agreement. An SLA is essentially a contract
between business units and IT service providers, such as in-house IT departments or outsourced
service providers. These agreements typically define the scope and levels of service provided.
Service-level requirements define the functional requirements that the business needs in order to
carry out its functions. They also entail, although this is not necessarily explicit, the need for
communications between business units and service providers. Business unit requirements are
rarely static and even in the best situations, requirements may not capture all nuances of a
business unit’s needs. Success for both business units and services providers require that
communications do not stop once requirements are defined.
Requirements will vary according to business objectives, but several topic areas are common to
most business applications:
• Application functions
• Training
• Backup and recovery
• Availability
• Access controls
• Service catalog and satisfaction metrics
Each of these areas should be documented in service-level requirements.
Application Functionality
Within the section on application functionality, the project sponsors should define what the
system is to do. It is important to avoid becoming mired in implementation details at this point.
The goal is to define what the system should do—not how it should be done. For example, if an
application must be accessible from both traditional Web clients and mobile device clients, state
that purpose; there is no need to include design considerations, such as whether to use Handheld
Device Markup Language (HDML) or Wireless Markup Language (WML).
It is important to think of application functionality in terms of business tasks, such as:
• Providing customer support
• Verifying inventory
• Reporting on the status of operations
• Confirming customer orders
The specific functions might cover a broad range of options and they should be as inclusive as
possible when dealing with service requirement agreements, especially if part or all of the
service will be outsourced.
114
Chapter 6
Training
Training should address both service use and service administration. User training is relevant
when an application as well as network and hardware infrastructure are included as part of the
provided service. For example, if an outsourcing firm is providing a CRM service that has never
been used by the customer, end user training should be included in the scope of the requirements.
Administrator training is almost always required, even when most of the systems infrastructure
will be managed by the IT department or an outsourcing firm. Application administrators are
often responsible for implementing and maintaining users, roles, and access controls as well as
organization-specific configurations related to the application’s functions.
115
Chapter 6
Figure 6.1: Although implementation details are not part of service requirements, the cost of different options
can be a factor.
The time to recover is only one aspect of recovery criteria; another is specifying what it is that
will be recovered.
116
Chapter 6
Figure 6.2: Backups without further availability measures can leave work performed since the backup
vulnerable to system failures.
Fortunately, there are availability procedures (discussed in more detail later) that can provide
recovery up to the point of failure. These tend to require more complex software, but they are
often used in applications designed for midsized and large enterprises.
Availability
Availability criteria answer the question “What is the tolerance for downtime with this service?”
The answer is obviously closely related to requirements for backup and recovery but also focuses
on the tolerance for downtime. Although backup and recovery procedures are designed for
particular recovery times and recovery points, availability addresses the question of how
frequently the business is willing to tolerate downtime.
For example, a server might go down at 11:00am and be back by 1:00pm the same day and still
meet backup and recovery requirements. If the same server goes down every day, it might still
meet the recovery objectives, but the business users are not likely to tolerate a system that is
down 2 hours of the day. The key questions with regard to availability in service requirements
are:
• How long can the system be down?
• How frequently can the system go down?
The length of time a system can be down is expressed in minutes, hours, or days. The amount of
disruption in the ideal world is virtually none, but in reality, the cost of countermeasures to
prevent downtime must be balanced with the benefits.
117
Chapter 6
The rational choice is to allocate resources to availability measures until the cost of those
measures exceeds the expected cost of the corresponding downtime. For example, if a high-
availability solution is available for $50,000 and promises to keep downtime to less then 5
minutes, and another solution is available for $5000 but reduces downtime to 1 hour, which
solution is better? The answer depends on the lost revenue or cost of being down. If, for
example, the business would loose $10,000 if the system were down for 1 hour, the less
expensive solution is a better choice.
The frequency is usually expressed as a percentage of total uptime. For example, if a system
should be available 24 hours a day, 7 days a week, and the requirement is 99 percent uptime, the
system could be down 87.6 hours, or more than 3 days per year. Table 6.1 shows the amount of
downtime allowed under several requirements.
Availability Rates
Total Hours per
Year: 8760
Table 6.1: System availability requirements are often expressed as a percentage of total possible hours a
service could be available.
Additional areas that should be addressed in service requirements are security and access
controls.
Access Controls
Access controls dictate who can do what with information assets. When developing service
agreements, access controls tend to be high level, unlike application-specific access controls,
which can be detailed and fine-grained. Access controls are dependent on three mechanisms:
• Identification and identity management
• Authentication
• Authorization
118
Chapter 6
Figure 6.3: LDAP directories maintain identity and organization information that can be leveraged for access
control management.
119
Chapter 6
Active Directory (AD) can be used to store detailed information about users, including
organizational role, phone numbers, email addresses, public keys (when a public key
infrastructure—PKI—is in use) and other identifying information. AD and other types of
network directories can store information about other structures and assets on a network:
• Organizations and organizational units (OUs)
• Organizational role
• Groups of users
• Devices
• Applications
An advantage of directory-based identity management is that applications do not need to
maintain separate databases of user information. Centrally managing basic user information still
allows applications to tailor authentication rules to their particular needs.
Authentication
Authentication is the process of proving one’s identity. Passwords are commonly used for this
purpose, but with all the well-known limitations of passwords, other techniques have become
more popular. Some other methods for authenticating to systems are:
• Smart cards
• Fingerprints
• Palm scans
• Hand geometry
• Retina scan
• Iris scan
• Signature dynamics
• Keyboard dynamics
• Voice print
• Facial scan
• Token devices
The biometric methods also serve as identification methods. The objective of authentication is to
grant access to a system only to legitimate users. Because a single method, such as a password,
can be compromised, systems with high security requirements may use multi-factor
authentication.
120
Chapter 6
With multi-factor authentication, two or more authentication methods are used to verify a user’s
identity. This method often combines multiple types of mechanisms, relying on, for example,
something the user knows (such as a password), something the user has (such as a smart card),
and something the user is (such as a unique fingerprint). Once a user has been identified and
authenticated, the user is granted access to the system. What the user is able to do with that
system is dictated by the authorization rules defined for that user.
Authorization
Authorizations are sets of rules applied to users and resources describing how the user may
access and use the resource. For example, users may be able to log into a network and access
their own directories as well as directories shared by all users in their department. The following
list highlights considerations for defining authorization requirements with regards to SLAs:
• Who are legitimate users of the system or network?
• How is their identity information maintained?
• How are users grouped into roles?
• How are privilege assignments to roles managed?
• Will the auditing capabilities of access control systems meet the audit requirements of the
customer?
As a rule, service level requirements should focus on what a service should provide, not how it is
provided—but access controls can be an exception to that rule. For example, if an organization
has invested in an identity management system, with constituent LDAP or other directories,
single sign-on (SSO) services, and authorization services, then an SLA can, and should, dictate
the use of that system. Sometimes you cannot avoid having to manage multiple access control
systems; in those cases, you should at least try to minimize their number. Another aspect of
service level management that spans multiple areas of IT is maintaining a catalog of IT services
and service metrics.
121
Chapter 6
In addition to keeping track of what services are provided, service management best practices
dictate that you measure how well services are provided. Some common measures include:
• Response time
• Time to resolution
• Number of incidents by category
• Unit costs, such as cost of service per user or number of users supported per device
• Direct customer satisfaction surveys
These metrics, especially when applied to a service desk, should be categorized by priority. A
security breach that leaves a database of customer information vulnerable is an urgent incident
that must be responded to immediately. A problem that delays or inconveniences without
disrupting core business operations might be categorized as normal and addressed on a first-
come first-served basis.
Service level management spans the range of IT services. It includes some elements of business
continuity planning, security services, and capacity planning. Successful service level
management begins with well-defined SLAs that identify user needs in several areas as well as
the level of service users can expect in each of those areas. In addition, service managers are
expected to measure performance and maintain and improve service. Of course, all this
management, along with the rest of IT resources, cost money.
Cost Accounting
Cost accounting is the process of allocating the cost of providing service to the recipients of that
service. It sounds like a reasonable method—you pay for the services you use. When you are
buying relatively simple products, like a spindle of DVDs for backing up files, you can go to an
office supply store, pick out the right product for your needs, and pay the pre-set price. Why
can’t you do that for all IT services? The answer is, as it often seems to be, that the simplified
models of how things work start to break down when you get to real-world scenarios that are
more complex than example cases.
122
Chapter 6
Competing Requirements
Consider the following scenario: An IT department provides a backup service. Some
departments have relatively simple backup and recovery requirements, while others are more
involved. The remote sales departments need their network file servers backed up at night, and
their backups should be kept for a week. After that, the backup media can be reused. A full
backup on the weekends and incremental backups during the week are sufficient. The total
amount of data backed up is in the 100s of gigabytes. Another department has a terabyte-scale
customer management database that must be backed up every day. Audit requirements
necessitate keeping a month’s worth of data. Recovery time requirements are so tight, there can
not be too many incremental backups between full backups (recovery from a single full backup
is faster than recovery from a full plus several incremental backups). To meet the requirements
of the department using the customer management system, the IT department has to buy a high-
end backup tape solution with robotic components and high-speed tape devices. How should the
costs be allocated?
Cost Allocation
This is where it gets complicated because there are a number of options. The remote office has
minimal requirements that could have been fulfilled without the high-end solution needed by the
customer management department. The remote office could be charged a rate competitive with
what they would pay for an outside service to provide the service. In this model, the remote
office does not incur additional cost because of the needs of another department.
Another model allocates the cost based on units of service provided. If the customer management
department uses 95 percent of the backup storage and the remote office uses 5 percent, the
former is charged 95 percent of the total cost of the service and the latter is charged 5 percent. In
this case, the remote office is paying a premium for high-end hardware it does not need.
A third option is to use a graduated schedule of charges so that the customers using the least
amount of service pay less than the customer that forces the IT department to use high-end
solutions.
Yet another option is to have two backup solutions: one for low-end needs and one for high-end
needs. Each department would pay the full cost of its solution. Unfortunately, this could be the
most expensive option because two types of systems would have to be purchased and
maintained. This is the least rational solution for the organization as a whole.
In practice, the second option, allocating costs based on usage, is the easiest to implement. It
avoids the competitive analysis required by the first option, the political battles associated with a
graduated scale, and the extra expense of the two-solution option.
123
Chapter 6
Forecasting
Forecasting is as much an art as a science. It is fundamentally about estimating the cost of future
services, which include several types of costs, such as:
• Labor, including both employees and contracted labor
• Capital expenditures
• Lease costs
• Service contracts, such as maintenance
• Consulting
Within the areas of forecasting that can be standardized, a few general observations can be made:
124
Chapter 6
Forecasted Costs
160000
140000
120000
100000
Costs
Labor
80000
Hardware
60000
40000
20000
0
1 2 3 4 5 6 7 8 9 10 11 12
Months in Future
Figure 6.4: Patterns of growth in cost can vary, some are continuous and others are more step-like.
For example, the number of servers can increase for a while before an additional systems
administrator will have to be added to the staff. However, the total cost of adding that one server
that necessitates hiring another administrator is far higher than the cost of adding the previous
server. This interaction among resource types must be accounted for when forecasting.
125
Chapter 6
Cash flow, essentially the money coming into a business minus the funds going out, can vary
over time, and expenditures must be timed to occur after sufficient cash is on hand. For example,
if the IT department plans to purchase additional servers and hire a new systems administrator,
the business needs cash on hand to pay for the hardware and meet payroll. When forecasting,
consider the timing of cash flows in the business, especially if your business is subject to
seasonal variations.
When forecasting, it helps to distinguish types of costs and how their growth patterns vary. It is
especially important to watch for costs that introduce jumps in the total cost of a project or
operation as well as the timing of expenditures that should be staged according to expected cash
flows in the organization.
It should also be noted that forecasting for operational expenses, such as labor, leases, and small
equipment, requires a different type of analysis than major expenditures for equipment with
multi-year life spans. Those large expenditures warrant a more investment-oriented approach
known as capital expenditure analysis.
NPV
The NPV of an investment is a measure, in today’s dollars, of the value of future savings or
returns due to an investment made today. To determine the value of future savings or returns,
you must take into account the present value of money. For example, if you were given the
choice of receiving $1000 today or $1100 dollars one year from now, which choice would
maximize your revenue? That would depend on the interest rate of money in the open market. If
the interest rate is 5 percent per year, then having $1000 to invest today would yield $1050 in
one year; the better investment would be to wait and receive the $1100 in one year.
The interest rate used in this calculation is known as the discount rate. It is used to determine the
relative value of an investment. The NPV calculation takes into account the fact that savings or
returns accrue over time, and it uses the discount rate to account for changes in the value of
money over time.
126
Chapter 6
Let’s look at an example to see how this works: The IT department is considering investing in a
new database server to replace two existing servers. The cost of the database server is $50,000.
The department estimates that it will save $15,000 per year in maintenance, service contracts,
and licensing costs. Will the investment in a new server save money in the long run?
To answer that question, you use the formula for calculating NPV. Assuming the useful life of
the database server is 3 years, the formula for NPV is:
Amount saved in Year 1 / (1 + Discount Rate) +
Amount saved in Year 2 / (1 + Discount Rate)2 +
Amount saved in Year 3 / (1 + Discount Rate)3
Assuming a 5 percent discount rate, the calculations are shown in Table 6.2.
NPV Calculations
NPV: $40,849
In this example, the NPV of the investment is $40,849, less than the $50,000 investment. Unless
there are other reasons to make the investment, the organization would be better off keeping the
$50,000 in the bank than spending it on the server. Another commonly used measure in capital
expenditure analysis is ROI.
ROI
ROI is a commonly used measure for a number of reasons; ROI
• Takes into account the total cost and benefit of an investment
• Is expressed as a percentage, not a dollar amount, so it is easy to compare ROIs for
different investment options
• Is well known, perhaps in large part to the first two reasons
ROI is a calculation that takes into account the present value of future savings (like the NPV
calculation), increased income generated by the investment, and the initial costs of the
investment.
127
Chapter 6
In the NPV calculation, you started with the amount saved in a given year. With the ROI
calculation, you start with the net benefit of an investment for a given year. The formula for net
benefit is:
Net Benefit = Savings due to Investment + Increased Revenue due to Investment –
Recurring Costs
The net benefit fits into the ROI formula which is similar to the net present value formula. For a
three year period, the ROI formula is:
[ Net Benefit in Year 1 / (1 + Discount Rate) +
Net Benefit in Year 2 / (1 + Discount Rate)2 +
Net Benefit in Year 3 / (1 + Discount Rate)3 ] / Initial Costs
Let’s use the formulas in an example. An organization is considering an investment in a network
security appliance. The appliance will allow the IT department to retire or repurpose two servers
running content-filtering and antivirus software. The appliance requires less time to administer
than the two servers currently running security countermeasures, so there will be some savings
on labor. The appliance will also filter traffic faster, allowing for the rollout of new Web-based
services expected to generate additional revenue in the future. What is the ROI?
Start by calculating the net benefit for each of the next 3 years as shown in Table 6.3.
Additional Recurring Net
Year Savings Revenue Costs Benefit
1 $10,000 $10,000 $20,000
2 $20,000 $40,000 $5000 $55,000
3 $10,000 $80,000 $5000 $85,000
The savings are due to expenses that would be incurred if the existing servers and applications
running on those servers are kept. Year 1 and year 3 consist of software license costs,
administration costs, and routine maintenance costs. Year 2 includes those as well as several
hardware upgrades or replacements expected based on mean time between failures (MTBF) of
several of the server components.
The additional revenue is due to the fact that a new service can be offered because of the higher
throughput available from the security appliance. The first year will consist primarily of a small
pilot program and initial marketing efforts. Projections for the plan estimate significant growth
starting in the second year and continuing into the third year. The recurring costs are the costs
associated with maintenance. These include minimal administration charges as well as
maintenance fees charged by the appliance vendor. The net benefit is calculated according to the
formula shown earlier.
128
Chapter 6
You can now move on to the ROI formula. Assuming a 5 percent discount rate, and an initial
cost of $25,000 the ROI is:
[ $20,000 / (1 + 0.05) +
$55,000 / (1 + 0.05)2 +
$85,000 / (1 + 0.05)3 ] / $25,000 = 569%
This is clearly a good investment option, due largely to the increased revenue enabled by the new
appliance and, to a lesser degree, to the savings of the expenditures in the second year to
maintain the existing servers. ROI is a broadly understood and easily used calculation. Another
capital expenditure calculation that allows for comparison between projects is IRR.
IRR
IRR is a percentage, and is thus similar to the ROI rate; however, IRR does not depend on
knowing, or estimating, a discount rate. Rather, IRR calculates the discount rate for which the
NPV of an investment is zero. The advantage of this approach is that it is easy to compare two
projects to determine which is a better investment regardless of the size of the investment.
IRR is an iterative calculation that starts with the initial cost of an investment and subsequent
revenues or savings generated by the investment. Microsoft Excel provides a built-in function
IRR as well as a modified version of IRR, called MIRR, that address some of the shortcomings
of IRR in some situations.
Some financial analysts have questioned the use of IRR in capital expenditure analysis because, they
claim, it makes some poor investments look better than they actually are in some cases. See John C.
Kelleher and Justin J. MacCormack’s “Internal Rate of Return: A Cautionary Tale” at
http://www.cfo.com/article.cfm/3304945/1/c_3348836 for more information.
129
Chapter 6
130
Chapter 6
Project Management
Project management is a well-defined practice for achieving a set of one-time objectives, such as
developing an application or migrating a service, such as email, from one platform to another.
Unlike operations, specific projects are usually not repeated. However, the nature of projects is
sufficiently similar to warrant a set of best practices. The following sections summarize the core
activities within project management, especially as they relate to financial management.
131
Chapter 6
Not all identified risks must have a mitigation strategy; it is best to focus on those that could be the
most disruptive.
For more information about best practices in project management, see resources at the Project
Management Institute, http://www.pmi.org/.
Financial and service level management are major parts of service delivery management. More
narrowly focused but still essential areas of service level management are addressed in the
remainder of this chapter.
132
Chapter 6
Capacity Management
Capacity management is the practice of understanding the demands for IT resources, such as
computing, memory, storage, and network bandwidth, and ensuring adequate resources are in
place when they are needed. This is done in three ways:
• Performance management
• Workload management
• Application sizing and modeling
These are closely related but address the needs of capacity management in slightly different
ways.
Performance Management
Performance management entails monitoring systems to ensure resources are used efficiently, to
detect trends in the growth or reduction in the needs for particular resources, and to identify
performance bottlenecks. When a performance problem occurs, the key to resolving the problem
is identifying the point in the process that is causing the slowdown.
Consider a customer management system that generates reports on sales activity. The company
has been growing and sales activity is increasing; at the same time, the report-generation process
is taking longer and longer, out of proportion with the growth in sales activities. The causes of
the problem could be:
• Insufficient memory in the database server
• Insufficient bandwidth on the network
• Poorly coded SQL within the database application that does not scale well
• Insufficient CPU capacity for the number and size of the reports
It is critical to identify the bottleneck. If the problem is poorly written SQL code, adding more
processors and memory may reduce the problem; however, this option will incur a significant
expense and will probably work only for a short time. If the problem is insufficient memory, the
CPU is probably not being used to capacity; adding faster or additional CPUs will not reduce the
problem. Still another possible solution is to adjust the overall load on the system.
133
Chapter 6
Workload Management
Workload management entails understanding the full set of processes that must be run, their
dependencies, and their resource requirements and scheduling jobs and resources to maximize
the use of computing resources. The first step to workload management is identifying the
resources needed by each job. Some will require large amounts of bandwidth but little CPU, such
as transferring data for a data warehouse load; others will be both disk and memory intensive,
such as sorting large data sets for reports.
The second step is to schedule complementary jobs so that the contention for a single resource is
minimized. Assuming there are not linear dependencies between the jobs (for example, job A
must finish before job B starts), processes with different resource requirements should be
scheduled together. Another rule is to schedule jobs early when there is a dependency on them
by multiple other jobs. This method maximizes scheduling options of the later jobs. The other
core process in workload management is monitoring. This should focus on both current
performance and trends in growth or reduction in the need for particular resources. Another area
of capacity management entails analyzing the needs of new applications.
134
Chapter 6
Continuity Management
The goal of continuity management is to ensure business operations are able to continue in case
of a significant disruption in multiple services. IT continuity planning should be done as part of a
broader exercise in business continuity management.
135
Chapter 6
If there is a significant disruption in services, the business should determine which systems are
mission critical and the order in which they should be brought back online. There are also
financial considerations in continuity management. How much should be spent to ensure the
customer management application is available in the event of a disaster at the primary data
center? How long can the system be down before the business suffers adverse impacts? These
questions are best answered using a formal risk analysis procedure that includes:
• Identifying assets
• Assessing the value of assets
• Identifying potential threats to assets and the likelihood of their occurrence
• Prioritizing the allocation of resources based on asset value and threat level
The purpose of availability and continuity management is to keep systems up and running at a
level that meets SLAs. The tasks involved are varied because the problems addressed range from
the relatively minor (the system is sluggish for short periods of time) to major (operations need
to be relocated to a backup data center).
Summary
Service delivery entails many tasks, both operational and management. The previous chapter
examined the operational aspects; this chapter covered the management side of service delivery.
Although management issues are broad, they are dominated by service level management and
financial management. If these areas are well managed, three other key areas—capacity
management, availability management, and continuity management—are already well on their
way to being effectively implemented. The next chapter continues to examine elements of
systems management with a focus on application, software, and hardware management issues.
136
Chapter 7
137
Chapter 7
Again, it is worth noting that although this list and the discussion that follows might give the
impression that the life cycle logically marches from one stage to the next, never veering from the
predefined sequence, that is often not the case. New questions may arise during the analysis and
design stage that trigger revisions to requirements. Testing might reveal unanticipated combinations
of conditions that force a redesign of a module. Of course, the business justification for an entire
project can change if there is a change in business conditions, leading management to scrape
everything developed to that point.
In addition, it should be noted that software developers use different methodologies for creating
applications. Most of these methodologies use the stages described in this chapter in one form or
another. The major differences in methodologies tend to focus on whether to use one or more passes
through these stages and how much to try to accomplish at each iteration through the stages. (See
the sidebar “Software Development Methodologies” for more information about this topic.)
Figure 7.1: The dominant progression through the life cycle follows the solid lines, but in practice, there are
many other paths through the life cycle as shown by the dashed lines.
The first step in the application life cycle is initiated by an organizational need.
138
Chapter 7
Business Justification
Why would an organization commit resources—money, staff, and time—to developing or
acquiring an application? There must be some benefit that outweighs the cost, of course.
Sometimes an organization may make a decision to invest in the development of an application
because the organization believes it will be a key to strategic success. Small startups can work
like this. A few developers and managers with a vision for starting a new business can be
justification enough. In larger organizations—such as midsized and large corporations,
government agencies, educational institutions, and major non-profits—a more formal approach is
usually required.
A business justification is essentially an argument for developing or purchasing an application
because it will serve a need of the organization. These documents often include:
• A description of the current state of the business or organization and a missing service or
unrealized opportunity.
• An overview of the benefits of implementing the proposed application, such as improved
customer service, which leads to higher customer retention rates; reduced cost of
manufacturing a product by eliminating older, higher cost IT systems used to support the
current manufacturing process; or higher throughput of a transaction processing system,
which will lower the marginal cost of each transaction.
• A formal assessment of the costs of the proposed project. Cost analysis often includes
measures such as the return on investment (ROI) or the internal rate of return (IRR),
which quantify the financial impact of the project and aid in allocating resources among
multiple proposed projects.
• A discussion of the risks associated with a proposed project. Any projection, such as a
business justification, is based on assumptions. The risk discussion points out what could
go wrong and how those risks can be mitigated.
The business justification should also demonstrate how the proposed application will further
align IT services with the strategic objectives of the organization. There are probably many
applications that can be cost justified but still do not align with strategic objectives. The goal of
deploying IT applications is to further the objectives that have already been defined; it is not to
introduce side services that might generate revenue for the organization but distract from core
services. Once it has been demonstrated that an application will serve the broader business
objectives of an organization, the application project is formalized, and the requirements-
gathering phase begins.
139
Chapter 7
Requirements Phase
The purpose of the requirements phase of an application project is to define what the application
will do. At this point, the question of how the application will operate is not addressed, that is
left for the analysis and design stage that follows. The key topics that should be addressed during
requirements gathering are:
• Functional requirements
• Security requirements
• Integration requirements
• Non-functional requirements
There is some overlap between these areas and requirements entail dependencies with
requirements in other areas as well.
Functional Requirements
Functional requirements are composed of use cases and business rules. Functional requirements
begin with the development of use cases, which are scenarios for how an application may be
used. Use cases include descriptions of how users, known as actors, interact with the system to
carry out specific tasks.
Uses cases typically include:
• A use case name and a version, such as “Analyze sales report, version 3”
• A summary briefly describing what the actor does with the system—for example, in the
“Analyze sales report” use case, the actor might authenticate to the application, enter date
and regional parameters, format data in tabular or graphical form, and sort and filter data
as needed
• Preconditions (conditions that must be true for the use case to be relevant)—for example,
a precondition of the “Analyze sales report” use case is that the data warehouse providing
the data has been updated with the relevant data
• Triggers are events that cause the actor to initiate the use case; an event such as needing
to calculate the distribution of inventory to regional warehouses will trigger the analysis
of most recent sales
• Primary and secondary sequences of events within the use case—for example, the
primary events sequence describes the typical steps to retrieving and displaying sales
data, and the secondary sequence describes what occurs when an exceptional event
occurs
• Post conditions describe the state of the system after a use case is executed; data may be
updated, other functions may be enabled, and other use cases may be triggered
The purpose of use cases is to describe, in high-level detail, specific functions. The finer-grained
details are captured by business rules.
An introduction to use cases and related modeling topics can be found at http://www-
128.ibm.com/developerworks/java/library/co-design5.html.
140
Chapter 7
Business rules are formal statements that define several aspects of information processing
• How functions are calculated
• How data is categorized
• Constraints on operations
Business rules are specified early in the application development process because so much
depends on them. For example, if a sales analysis system is proposed, it must be understood
early on how to calculate key measures such as gross revenue, marginal costs, and related
metrics. It must also be determined whether multiple definitions must be supported. Take
marginal cost calculations, for example. The division responsible for manufacturing a product
might include the cost of materials and equipment in the marginal cost calculation; whereas, the
finance department might include those costs plus the sales commission paid to sell the product.
This is an example of a single term meaning multiple things depending on the context.
Like use cases, formalism has been developed to standardize the definitions of business rules. The
Business Rule Markup Language, http://xml.coverpages.org/brml.html, is an open standard for
incorporating business rules into applications.
Security Requirements
Security requirements for an application should be defined along with functional requirements.
Implicit in every functional requirement are the questions “Who should be able to use this
function?” and “When can this function be used?” These broad questions, in turn, are answered
by more precise, but not detailed, questions such as:
• In what roles will users be categorized? The privileges and rights to use the application
and functions within the application should be based on the roles granted to users.
• How is the data used in the system categorized according to sensitivity classifications? Is
it public data, sensitive information that should not be disclosed outside the organization,
or private information whose distribution is controlled by the person it is associated with?
Is it secret information, such as proprietary information, trade secrets, or comparable
information?
• How will the application be accessed? Will remote users have access? If so, is it through
a public Web site, a restricted Web site requiring a username and password, or through a
VPN?
• What authentication mechanism is required to access the program? Is a
username/password scheme secure enough? If so, what is the password policy? Is multi-
factor authentication required? If so, what types of authentication mechanisms are
required? These could include smart cards, challenge/response devices, or biometric
devices.
141
Chapter 7
The scope and detail of security requirements are slightly different from functional requirements.
In the case of security, it is common to delve into the “how” instead of addressing only the
“what.” For example, the need for biometric authentication is really an implementation issue that
would not be specified if it were a functional requirement. However, security requirements may
be dictated by constraints outside the scope of the project. A financial services company, for
example, may decide that to remain in compliance with government regulations, biometric
security measures are required for all applications that reference customer account information.
The designers of the application will have no choice in the matter; if the application they are
developing accesses customer data, it is required to use biometric security measures. In cases
such as this, it is important to document these requirements before the analysis and design phase
begins.
Security requirements should also address:
• Compliance requirements
• Any restrictions on the time of access to the application
• How identities are managed? For example, will all users be registered in an organization-
wide Active Directory (AD) or LDAP directory?
• The federation of identities (that is, relying on identity information managed by another
party) if third parties are granted access to the application
• Encryption requirements, including the strength of encryption
• Policies on the transfer of data from the application. For example, can users download
data and store it locally on their workstations? Can they store data on their notebook
computers or other mobile devices?
• Any security policies and procedures that are relevant to the application
Security requirements sometimes have to address how the application will operate with other
applications or data sources.
Integration Requirements
It is difficult to imagine an application that will not integrate with some other application or data
source. Rarely are today’s applications islands unto themselves. For this reason, it is helpful to
understand the ways in which applications share services and data among themselves.
142
Chapter 7
Non-Functional Requirements
The term non-functional requirement is a catch-all used to describe requirements that are not
captured in the other categories. Some designers would include security and integration in the
non-functional category; however, their importance and complexity is often far greater than the
other non-functional requirements and therefore warrant a more detailed discussion. The
remaining categories of non-functional requirements include:
• Backup and recovery
• Performance levels
• Service availability
• Service continuity
143
Chapter 7
Figure 7.2: Example integration of application with other servers and services.
Many of these non-functional requirements overlap with systems management responsibilities. See
Chapter 2 for more information about these topics.
144
Chapter 7
Performance Levels
Performance levels define the expected response times to users and the number of users that can
be supported. This information is needed to size hardware appropriately. The number of CPUs,
memory, and network bandwidth required will depend, in part, on the expected performance
levels.
Service Availability
Service availability addresses the extent to which the application will be available. For example,
mission-critical applications may be expected to be up 24 hours a day, 7 days a week. In practice,
except for the most demanding applications, service windows are reserved for outside of peak
operational hours to attend to upgrades, patches, and other maintenance. When true 24 × 7
service is required, servers are configured in a cluster or failover configuration that improves
uptime and allows for a rotating maintenance schedule across the constituent servers.
Service Continuity
Service continuity requirements specify what is expected in the case of service disruption, such
as a natural disaster that shuts down a data center. These requirements are dictated by the need to
have the application available, the duration which the application can be down without impacting
the organization’s operations, and, of course, the cost of equipping and maintaining an off-site
facility.
Gathering and defining requirements for applications is an essential step in the application life
cycle. Functional requirements define what an application is to do, security requirements specify
the level of confidentiality and integrity protection is required, integration requirements deal with
how the application will function within the broader IT infrastructure, and finally the non-
functional requirements define the parameters needed to support several core systems
management services, such as backups and service continuity. Once the application requirements
are defined, the life cycle moves into the analysis and design phase.
145
Chapter 7
Solution Frameworks
A solution framework is a high-level design of an application that describes the major modules
within the application as well as the architecture that encompasses and integrates each of the
modules. Although applications have different architectures, an increasingly common model is
based on three or more tiers:
• Data services
• Application services
• Client services
Figure 7.3 shows a simple example of such a model.
For simplicity, this diagram depicts a three-tier model. However, within the middle tier there may be
multiple levels of application services providing functionality for other modules within the application.
146
Chapter 7
Figure 7.4: In the data warehouse model, data is first integrated in a separate data store and then processed
by an application server.
Other systems, such as order processing systems, may use multiple, independent databases. For
example, a financial services company may allow customers to access their checking accounts,
mortgage statements, and credit card activity all from a single Web application. The data,
however, is stored on three different systems, each one dedicated to managing one type of
account (see Figure 7.5).
147
Chapter 7
Figure 7.5: In many applications, multiple data sources are integrated directly in the application server.
During the framework modeling process, the source systems and how they will function together
is determined. These data service providers are used by application services that occupy the
middle tier of the architecture.
148
Chapter 7
Client Tier
The client tier is responsible for rendering information provided by the data services and
application services tiers. The client tier is becoming more challenging to manage and develop
for as the options for clients expand.
Conventional workstations and notebooks are now complemented with PDAs and smart phones
as application clients. This reality requires systems designers to develop for multiple platforms
using multiple protocols. For example, HTML, the staple of Web application development, will
not necessarily meet the functional requirements of mobile clients such as PDAs and smart
phones; alternative methods are required.
Frameworks are skeleton designs of how an application is organized. It is at this point that
systems managers can start to see how the application will fit into the existing network and
server infrastructure, what additions will be needed to meet hardware requirements, and what
additional loads will be put on network services. This is also the point at which decisions are
made about which components of the application to build and which to buy.
149
Chapter 7
In practice, few organizations outside of software development firms will start with tools and
build from scratch. Similarly, unless the application required provides a common, commodity
service, such as a backup and recovery program, few organizations will avoid at least some
configuration and customization of major applications.
The process of making the buy-vs.-build decision includes determining:
• The functional components of the application
• The communication protocols between the components
• The constraints on the application, such as the types of databases that will provide data to
the application
• The development skills available in house, or readily available
• The time to deliver the application
• Options in the commercial market and open source for components
• Availability of subject matter experts who can support design
Ideally, the end result is a balanced approach that leverages existing components while reserving
custom development for key components that add competitive advantage and cannot be
adequately implemented using existing systems.
With tools and components identified, the detailed design can begin. The more components or
existing packages are used, the less there is to design. At the very least, a detailed configuration
of a turnkey system should be in place before the system is deployed to a production
environment.
Detailed Design
The goal of the detailed design stage is to create a document suitable for programmers and
systems administrators familiar with the selected tools and components to build the application.
At this point, the requirements and overall architecture should be defined and the task is to
identify how the requirements will be met.
In practice, designers will discover elements of the application that were not considered during
requirements or find that requirements have changed (even with short development cycles,
requirements can change before detailed design is complete). These discoveries can trigger
review of functional requirements, non-functional requirements, and architectural design. These
discoveries are so common that they have prompted the creation of several design
methodologies. From a systems management perspective, this demonstrates that the supporting
infrastructure originally planned for a new application deployment may not be what is actually
deployed when the system design is finally completed. Once the design is in place, the
application life cycle can move to the development stage.
150
Chapter 7
Development
Development entails building applications and application components. Many books have been
written on this subject and it is well beyond the scope of this chapter and this guide to try to
address the practice of software engineering. There are, however, three topics relevant to systems
management that are worth addressing:
• Source code management
• System builds
• Regression testing
Each of these entails software artifacts that, like other assets, require a structured management
regimen.
System Builds
A system build is the process of gathering the component modules under development and
creating an executable application. Once enough components have been developed to have even
the most basic functions, system builds are used to ensure development continues in such a way
as to not break (at least not too badly) previous work. A system build is a minimal test of the
code under development. If an application’s modules and libraries can be compiled into an
executable application, the specific functions of the system can be tested.
151
Chapter 7
Regression Testing
Regression testing is the practice of testing applications or modules after small changes to ensure
that previously functionality components have not been broken by the introduction of bugs in
new code. Regression tests can be automated and the results compared with previous results.
This type of testing is not the full-scale system testing done prior to releasing a piece of software.
Regression testing is often done automatically after building an executable application. When
software is sufficiently constructed and tested by developers, it moves to the quality control–
focused level of testing, typically carried out by a testing team that does not include developers.
Software Design Methodologies
Software design methodology is one of those topics that can trigger seemingly incessant debate among
software developers. Over the past decades, a number of methods have been proposed, all with some
variation on top-down or bottom-up design. Although there are a number of minor variations on the major
models, we will focus only on the major ones, which are:
● Waterfall model
● Spiral model
● Agile model
The waterfall model is a linear approach to software development. According to the waterfall model, one
starts by gathering requirements, then develops a high-level design followed by a detailed design, builds
the code according to the design, tests it and correct bugs, and then deploys it. The advantage of this
model is that it is intellectually simple and easy to understand. The disadvantage is that it does not work
in most software development projects. The world does not proceed in the lock-step fashion assumed in
the waterfall model. Requirements change and this model does not adapt to that. The spiral method was
developed to avoid the fatal flaws of waterfall while maintaining the structured approach that does serve
the goals of software engineering.
Through the spiral approach, developers build software iteratively and assume that requirements will
change and that during the process of developing a system, new information is gleaned that will help in
the development of other parts of the system. Rather than build an application in one pass through the
structured stages, spiral methodologies build a set of functions in each iteration through the structured
stages.
In theory, proponents of waterfall methodology might argue that a skilled requirements gatherer could find
all the requirements early in the development cycle. Even if someone did have the mythic skills to
elucidate all the requirements in the precise detail needed, this does not account for the cost of gathering
those requirements. It is a well-known principle in economics that the cost of producing one more item
may not be the same as the cost of producing the previous item. In the case of gathering requirements,
the marginal cost, as it is known, of getting one more requirement begins to increase at some point. In
some cases, users may not know their requirements until they have had a chance to interact with the
application.
Agile software development methodologies take the spiral approach to an extreme and use very short
software development cycles—as short as several weeks. This allows for almost constant evaluation and
quick adoption.
152
Chapter 7
Software Testing
Software testing is a quality control measure that is designed to identify bugs in software (similar
to regression testing) and to ensure that all functional requirements defined in the earlier stages
are met by the software. The testing at this stage is integrated testing that exercises the full
functionality of the application. Unlike the testing done by developers, which is referred to as
unit testing, the goal with integrated software testing is to make sure the application’s
components function correctly together.
The artifacts used in integrated software testing are:
• Test plans
• Test scenarios
• Test procedures
• Test scripts
A test plan is a high-level document describing the scope of software testing and usually
includes:
• Functions to be tested
• References to requirements documenting functional requirements
• Known software risks
• Test criteria
• Staffing and resource requirements
• Schedule
The details of how functions are tested are included in test scenarios and test procedures. Test
scenarios describe use cases and specific features within those use cases to test. For example, a
scenario may describe a user retrieving a sales analysis report, entering search criteria for
filtering data, and exporting data to a spreadsheet. The test procedures define the steps carried
out by the tester to test each function. For example, to export the data to a spreadsheet, the tester
will select “Export” from the menu, enter file name “Test123.Xls,” save the file, then open the
file in a spreadsheet program and verify that the table headings, summary data, and formatting
are correct.
Testing can be a time-consuming and tedious task, especially when large numbers of functions
must be tested. Test scripts can be used to automate this process and a number of tools are
available.
153
Chapter 7
In addition to testing basic functions, which presumably was done during unit testing in the
development stage, systems integration is also tested. Applications will depend upon other
applications and while application programming interfaces (API) may be well defined and used
properly by client applications, there is more to testing integration than simply making sure a
single API call works correctly. Integration testing should include testing:
• Scalability of calls from the client application
• Consistent security between the applications
• The ability to roll back a transaction across multiple applications
These are the types of non-functional requirements that are not tested in unit testing and must be
explicitly planned for in integration testing.
As the application, or in the case of large, multi-phase application developments modules, passes
integration testing, the application is moved to production through the deployment process.
Software Deployment
The process of software deployment is complex because of the dependencies between so many
aspects of information architectures. Release management, as the practice of controlled software
deployment is known, consists of a number of tasks, including:
• Coordinating with testers to ensure software is ready for deployment to particular
platforms
• Packaging software for installation on target platforms
• Determining dependencies for successful installation of software
• Receiving change control approval to deploy software
• Updating the configuration management database to reflect the new versions of the
software on particular platforms
• Planning the deployment so as not to disrupt operations or at the very least, to minimize
the impact of the deployment
In addition to coordinating the installation of software, the release management team must
coordinate with developers and trainers to ensure that end users, systems administrators, and
support personnel are all trained on the new software. The deployment phase in many ways
marks the final state of the software development life cycle because after that software is actually
in use. It is not truly a terminal state, though, because maintenance is such an important factor in
the life cycle.
154
Chapter 7
Software Maintenance
Software maintenance is the practice of making modifications to applications to ensure that they
continue to meet functional and non-functional requirements and do not present security
vulnerabilities that could compromise the integrity and confidentiality of information or the
availability of the system itself. Software maintenance usually comes in the form of patches and
upgrades.
Patches are usually small changes to code to correct a known problem. They do not provide
additional functionality. Upgrades, in contrast, are designed to enhance the functionality,
performance, or scalability of an application.
Another distinction between patches and upgrades are the timeframes for deploying them.
Patches may be provided by application developers as soon as a problem is discovered,
especially if the flaw results in a security vulnerability. In these situations, systems
administrators may have less time to test and apply a patch. For example, if fast-spreading
malware threatens an application and a newly released patch is available from the application
vendor, the systems administration team may deploy the patch with minimal testing. Upgrades
are usually well planned and both the application developers and application users have time to
properly plan their deployment.
155
Chapter 7
Data Dependencies
Data dependencies occur when one application depends on another to provide specific data at
certain times. There are many factors of data dependencies to consider but from a systems
management perspective, a key question is, At what point does an application failing to meet its
requirement to provide data begin to adversely impact operations? Consider some examples:
• The enterprise resource planning system of a retailer with 200 stores nationwide
aggregates sales data from stores each night. All stores are expected to provide data by
midnight (headquarters time zone). If more than 5 percent of the stores fail to provide
data or three or more stores in the same region fail to upload data, summary reports
cannot be generated.
• An order processing system depends on an inventory system to check stock levels before
committing to a delivery date. If the inventory system is offline, the order processing
system estimates the delivery date based on the location of the customer and proceeds
with the order.
• A data warehouse draws data from several systems, integrates the data in an enterprise
data warehouse and then populates a series of data marts targeted to particular analysis
functions. In the event some data sources are down, the data warehouse load continues
but the data warehouse generates only those data marts for which all data has been
received.
Clearly, data dependencies are not “all or nothing” affairs. Well-designed applications degrade
gracefully. If partial data is available, then partial functionality and services should be available.
Systems managers should design and manage infrastructure in such a way to support data
dependencies; to do so they must have insight into not only the requirements but also the
capabilities of applications with respect to data dependencies.
156
Chapter 7
Time Dependencies
Time dependencies are an important factor in application management. In some cases, these are
essentially questions of scalability. For example, an online order processing system may be able
to take as many as 1000 orders per minute, but it depends on a service provided by a sales tax
computing Web service that can only process as many as 500 orders per minute. Systems
managers can work with developers to improve on this by dedicating additional servers to the
Web service once the dependency has been identified.
Another type of dependency is more difficult to work around. It is not uncommon for large,
centralized applications to do quite a bit of batch processing outside of business hours. Banks,
for example, will post transactions against accounts and process loan payments during off hours.
Because transaction processing systems are subject to heavy transaction loads, this is also the
ideal time to accomplish tasks that would put an inordinate load on the application during normal
business hours. Data extractions, for example, often occur at these times. The problem is that
there are often several or more data extraction jobs that need to run in a limited time window.
Understanding these requirements is important for system managers so that they can arrange jobs
and allocate resources appropriately to meet the requirements of these non-transaction processing
requirements.
As with other performance measures, it is important for systems managers to track trends in non-
transaction processing. For example, if batch jobs are taking longer and longer to run, are some
critical processes running over into normal business hours and therefore potentially interfering
with core business operations? If so, how can the current configuration of hardware, software,
and batch jobs be reconfigured to eliminate the problem? The answer to this question requires
detailed information from a variety of sources including system logs and the configuration
management database.
Software Dependencies
Software dependencies are another type of dependency that should be explicitly managed.
Successful change management procedures depend upon knowing the dependencies between
applications so that a functioning system is not inadvertently disrupted by a change in some
dependent code. Tracking dependencies explicitly in a configuration management database can
help to minimize the chances of that kind of mistake. This is just one of the reasons that software
should be managed like other assets.
Hardware Dependencies
Applications are deployed to particular servers that have specific configurations. The
dependencies between applications and the hardware configuration required to support them
should be explicitly modeled. At any time, a systems administrator or IT manager should be able
to report on the details of which applications are running on the various servers in the
organization.
157
Chapter 7
Acquiring Assets
Acquiring assets and planning for their integration and deployment may depend heavily on the
software development life cycle if the asset is built. Regardless of whether an application is built
or bought, the acquisition process is dominated by:
• Functional requirements
• Compatibility with architecture
• Capacity planning
Functional requirements have been detailed earlier in the chapter. Compatibility with architecture
is another factor that can limit an organization’s options when it comes to acquiring assets.
Although shared standards allow virtually any major platform to inter-operate, the cost of
supporting multiple architectural models and platforms may be cost prohibitive. An architecture
based on J2EE standards, for example, can function with .Net applications but the additional
effort to deploy and maintain multiple architectures may outweigh the benefits.
Capacity planning must also be considered when acquiring assets. Factors influencing the
capacity of an application include:
• Number of users
• Peak load periods
• Time dependencies on other applications and data sources
• Expected growth rates
Availability requirements should also be considered in capacity planning. A clustered
configuration of servers, for example, could improve both availability and capacity.
158
Chapter 7
Summary
Managing applications is a process with characteristics not found in other areas of asset
management. The dynamics of the software life cycle introduce additional artifacts that must be
managed, such as requirements documents, code libraries, and test cases. Applications
themselves are more dynamic than many other assets and this, in turn, creates more work to keep
configuration management databases up to date and accurately reflecting the state of deployed
applications.
159
Chapter 8
What Is Governance?
Governance is the process of setting long-term objectives, establishing controls that measure the
progress toward those objectives, and monitoring to ensure controls are followed and objectives
are being met. In short, governance is about deciding what an organization should do, how to
ensure it will get done, and then making sure it does get done. As Figure 8.1 shows, the
governance process encompasses all aspects of service-oriented management (SOM).
Figure 8.1: The governance process defines a framework in which SOM operations are controlled.
Let’s begin with an example that gives an overview of types of governance activities, including:
• Planning and organizing IT operations
• Acquiring and implementing IT solutions
• Ensuring proper delivery and support for IT solutions
• Monitoring services to ensure compliance with policies and procedures
When discussing each activity, let’s explore how to establish goals for each activity and how to
measure progress toward those goals.
The practices described here are industry standards that have evolved over the years. The best
formalization of these types of best practices can be found in the Control Objectives for Information
and related Technology (COBIT) framework established by the Information Systems Audit and
Control Association (ISACA). More details about COBIT and ISACA can be found at
http://www.isaca.org/.
160
Chapter 8
Governance: An Example
Governance, as practiced according to COBIT, is a typical reductionist management practice.
You first identify the parts of SOM, dividing them into logical groups, then continue dividing
those groups into smaller and smaller constituent parts until the resulting units are easily
described in terms of:
• What is to be accomplished?
• What factors influence the success of the objective?
• How can progress on the objective be measured?
For example, consider a company that has a strategic objective to reduce telecommunications
costs by deploying voice over IP (VoIP). Doing so will require substantial investment of time
and money, and the board of directors expects executive management to have a plan in place for
overseeing the deployment of the VoIP system as well as ensuring that ongoing operations are
meeting the organization’s needs within budget and on schedule. The process begins with
planning how to acquire and implement the service. After the planning stage is complete, the
process moves on to the acquisition and implementation processes. COBIT then provides a
framework for delivery and support as well as monitoring and evaluation.
161
Chapter 8
This process is controlled by involving business owners, project managers, and domain experts
who will follow a formal planning process and document their findings, which are then reviewed
and approved by executive management before proceeding to the next stage.
The business analysis and project management group as well as managers from network services
and server and client management would have to be involved in the first step, determining how
the project fits into the IT architecture. This same group would identify the management
processes that will control the execution of the process. Now, ideally, that should be a relatively
easy task. In a mature governance environment, those processes are well established.
See the section “Governance and Maturity Models” later in this chapter for more information about the
different levels of process maturity.
Incorporating the financial planning of the project and conducting the risk analysis are the
responsibilities of the IT managers with assistance from business analysts, project managers, and
domain experts. This group should also handle planning for staffing requirements and training.
The final stage of planning is to formulate a project plan and engage management oversight of
the project.
Each of the activities must be well documented. Common procedures, such as project planning
and risk analysis, often have formal document deliverables that have a well-defined structure.
Project management professionals have formalized their discipline and have developed a body of
knowledge and a set of documents common to project management across domains, not just IT. For
more information, see the Project Management Institute Web site at
http://www.pmi.org/info/default.asp.
The deliverables from the planning stage should include project plans, risk analysis, and
requirements documents. The governance process measures timeliness and quality to ensure that
the planning process is working as expected. For example, key measures might include whether
the documents were prepared on time, if the requirements document addresses the full scope of
business and technical requirements, and whether the project plan met the standards outlined by
the Project Management Institute.
162
Chapter 8
The success of selection phases can be measured by the number of times business owners agree
with feasibility studies and sign off on requirements as sufficiently comprehensive to proceed
with the project. The measures of the deployment phase can include:
• Number and severity of bugs found in testing (reflects on the selection process)
• Number and severity of bugs found after deployment (reflects on the testing process)
• Number of days ahead or behind schedule for key implementation milestones
• Satisfaction of business owners and users with initial deployment
• Number of users trained on the system
After implementation of the service, the governance process will continue by controlling the
maintenance and support for the service.
163
Chapter 8
Monitoring Operations
The final stage of the governance process is monitoring the delivery of VoIP services to
determine whether objectives are being met. This could include analyzing summaries of the
management reports generated as part of the operation maintenance. The objective isn’t just to
know how the service is performing but to know what is being done to correct any problems.
In practice, you do not perform governance over a single project or operation but over all IT
projects and operations. This example illustrated the types of controls and measures that need to
be in place to ensure that projects and services meet management expectations and, if they do
not, that mechanisms are in place to make executive management aware of problems and provide
them with enough information to address the problem. Let’s move from the example to the
formal structure of controls.
Governing IT Services
Governing IT services, according to COBIT practices, is divided into four parts:
• Planning and organization
• Acquisition and implementation
• Delivery and support
• Monitoring
Each of these areas is broken down into a set of control objectives, which in turn, have a
definition, a method for achieving the objective, and suggested measures for determining
whether the objective is being met.
This section is not an attempt to cover all the topics addressed by COBIT. Some planning and
organizational topics, such as controlling quality, are not covered. The purpose of this section is to
describe governance and its relation to SOM. This chapter cannot, and does not attempt to, replace
COBIT documentation.
164
Chapter 8
Defining IT Architecture
The planning and definition of an IT information architecture is one of the first points at which
security emerges as a prominent aspect of planning. The information architecture of an enterprise
includes:
• A organization-wide data model
• A data classification scheme
• Assignment of ownership of elements of the data model
165
Chapter 8
Data Ownership
A role of data owner should be defined for each element of the data model. The business owner
is the person responsible and accountable for the management of that data. The business owner
role is typically filled by an executive or management role; it is not the systems administrator or
database administrator who may be responsible for the day-to-day maintenance of the data and
the infrastructure that supports it. Data owners are responsible for:
• Formulating policies and procedures controlling the use of the data
• Meeting regulatory requirements concerning the data
• Defining security, availability, and business continuity requirements regarding the data
The information architecture is one of the areas of COBIT that has direct impact on systems
management operations; another is defining IT processes and organization.
166
Chapter 8
167
Chapter 8
Managing IT Investments
Managing IT investments can be boiled down to one word: budgeting. Given a set of strategic
directives, IT executives are expected to deliver the services needed with the financial resources
allocated. This process is more than just balancing funding and expenditures, it includes:
• Allocating funds to specific operations and projects
• Creating financial forecasts and optional scenarios
• Establishing criteria for measuring the value of proposed projects
• Monitoring the value of ongoing projects
Managing investments is highly dependent on proper management of human resources and
projects.
168
Chapter 8
Managing IT Risks
Risk management, like project management, should be done within a formal framework. IT, by
nature, has risks not present in other business areas. The potential for system incompatibilities,
security threats, and the disruption of operations can occur on a substantial scale with relatively
little input. For example, a single attacker could breach a database application and steal tens of
thousands of customer records, or a single failure in a critical network device can disrupt
multiple operations.
Managing IT risks includes:
• Defining a risk management framework for determining risks and identifying the
organization’s level of risk tolerance
• Conducting risk assessments
• Formulating risk mitigation plans
By creating a formal management structure that includes all the essential elements outlined, an
IT organization will have a strong foundation for moving to the other areas of IT operations,
such a the acquisition and implementation of IT services.
169
Chapter 8
The evaluation and selection tasks entail what is often called the “build vs. buy” decision. This is
something of a misnomer because complex systems are can rarely be reduced to such a simple
dichotomy. In practice, the decision is more akin to selecting a point on a continuum ranging
from buying a turnkey solution to building a custom solution for every aspect of a system.
For example, in the case of the VoIP example from earlier in the chapter, the systems designers
and project sponsors may conclude that no commercially available system meets all needs. The
same group is likely to conclude that “building” a VoIP solution is not feasible. The solution in
such cases involves starting with a commercial application as a base and customizing
applications as needed and integrating with existing infrastructure to get the functionality
required. This is done during the acquisition and maintenance phases.
The term “death march” has come into software development parlance to describe a project that will
inevitably fail. The failure is often due to a combination of poor planning, poor project management,
insufficient resources, changing requirements, and unrealistic schedules. All these factors can be
avoided, or at least mitigated, by proper governance procedures.
170
Chapter 8
Managing Change
Changes made on an ad hoc basis are more likely to succumb to a common scenario. It begins
with an urgent requirement coming to light or the discovery of a flaw in a program. Due to a
sense of urgency, rather than follow formal analysis, design, development, and testing, it is
decided that a developer can start with a minimal summary of requirements (which are rarely
documented). The developer makes a change that addresses the immediate problem, or at least
corrects the symptom of the problem, with the good intention of going back into the code and
fixing it the right way when he or she has more time. Formal testing procedures are bypassed and
after a few unit tests followed by a minimal integration test, the code is moved to production.
What follows from that point can vary, but some of the outcomes are:
• The patch itself has a bug that was not detected during the minimal testing that was done
• A new bug indicates an unanticipated dependency in another part of the code, which was
thought to be unrelated to the section that was patched
• The patch, while programmed according to the system documentation, fails to work
correctly because of a previous ad hoc patch that changed a function but was unknown
because the follow-on step of updating the documentation wasn’t performed
171
Chapter 8
172
Chapter 8
173
Chapter 8
Security entails a balance between the need to protect information and assets and the need to
keep resources accessible to users without unnecessary burden. To strike a proper balance, the
business requirements, relative to security requirements, should be well documented. This begins
with data classification. Identity management and access controls will build upon the information
classification scheme described earlier. By first portioning information into different categories,
it is easier for security managers and systems administrators to properly apply access controls.
Other requirements, such as the need to share information with business partners, can extend
beyond the boundaries of the organization.
Government regulations drive a wide array of security requirements and have helped to promote
the practice of IT governance. At the very least, organizations should understand which
regulations apply to the operations and then review policies and procedures in light of those
regulations. In some cases, such as the Sarbanes-Oxley Act, auditors can help formulate
appropriate controls to meet regulatory requirements.
Monitoring and auditing policies are required as well. Governance depends on measures to
assess the effectiveness of controls, so one would expect security management to require
monitoring for that reason alone. More importantly, monitoring is an active part information
security practice; it serves multiple purposes, ranging from helping detect anomalous events to
providing traces of events that occur during a security breach.
Finally, the governance of security operations should include the establishment of incident
response policies and procedures. Executives, managers, systems administrators, network
managers, and others should all know their roles and responsibilities in the case of a security
breach. Well-defined reporting procedures should be established. Key measures of system
security management include the number and severity of security breaches and the number of
times security requirements are not met.
174
Chapter 8
Thus, if five departments use equal amounts of storage, they each pay the same amount. Suppose
that one of the departments decides to use another storage service or no longer needs as much
storage. The IT department is charging less because less storage is used, and so they are no
longer recovering their costs. Does the IT department increase charges to compensate for the lost
charge backs? If it does, the other departments will bear the increased costs leading them to
either reduce their storage use or look elsewhere for storage services. If the remaining
departments reduce their storage use, the cycle continues, and the IT department would have to
increase per-unit charges again to recover costs.
Internal charge-back models must be carefully formulated to avoid distorting reasonable
economic incentives. Some balance must be found to meet the objectives of individual
departments while realizing the benefits of economies of scale. Key measures of the success of a
charge-back system is the number of times charge-back costs are disputed and the number of
times service agreements are either terminated or not renewed because of cost disputes.
Measuring Variance
Budgets will vary from actual expenditures; how often this happens and to what degree is
another measure of the budgeting and allocation management process. When measuring
variance, management should determine an appropriate aggregate level.
For example, within a department, a line item for one activity, such as payroll and benefits, may
be over budget but another comparable line item, such as consulting fees, may be sufficiently
under budget to compensate for the difference. This may or may not be a cause for concern.
Consulting fees within the budget may be highly variable, while payroll costs tend to be less so.
If consulting fees are reduced in the next budget cycle, what will offset the ongoing increased
payroll charges?
Another type of variance that should be monitored by the governance process is the reallocation
of funds to different types of expenditures. For example, a decrease in spending on service
contracts to compensate for overruns in other line items could leave some services vulnerable to
disruption or subject to lower performance levels than defined in SLAs.
Providing Training
Training is a fundamental IT service. For overall governance of training, the following are key
measures:
• Number of users trained
• Rate of service desk calls related to functions addressed in training
• Subjective quality ratings provided by trainees
Effective training is correlated with the demands for service support.
175
Chapter 8
Managing Data
Backup and recovery operations are required to preserve the availability and integrity of data.
Although the topic sounds mundane and rather simple at first, the complexities of backup
become clear quickly. Some of the topics that must be addressed in backup policies include:
• Determining what to back up—Data is frequently duplicated for performance purposes or
for ease of integration. What data source is considered the system of record (that is, the
definitive record)?
• Adequately protecting backup data—For example, what data classifications should be
encrypted when backed up?
• How long should data be retained?—When data is removed from operational systems
according to records retention policies, how will copies of the data be deleted?
• Backup media is subject to failure like any other device—How much testing of backup
and archive material is required?
This service may be measured by the number of times backups are successfully performed and
the percent of time backups are performed in the time allotted to the backup process.
176
Chapter 8
177
Chapter 8
178
Chapter 8
Inadequate resources hamper the development team in the third example. The team is forced to
conduct two distinct activities, testing and development, in the same environment. This can lead
to conflicts in the use of resources, introduce dependencies that would not exist if separate
environments were used, and can delay deliverables as tasks are scheduled around the limited
resources of the development environment. In addition, testing sizable software development
efforts require a formal methodology; many developers are not trained in those methodologies.
The result is, good intentions aside, inadequate testing that leads to higher risk of failure during
deployment.
The final scenario depicts a lack of emphasis on operational support. Without proper training and
support, systems administrators will not be able to effectively manage and tune an application.
Users may not receive or may be delayed in receiving the services they need. These kinds of
scenarios show that adopting sound management frameworks and development methodologies is
not black and white (as in, either you do it all or you do not do it at all); rather, there is a
continuum with many processes within many organizations somewhere between the best and
worst extremes.
179
Chapter 8
This framework uses a set of key performance goals and key performance indicators to guide the
implementation of the COBIT objectives. Key goal indicators measure how effectively an
organization achieves its goals. Key performance indicators measure specific operations and
processes and are leading indicators of the organization’s trend toward reaching its goals. Key
goal indicators measure overall performance with respect to a goal after the fact. In addition, key
performance indicators are measures gathered during the observation period and thus allow time
for management to adjust practices and make corrections as needed.
For more information about maturity models and management, see the SEI documentation at
http://www.sei.cmu.edu/managing/managing.html.
Organizations that start to implement formal governance procedures will do so at some point in
the maturity model. If, for example, executive management has decided to improve the software
development process, which is currently at some point between Level 1 and Level 2, then one of
the first objectives will be to formalize documentation and training. Governance measures should
also focus on allowing management measures to progress toward those goals. Implementing
governance procedures must be done with recognition for the relative capability maturity of the
IT organization.
Summary
Governance is the process of directing and controlling operations to ensure that long-term
objectives are met. COBIT is a deep and broad framework for implementing governance best
practices. The IT field is mature enough that management and governance practices need not be
an exercise in reinventing the wheel; rather, the goal of executives and IT management should be
to find frameworks that serve the needs of the organization and work well together. SOM, for
example, is highly amenable to governance because of the logical organization of operations and
resources and the focus on measuring performance. It should be understood that governance is an
ongoing process that will change as the maturity level of the IT organization changes. As
systems management, software development, and training procedures improve, it is likely that
the ability to keep those aligned with strategic objectives will improve as well.
180
Chapter 9
181
Chapter 9
Network Security
Network security requires many security measures around the network perimeter. For example,
common network security devices include:
• Firewalls
• Intrusion detection/prevention systems
• Content filters
• Network access controls
• Messaging boundary gateways
All these are primarily security devices, but they are still information assets that require
management. Furthermore, these devices are becoming more complex and that implies more
demanding management. Take, for example, the most basic of network security devices, the
firewall.
Firewalls segment networks and control the type of traffic that can pass between segments. For
example, HTTP traffic may be allowed from devices outside of the organization’s network but
FTP traffic is not. Firewalls are a first line of defense but are limited by the amount of
information they analyze. For example, packet filtering firewalls examine only packet header
information, whereas proxy firewalls examine information within the packet. Application
firewalls are increasing in popularity because they can filter traffic based on the needs and
vulnerabilities of a particular application.
For network administrators, the increasing complexity of firewalls and other network security
devices will bring with it greater demand for mature systems management practices. Whereas in
the past when one or two packet firewalls might be used in a network, today there may be several
other more complex firewalls within the segments as well. These must be administered, patched,
and maintained, and that means these must fall under systems management operations.
The following activities can generally be expected of network administration and systems
administration groups when supporting network security:
• Assisting with the procurement of network security devices
• Installing and configuring those devices
• Monitoring basic functionality
• Generating alerts and logging events, such as a device going offline
• Maintaining asset information in a configuration management database
• Applying patches
• Assisting with vulnerability assessments
• Participating in risk analysis operations related to network security
Of course, network administration and systems administrators each have distinct areas of focus;
to maintain and improve security, they should also understand the architectures and processes
that constitute their colleagues domains.
As you move further from the perimeter and away from specialized security devices, the roles of
systems administrators as security professionals increase. This is certainly true of host security
operations.
182
Chapter 9
Host Security
Host security measures maintain the integrity, confidentiality, and availability of information and
services provided by servers and client devices. System attacks are those targeted to particular
applications or hosts. The purpose of such attacks may be to disrupt services or steal information.
As the economic motives behind attacks have grown to dominate the reasons for serious attacks,
we are likely to see more attacks targeted to specific applications and hosts.
Attacks may include:
• DoS attacks attempting to disrupt the operations of an organization
• Database breaches attempting to steal private but profitable customer information
• Application-specific attacks, such as attacks on enterprise management systems that
contain sensitive and confidential information about an organization’s operations
System attacks are often not ends in themselves but rather a means to an end—information and
resource theft. In spite of the range of functions hosts serve, a number of security measures are
common to all, including:
• Personal firewalls
• Anti-malware
These are some of the most common elements of security support within the practice of systems
management.
Personal Firewalls
Personal firewalls serve the same purpose as network firewalls but the function is localized to a
single host. The personal firewall controls traffic into and out of a device. Controlling inbound
traffic on a host is a basic perimeter type defense with obvious benefits.
Outbound traffic can also be blocked. This can reduce the impact of a compromised device,
which might, for example, be part of a spam-generating botnet. The malware infecting the device
may generate spam but the personal firewall can block the transmission of the unwanted email.
The challenge for managing personal firewalls is the number of devices that must be deployed
and the varying requirements. Consider some examples:
• A Web server will require traffic on HTTP, HTTPS, and related ports
• A database server will require inbound and outbound traffic on ports dedicated to the
database listener
• A salesperson’s notebook will should have email (SNMP) traffic blocked
• All hosts may, by default, require ftp ports blocked
Determining the proper configuration of a personal firewall is the responsibility of security staff.
Once the configuration is defined, though, ensuring that the proper configuration is in place and
the software is up to date and activated on the device is the responsibility of systems
administrators. One of the areas in which the systems management and security staff will have
shared responsibilities is in managing anti-malware systems.
183
Chapter 9
Anti-Malware
The malware threat has evolved from disruptive and annoying viruses written to demonstrate a
hackers ability to circumvent normal operating system (OS) operations to financially motivated,
sophisticated blended threats designed to steal information and compromise hosts. There are
several distinct types of malware:
• Viruses and worms
• Keyloggers and video frame grabbers
• Trojan horses
• Botnets
• Rootkits
These different types of malware are used for carrying out different aspects of an attack and may
be blended together to create a more serious threat than posed by any single type of malware on
its own. Understanding the difference in malware is important to understanding how they can
impact IT operations.
184
Chapter 9
185
Chapter 9
Trojan Horses
Trojan horses are programs that appear to serve one purpose but actually contain malware.
Trojan horses may be found in:
• Browser add-ons
• Utility programs, such as clock synchronizers
• File-sharing utilities
• Programs and files sent through email and instant messaging
Trojan horses are a mechanism for distributing malicious code. They are often used with
multiple forms of malware, known as blended threats, which can include keyloggers,
communications programs, file transfer programs, and command and control programs that allow
remote control or remote execution of code. The ability to execute programs on compromised
hosts gives attackers the means to create networks of compromised computers, sometimes called
zombies but more commonly known as bots.
186
Chapter 9
Rootkits compromise the OS, so there is not necessarily a trusted computing base. Any
information returned by the OS kernel (for example, processes that are executing or the size of a
particular binary file) may not be true because the code that executes the requested service may
be compromised.
Some tools have been developed to detect patterns indicative of the presence of a rootkit. For
example, a rootkit detector might compare file system information returned by the OS with
information returned by low-level analysis of the disk system; any discrepancies could indicate
the presence of a rootkit. Another technique is to boot a device from a trusted source, such as an
OS CD and scan for rootkits.
Rootkits may become even more difficult to detect, especially if vulnerabilities in BIOS are exploited.
See Robert Lemos’ “Researchers: Rootkits Headed for BIOS” at
http://www.securityfocus.com/news/11372.
The best response to the threat of malware attacks is to use a defense-in-depth strategy. This
approach recognizes that no one countermeasure or policy will fully mitigate the risks of an
attack. It also recognizes that anti-malware programs and related systems are themselves
complex programs with their own limits and vulnerabilities. A defense-in-depth approach to
malware protection will include:
• Antivirus and personal firewalls on client devices
• Network-based content filtering to block malicious content before it reaches the client
• Intrusion prevention monitoring to detect unusual network activity, such as large volumes
of network traffic outside of normal patterns
• Host-based intrusion prevention that detects changes to OS files
• Regular monitoring of logs and audits of security measures
• End-user training, especially on the threat of social engineering techniques
• Comprehensive set of policies that define an organization’s strategy for managing the
risks of malware attacks
An emerging technique for addressing the threat of unwanted applications—such as malware,
bots, and other unintentionally downloaded software—is application control. Application control
mechanisms allow administrators to define policies about the programs that may run in an
environment. For example, a policy may categorize applications based on an administered
security rating, digital signing, date of discovery, or other attribute. Measures such as application
controls are an increasingly important addition to defense-in-depth strategies.
Another area of security that is dependent on systems management services is vulnerability
management.
187
Chapter 9
188
Chapter 9
Configuration Management
Configuration management entails tracking the software and configuration of devices. This is of
importance to security management in a number of scenarios. First, if a vulnerability is
discovered in a particular version of an application, a configuration management reporting
system can identify which hosts are running that version. This is especially important when
client devices may have one of several different versions. Although an organization may make it
a policy to standardize on one or two versions of office suites, it may have half a dozen or more
versions of client-side database drivers.
Another case in which configuration management supports security operation is with risk
analysis and incident response. The existence of a vulnerability is one factor that determines how
to respond; another factor is the importance of the device with the vulnerability. High-priority
devices, such as customer-facing servers, should be patched immediately if the vulnerability
could seriously disrupt operations. However, a lower-priority device, such as a server running a
database tracking a training schedule, can be queued for patching at a later time after critical
systems have been addressed.
Another area configuration management can help with security is in the planning process. For
example, if antivirus software will be upgraded, the configuration management database can help
to determine the number of licenses required. Much of host-based security, though, is based on
countermeasures, such as personal firewalls and anti-malware systems.
Patch Management
Patch management is the process of updating software and configuration (“patching”) to improve
security, functionality, or performance. There are several components in patch management:
• Being aware of patch releases
• Testing patches
• Deploying patches
• Maintaining configuration information
All of these require systems management support. Patches are released on regular schedules by
many vendors, so those can be incorporated into maintenance schedules. These patches often
address minor or moderate impact bugs or provide performance improvements. Unscheduled
patches, such as fixes for security vulnerabilities, may come at any time. Assessing the impact of
vulnerabilities and the benefit of patching is a process that should be done by both security and
systems administration teams.
189
Chapter 9
The SQL Slammer incident is one of those cases that did not have to happen. Microsoft had patched
the vulnerability exploited by SQL Slammer months before the worm struck. Part of the problem was
that database administrators had not patched SQL Server instances, and part of the problem was due
to users not knowing they were running a desktop version of SQL Server that had been embedded in
some applications. This is one of the reasons asset management is so important to information
security—you must know what software you are running and how it is patched.
Controlling Access
One of the most challenging aspects of both security and systems management is access controls.
The number of users, roles, and privileges is growing so rapidly that for many organizations, the
only way to keep up is to leverage automation. This, in turn, requires a policy framework for
driving the automated tools.
Controlling access is not one dimensional and one must look at it from both a user and resource
perspective. With regards to users, the key issues are identity management and authentication
and authorization. From the resource management perspective, it is important to address topics
such as file and disk encryption as well as secure remote access.
190
Chapter 9
For example, if a new employee joins the finance department, a policy can define the
authorizations to use all the financial systems common to members of the department as well as
common to all employees. More specific details, such as the person’s role in the department, can
provide further authorizations. This model provides for a centralized method for access control in
contrast to a commonly found alternative.
Without identity management, a user’s authentications are distributed across systems and
applications. If a person needs access to the financial planning system, an account is created on
that application for them. If they need access to a network file server, authorizations are
established on the network server for them. When the person changes positions or leaves the
company, systems and application administrators around the company have to update access
controls. Deploying identity management is clearly beneficial for both operational and security
reasons and it is another area that crosses boundaries between systems management and security
management.
For more information about the problem of lost and stolen notebooks, see the Realtime IT
Compliance Community at http://www.realtime-itcompliance.com/lost_stolen_laptops/.
A logical solution for security professionals is to use file, or ideally full disk, encryption. If a
mobile device is stolen or lost, no one else will be able to access the information on the device
assuming sufficiently strong encryption is used.
Systems administrators may not see it as such a black-and-white situation. Yes, encryption will
protect data when the device is stolen, but what about operations when the device is not stolen.
Consider:
• What happens if there is a problem with the disk drive, the encryption key cannot be
recovered, and the disk must be reformatted?
• The encryption key is lost?
• What is the performance penalty for encryption?
• How will devices be administered once they are encrypted?
• How will full disk encryption configurations vary by hardware model and feature?
Full disk encryption is growing in popularity and more organizations are likely to adopt it.
Administrators will have to understand how this functionality will effect end user support,
recovery efforts, and device management tools and procedures.
File encryption also helps to address the growing problem of data in motion. Files are easily
transferred to removable media, such as USB memory devices, iPods, and removable disk drives.
Encryption can help to protect data copied to such devices; a better solution is controlling access
to such media based on policy, or in some cases, blocking access to them completely.
191
Chapter 9
192
Chapter 9
Security Policies
Security policies are the foundation of an information security program. Policies are high-level
descriptions of what is permitted and what is expected with regard to security. Organizations will
typically have several security polices, covering:
• Acceptable use of IT infrastructure
• Access control
• Anti-malware policy
• Content-filtering policy
• Encryption policy
• Document and email retention
• Notebook and mobile device security
• Server and workstation security policy
• Wireless network access policy
Policies are generally written to clearly define the scope of the policy, the reason for the policy,
and the details of the policy as well as provide the definition of technical terms if needed. An
encryption policy, for example, might contain:
• A scope statement that defines the business units, employees, contractors, and business
partners that need to adhere to the policy.
• An explanation for the need for the policy, such as protecting the confidentiality of
customer information and proprietary company information.
• Policy details, such as a list of the categories of information that must be encrypted (for
example, confidential, private, and sensitive information), the algorithms that may be
used, and minimum key lengths.
• Definitions for terms such as digital signatures and public key cryptography.
Policies, such as encryption, can apply to multiple services or they may be specific to a particular
service, such as email policies. In either case, policies should be aligned with the service-
oriented model.
193
Chapter 9
Compliance
Adequate protection of private and confidential information plays a role in many government
regulations. Some of the most well known include:
• Sarbanes-Oxley Act—publicly traded companies
• Gramm-Leach-Bliley Act—financial service firms
• Health Insurance Portability and Accountability Act (HIPAA)—health care firms
• BASEL II—financial services
• 21 CFR Part 11—pharmaceutical companies
• Federal Information Security Management Act (FISMA)—federal government
• California State Bill (SB) 1386—business with customers in California
• EU Directives on Privacy—companies doing business in the EU
• Personal Information Protection and Electronic Documents Act (PIPEDA)—companies
doing business in Canada
Responsibility for complying with the array of regulations in existence is likely spread across a
number of departments. Fortunately for IT practitioners, sound security management practices
often contribute significantly to meeting compliance requirements. With proper controls, such as
information classification, access controls, network and host defenses, and proper monitoring
and auditing, IT departments can meet the requirements of many regulations by continuing their
security best practices. The organization of information security addresses the need for
governance and management of security services and functions.
With regards to governance, executive management should have well-defined controls and
measures in place to allow them to monitor and, if necessary, correct security operations. The
governance model detailed in the Control Objectives for Information and Related Technologies
(COBIT) framework provides a sound foundation for governance practices in general. The
controls and measures described in COBIT are useful across the spectrum of service-oriented
management, not just security management.
For more information about COBIT, see the Information Systems Audit and Control Association’s
Web site at https://www.isaca.org/.
194
Chapter 9
195
Chapter 9
Information Classification
Information classification is the process of labeling different types of information and
establishing appropriate controls for each type. Commercial and military institutions use
different classification schemes; the most common categories in commercial classifications are:
• Public
• Sensitive
• Private
• Confidential
By categorizing information, appropriate controls can be placed on information without having
to apply a most-restrictive policy that protects all information as if it were equally important.
Public Information
The public classification is reserved for information that, if disclosed publicly, would not have an
adverse affect on the organization. For example, information provided in press releases would
not contain information that requires any unusual level of protection.
Sensitive Information
Sensitive information should not be publicly disclosed, but if it were, the disclosure would not
have serious adverse affects on the organization. Information about project plans, work
schedules, orders, inventory levels, and other operational data by itself could not be used against
an organization. It is conceivable that a competitor could piece together competitive intelligence
about a firm by examining large amounts of such operational information.
Private Information
Private information is about customers, clients, patients, employees, and other persons who have
dealings with an organization. The disclosure of private information could adversely affect those
individuals; organizations may be subject to fines or other legal proceedings for violating
regulations regarding the protection of private information. Examples of private information
include:
• Employee records
• Protected healthcare information
• Financial records
• Social Security numbers, driver’s license numbers, and other identifying information
Depending on the industry, organizations could be subject to a range of regulations governing
the protection of private information. The health care and financial services industries are subject
to comprehensive regulations in the United States; the European Union (EU) has established
broad privacy protections that apply to all businesses.
196
Chapter 9
Confidential Information
Confidential information requires significant controls because the disclosure of this information
could have a significant impact on an organization. Some of the typical types of confidential
information include:
• Trade secrets
• Negotiation details
• Strategic plans
• Intellectual property, such as algorithms and product designs
Like private information, confidential information should be protected with well-defined access
controls and clear lines of responsibility.
Although many of the same measures may be used to protect confidential and private
information, they are fundamentally different and should not be linked with regards to security
policies and procedures. Private information, for example, may be subject to specific audit
requirements that are not relevant to protecting confidential information. Similarly, some
confidential information may be protected with stronger, and more costly, measures than
required for private information. These two categories should always be managed as separate
entities.
Audit Controls
Auditing begins with policies. Policies may be defined by an organization on it own or as part of
compliance with regulations. Regardless of the motivation for policies, the role of auditing is to
ensure that they are appropriate for the objective and sufficiently implemented. Some of the most
important areas that should be verified in audits include:
• Information classification
• Access controls appropriate for information classifications
• Adequate perimeter and network defenses
• Adequate host defenses
• Adequate review of content, both entering and leaving the network
• Sufficient training on security measures
• Backup and recovery procedures
• Appropriate security management practices, such as separation of duties and rotation of
duties
Auditing is an in-depth review of security policies and procedures. Auditing may be regular but
is still infrequent; day-to-day monitoring is also required.
197
Chapter 9
Security Monitoring
Monitoring can be time consuming unless tools are used to help sift through the volumes of log
data that can be generated in even a moderate-sized network. The difficulties arise from the
range of events that should be monitored, including system events, application events, and user
events. Some of the most common are:
• System performance metrics, such as number of processes, CPU utilization, storage
utilization
• Login attempts and failures
• Applications executed and functions executed within enterprise applications
• Changes to OS configurations
• Errors generated by applications
• Files read, modified, and deleted
• Attempts to access unauthorized resources
In isolation, any one of these events may not be indicative of a serious breach. However, in
conjunction with other events, these may warrant closer examination and may indicate a breach.
One of the greatest challenges in information security today is integrating data from the variety
of security mechanisms already in place. Firewalls, routers, intrusion prevention devices, access
control systems, OSs, anti-malware solutions, and content-filtering applications can all generate
large quantities of data, some of which can be quite useful if it is identified and integrated with
other information in a timely manner.
198
Chapter 9
Mitigation strategies within service-oriented should address the full service, and this often entails
detailed mitigation strategies based on the particulars of an implementation. For example,
standby servers in a different location being used to mitigate the risk of a compromised email
server shutting down communications services. If the primary email server were to fail, email
records within the domain’s DNS entries could be updated and email re-routed to the alternative
server. Controlling risks is closely aligned with another security management function: business
continuity management.
199
Chapter 9
Incident Response
An incident response plan is like an insurance policy: no one wants to have to use it, but
everyone is glad to have one when it is needed. A security incident can take on many forms,
including:
• A virus infection of multiple devices or critical servers
• The discovery of a significant number of Trojan horse programs
• Infections with keyloggers
• A DoS attack on a network device
• An attempt to break into a server
• An attempt to steal information for a database
• The discovery of a botnet within an organization’s network
• Loss of a notebook or other mobile device containing sensitive, private, or confidential
information
Incident response planning has two dimensions—one addresses procedures and the other
addresses the human resources element of the problem.
200
Chapter 9
Technical staff, especially front-line service desk support and systems administrators should be
trained on how to respond according to the severity of an incident. For example, minor incidents,
such as a virus infection on a single device, might call for a basic response using a procedure
defined for relatively predictable incidents. For major incidents, such as a DoS attack that is
blocking access to critical servers, front-line technical staff should know how to enlist additional
help to deal with the problem.
Executives and managers should understand the implications of various types of attacks with
regard to the impact on business operations as well as legal responsibilities with regards to
reporting the incident and complying with government regulations.
Separation of Duties
There is something strange about the fact that it is more prudent to trust two or more individuals
than it is to trust one, but that is the idea behind separation of duties. This is especially important
when responding to security incidents. One of the activities of incident response is to collect and
preserve evidence. It is not unheard of for someone working for an organization to be involved
with crimes against that organization. If an employee or contractor perpetrated an incident, that
person may be involved with the incident response.
For example, a database administrator is someone with the keys to the proverbial kingdom when
it comes to large volumes of business information. If someone were stealing customer credit card
data from a database and a security monitor on the network detected unusual activity on a
database server, the first person to call would be the database administrator. The potential
problem is clear; the solution is to have at least two knowledgeable individuals respond to an
incident.
Response Evaluation
Security breaches are disruptive and potentially costly, but they are also opportunities to improve
security measures. A post-incident evaluation can provide valuable information about:
• How attackers breached security mechanisms
• Which security mechanisms worked and which did not
• If attack techniques were not anticipated
• Whether monitoring and logging were adequate to diagnose the incident
• Vulnerabilities in applications, OSs, or network devices
• Vulnerabilities in policies and procedures
The goal of the post-incident evaluation is to improve the quality of security, not simply to place
blame. Managing information security is difficult and a breach does not necessarily imply
negligence or disregard for policies and procedures.
Summary
Security management is one of the most multi-faceted areas of systems management. It ranges
from the broad issues of managing security information down to the detailed practice of threat
and vulnerability assessment. In addition to day-to-day activities such as monitoring systems,
applications, and users, systems administrators and security professionals must manage an array
of security mechanisms deployed in such a way as to provide multiple layers of defense.
201
Chapter 10
202
Chapter 10
Figure 10.1: Threats are associated with virtually every part of an IT infrastructure.
203
Chapter 10
Network Devices
Network devices such as routers, firewalls, intrusion prevention systems (IPSs), content filters,
and other security appliances are subject to the risks of network attacks. The Denial of Service
(DoS) attack is relatively simple but highly effective, especially when launched from multiple,
distributed devices.
For an example of just how disruptive a distributed DoS (DDoS) attack can be, see Scott Berinato’s
“Attack of the Bots” in Wired Magazine at http://www.wired.com/wired/archive/14.11/botnet.html.
204
Chapter 10
Figure 10.2: A man-in-the-middle attack entails intercepting and tampering with communications between
two parties that is presumed to be secure.
Databases
Databases are an obvious target of attack. These are the repositories of a wide range of
information, including:
• Personal information about customers
• Employee information
• Financial records
• Operational information
All databases with a user interface are subject to SQL injection attacks, which can result in data
theft. In this type of attack, an attacker creates a query that exploits vulnerabilities in the
interface’s query processing code. There are multiple techniques for preventing SQL injection
attacks, most of which are based on sound coding practices.
205
Chapter 10
In addition to SQL injection attacks, vulnerabilities in database components, such as the listener
(the application that listens on a specified port for requests to the database), can be used to
compromise a database. Once an attacker has gained access to a database, the attacker can also
tamper with the data as well as steal it. Relatively small changes to data can be difficult to detect
unless auditing and monitoring policies are well established. Database servers are also subject to
disruption from DoS attacks.
Application Code
The importance of application code can range from the mundane, such as scripts for cleaning up
temporary directories, to mission-critical systems, such as enterprise resource planning (ERP)
systems. These assets are subject to several risks, including:
• Flaws in logic
• Insufficient error-handling code
• Dependency on flawed library or other shared code
• Insufficient CPU, storage, or network resources
Flaws in application logic will be found in any sufficiently complex system. Vendors and
developers routinely patch applications to correct known problems.
Insufficient error-handling code is a problem because applications will encounter conditions that
will disrupt normal operations. When a storage device is full, the application will not be able to
save data. How does the application respond? Graceful degradation of services requires that the
application provide alternative means for systems administrators or users to respond to the
problem.
Complex applications are modular and layered. Lower levels provide services for upper levels.
For example, a customer management system uses databases for persistent storage; database
systems depend on files systems or, in some cases, low-level I/O routines provided by the
operating system (OS). Client applications, such as office productivity programs, depend on
graphical interface components provided by the OS. Vulnerabilities in any of these lower-level
systems can create risks for any application that uses them.
206
Chapter 10
In addition to the long-term risks associated with flawed code, there are transient risks such as
insufficient resources. An error in an application or a mistake by an operator can consume large
portions of available bandwidth on the network, for example, by unnecessarily transferring a set
of large files. Similarly, a poorly formed database query can easily consume available CPU
cycles and I/O operations.
Figure 10.4: Layered applications introduce dependencies that pose potential risks.
Systems Documentation
Documentation rarely makes it on any top-ten list of information assets, but it should.
Organizations that depend on the knowledge of their staff, contractors, consultants, and business
partners without formally documenting processes and procedures are at risk. Employees leave,
contractors move on to the next assignment, and partners go out of business. The need to
formalize and capture information about IT systems is obvious.
Capability maturity models, such as Carnegie Mellon Software Engineering Institutes’ model,
define levels of capability that range from ad hoc management to optimized management.
Moving from ad hoc through more capable stages requires, among other things, formalized and
documented processes. Without this, organizations are subject to a number of risks, including:
• Disruption of services
• Additional expenses associated with reverse engineering
• Delayed deployment of applications and services
• Increased need for training
• Poor quality control
207
Chapter 10
For more information about capability maturing models, see the Software Engineering Institute’s (SEI)
Web site at http://www.sei.cmu.edu/cmm/.
Intellectual Property
Intellectual properties are intangible assets based on the creativity of an organization or
individual and provide some type of competitive advantage or constitute an asset that can be
sold. Intellectual property includes:
• Patents
• Trade secrets
• Designs and art work
• Processes
• Copyright material
The more knowledge-based a business, the more important the intellectual property. This type of
asset is subject to a number of risks but the most important, and threatening, is theft. Headlines
from the U.S. Department of Justice (DoJ) press releases depict the range of crimes related to
intellectual property theft:
• “Former Vancouver Area Man Sentenced to Five Years in Prison for Conspiracy
Involving Counterfeit Software and Money Laundering: Web of Companies Sold up to
$20 million of Microsoft Software with Altered Licenses”
• “Pharmaceutical Distributor Pleads Guilty to Selling Counterfeit Drugs”
• “Local Business Owner Sentenced to Year In Jail for Copyright Infringement Conspiracy
Related to the Sales of Counterfeit Goods”
• “California Man Sentenced for Electronically Stealing Trade Secrets from his Former
Employer, a Construction Contractor”
For more examples, see the Computer Crime & Intellectual Property Section of the U.S. Department
of Justice Web site at http://www.usdoj.gov/criminal/cybercrime/ip.html.
208
Chapter 10
Mitigating the risks to intellectual property is challenging. Unlike tangible assets that can be
locked down and monitored, intellectual property pervades an organization, is embedded in
software that is distributed to customers, and may be remembered by employees and other
insiders long after they leave an organization.
The types of information assets and risks to those assets are wide ranging. From the most
mundane PC to the valuable intellectual property, the relative impact of risks must be assessed in
order to properly manage risk.
Types of Costs
The impact of threats is a function of the value of the asset damaged by a threat, the cost of
restoring the asset, and the cost of not having the functional asset. For example, if an application
server is destroyed in a fire, the cost to the organization includes:
• Replacing the server
• Restoring data to the replacement server
• Configuring the replacement server
• Testing the replacement server
• Lost revenue or productivity during downtime
• Cost of switching to and from failover servers, if used
These costs apply to other types of assets as well. In addition to these, consider the following
when determining the impact:
• The value of intellectual property to competitors
• The potential for penalties for violating regulations, such as failure to comply with
privacy regulations if a customer database is compromised
• The costs to brand value due to the public disclosure of a security breach
• Contractual penalties for not meeting service level agreements (SLAs)
Identifying the types of costs is followed by steps to quantify those costs.
209
Chapter 10
Determine Costs
Quantifying costs related to risks is far from straightforward. To begin, let’s examine the
simplest of cases and then move on to the more challenging areas.
Qualitative Evaluations
Qualitative techniques are used when quantitative measures cannot be used. These techniques
typically depend on the reasoned opinions of experts or others knowledgeable about a particular
area. For example, if a bank is trying to assess the cost of a security breach in which 10,000
customer records are compromised, it might consult with:
• Attorneys regarding disclosure regulations
• Marketing executives for an assessment of the negative publicity
• Industry consultants who have worked with competitors in similar situations
• Customer focus groups
The outcome of the evaluations may be ordered sets of risks with relative measures—such as
high, moderate, and low—assigned to each. Although these assignments are not as precise as
quantitative measures, they can provide enough guidance to allocate resources to protect these
assets. The value of assets is one component of risk analysis calculations; another is the
likelihood of threats.
210
Chapter 10
The CSI/FBI report is available from the CSI Web site at http://gocsi.com/. For more information
about the limited usefulness of self assessments, see Jeffrey Gangemi’s article “Cybercriminals
Target Small Biz” in BusinessWeek online. According to the article "approximately 70% of small
businesses consider information security a high priority, and more than 80% have confidence in their
existing protective measures" yet almost 20% do not use antivirus scanning and 60% do not use
encryption on their wireless networks.
With asset values and the likelihood of experiencing particular threats calculated, you can move
on to the next stage of risk assessment, calculating risk measures.
211
Chapter 10
Exposure Factors
EF is the percentage of the value of an asset lost in one occurrence of a threat. For example, if a
server is completely destroyed by a flood, the EF is 100 percent; if one-fifth of the data in a
database is stolen or otherwise compromised, the EF is 20 percent. Note that each threat will
have a distinct EF.
SLE is calculated with the formula:
SLE = asset value × exposure factor
For example, if the value of a database is $500,000 and the EF is 20 percent, the SLE is
$100,000 (500,000 x 0.2).
An asset may be subject to multiple threats, so there can be multiple SLEs for a single asset. A
notebook, for example, is exposed to theft, malware attack, hardware failure, and, in some cases,
fire due to battery overheating. SLE would need to be calculated for each of these possible
events.
212
Chapter 10
ALE is calculated for each threat to each asset to determine the overall loss expectancy. Table
10.1 shows an example of calculating the total loss expectancy for a single asset.
Laptop
Value Threat EF SLE ARO ALE
Total Loss
Expectancy $775
Table 10.1: Total loss expectancy for a notebook in a one year period.
From the calculations in Table 10.1, you can see that it would be reasonable to spend as much as
$500 per year in anti-theft devices but no more than $50 for anti-fire measures. Ideally, the
outcome of risk analysis is a plan to minimize the cost of countermeasures while maximizing the
reduction in the overall level of exposure to assets.
For an in-depth look at risk assessment, see the Risk Management Guide for Information Technology
published by the U.S. National Institute for Standards and Technology, available at
http://csrc.nist.gov/publications/nistpubs/800-30/sp800-30.pdf.
Of course, if you do not have the detailed cost information for this type of calculation, such a
formal method is not useful. An alternative method, which is especially useful when qualitative
risk assessments are used, is the risk-level matrix.
213
Chapter 10
Table 10.2: A risk matrix that combines likelihood and impact to assess the overall importance of risks.
Some of the combinations of likelihood and impact yield obvious overall risk levels. For
example, a high likelihood risk—such as malware infected emails—combined with high
impact—such as infecting a large number of clients or consuming network resources, a la SQL
Slammer—yields a high risk threat. Low likelihood threats with low impacts present little risk to
an organization and should not be the focus of attention during risk analysis.
Asymmetric combinations of likelihood and impact are more difficult to judge. For example,
how much effort should be made to mitigate a low likelihood but high impact threat? For an
organization with a conservative perspective, such a risk should be categorized as medium;
however, a more risk-tolerant company may categorize it as a low risk.
The risk levels in Table 10.2 are suggestive but by no means definitive. The risk tolerance of an
organization should dictate the risk levels when different combinations of impact and likelihood
are in question.
214
Chapter 10
215
Chapter 10
It should be noted that these categories are not mutually exclusive. In fact, impacts of threats can be
measured in more than one of these categories at a time. A high-profile security breach can have an
impact on customer relations as well as lead to fines and penalties for violations of privacy
regulations. A failed server at a retailer can impact both operations and customer relations, especially
if the failure occurs during the high-volume holiday season.
As Figure 10.6 shows, impacts can be thought of as affecting multiple categories at the same
time.
216
Chapter 10
Figure 10.6: Threats can have impacts along multiple business dimensions simultaneously.
Operational Impacts
Operational impacts are those that challenge the ability of an organization to carry out its
workflows. Some common operational workflows are:
• Receiving and processing customer orders
• Fulfilling orders
• Conducting marketing and advertising
• Managing customer relations
• Providing service desk support to internal IT users
• Performing maintenance
• Executing projects
• Managing operations
Within the set of operational impacts, the timeframes along with when the impact of a threat is
realized can vary significantly. For example, consider the difference in impact if a point of sales
system fails and if a data warehouse database server fails.
217
Chapter 10
When a point of sales system fails, revenues from sales stop. Merchandise is not sold, customers
may turn to other providers for the products they need, and financial reporting and reconciliation
operations are blocked. In short, critical, time-sensitive operations are disrupted and losses may
be permanent. This is not the case when a decision support system is offline.
Consider how a data warehouse or other business intelligence application is used. Managers
receive reports about sales, revenues, expenses, and other measures of the financial state of their
department and lines of business. Often, one of the main purposes of a business intelligence
application is to provide a comprehensive view of the state of operations that is not available
from traditional transaction reporting systems, such as account receivables and account payable
systems. These transaction-oriented systems have been designed to keep financial records and
ensure accurate and comprehensive accounting. Business intelligence systems supplement those
with reports designed for analyzing longer-term trends and patterns of activity that required
consolidating data from multiple systems.
Now imagine a data warehouse database server is down for a day. What is the impact on
business? In the short term, the impact of such a disruption is minimal. Managers and executives
can presumably continue to manage day-to-day operations and can perform planning and
strategic analysis later when the data warehouse is back online. There is not a threat of lost
revenues, the disruption is not obvious to customers or business partners, and presumably this
type of management system is not directly subject to compliance regulations, at least in terms of
availability.
Figure 10.7: Threats to time-critical operations, such as sales, have more significant impacts than threats to
less time-sensitive operations, such as business intelligence reporting.
218
Chapter 10
Compliance Impact
Operational and security risks can have an impact on regulatory compliance. The past several
years have witnessed heightened awareness about regulations as well as the advent of new, high-
profile regulations. The list of regulations that affect business and government agencies and
departments is long and spans multiple jurisdictions. Some of the most well-known and broadly
applicable are:
• The Sarbanes-Oxley Act (SOX), which governs financial reporting and other aspects of
management in publicly traded companies in the United States
• The Gramm-Leach-Bliley Act, another U.S. regulation, provides for the protection of
personal financial information
• The Health Insurance Portability and Accountability Act (HIPAA), which regulates the
use and disclosure of protected health care information in the United States
• The Australian Federal Privacy Act and the Canadian Personal Information Protection
and Electronics Documents Act (PIPEDA), which establish privacy protections in their
respective countries
• The European Union Data Privacy Directive and Directive on Privacy and Electronic
Communications provide protections for those living in EU member countries
• California State Bill (SB) 1386 requires business and government agencies to notify
victims living in California when personal private information is disclosed
• Bank of International Settlements’ BASEL II requirements cover reporting and
disclosures by financial institutions
• The U.S. Food and Drug Administration (FDA) 21 CFR Part 11 regulations govern
operations of pharmaceutical companies
• Federal Financial Institutions Examination Council (FFIEC) guidelines on business
continuity planning in financial institutions
• Federal Information Security Management Act (FISMA), established by the U.S. federal
government, to establish standards for information security within departments and
agencies of the U.S. federal government.
A number of conclusions can be drawn from examining this list:
• Regulations are defined by a range of governing bodies, from state-level governments,
such as California, to trans-national institutions, such as the EU
• Regulations are targeting both the integrity of business operations, as seen in SOX and
BASEL II, and the protection of individuals’ privacy, seen in California SB 1386 and the
EU’s privacy directives
• Regulations apply to a broad range of industries and governments; in some cases,
regulations are directed at specific industries (the FDA’s 21 CFR Part 11 regulations of
pharmaceuticals); in other cases, regulations are broadly applicable (such as SOX, which
applies to all public companies in the U.S. and FISMA, which is broadly applicable
across the U.S. federal government)
Clearly, compliance is a significant category when assessing the impact of risks; however, it is
not just the government that you must be concerned with when considering the impact of risks on
operations.
219
Chapter 10
Figure 10.8: Supply chains now expose organizations to the impact of operation disruptions from other
businesses.
Negative feedback can spread across the supply chain as the impact of unfilled orders, missed
opportunities, and long-term customer dissatisfaction becomes known.
220
Chapter 10
For a list of privacy breaches, see the Privacy Clearinghouse Chronology of Data Breaches at
http://www.privacyrights.org/ar/chrondatabreaches.htm.
Quantitative measurements of the impact of such breaches and the negative publicity are difficult
if not impossible; qualitative measures are the best that can be expected in such cases.
The impact of risks should be understood along several dimensions, including operational
impact, compliance impact, business relationship impact, and customer relations impact. This is
a fundamental aspect of risk analysis and without a thorough understanding of the range of
effects of different threats, you cannot accurately gauge and mitigate threats to the organization.
Summary
Risks are a constant in the realm of IT infrastructure management. Security risks are well
publicized and a wide range of countermeasures have been deployed to mitigate security risks.
Other types of threats to business continuity and integrity, such as natural disasters and
disruptions to supply chains, can also present risks to both short-term operations and long-term
strategic goals. The practice of risk management has evolved, providing the tools and techniques
to effectively and efficiently understand these threats. Of course, the ultimate goal is to mitigate
these threats, and risk analysis enables this goal even with limited resources.
221
Chapter 11
Controlling IT Costs
“Do more with less” is something of a popular mantra in management circles, and less popularly,
with IT operations staff. As unpopular as it is with some, that four-word sentence captures the
driving business factors that are shaping how we implement and manage information services.
Consider how it translates into day-to-day operations:
• As employees and contractors leave, the remaining staff is expected to assume their
responsibilities
• Strategic plans—driven by market conditions, perceived opportunities, government
regulation, and other factors—create new requirements for IT services but not additional
funding for meeting those needs
• Internal customers’ expectations are increasing because they are exposed to rich
applications in other external environments, such as the Web
222
Chapter 11
The outcome of these pressures includes the need for IT managers to deftly reallocate resources,
leverage technologies in innovative ways, and constantly plan for change. To succeed, managers
need to focus on business fundamentals while adapting to the dynamics of information
technologies.
The fundamentals of controlling costs are the same in IT as any other part of an organization;
economics textbooks will tell you that there are labor costs and there are capital costs. What
those textbooks do not always tell you is what to do with those costs. To fill this knowledge gap,
let’s first divide the world of IT costs slightly differently than the most basic branch and consider
three types of costs:
• Labor
• Capital expenditure
• Operating costs
Let’s examine how mature systems management processes benefits each of these.
Labor Costs
Labor costs can make up a significant portion of an organization’s IT budget, and controlling
those costs while maintaining quality service levels can be a challenge. Of course, any manager
can cut staff and reduce bottom-line costs, but organizations need to maintain services, adapt to
new opportunities, and expand the range of services offered. Blindly cutting staff is a short-term
solution to a long-term problem. IT managers succeed when they consider the full range of issues
in staffing their operations:
• Cutting costs can mean reducing quality if reductions are not based on reorganization that
includes quality measures in decisions
• Restructuring often requires improved communications and reporting to support a
geographically dispersed workforce
• Automation can reduce labor costs and maintain quality of service (QoS) if workflows
are well understood and systems are implemented to accommodate those workflows
223
Chapter 11
The SOM model described throughout this guide can help reduce labor costs by making systems
management more efficient while maintaining and improving QoS. In particular, the SOM model
can support
• Automation of manual processes
• Cross functional skills and reallocation of resources
• Improved support services
With the exception of the initial request and the manager approval, the rest of the process is driven by
an established and automated workflow that is controlled by policies for authentication and
authorization.
224
Chapter 11
225
Chapter 11
Automation in this process provides several advantages. First, although the time required to
create user accounts may be relatively small, a large volume of transactions can result in
significant costs over time. With support for password resets, a provisioning system can further
reduce the cost of user access management.
A second benefit is improved quality control. The business rules governing user authorizations
can be lengthy and in some cases complex. For example, authorizations may be granted based on
employees’ department, roles in the organization, and the projects they work on. It is more
efficient to have an automated process querying a directory for user attributes and applying a
policy using those attributes than having a system administrator manually checking detailed
request tickets for specific details about system access. Consider: If a user had to specify which
systems he/she needed access to, the list might include:
• Local PC
• Shared network drives
• An email account
• Group calendar
• Employee self-service portal
• Department-specific applications
• Project-specific applications
• Position-specific applications
Each of these may have different authorizations. For example, employees in IT may be given
administrator or power user privileges for their workstations but others are not. Managers may
have access to a project management application but other staff does not. By defining policies
that specify authorization rules and applying them consistently with a workflow process, you
reduce the likelihood of errors.
226
Chapter 11
227
Chapter 11
228
Chapter 11
Inventorying Assets
Tracking which devices are online is a fundamental operation; without accurate inventory data,
other operations, such as patch management, lease management, license allocation, and security
management, will produce suboptimal results, at best, and fail, at worst. In fact, inventory
management is the foundation of asset management and begins with discovery of both software
and hardware assets to populate the inventory.
As the number of devices in an inventory grows, the problem of tracking them obviously
becomes more difficult. But quantity is not the only problem.
Configurations can change quickly. New software may be installed on client devices, OS
configurations may change, and peripheral devices may be added to PCs and workstations. In
addition, reorganizations, mergers, and divestitures can create an inventory management
challenge because of the short time and large number of changes that can occur. Updating
inventory with a large number of changes in a short period of time while maintaining sufficient
quality controls is a task that can place a significant burden on IT staff. Again, automation can
result in significant cost savings by reducing the number of staff required to manage inventory.
Troubleshooting
Troubleshooting is more difficult to operate than other operations, but supporting services can be
automated resulting in reduced labor costs. Some troubleshooting problems are isolated to a
single device. For example, a user may notice an increase in the time required to open local files,
start desktop applications, and perform routine tasks. A review of the current configuration may
determine that the recently added applications are taxing the device resource and additional
memory is required. A Service desk technician may also notice differences in the configuration
from the standard configuration, which leads the technician to investigate the possibility of a
spyware or botnet infection. Having hardware and software configuration information from a
configuration management database (CMDB) can reduce troubleshooting times in such cases.
Other situations are more difficult to diagnose. For example, users of an enterprise application
may report slow performance. The application is a multi-layered system that includes:
• A Web client application
• A Web server
• A J2EE application server
• A messaging service
• A relational database
The slowdown could be caused by a problem in one of these components or in a combination.
Troubleshooting multi-layered applications requires coordination of developers, database
administrators, application administrators, and network support staff. This coordination is
facilitated if a configuration database is available that tracks information across platforms.
229
Chapter 11
Consider a potential problem with a critical system, such as a financial services application.
What if a single configuration item fails, what is the impact on system availability? Will an
essential business operation complete on time? Since the CMDB tracks configuration item
relationships that define the service, technicians can quickly evaluate the impact of a potential
failure and assess alternative solutions to work around the failure.
The potentially labor-intensive tasks—provisioning user access, patching and upgrading devices,
inventorying devices and troubleshooting—are examples of common IT operations that can
realize reduced costs if automated processes are in place. Another way the SOM model, coupled
with automation, can reduce labor costs is through the facilitation of the development of cross-
functional skills.
Cross-Functional Skills
IT professionals have come to expect frequent reallocation of staff as a strategic initiative
change. Along with reallocations come the understanding that more and more tasks are being
aggregated into fewer staff positions. This is part of the logic of improved productivity that is so
important to remaining competitive. An important corollary to the idea of consolidating
responsibilities is the need for cross-functional training.
Consider a systems administrator who had been responsible for managing a number of Linux
servers that supported Web servers and application servers. The administrator is then assigned
responsibility for a set of Windows XP servers used for network file shares. If this person is out
sick, on vacation, or quits, who will run these servers? It is not practical to have another person
on staff as backup. It is practical to cross-train others for the job.
Figure 11.3: Without cross-functional skills, dependencies develop on single individuals or small groups.
230
Chapter 11
The idea behind cross-training is that there is no dependency on a single individual to provide an
essential service. If one systems administrator is away, another should be able to fill in. The
problem is that the complexity of systems management makes it difficult to understand the depth
and breadth of a wide array of systems. A Linux administrator may be able to pick up UNIX
administrators’ duties pretty quickly, but the same might not be said for a Windows
administrator. Similarly, a Windows administrator familiar with supporting desktop devices may
not be familiar with the intricacies of managing Windows servers running SQL Server or
Microsoft Exchange.
The problem of maintaining adequate skill levels across multiple employees is reduced if low-
level, tedious, platform-specific tasks are automated, leaving the higher-level analysis and
management tasks to staff. For example, monitoring disk usage requires different commands
under Windows than under UNIX and, depending on the reporting requirements, can require
knowledge about specific parameters to command-line utilities. Rather than spending time
scanning UNIX manual pages for the right parameter, a systems administrator’s time is better
spent addressing the core tasks and ensuring adequate disk space. Both Windows and UNIX
administrators could perform basic monitoring tasks using a centralized management console
with information about the status of various servers.
Figure 11.4: Using asset and configuration management tools can facilitate cross-training by alleviating the
need to learn low-level, platform-specific details.
Using tools to perform low-level information gathering tasks is just one example of how systems
management support tools can facilitate cross-training, which, in turn, can improve the overall
quality of systems management and allow for consolidation of tasks across a smaller workforce.
231
Chapter 11
232
Chapter 11
Capital Expenditures
Capital expenditures are expenses to acquire or improve long-term assets. These can include:
• Disk arrays
• High-end servers
• Enterprise applications
• Network devices
In budgeting, capital expenditures are often treated separately from operational expenses. Capital
expenditures warrant detailed analysis because they are costly and commit the organization to a
long-term investment. A formal, mature systems management model can help with capital
expenditures in two ways:
• Improved asset management
• Decision support reporting
233
Chapter 11
234
Chapter 11
Operating Costs
The last type of cost in IT is operating costs, or the cost of running the IT department on a day-
to-day basis. Labor costs are typically considered part of operating costs, but in this discussion,
labor costs have been treated separately. This section deals primarily with the remaining types of
operating costs. Specifically, this section examines how a mature systems management
framework can improve cost controls by improving several areas:
• Management reporting
• Allocation of resources
• Predictability of operations
• License management
• Security posture
235
Chapter 11
Figure 11.5: Silos of management have advantages but can make management reporting more difficult than it
needs to be.
236
Chapter 11
A restructuring is likely to lead to different silos without actually solving the management
reporting problem (something of a “rearranging the deck chairs on the Titanic” solution). A
better option is to use a centralized configuration database that can collect and manage
information about assets across organizational boundaries. This option has several advantages:
• It is independent of organizational and management structure
• It allows for consolidated reporting
• Reports are consistent across management domains
• More in-depth analysis, such as dependencies between systems and resources, is possible
Figure 11.6 shows an example of the types of information that can be collected and managed
within a consolidated centralized management database.
Figure 11.6: A centralized configuration management database can collect and maintain information about
assets across organizational boundaries and support improved management reporting.
237
Chapter 11
Figure 11.7: Knowledge of the types of assets in the inventory is the first step to optimizing the allocation of
resources.
238
Chapter 11
A centralized management model that includes inventory and cost information can greatly
facilitate the financial analysis that must be done to optimally allocate resources. Much of the
same data that is used for optimizing the allocation of resources is also useful for predicting time
requirements and levels of effort required for systems management operations.
239
Chapter 11
Figure 11.8: Software license management can track usage against licenses and help administrators remain
in compliance with contractual agreements.
240
Chapter 11
These management areas are important to security because, as is well known, information
security requires multiple layers of defense to mitigate the potential for a “weakest link”
problem. For example, if a network worm is exploiting a known OS vulnerability and the only
defense is the antivirus software running on desktops, any problems with that antivirus program
could result in infection. It is not difficult to imagine, for example, a notebook without updated
antivirus signatures that would miss detecting the worm and leave the notebook vulnerable. A
comprehensive systems management program can mitigate this type of potential problem by
• Ensuring OS patches are up to date
• Allowing systems administrators to quickly identify devices that do not have up-to-date
antivirus programs
• Providing better reporting on system accounts, increasing the chances of detecting
unauthorized accounts
• Supporting the enforcement of least privileges, so if a system is compromised, the
processes running on that device cannot do widespread damage
Let’s examine the problem of systemic vulnerabilities. Vulnerabilities are weaknesses in systems
that can be exploited to compromise the integrity, confidentiality, or availability of a system.
Vulnerabilities are created by
• Errors in applications
• Incorrect configurations
• Deficiencies in procedures
All these potential sources of vulnerabilities can be compensated for by proper systems
management (at least to some degree). Patch management is especially important with the first
problem, errors in applications. Applications today are increasingly complex, they are deployed
on a variety of platforms, and they are often designed and developed under tight deadlines that
leave too little time for comprehensive testing. The result is that serious flaws creep into the
software and are eventually deployed across enterprise IT systems.
Vendors regularly patch software. Microsoft, for example, has regular monthly updates. Other
large vendors, such as Oracle, use a quarterly schedule. Of course, high-risk vulnerabilities may
be corrected outside of this schedule. These types of regular updates allow systems
administrators to plan for updates so that patching does not have to be an ad hoc, disruptive
process.
Zero-day vulnerabilities are particularly problematic because they are unknown to vendors and
customers until attackers or malware developers exploit them. By definition, there are no patches for
zero-day vulnerabilities when they are exploited. This is one of the reasons that defense-in-depth
security strategies are so important. No one security method, such as patching, is effective all the
time against all threats. Only by combining multiple countermeasures can an organization achieve
reasonable levels of security.
241
Chapter 11
Security professionals often advocate defense-in-depth strategies. This advocacy should not be
misconstrued as a call to simply implement more security applications, such as firewalls,
antivirus solutions, content filters, intrusion prevention systems (IPSs), and a host of other tools.
You certainly need those, no question—but you also need sound systems management.
A network fully loaded with the latest security countermeasures will not be secure if the network
devices and servers are misconfigured, if client devices are not patched, if applications are not
using authentication and authorization mechanisms, or if tested backup and recovery procedures
are not in place.
The benefits of a methodical systems management approach touch numerous parts of IT
management, from the allocation of resources and the predictability of operations to improved
software license management and systems security. Just as any coin has two sides, so does the
story of IT costs and systems management. The other side is the cost of not properly managing
systems operations.
Compliance
Regulatory compliance is something we have all come to expect and live with. Some regulations
are broadly applicable to a large number of organizations. The Sarbanes-Oxley Act, for example,
requires adequate controls on IT operations to ensure the integrity of financial reporting of all
companies publicly traded in the United States. Businesses are not the only ones subject to
regulation: governments establish regulations for themselves as well. The Federal Information
Security Act (FISMA) defines security requirements for U.S. federal agencies and departments.
Some regulations targeting particular industries worth noting are:
• Health Insurance Portability and Accountability Act (HIPAA)—health care
• 21 CFR Part 11—pharmaceuticals
• FISMA—U. S. federal government
• Gramm-Leach-Bliley Act—financial services
242
Chapter 11
When considering the impact of compliance, consider (at least) two parts: the initial fines and
other costs of a violation and the cost of cascading violations. For example, a violation of
HIPAA can result in stiff fines when protected health care information is disclosed. However, a
violation of a state’s privacy statue can result in fines and may trigger the violation of another
federal regulation, such are the Gramm-Leach-Bliley Act, which results in additional fines.
Effective systems management practices will not guarantee that an organization is in compliance
but can provide the tools and management reporting necessary to get into compliance and
demonstrate that compliance.
243
Chapter 11
Business Disruption
Yet another factor to consider when determining the cost of systems management is the potential
for business disruption. When information systems are down, the impact can be widespread,
shutting down day-to-day operations as well as adversely impacting management operations.
Often businesses will invest in backup solutions, offsite facilities, and other measures in case of
disaster. The transition from primary to backup systems can be difficult in the best situations, but
without proper planning and management, they may be impossible to implement without adverse
consequences.
Clearly, there are costs associated with implementing comprehensive systems management
models such as the SOM model. There are, however, even greater potential costs for not
implementing such models.
Summary
The benefits of mature systems management practices are well known. Labor, capital
expenditure, and operational costs all benefit from such practices. In the case of labor, the
automation of manual tasks, improved cross-functional training, and improved service support
follow. Capital expenditures benefit from better reporting and decision support. Day-to-day
operations benefit in several ways ranging from better allocation of resources and license
management to improved security and operational predictability. Finally, consider the cost of not
leveraging systems management best practices, which, in addition to the lost opportunity for
improvement, brings costs all its own.
244
Chapter 12
245
Chapter 12
246
Chapter 12
Figure 12.1: CMDBs provide the means to track virtually all IT assets.
247
Chapter 12
Ownership information tracks the organizational dimension of a configuration item. This can
include the business unit responsible for the services provided by a configuration item. For
example, the finance department may be the owner of a server and enterprise resource planning
(ERP) application. Ownership may be distinct from responsibility, which should also be tracked
in a CMDB. A group within the IT department may have responsibility for the application server
owned by the finance department in the previous example.
Relationship data describes how configuration items depend on one another or are used together.
For example, a UNIX server may depend on a particular router in the network.
Take the case of a newly discovered piece of malware that exploits a vulnerability in a
commonly used code library. Which devices in the organization are using that library? Of all the
vulnerable devices, which are running mission-critical operations? Which are on mobile devices
that may not be connected to the network and may not receive the patch when pushed to devices
by the patch management system?
Compliance is forcing a new regimen on IT operations. More controls are now required to ensure
that devices are configured properly and patched appropriately. It is not uncommon to establish
minimum security requirements for any device connecting to the network. Notebooks, for
example, may be required to run anti-malware and personal firewalls. If these services are not
available, the device is not granted access to the network. How is this enforced?
Policy management solutions and access control devices have to be coordinated to ensure any
device accessing the network is in compliance. A single policy can apply to multiple devices and
devices may be subject to multiple policies. The responsibilities of systems managers are
growing rapidly and automation is essential to keeping up with these changes. If automated
systems management solutions are not in place, or partially in place, the first step is to assess the
state of IT management practices.
248
Chapter 12
249
Chapter 12
Figure 12.2: Risk tolerance is part of the background in which all business and technical decisions are made.
For more on Vista migration, see the Microsoft Desktop Deployment Center and the Altiris Vista
Resource Center.
250
Chapter 12
A formal risk analysis can identify the theoretical cost and benefits of risks and risk mitigation
strategies. Of course, executives and managers will try to mitigate these risks but there are limits
to these efforts:
• Financial constraints
• Time constraints
• Unknown factors
• Unknown frequencies of risks
• Technical limitations
• Resource constraints
Understanding these limitations and working around them must be guided by the organization’s
tolerance for risk.
Financial Constraints
Managers have limited resources for dealing with risks and choices will often have to be made
between mitigation strategies. For example, should funds be invested in a new higher-capacity
backup system or should those same funds be used to upgrade network security? Both are
arguably essential to maintaining business operations, but there may be funds for only one.
Time Constraints
Time constraints are also a factor. One may have the funds and staff with the technical skills to
address a problem but not the time. If a company acquires another firm with poorly designed
network architecture, should the new resources be redeployed following the company’s
architecture? Ideally yes, but it may require pulling senior systems administrators and network
managers away from other high-priority projects.
251
Chapter 12
Unknown Factors
There is little one can do about unknown factors except to plan in terms of broad generalities.
Natural disasters, security breaches, and systems failures are broad risks but you will never be
able to plan in detail for all types or understand the impact of all possible instances of these risks.
Another class of unknowns is the impact of risks. A fire that destroys a computer center is easily
quantified. The cost of a data loss incident is not so clear, but key factors include:
• Diminished brand value
• Loss of customer loyalty
• Fines and other compliance costs
In addition to these kinds of unknowns, another group of unknowns add to risk assessment
difficulties.
Technical Limitations
For some risks, you simply do not have adequate mitigating solutions. Information security has
always been a matter of responding to emerging threats that are motivated to circumvent existing
countermeasures. Some of the best methods for dealing with known risks impose unacceptable
limitations. For example, Windows Vista has been designed with improved security measures
but some existing software will not function under these new security measures. Users have a
choice to not run these applications or to run them with elevated privileges that provide similar
access to earlier versions of the Windows OS. As Figure 12.3 shows, what is desired and what is
achievable can be vastly different because of the constraints facing the organization.
252
Chapter 12
Figure 12.3: A variety of constraints limit an organization’s ability to reach the ideal level of risk mitigation.
Resource Constraints
Another constraint is the availability of resources, especially staff with sufficient skill sets.
Again, planning can mitigate some of these risks but there is always the potential for a key
person to leave a project at a critical time.
Responding to Risks
Once risks have been identified, an organization can respond to those risks in one of three ways:
• Accept the risk
• Mitigate the risk
• Transfer the risk
Accepting the risk means the organization understands the risk, has evaluated the potential costs
of the risk as well as the costs and benefits of deploying countermeasures to the risk but has
decided not to take any steps to reduce the risk. At first glance, this may sound somewhat
irresponsible, but this is often a reasonable strategy. For example, if a data center is in a 100-year
flood plain, a company may decide that moving the operation or deploying flood controls
outweighs the benefits; accepting the risk is then a reasonable strategy.
253
Chapter 12
Mitigating the risk means that countermeasures are taken to reduce the risk. You use risk
mitigation strategies constantly, although you may not think of them as such. Consider the
following:
• Deploying anti-malware on PCs
• Implementing content filtering on network traffic
• Establishing acceptable use policies for IT equipment
• Using clusters of computers instead of a single server for a mission-critical applications
• Conducting code reviews on custom-developed applications
• Using project management best practices
These are all examples of risk mitigation measures. Some of these, such as deploying anti-
malware programs, are obviously done to reduce a well-known risk. Others, such as project
management best practices, are not solely risk mitigation measures although it is a key proactive
risk mitigation technique. In the case of project management, the best practices reduce the risk of
cost overruns and delay of deliverables. Risk mitigation does not eliminate risks; that is not
possible. Instead, the goal is to reduce the risks as much as possible using reasonable resources.
The final option for dealing with risk is to transfer it. This means an organization purchases
insurance so that in the event the risk is realized, the insurance company bears the cost of the
risk. Like risk mitigation, risk transfer is appropriate in a variety of circumstance and its use will
depend on the balance of cost and benefits.
At the conclusion of step one, an organization should have an understanding of business
objectives and how IT can serve those objectives. At the same time, these steps provide some
perspective on risks and the ability to mitigate those risks. The next step is specific planning for a
move to a service-oriented model.
254
Chapter 12
Prioritizing Needs
Service-oriented management and systems management in general encompass a wide range of
operations and services. The first part of the planning process is to understand which of these
operations and services are the most important; common among top priorities are:
• Acquiring devices and applications
• Deploying devices and applications
• Providing service desk support
• Ensuring asset management
• Maintaining systems availability
• Monitoring systems
• Auditing and compliance reporting
• Developing applications
• Securing databases and hosts
• Enforcing policies
• Improving quality controls on IT procedures
• Ensuring application compatibility
Each of these could justifiably be considered top priorities depending on the circumstances.
There is no single right answer to the question, “Where should we start?” Rather than try to force
a one-size-fits-all answer to that question, it may be more useful to examine a few scenarios to
see how varying circumstances shift priorities. These will include:
• A new business without an existing systems management structure
• A company that has recently acquired another firm
• A company in a highly regulated market
Again, the goal is not to provide a black-and-white decision-making procedure for how to
proceed with prioritizing needs but to show some examples of the kinds of questions and issues
that may influence the prioritization process.
255
Chapter 12
New Business
Consider a new business that is started to provide online services to manufacturers. The services
are delivered through a combination of onsite consulting and online support through a customer
portal. (The details of the service are not important at this point.) The characteristics of the
market are:
• Relatively few compliance requirements because the customers are not in financial
services, healthcare, or another highly regulated area
• The company is privately held so the Sarbanes-Oxley Act (SOX) does not apply
• The market is competitive and customers can easily switch providers, so developing
customer loyalty is important
• Consultants and sales staff will need full access to IT resources from remote locations
• Customers will need access to the customer portal application but customer data should
be segregated so that customers can access only their own data
• Customers expect high availability of the customer portal
Given this set of requirements, high-priority operations include:
• Acquiring devices and applications
• Proving service desk support
• Maintaining systems availability
• Developing applications
• Securing databases and hosts
A new business will of course need to acquire devices and applications, so managing that process
well from the beginning is important. Also, as customer loyalty is so important in this market,
service desk support will be a top priority as well. Supporting application development, system
availability, and database and host security are the kinds of operations customers will not see
directly but are fundamental to delivering services that are at the front lines of the business.
256
Chapter 12
Post-Merger Organization
When two organizations merge, there are often plenty of technical issues to resolve. Integrating
network architectures, databases, and applications requires knowledge of low-level details and
careful planning. Once the systems are integrated, systems managers will have to apply
management procedures consistently across all devices regardless of how they were managed in
the past. In this scenario, some of the most important operations are:
• Asset management
• Service desk support
• Databases and host security
• Policy enforcement
• Improved quality controls on IT procedures
• Configuration management
One of the first challenges to address in a post-merger situation is compiling an accurate
inventory. You cannot manage a device if you do not know you have it or if you do not know
where it is or what kinds of applications are running on it. Asset management is one of the top
few priorities in a post-merger environment.
Mergers can be disruptive to existing operations, so service desk support can be critical to
maintaining operations and efficiency. Disruptions and changes in network architecture can
introduce new and unforeseen security vulnerabilities. There is also the chance that existing
patch management operations are disrupted during a merger that in turn perpetuate existing
vulnerabilities. Another key area is to ensure policies are enforced and procedures continue to be
carried out across newly acquired assets. The priorities in organizations going through less-
disruptive changes are somewhat different.
257
Chapter 12
Asset management is a fundamental service that enables several others. Having detailed
information about the location, configuration, and status of all devices in the organization is the
basis for reporting on them and demonstrating that they are in compliance. Asset management
covers the full life cycle of hardware management, from procurement to disposal. Monitoring
systems is another part of maintaining compliance because it is an early warning procedure that
can help detect and control security breaches as well as other problems that can disrupt
operations.
Auditing and compliance reporting are obviously necessary in this situation. Auditing involves
more than the annual review by external auditors. Continuous monitoring and auditing of key
events, such as failed access attempts, changes to deployed code, and configuration
modifications should be logged and reviewed regularly.
Some of the most high-profile security breaches have involved the theft of information from
databases. Part of database security is maintained with the database system itself, but much of
that depends upon a secure host. Systems managers play a key role in securing databases by
hardening host OSs and regularly monitoring the device for signs of security problems.
Establishing polices that meet auditor expectations can be challenging enough but ensuring those
policies are enforced at all times in all applicable cases brings its own host of difficulties.
Policies, for example, are platform neutral and must be enforced regardless of the device
performing an operation. Consider the process of accessing a customer financial record; this
could occur from:
• A desktop PC used by a customer support representative
• A batch process that runs from a central server updating account information on a regular
basis
• A notebook used by an analyst investigating a problem with the account
• A smartphone, which combines cell phone and PDA functionality, used by the customer
to transfer funds between accounts while traveling
Effective policy enforcement requires a combination of thorough planning and automation to
ensure that all use cases are accommodated. This relates to the final high-priority need,
improving quality controls on IT procedures. Monitoring operations is necessary but it may
disclose weaknesses in some areas. For this reason, it is important to be able to measure the
performance of IT operations, especially as it relates to compliance-oriented policies and
procedures. Management reporting on operational procedures can help isolate problem areas and
measure the effectiveness of various remediation plans so that procedures eventually meet
expectations.
There is no absolute ordering of priorities that applies equally well to all organizations. Priorities
will largely be driven by the business strategies of the organization (which are assessed in the
first step of the roadmap process) and the current state of the organization. Although the
priorities will vary, two themes are common across organizations making the move to service-
oriented management: the need for a centralized repository of information and reporting and the
benefits of optimizing policies and procedures.
258
Chapter 12
259
Chapter 12
260
Chapter 12
In such as extreme case as the one just described, one could install a CMDB, collect information
about configuration items, and even keep it up to date with regular refreshes. The problem is it
would do some good but not as much as possible. Potential benefits include:
• A single reporting system for which applications and OSs are running on each device
• A single reporting system for determining the patch level of each device
• A rudimentary asset-tracking system that could at least catalog basic information about
devices on the network
What this approach would miss are the benefits that come from a combination of well-
formulated management policies and automated services:
• Prioritizing devices in terms of mission-critical functions
• Linking documentation to configuration items
• Integrating asset management information with other management tools, such as patch
management and deployment systems
• Enforcing policies based on attributes of devices and their users
Optimizing policies and procedures requires:
• An understanding of business goals and strategies
• Overall risk tolerance of the organization
• Regulations and other constraints on the organization
• An understanding of the existing IT infrastructure and plans for future changes
• A commitment to follow established procedures when carrying out IT management tasks
The last bullet point is one of the most important. The specific details of how one manages
patches, deployments, or testing is often less important than the fact that one is following an
established set of procedures. This is the topic addressed in the final step of the roadmap.
261
Chapter 12
262
Chapter 12
263
Chapter 12
Measuring Operations
There is an old saying that you cannot manage what you do not measure. This is certainly true in
IT. One does not need to measure every aspect of every procedure and operation. Rather, it is
better to find representative measures for key services, such as:
• In service support, the number of service desk calls, the duration of calls, and the number
of calls escalated
• In patch management, the number of patches applied, the number of failed patch
operations, and the time required to apply patches
• In deployment management, the number of devices updated, the number of failed
deployment attempts, and the staff hours required
• In change management, the number of changes, the time to approve changes, and the
number of emergency changes
Like best-practice frameworks, these examples are starting points for formulating sets of
measures that reflect the state of IT infrastructure and operations.
264
Chapter 12
Summary
Management models that have worked in the past in more slowly changing business
environments are no longer sufficient for the dynamics of today’s IT operations. To enable an
adaptable operation, you must assess the current status of IT operation, if necessary, plan a
transition to a mature service model based on frameworks such as service-oriented management,
ITIL, and COBIT, and finally implement and maintain the practices outlined there. IT
management is demanding but the tools and practices are established to help you bring direct
value to organization.
Throughout this guide, service-oriented management has been presented as a means to address
the key challenges facing IT operations, including:
• Business objectives and IT alignment
• Planning and risk management
• Business continuity and operational integrity
• Security and compliance
• Capacity planning
• Asset management
• Service delivery
The key features of a service-oriented management strategy that serve this goal include:
• Modularity of services
• Comprehensive management of configuration items in a centralized repository—the
CMDB
• Ability to report on assets and dependencies between assets
• Support for maintaining adequate security in the information infrastructure
• Support for asset management
• Support for the delivery of new IT services and applications
There is no single process or methodology that will guarantee the success of an IT operation.
There are, however, well-developed best practices that provide ideal starting points and detailed
guidance on managing a significant part of any information management operation. That in
conjunction with the ability to adapt to the particular needs of your own organization is the best
approach to meeting your organization’s long-term goals and objectives.
265
Chapter 12
266