Vous êtes sur la page 1sur 17

MAJOR INCIDENT PROCESS

Overview

Version 2.2
June 28, 2018
Matthew Wollman
This page left intentionally blank.

Page 2 HUIT Major Incident Process


Document Change Control

Version # Date of Issue Author(s) Brief Description

0.1 8/3/2012 Matthew Start of Document


Wollman

0.2 8/21/2012 Matthew Incorporated feedback from Courtney Harwood,


Wollman Richard Ohlsten and Steve Martino

0.3 8/28/2012 Matthew Made major modifications to Responsibilities and Workflow.


Wollman
 Added definitions for critical, core, and non-core service

0.4 9/10/2012 Matthew Incorporated feedback from Dennis Ravenelle


Wollman
 Drafted water mark
 Reordered Objectives and Policy by importance
 Further clarified definition
 Added additional Role responsibilities
 Expanded Process activities
 Made grammatical changes

1.0 9/24/2012 Matthew First release of document after core team approval
Wollman
 Removed P1 and P2 differences

1.1 11/27/2012 Matthew Separated Incident Commander and Incident


Wollman Communications roles; added text about criteria for
hierarchical escalations

2.0 2/13/2013 Matthew Combined Purpose and Scope, and objectives and policies.
Wollman & Reorganized roles and responsibilities by order of role
Janet Crystal involvement in process. Reorganized and reduced Process
activities section to a high – level overview. Process activities
will be detailed in separate documentation

2.1 8/15/2014 Matthew Change to RACI, Service Owner is Accountable for External
Wollman Communications, Removed C-Cure to Critical Services

2.2 11/2/2014 Matthew


Wollman

HUIT Major Incident Process Page 3


This page left intentionally blank.

Page 4 HUIT Major Incident Process


Table of Contents
Document Change Control............................................................................................................................ 3
Purpose and Scope........................................................................................................................................ 7
Policies .......................................................................................................................................................... 7
Process Roles and Responsibilities ............................................................................................................... 8
Incident Commander ................................................................................................................................ 8
Incident Commander Escalation ........................................................................................................... 8
Incident Communicator ............................................................................................................................ 9
Service Desk .............................................................................................................................................. 9
SOC Operations ......................................................................................................................................... 9
Technical Resources (Infrastructure, Development, DevOps, etc.) ........................................................ 10
Technical Line Manager .......................................................................................................................... 10
Service Owner / Practice (or Product) Manager ..................................................................................... 10
Process Activities ........................................................................................................................................ 11
Major Incident Identification .................................................................................................................. 11
Initial Communication and Escalation .................................................................................................... 11
Incident Coordination ............................................................................................................................. 11
Conference Bridge .................................................................................................................................. 11
External Communication ........................................................................................................................ 11
Internal Communication ......................................................................................................................... 12
Investigation............................................................................................................................................ 12
Resolution ............................................................................................................................................... 12
Incident Documentation ......................................................................................................................... 12
Appendix A: Process Flowchart for a Major Incident ............................................................................. 13
Appendix B: RACI Matrix ........................................................................................................................ 14
Appendix C: Critical Services .................................................................................................................. 15
Appendix D: Major Incident Process Timeframes (Estimated) .............................................................. 16
Glossary ....................................................................................................................................................... 17

HUIT Major Incident Process Page 5


This page left intentionally blank.

Page 6 HUIT Major Incident Process


Purpose and Scope
The Harvard University Information Technology (HUIT) Major Incident process provides a unified system
for resolving Major Incidents as quickly as possible through proper identification, predefined escalation
paths, and prompt communication procedures across all HUIT services.

A Major Incident is the interruption or degradation of a core production service (any centralized
HUIT-provided service that serves multiple customers and users) that results in the disruption of its
customers’ ability to carry out University teaching, learning, research and/or administration at the
University.

The scope of this document is to provide an overview of the processes that apply to every Major
Incident for all HUIT services and that all HUIT employees must follow. Once trained, all HUIT employees
will be able to identify a Major Incident and to escalate it to the appropriate technical group for
resolution.

Policies
1. HUIT’s focus is to alert the community to the occurrence of a Major Incident as quickly as possible.
Early notification of a potential issue is more important than an accurate description of the problem.
2. HUIT will use standardized methods and procedures to enable an efficient and prompt response,
analysis, documentation, ongoing coordination and ownership, communication, and reporting.
3. Escalation in a Major Incident will start with the Incident Commander and move to the HUIT
employees most responsible for each service.
4. HUIT will communicate with affected end-users regularly throughout the lifecycle of a Major
Incident.
5. HUIT will maintain a consistent and regular presence through open communications among HUIT
staff and will provide consistent updates to the Service Desk, Service Owner, Incident Manger, and
HUIT leadership.
6. HUIT will log and document all details of Major Incidents throughout the lifetime of each event.

HUIT Major Incident Process Page 7


Process Roles and Responsibilities

Incident Commander
The Incident Commander has the highest level of responsibility during a Major Incident and is
accountable for its lifecycle through coordination, documentation, and communication. The roles of
HUIT Incident Commander and HUIT Incident Communicator may be combined in one person for
incidents that are of short duration or that are deemed less critical. For incidents of longer duration or
those with greater impact, the responsibility of the Incident Commander can be escalated to a Manager
or Director in HUIT.

The Incident Commander is responsible for the following activities:

 Facilitating and participating in and a conference bridge


 Maintaining communication with Technical Resources and Service Owners for status updates and
additional information
 Coordinating resources needed to troubleshoot, communicate, and/or make decisions to resolve a
Major Incident
 Ensuring that internal and external communications about a Major Incident are completed in a
timely manner
 Creating and completing a Major Incident Report

Incident Commander Escalation


If the scale of the event requires escalation to a HUIT Manager or Director, the responsibilities for the
Incident Communicator role will remain with the original Incident Commander. The following conditions,
whether individual or in combination, will guide the need for escalation of Incident Commander
responsibilities to a higher level of HUIT management:

1. A Major Incident is one of the Critical Services listed in Appendix C of this document.
2. A Major Incident affects over 1,000 users of one or more services.
3. A Major Incident is not or cannot be resolved within four hours.

Page 8 HUIT Major Incident Process


Incident Communicator
The Incident Communicator is responsible for the documentation and communication during a Major
Incident, both internally to HUIT and externally to customers and end-users. The roles of HUIT Incident
Commander and HUIT Incident Communicator may be combined in one person for incidents that are of
short duration or that are deemed less critical. For incidents of longer duration or those with greater
impact, the responsibility of the Incident Commander can be escalated to a Manager or Director in HUIT.

The Incident Communicator is responsible for the following activities:

 Participating in a conference bridge


 Communicating internally to HUIT staff and externally to the customers of the service, end-users,
and other non-HUIT parties
 Maintaining a record of events throughout a Major Incident
 Notifying HUIT staff and any external parties of the resolution
 Updating the HUIT website, Twitter, Facebook, and email distribution lists with notifications of
incidents, updates, and resolution.

Service Desk
The Service Desk is responsible for the following activities:

 Identifying a Major Incident


 Escalating a Major Incident to the HUIT Incident Commander
 Logging Major Incident tickets for end-users
 Participating in a “servicedesk” Jabber chat room or the conference bridge
 Placing a generic Major Incident message on the ACD system

SOC Operations
The SOC Operations group is responsible for the following activities:

 Identifying a Major Incidents


 Escalating a Major Incident to the HUIT Incident Commander
 Logging Major Incident tickets for end-users
 Participating in a conference bridge
 Notifying the Service Desk during business hours of any Major Incident

HUIT Major Incident Process Page 9


Technical Resources (Infrastructure, Development, DevOps, etc.)
Any HUIT Technical Resource who receives alerts, escalations, and/or who has a role in restoring HUIT
services to normal operation is responsible for the following activities:

 Identifying a Major Incident


 Escalating a Major Incident to the Technical Line Manger
 Troubleshooting and working to resolve the incident in accordance with internal procedures for
handling Major Incidents
 Documenting incident details and steps taken to resolve the underlying problem
 Providing regular updates to the Line Manger and/or the Incident Commander on the status of an
investigation and the resolution of the incident.

Technical Resource Manager


Any HUIT manager who manages technical resources and their performance is responsible for the
following activities:

 Identifying and escalating a Major Incident to the HUIT Incident Commander


 Identifying the scope of the problem and identifying additional services that may be affected by a
Major Incident
 Notifying respective service areas and providing updates throughout the lifecycle of a Major Incident
 Participating in a conference bridge
 Facilitating communication among technical resources, HUIT Incident Commander and Service
Owner
 Recording and tracking progress throughout the lifecycle of a Major Incident and providing updates
to the Incident Commander and Service Owner
 Estimating the service recovery time
 Managing the activities of the Technical Resources

Service Owner / Practice (or Product) Manager


Any HUIT employee or their proxy who is responsible for the overall quality of a service and has the
most comprehensive knowledge of its components is responsible for the following activities:

 Identifying a Major Incident


 Participating in a conference bridge
 Notifying the Service Desk during business hours of a Major Incident
 Identifying the business impact of a Major Incident
 Communicating externally to the customers of the service, end-users and other non-HUIT parties
 Maintaining a record of events throughout a Major Incident
 Confirming that resolution of a Major Incident is in place
 Notifying HUIT staff and any external parties of the resolution after confirmation

Page 10 HUIT Major Incident Process


Process Activities
HUIT maintains detailed descriptions of the following activities in separate documents. They are listed
below in this document for high-level reference and overview.

Major Incident Identification


 Major Incidents can be initiated by customers, reports from user, observations, monitoring, Event
Management, and/or Change Management.

Initial Communication and Escalation


 As soon as HUIT staff has identified a suspected Major Incident, they must escalate it immediately to
the HUIT Incident Commander.
 The Incident Commander will declare the event as a Major Incident and set its priority.
 The Incident Commander will escalate it to the appropriate technical groups and service owners.
 After declaration of a Major Incident, the Incident Communicator will email the Service Desk with
appropriate information and place a service alert on the HUIT website.

Incident Coordination
 The Incident Commander will involve and consult with all necessary parties to resolve the incident
as quickly as possible.
 The Incident Commander will facilitate conference bridges to ensure that information is
disseminated in a timely manner, that time spent on the bridge is focused and that troubleshooting
can continue.
 The Incident Commander will escalate the incident to additional resources, including hierarchical
escalations as necessary.

Conference Bridge
 Once notified of a Major Incident, the Incident Commander will use a conference bridge that
includes all affected groups to maintain communication between the technical resources and the
service owner(s).
 The Incident Commander will determine the appropriate schedule for calling a conference bridge
and its duration after the initial assessment.

External Communication
 Throughout the Incident, HUIT will use its website as the primary location for information updates.
 HUIT will distribute Incident notification(s) to external customers, add an outgoing message to the
Service Desk ACD system (as necessary), and send a tweet whose content will also appear on HUIT's
Facebook page and in Harvard’s Yammer community.

HUIT Major Incident Process Page 11


Internal Communication
 The Incident Communicator will create a Major Incident ticket to be available in Remedy.
 The Incident Communicator will send a notification containing internal details of the Major Incident.
 The Incident Communicator will notify the Operational Managing Directors, as necessary.

Investigation
 HUIT will investigate continuously throughout a Major Incident and coordinate updates with
vendors, developers, and end-users.

Resolution
 Service Owners have final sign-off authority on the resolution of a Major Incident and ensure
end-user notification.

Incident Documentation
 The Incident Communicator will document the initial assessment of the incident's root cause (if
known), create a timeline, and establish the steps taken for investigation and resolution.
 The Service Owner(s) and Technical Line Manager(s) will forward any notes or timelines that they
have maintained throughout the incident to the Incident Commander.

Page 12 HUIT Major Incident Process


Appendix A: Process Flowchart for a Major Incident
Major Incident Process
8/29/2012

Is this a Major Update ACD

Escalate to ITSM (6-2831)


Service Desk

Phone or Emails
incident?

Yes
No

Assume Normal
Process

Join Conference Bridge


Users or Is this a Major
Monitoring Incident?
Operations

Yes

No

Assume Normal
Process
Service Owner / Product

 Confirm Incident
Is this a Major Resolution
Customer
Incident?  Communicate
Manager

Communicate to External Externally


Yes Customers / Users  Provide Business
Impact Details /
Log for Incident
Report
Assume Normal
Process

 Resolve Incident
Escalate to
 Communicate
ITSM

Notify Service
Assume Incident Appropriate
Notify SD Owner / Product Internally End
Commander Role Technical
Manager  Open Incident
Resources
Report
Line Manager
Technical

Provide
Escalate to ITSM Notify Update to ITSM
Technical
(6-2831) Huit-inf-alerts Coordinate add’l
Details / Log for
tech resources
Incident Report
Technical Resources

Is this a Major Escalate to Line Investigate and


Monitoring Yes
Incident? Manager Restore

No

Assume Normal
Process

Legend
Incident Ownership Coordination and
Communication Tasks Technical Resolution Tasks
Path Ownership Tasks

HUIT Major Incident Process Page 13


Appendix B: RACI Matrix
A = Accountable, R = Responsible, C = Consulted, I = Informed

Incident Communicator

Technical Line Manager


Incident Commander

Technical Resource
SOC Operations

Service Owner
Service Desk
Activity
Incident Identification A R R R R R
Initial Communications A,R R R C C
Escalation A,R R R R R R
Incident Coordination A,R C C
Conference Bridge A,R R I C R
External Communication R R I I C A,R
Internal Communication C,I R A C,I
Investigation I I I R C,I A
Resolution A,R R R R
Incident Documentation A,R R C C C C C

Page 14 HUIT Major Incident Process


Appendix C: Critical Services
1. Central Networking Services 3. E-mail
4. PIN / LDAP
a. DNS, CHCP, Infoblox 5. University website
b. Core / Data Center Routers 6. College website
c. Core / Data Center Firewalls 7. Phone System / Voicemail / i3
d. Load Balancer 8. PeopleSoft
2. Data Center 9. Oracle Financials
10. HarvIE
a. Facilities / Power 11. CAADS
b. Shared Storage 12. iSites / Canvas
c. Virtualization

HUIT Major Incident Process Page 15


Appendix D:Major Incident Process Timeframes (Estimated)

•Major Incident Identified


•Major Incident Escalated to Incident Commander
•Service Desk Informed
•Service Owner Informed
T0 •Technical Resources Informed
•Call Bridge Opened

•Initial Communications
•Service Desk Updates ACD System
•Service DeskLlogs Remedy Ticket
•Incident Commander Sends HUIT Alert
T+30 •Incident Commander Updates Website
•Service Owner or Incident Commander Sends External Notification

•First Update
•Initial Diagnosis?
•Estimated Time to Resolution?
T+45 •Additional Communications Need to be Sent?
•Agree upon Update Times and Intervals (e.g., every 30 minutes)

•Regular Updates
•Update on Progress?
•Additional Rresources?
T+Interval •Updated Communications?

•Service Owner Confirm Service is Restored to Acceptable Levels


•Incident Commander Notifies HUIT Alert
•Incident Commander Updates Website
•Service Owner Sends External Communication
Resolution •Incident Commander Resolves Major Incident
•Incident Commander Begins Incident Report

Page 16 HUIT Major Incident Process


Glossary
Core Service—Any HUIT-provided service that serves multiple customer groups and end-users, and is a
centralized service. See non-core service.

Critical Service—Any service whose failure or degradation creates an immediate and large-scale impact.
See Appendix C.

Incident Commander—The Incident Commander is responsible for the lifecycle of the Major Incident,
including coordination, documentation and communication and is its owner.

Major Incident—A Major Incident occurs when a core production service is interrupted or degraded,
resulting in a noticeable disruption of the customers’ ability to carry out University teaching,
learning, research and administration.

Non-Core Service—Any HUIT service that is hosted or provided to one specific customer or group of
users for a non-centralized purpose.

Service Owner—In the context of the Major Incident process, the service owner is a HUIT staff member
who has a comprehensive view of the service including but not limited to customer and user
relationships, a broad understanding of the components required to deliver that service, and the
expectations for the quality set for that service.

Utility—The functionality offered by a service to meet a particular need. Utility can be summarized as
‘what a service does’, and can be used to determine whether a service is able to meet its
required outcomes or is ‘fit for purpose’. The business value of an IT service is created by a
combination of utility and warranty.

Warranty – Assurance that a product or service will meet agreed requirements. This may be a formal
agreement such as a service level agreement or contract, or it may be implied through ad-hoc
messages or agreements. Warranty refers to the ability of a service to be available when
needed, to provide the required capacity, and to provide the required reliability in terms of
continuity and security. Warranty can be summarized as 'how the service is delivered', and can
be used to determine whether a service is 'fit for use'. The business value of an IT service is
created by the combination of utility and warranty. See also service validation and testing.

HUIT Major Incident Process Page 17