An Effective Method to Manage Incident Response SLA
Effective incident management provides recurring value. Learn about managing incident response in your ITIL team's operations.
Join the DZone community and get the full member experience.Join For Free
ITIL defines an incident as an unplanned interruption to or quality reduction of an IT Service. The service level agreements (SLA) define the agreed-upon service level between the provider and the customer.
An incident interrupts normal service; such as when a user’s computer breaks, when the VPN won’t connect, or when the performance of a service degrades. These are unplanned events that require help from the service provider to restore normal function; incident management restores IT services to normal working levels.
Incident management focuses solely on handling and escalating incidents as they occur to restore defined service levels. The main goal is to take user incidents from a reported stage to a closed stage.
Once established, effective incident management provides recurring value for the business. It allows incidents to be resolved in time-frames previously unseen. Incident management also involves creating incident models, which allow support staff to efficiently resolve recurring issues. Models allow support staff to resolve incidents quickly with a defined process for incident handling. The visibility of incident management makes it the important to implement and get buy-in for, its value is evident to users at all levels of the organization.
Operational incident management requires the following key pieces:
A service level agreement between the provider and the customer that defines incident priorities, escalation paths, and response/resolution timeframes
Incident models, or templates, that allow incidents to be resolved efficiently
Categorization of incident types for better data gathering and management
Agreement on incident statuses, categories, and priorities
Agreement on incident management role assignment
The Incident Management Process
In ITIL, incidents go through a structured workflow that encourages efficiency and best results for both providers and customers. ITIL recommends the incident management process follow these steps
Incident Response (Diagnosis, escalation, investigation, resolution, recovery, and closure)
Incident Identification and Logging
The first step in the life of an incident is incident identification. In an IM/AM case incidents comes from automated notices such as monitoring software, emails, support chat etc.,
Once identified as an incident, logs the incident as a ticket. The ticket should include information such as the user’s name, the incident description, and the date and time of the incident report. The logging process can also include categorization, prioritization etc.,
Incident categorization is a vital step in the incident management process.
Categorization structures in IT Service Management are divided into two distinct components: Operational Categorization and Product Categorization.
Operational categorization is a three-tier structure that helps you to define the work that is being done for a particular incident. This structure is also used to qualify reporting in the system, qualify how groups and support staff are assigned, and route approvals.
Product categorization is a three-tier structure that helps you to define a description of the object or service on which you are performing the work (for example, Hardware, Peripheral Device, Monitor).
The structure of Operational categorizations template is Action -> Object -> Subject. Stated from the perspective of the user/customer reporting the outage, the classification should be “I (the user) need your (support) to <Op Cat1> the <Op Cat2> on my <Op Cat3>”. Those values should be in four sections, differentiated by the value in the incident type field.
User Service Restoration: Related to existing products broken or service interrupted.
Fix/Repair -> Connectivity -> Network
Fix/Repair -> Hardware -> Laptop
Infrastructure Event: Created exclusively from Event Management application(s) feeding ITSM. Would use any other User Service Restoration OpCats in addition to these exclusive ones that could likely be generated by an automated alarm but would not result in any action.
Check/Verify -> Service -> Server
Check/Verify -> Alarm -> Server
Infrastructure Restoration: These are exclusively for Events that are not only created by an automated rule but resolved by preset automated action without the need for human intervention. They would use the User Service Restoration OpCats.
The details of the exact software and hardware in play are defined in the Product Categorizations. Note that these values are for Incident Types that are Incident Related.
Operational Categorizations tell what needs to happen but are deliberately vague, or generic, in terms of exactly which objects will be affected. The categorization of a specific laptop or server is more appropriate to mention in the ProdCats. The ProdCat of a CI is closely related to that mentioned in OpCat3, since the server is what is being changed. Then the relationship between the Incident and the actual CI(s) affected can be drawn, using the same ProdCats. The principle is that the Cat3 is the object being most affected, and should be detailed in the ProdCats. Optionally (and probably optimally) both the Cat2 and Cat3 CIs should also be related to the Incident.
Product Categorizations are not application-specific. They should reflect a combination of sources, and use the subset of the aggregate appropriate to the installation. The potential source of data is CMDB.
Priority reflects the organizational response required for an Incident. Establishing a priority coding system requires two major parts
1. Definition of organizational response, for example, Critical, High, Medium, Low or Platinum, Gold, Silver, Bronze
2. A method for determining which response to apply to any given incident
ITIL presents an example of a 2-part priority coding system with five priority levels or tiers: 1-Critical, 2-High, 3-Medium, 4-Low, and 5-Planning.
It then offers a simple matrix with impact on the top, and urgency on the side to select the priority. Thus, establishing priority is a matter of mostly two things; impact and urgency.
You can configure assignment routing so that the system automatically assigns records, such as investigations or change requests, to the appropriate support group.
When an ITSM application uses the routing order, which is a feature of many of the main ticketing forms, it uses information from the form to find an assignment entry and select the support group for assignment.
Opinions expressed by DZone contributors are their own.