Saturday, November 17, 2018

Incident Management

Incident Management

An incident is any kind of disruption to an organization's IT services as well as to other services that affects a single user/ a team or the entire business operation. It hampers the normal business continuity. An incident is an unplanned disruption. It can happen anytime without prior notice. 

An incident can be ideally divided in to two types such as system generated incident or user logged incident. System generated incident are automatic in nature. 

Example of System generated incident

Virus detection by antivirus software.

Example of user logged incident

An user suddenly noticed there is an unauthorized access to a system or someone rushed in to office premise intentionally. He immediately informs the internal incident response team about the matter.

Example of Incidents can be range from civil repair work to cyber attacks. An incident can be many types such as;

  • UPS failure
  • Air conditioning failure
  • Network devices failure
  • Hard disk failure
  • Access control failure
  • Internet failure
  • Software issue
  • DDoS attack
  • Cyber attack
  • Backup failure
  • System failure
  • Virus attack
  • Printer issue
  • Natural calamities such as Fire outbreak, cyclone, flood, earthquake.

Incident Management is compromises of set of processes and principles adopted to return a service operation in to normal functioning after the incident occurs and ideally should complete it within the SLA (Service Level Agreement) time frame.

So an incident need to be identified, prioritized based on its impact and urgency, then accessed to resolve. An Incident management covers every aspect of an incident throughout its life cycle.

Process brief

The entire incident management process should be documented prior to implementation. There must be some key roles and responsibilities defined for the core team members such as Incident Manager, Incident Process owner, Process operator, etc. 

An incident management process has different steps with different approach. Below are the general steps;

Incident identification

Identify an Incident, type of incident etc.

Incident Logging or incident recording

Log or record the incident, log the time stamp, occurrence place, affected area, any root cause. Many automation software are there to do this task. Earlier manually people record the incident in registers or excel sheets. Assign a ticket/token to an incident with Incident number.

Incident Categorization and prioritization

Categorize the incident based on risk and severity level such as High, Medium, Low. Prioritize the incident based on how urgency to resolve it. Or we can say, Prioritization helps the Incident Management team to examine and understand the importance of the incident and the timelines within which the issue needs to be resolved.

There must be incident priority matrix established. An incident priority matrix is a documented guide where the incident priority level is set before, such as critical, high, medium, low, no impact.

In incident priority matrix, the urgency to resolve the issue, response time, resolution time, MAT (maximum allowable time) need to be defined as per the defined SLA (Service level agreement) or MSA (Master Service Agreement).

Incident assignment to team or owner

Create task for resolve, Provide resolution

Technical Team or subject matter experts (SME) to work on the incident, research, diagnosis, investigate on the issue or recorded incident. 

SLA adherence during resolution

Follow SLA timelines for resolution process.

Escalation if further investigation required or failure in resolution at L1 level.

Escalate in case of it is a major incident and L1 support cannot support it. 

Provide resolution

The goal of the Incident Management is to restore the business or services as soon as possible, hence resolution is very important. The resolution provided could either be a work in progress / a temporary fix or a permanent resolution.

Close the incident

Close the incident once resolution is carried out. In case, If post providing the resolution to the customer no feedback or further escalation received on the issue within 24 hours from the resolution time, then the issue will deemed as closed. 

Take user feedback on the closure and document all the process for the incident

Need to check with customer, client how they are happy with the resolution process. And collect if any feedback is there. 

Finally add the incident resolution process to KEDB.

KEDB:

KEDB is nothing it is Known error data base and it is very useful one in terms if incident management. 

During an incident occurs, initially KEDB is referred in that case to check as if any previous similar incident occurred in past and the process to resolve this is available in the database. So that it will minimize the time of resolution process. Hence KEDB should be updated in regular intervals to record all the incidents and problems. KEDB can further be categorized based on domains. 

Similarly there is Problem Management Process, i can provide a little brief on it;

A problem is nothing, it is unresolved incident or when an incident happens frequently in the system or organization. Problems can be identified from major incidents, combination of multiple incidents.

The problem management process is a similar kind of incident management process. However a change management process is an additional process which is linked to problem management process.

For example, during a problem investigation and root cause analysis (RCA) stage, it was finalized that the problem cannot be solved until we have to change it. It may be involved as buying a new hardware, software, hire new resource, and change in scope. So here change management process will act how to change the process.

So this is a basic understanding about incident and problem. If you have further questions or comments please provide.

Thanks!

-DR

No comments:

Post a Comment

Network Scanning Tools

Network Scanning through Nmap and Nessus Network scanning is a process used to troubleshoot active devices on a network for vulnerabilities....