notes

Incident Management

Definitions

  1. Incident - unplanned disruption of service functionality

Before an incident

  1. Define a clear definition of incident.
  2. Define what is major and major incident.
  3. Major incident qualities:
    • Timing is surprise
    • Timing is important
    • Situation not well understood at the beginning
    • Involve many people with different skills
  4. Define incident severity levels.

Timeline

Incident Response Timeline

Challenges

  1. No one responding
  2. Not the right people
  3. Solving the wrong problem
  4. People making things worse
  5. Too many people involved
  6. Stakeholders left in the dark
  7. Everyone stepping on each other toes.

    Boys playing football

Best Practices

  1. We should have two modes of operations with a clear distinction between them
    • Normal Operations
    • Emergency Operations
  2. Have different severities for incidents
  3. Practice, practice, practice, then practice some more.

Incident Command System Principles

  1. Common terminology - everyone in the team should use the same terms.
  2. Accountability - everyone participating in an incident takes responsibility for resolving it as a first priority.
  3. Unity of command - the is a single chain of commands to an incident commander.
  4. Explicity transfers of responsibility - you should get an explicit confirmation when transfer roles.
  5. Modular organization - separate into roles
  6. Integrated communication - haven defined common communication streams.

If you’re just learning incident response, you can go far by guessing the right thing to do based on ‘what whould the fire department do in this situation?’ (c) J. Paul Reed

  1. What would there on-call schedule be?
  2. Would they stop everything for a press conference in the middle of an incident?
  3. Would the mayor come in and take over the incident and tell the team exactly what to do?

Incident org chart

Definitions:

Important: * Focus on roles, not individuals.

People involve as incident grows:

  1. Start of an incident

    Incident Response Step 1

    • Usually IC is an on-call person
    • IC at start fulfill all the roles, including communication and scribe
    • IC could explicitly transfer this role later
  2. Escalate and involve more firefighters

    Incident Response Step 2

  3. Add more visibility to the users

    Incident Response Step 3

  4. Add more firefighters

    Incident Response Step 4

Steps in the incident

  1. Don’t panic or, at least, do not show that you panic.
  2. Introduce yourself
    • Hello, I’m Max. I’m from private cloud team.
    • Clear communication is essential.
    • Clear is better than concise.
  3. Is there an IC on the call?
    • the most obvious IC is on-call person
    • if there is no IC - take this ownership.
  4. The main task is stabilize
    • ask for status
    • Decide action by gaining consensus (ask: are there any strong objections?)
      • what risks are involved?
      • making a wrong decision is better than making no decisions
    • Assign task
    • follow up on task completion
  5. Clear ownership
    • do not ask: Can someone do this?
    • better to call someone by name and ask him to do this action
  6. Time box every task
  7. Get explicit acknowledgment

Crisis Patterns

  1. Thematic Vagabonding (butterfly minds) - you jump from one hypothesis to another very quickly without spending enough time on everyone of them.
  2. Goal Fixation - you spend too much time trying to confirm a single hypothesis. You rationally ignoring signals of other contributing factors, because you think you know it (gut feeling).
  3. Heroism - when someone goes and fix it silently. If he didn’t fix it then he wasted precious time, if we fixed it then we promotes a culture of smart individuals - which is not a scalable solution.
  4. Distraction - irrelevant noise in comm channels, e.g. doing postmortems while the incident.