Incident Management
Definitions
- Incident - unplanned disruption of service functionality
Before an incident
- Define a clear definition of incident.
- Define what is major and major incident.
- Major incident qualities:
- Timing is surprise
- Timing is important
- Situation not well understood at the beginning
- Involve many people with different skills
- Define incident severity levels.
Timeline
Challenges
- No one responding
- Not the right people
- Solving the wrong problem
- People making things worse
- Too many people involved
- Stakeholders left in the dark
-
Everyone stepping on each other toes.
Best Practices
- We should have two modes of operations with a clear distinction between them
- Normal Operations
- Emergency Operations
- Have different severities for incidents
- Practice, practice, practice, then practice some more.
Incident Command System Principles
- Common terminology - everyone in the team should use the same terms.
- Accountability - everyone participating in an incident takes responsibility for resolving it as a first priority.
- Unity of command - the is a single chain of commands to an incident commander.
- Explicity transfers of responsibility - you should get an explicit confirmation when transfer roles.
- Modular organization - separate into roles
- Integrated communication - haven defined common communication streams.
If you’re just learning incident response, you can go far by guessing the right thing to do based on ‘what whould the fire department do in this situation?’ (c) J. Paul Reed
- What would there on-call schedule be?
- Would they stop everything for a press conference in the middle of an incident?
- Would the mayor come in and take over the incident and tell the team exactly what to do?
Incident org chart
Definitions:
- IC - Incident Commander
- not any action can happened without explicit approve of IC
- does not do the work to resolve the incident (neither gathering data nor applying fixes)
- every decision should be after consensus (ask: “are there any strong objections?”)
- Deputy
- keep IC focused
- take on any and all additional tasks as necessary
- serves to follow up on reminder and ensure tasks aren’t missed
- acts as a ‘hot standby’ for IC
- Scribe - document what is going on
- document incident management timeline and important event as they occur
- key actions as they are taken
- good to document it publicly (e.g. in slack)
- document all follow up tasks
- Communication Liason - Communication Lead
- communicate with users, stakeholders and executives
- update every 20-30 minutes
- track customer tickets
- report on customer impact
- SME - Subject Matter Expert (firefighter)
- TL - Tech Lead (coordinate SME teams to aggregate data for IC in case of having too many SMEs)
Important:
* Focus on roles, not individuals.
People involve as incident grows:
-
Start of an incident
- Usually IC is an on-call person
- IC at start fulfill all the roles, including communication and scribe
- IC could explicitly transfer this role later
-
Escalate and involve more firefighters
-
Add more visibility to the users
-
Add more firefighters
Steps in the incident
- Don’t panic or, at least, do not show that you panic.
- Introduce yourself
- Hello, I’m Max. I’m from private cloud team.
- Clear communication is essential.
- Clear is better than concise.
- Is there an IC on the call?
- the most obvious IC is on-call person
- if there is no IC - take this ownership.
- The main task is stabilize
- ask for status
- Decide action by gaining consensus (ask: are there any strong objections?)
- what risks are involved?
- making a wrong decision is better than making no decisions
- Assign task
- follow up on task completion
- Clear ownership
- do not ask: Can someone do this?
- better to call someone by name and ask him to do this action
- Time box every task
- Get explicit acknowledgment
Crisis Patterns
- Thematic Vagabonding (butterfly minds) - you jump from one hypothesis to another very quickly without spending enough time on everyone of them.
- Goal Fixation - you spend too much time trying to confirm a single hypothesis. You rationally ignoring signals of other contributing factors, because you think you know it (gut feeling).
- Heroism - when someone goes and fix it silently. If he didn’t fix it then he wasted precious time, if we fixed it then we promotes a culture of smart individuals - which is not a scalable solution.
- Distraction - irrelevant noise in comm channels, e.g. doing postmortems while the incident.