notes

Postmortem

Downtime is like a present - it’s good until you have the same twice.

The goal of Postmortem is to understand how your system REALLY works

Safety

Safety - it the system quality that is necessary and sufficient to ensure that the number of events that can be harmful to workers, the public, or the environment is acceptably low.
Safety 1:
- absence of incidents;
- people and their activities are considered as threat to safety;
- safety activity is centered around creating barriers and removing causes.
Safety 2:
- errors are routing part of any complex system;
- there is no a clear division between system working and being broken;
- people create safety by their adaptability.

Basic

Real incidents
- Real incidents are not unusual.
- Real incidents are not all catastrophic issues - conduct postmortems for minor issues as well.
- Real incidents do not have a single, simple root cause. Contributing Factors!
- Postmortems are to learn, not to fix.
Postmortem - taking the opportunity to learn how your system really does work, after your system acts differently from how you expect it to, in order to increase your system and team’s resilience.
Contributing Factor - any factor contributing to the occurrence of or influencing a failure or hindering resolution of that failure.

Biases

Fundamental Attribution Error - when we explain other people’s actions by their personality, not the context they find themselves in, but we explain our own actions by our context, not our personality.
- Remember everyone has good intent and did their best at the time.
Confirmation Bias - when we seek information that reinforces existing positions, and ignore alternative explanations. We interpret ambiguous information in favour of our existing assumptions.
- Gather many viewpoints; let the facts guide your conclusions and not vice versa.
Hindsight Bias - when we recall events to form a judgment, and talk about contextualize events with knowledge of the outcome - often making ourselves look better in the process.
- Put yourself into the viewpoint of the practitioner at the time.
Negativity Bias - even when of equal intensity, things of a more negative nature have a greater effect on one’s psychological state than neutral or positive things (we magnify small errors if they lead to an incident).
- Review what went right; don’t rush to fix, learn instead.
Curse of Knowledge - when an individual communicating with other individuals unknowingly assumes the others have the background and knowledge to understand it (experts often have trouble explaining things to novices).
- Explain or ask others to explain basics; don’t assume people know things.
Outcome Bias - judging a past decision based on its outcome.

Language

~Why~ - do not use this word. As it forces people to justify their action.
How - helps to distant people from the actions they took, but also limit the scope of the inquiry, as we focus on mechanics, not the relations at play in larger system.
What - uncovers reasoning, which is important for building empathy with people in complex systems
- What did you think was happening?
- What did you do next?

Remember: People make what they consider to be the best decision given the information available to them at the time

Research

Stack traces.
Forensic logs.
Images (cores, dumps, etc.)

Content

Postmortem Template

Analysis

What went well

A way to fight Negativity Bias and Fundamental Attribution Error.
General questions:
- What aspects of our system and team contributed to our success here?
- During this incident and the events leading up to it, how did we actively create and sustain success?
- How are we already monitoring, responding, anticipating, and learning?
- How was the problem fixed, how did the responder figure out what was wrong, and how to remediate it?
Question examples
- What safeguards were in place?
- What went according to plan?
- What was effective about detection, analysis and remediation?

Contributing Factors

There are no root cause in complex systems

How did it happen?
What hindered detection?
What hindered diagnosis?
What hindered resolution?
Was the incident response effective?
What could have gone terribly wrong?
Look at the incident response itself, as well as the other causes.

E.g. engineer updated the config wrong:

What led him to update it in an unexpected way?
What made that easy or likely to happen?
How did no test catch it?
How did no review catch it?
How could it propagate to multiple systems undetected?

Corrective Actions

Ask:
- How can we eliminate this class of outages?
- How can we reduce the change of recurrence of this outage?
- How can we improve detection, diagnosis and resolution?
- How can we reduce impact?
Be specific.
Prepare a list of suggestions.
- Quick Fixes
  - bug fixes
  - workarounds
  - band-aids
  - adding a monitor
- Design Changes
  - deeper reworking of one or more technical components or processes
- Process Changes
  - performing tasks differently
Determine feasibility and level of efforts to implement proposals
- If you can do something, doesn’t mean that you should do it!
Don’t forget Efficiency-Thoroughness Trade-Off Principle
- The ETTO principle is the principle that there is a trade-off between efficiency or effectiveness on one hand, and thoroughness (such as safety assurance and human reliability) on the other.

Postmortem Meeting

Debriefing Facilitation Guide

The downtime reduces users confidence and loyalty to your product.

So we need to be transparent about what happened and how are we going to prevent it from happening again in the future.

Transparency:

Builds trust - this guys knows what they are doing.
Increases perception of reliability.
Reduces support costs
Controls the message

Summary of what happened
What happened in details
- What went well
- What didn’t go well
- How and when it was restored
- What is the user impact (e.g. lost transactions, lost of PII)
What We’re Doing About it

Analyzing and reducing the amount of incidents

Keep a History of Outages
Do Retrospective
- Focus on fact
- Be positive; amplify the good
- Look below the surface for the second story of how things happen
- Think about it; don’t jump to solutions.
Analyze near misses (when things almost got wrong)
- Where did safety work?
- Where were there issues?
- Are there recurring minor issues?