What are you going to do after an incident happen? Find the one to blame…? A better idea is to create an environment where failures are accepted and appreciated. You should create an environment of learning and reinforce that failing fast and learning from mistakes is something to strive for.
The described points are mostly based on the DevOps Handbook
Post-Mortems should help us examine
mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure. by John Allspaw
Schedule post-mortem as soon as possible after the accident occurs. Links between cause and effect should still be fresh in memory, avoid that these fade or circumstances change.
Basics of a Post-Mortem
- Construct a detailed incident timeline
- Gather details from many different perspectives
- Ensure no punishment of people for making mistakes is happening
- Empower all engineers to feel safe by allowing them to give detailed accounts of their contributions to failures
- Enable and encourage people who do make mistakes to be the experts, and who educates the rest of the organization on how to not make them again in the future
- Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions lies in hindsight
- Propose countermeasures to prevent a similar accident from happening in the future and ensure these countermeasures are recorded with a target date and an owner for follow-up
Whom to invite to a Post-Mortem - Stakeholders
Include people to the post mortem who…
- have been involved in decisions that may have contributed to the problem
- identified the problem
- responded to the problem
- diagnosed the problem
- were affected by the problem
- are genuinely interested in attending the meeting to learn
What ToDos needs to be done for a Post-Mortem Meeting
Engineers should focus on:
Why did it make sense to me when I took that action?
- record the timeline of relevant events as they occurred (to best knowledge)
- what actions have been taken
- at what time
- with what effect
- which investigation pathes have been considered
- which resolutions steps have been taken
Post-Mortem Time Schedule
Reserve enough time to find the root cause. Use the 5-why question method to find it. Create room to allow brainstorming and to decide on countermeasures to implement.
Counter measures should:
- be prioritized
- assigned to an owner
- have an implementation timeline
Doing this demonstrates that continuos improvement of daily work is more important than doing daily work itself.
What should be documented to create a good post-mortem and keep track to foster learning. Create an easy to use environment to document! Three key points to document, timeline, meeting details and incident description.
Document basic meeting metadata, like a title, date contact persons, personas etc.
- Post-Mortem Title
- Post-Mortem Meeting Date
- Post-Mortem Contact Person
- Post-Mortem Created by
- Post-Mortem Facilitated by
Document when the incident happened and how long it took until the incident was resolved. Also document additional times to be able to measure and create KPIs, e.g. MTTR.
- Incident Start Time
- Incident End Time
- Incident Detect Time
- Additional Times if necessary
- e.g. First User Impact, First User Report, Hotfix
Describe the incident in short and crisp sentences. Avoid long and detailed descriptions. Add time stamps to make the chronological order clear. Avoid attaching log files or stack traces - be precise.
- Incident Severity
- What happened (Bullet point list with timestamps)
- Additional Info (e.g. images, ! no logs, no stack traces)
- Bug Tickets associated
- Short Summary
# <Title> | Meeting | Value | | -------------- | -------------- | | Date | <Meeting Date> | | Contact | <Name> | | Created by | <Name> | | Facilitated by | <Name> | ## Timeline | Incident | Value | | ----------- | ----- | | Start Time | | | End Time | | | Detect Time | | | Additional | | ## Description > Severity: <Severity> - ## What happened _(Bullet point list with timestamps)_ - ## Additional Info _(e.g. images, ! no logs, no stack traces)_ - ## Remediation - ## Bug Tickets <Ticket Nr/Link> ## Short Summary ## Tags
Foster documentation and ease of use
Make it easy to document post-mortems.You can use tools or other easy to use system. The easier the system is, the more people will record and detail the outcomes. It will enable more organizational learning by creating a good knowledge-base.
Things NOT to do
Avoid using “would have” or “could have” in statements. Be specific, its not a guessing game or an excuse. Use terms of the system that actually exists and happened instead. What did you do or didn’t do that lead to the issue.
Things to do
After a Post-Mortem we should
- widely announce the availability of the meeting notes and any associated artifacts
- place information on a centralized location where the entire organization can access it and learn from the incident
- encourage others in the organization to read them to increase organizational learning
increases transparency with internal and external customers, which will in turn increases trust
- revisit post-mortems from time to time
- Make sure counter measures are still taking effect and are implemented
Table of content
- Basics of a Post-Mortem
- Whom to invite to a Post-Mortem - Stakeholders
- What ToDos needs to be done for a Post-Mortem Meeting
- Post-Mortem Time Schedule
- Post-Mortem Documentation
- Foster documentation and ease of use
- Things NOT to do
- Things to do
- Table of content