Blameless Post-Mortems

What are you going to do after an incident happen? Find the one to blame…? A better idea is to create an environment where failures are accepted and appreciated. You should create an environment of learning and reinforce that failing fast and learning from mistakes is something to strive for.

The described points are mostly based on the DevOps Handbook

Introduction

Post-Mortems should help us examine

mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure. by John Allspaw

Schedule post-mortem as soon as possible after the accident occurs. Links between cause and effect should still be fresh in memory, avoid that these fade or circumstances change.

Basics of a Post-Mortem

Construct a detailed incident timeline
Gather details from many different perspectives
Ensure no punishment of people for making mistakes is happening
Empower all engineers to feel safe by allowing them to give detailed accounts of their contributions to failures
Enable and encourage people who do make mistakes to be the experts, and who educates the rest of the organization on how to not make them again in the future
Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions lies in hindsight
Propose countermeasures to prevent a similar accident from happening in the future and ensure these countermeasures are recorded with a target date and an owner for follow-up

Whom to invite to a Post-Mortem - Stakeholders

Include people to the post mortem who…

have been involved in decisions that may have contributed to the problem
identified the problem
responded to the problem
diagnosed the problem
were affected by the problem
are genuinely interested in attending the meeting to learn

What ToDos needs to be done for a Post-Mortem Meeting

Engineers should focus on:

Why did it make sense to me when I took that action?

record the timeline of relevant events as they occurred (to best knowledge)
what actions have been taken
at what time
with what effect
which investigation pathes have been considered
which resolutions steps have been taken

Post-Mortem Time Schedule

Reserve enough time to find the root cause. Use the 5-why question method to find it. Create room to allow brainstorming and to decide on countermeasures to implement.

Counter measures should:

be prioritized
assigned to an owner
have an implementation timeline

Doing this demonstrates that continuos improvement of daily work is more important than doing daily work itself.

Post-Mortem Documentation

What should be documented to create a good post-mortem and keep track to foster learning. Create an easy to use environment to document! Three key points to document, timeline, meeting details and incident description.

See Template

Meeting Details

Document basic meeting metadata, like a title, date contact persons, personas etc.

Post-Mortem Title
Post-Mortem Meeting Date
Post-Mortem Contact Person
Post-Mortem Created by
Post-Mortem Facilitated by

Incident Timeline

Document when the incident happened and how long it took until the incident was resolved. Also document additional times to be able to measure and create KPIs, e.g. MTTR.

Incident Start Time
Incident End Time
Incident Detect Time
Additional Times if necessary
- e.g. First User Impact, First User Report, Hotfix

Incident Description

Describe the incident in short and crisp sentences. Avoid long and detailed descriptions. Add time stamps to make the chronological order clear. Avoid attaching log files or stack traces - be precise.

Incident Severity
What happened (Bullet point list with timestamps)
Additional Info (e.g. images, ! no logs, no stack traces)
Remediation
- Bug Tickets associated
Short Summary
Tags

Template

# <Title>

| Meeting        | Value          |
| -------------- | -------------- |
| Date           | <Meeting Date> |
| Contact        | <Name>         |
| Created by     | <Name>         |
| Facilitated by | <Name>         |

## Timeline

| Incident    | Value |
| ----------- | ----- |
| Start Time  |       |
| End Time    |       |
| Detect Time |       |
| Additional  |       |

## Description

> Severity: <Severity>

- 


## What happened 
_(Bullet point list with timestamps)_

-

## Additional Info 
_(e.g. images, ! no logs, no stack traces)_

-

## Remediation
-

## Bug Tickets

<Ticket Nr/Link>

## Short Summary

## Tags

Foster documentation and ease of use

Make it easy to document post-mortems.You can use tools or other easy to use system. The easier the system is, the more people will record and detail the outcomes. It will enable more organizational learning by creating a good knowledge-base.

Things NOT to do

Don’t Blame!

Don’t punish!

Avoid using “would have” or “could have” in statements. Be specific, its not a guessing game or an excuse. Use terms of the system that actually exists and happened instead. What did you do or didn’t do that lead to the issue.

Things to do

Foster Learning!

Post-Post-Mortem

After a Post-Mortem we should

widely announce the availability of the meeting notes and any associated artifacts
place information on a centralized location where the entire organization can access it and learn from the incident
encourage others in the organization to read them to increase organizational learning
increases transparency with internal and external customers, which will in turn increases trust
revisit post-mortems from time to time
Make sure counter measures are still taking effect and are implemented

Video

Morgue: Helping Better Understand Events by Building a Post Mortem Tool - Bethany Macri

Resources

The DevOps Handbook

Table of content

Introduction
Basics of a Post-Mortem
Whom to invite to a Post-Mortem - Stakeholders
What ToDos needs to be done for a Post-Mortem Meeting
Post-Mortem Time Schedule
Post-Mortem Documentation
Template
Foster documentation and ease of use
Things NOT to do
Things to do
Post-Post-Mortem
Video
Resources
Table of content