Ckecklist - Incident Post-Mortem

What is an Incident Post-Mortem?

A postmortem (or post-mortem) is a process intended to help you learn from past incidents. It typically involves an analysis or discussion soon after an event has taken place.

As your systems scale and become more complex, failure is inevitable, assessment and remediation is more involved and time-consuming, and it becomes increasingly painful to repeat recurring mistakes. Not having data when you need it is expensive.

Streamlining the postmortem process is key to helping your team get the most from their postmortem time investment: spending less time conducting the postmortem, while extracting more effective learnings, is a faster path to increased operational maturity. In fact, the true value of postmortems comes from helping institutionalize a positive culture around frequent and iterative improvement.

If you want to read more:

  • https://www.pagerduty.com/resources/learn/post-mortem-incident-report/

  • https://codeascraft.com/2012/05/22/blameless-postmortems/

  • https://www.pagerduty.com/resources/ebook/post-mortem-handbook/

To create an Incident Post-Mortem:

  • Go to Simbiose Ventures drive > Internal Facing > Incident Post-Mortem

  • Right-click and create a new docs

  • On the docs, change the name to: [SEGMENT OF THE INCIDENT]Name of the Incident.

  • Ex: [INFRA]Name of the incident

  • Follow the next instructions

In general, an effective postmortem report tells a story. Incident postmortem reports should include the following:

  • A high-level summary of what happened:

    • Which services and customers were affected?

    • How long and severe was the issue?

    • Who was involved in the response?

    • How did we ultimately fix the problem?

  • A root cause analysis:

    • What were the origins of failure?

    • Why do we think this happened?

  • Steps taken to diagnose, assess, and resolve:

    • What actions were taken?

    • Which were effective?

    • Which were detrimental?

  • A timeline of significant activity:

    • Centralize key activities from chat conversations, incident details, and more.

  • Learnings and next steps:

    • What went well?

    • What didn’t go well?

    • How do we prevent this issue from happening again?