After the incident#
After an incident is resolved, there are a few important steps to take to ensure that we learn from the incident and make sure it doesn’t happen again.
1. Create post-incident action items#
After the incident is over, we must prioritize any action items that prevent this kind of incident at the same level as a contract deliverable, and attempt to bring it into the next sprint. While we can’t guarantee there shall be no outages, we must do everything we can to prevent known causes of outages from recurring.
Responsibility#
It’s the Technical Lead’s responsibility to shape an absolute minimum sized task to mitigate the issue that caused this incident, and the responsibility of the Technical Lead & Engineering Manager to advocate for bringing it in during the next sprint.
2. Create an Incident Report#
Once the incident is resolved, we must create an Incident Report. Our incidents are all public, so others can learn from them as well.
We practice a blameless culture around incident reports. Incidents are always caused by systemic issues, and hence solutions must be systemic too. Go out of your way to make sure there is no finger-pointing.
Responsibility#
The Communication Liaison is responsible for starting the incident report process, and making sure the Incident Report is completed. They are not required to fill out all of the information in the report, though they may do so if they wish.
Steps#
We use PagerDuty’s postmortem feature to create the Incident Report. This lets us use notes, status updates from PagerDuty as well as messages from Slack easily in the incident report!
Open the incident in the PagerDuty web interface, and click the
New Postmortem Reportbutton on top.Owner of the Review Processshould be set to the person writing the incident report.Impact Start Timeis our best guess for when the incident started (not when the report came in).Impact End Timeis when service was restored. Best guesses will do!
Add Data Sources
Link to the Slack channel we created for this incident, with an appropriate time to cover all the messages
Fill out the timeline
The goal is to be concise but make it possible for someone reading it to answer “what happened, and when?”.
See Tips for writing an incident timeline for more information.
Fill out the “Analysis” section to the extent possible.
Perfection is the enemy of the good here. Save as you go.
In particular, the
Action Itemsshould be a list with items linked out to GitHub issues created for follow-up.
Click “Save & View Report” when you are done
Ask other members of the incident response team to review the incident report
In particular, the
Technical Leadshould review and approve the report before it is marked as “Reviewed”.
After sufficient review, and if the
Technical Leadis happy with its completeness, edit the report again, mark the Status dropdown as “Reviewed”, and then click “Save & View Report” again.Download the PDF, and add it to the
2i2c/incident-reportsrepositoryGiven review is already completed in the pagerduty interface, you don’t need to wait for review to add the report here.
Email a link to the incident report to the community representative, ideally via the Freshdesk ticket used to communicate with them during the incident itself.
Tips for writing an incident timeline#
Below are some tips and crucial information that is needed for a useful and thorough incident timeline. You can see examples of previous incident reports at the 2i2c-org/incident-reports repository.
The timeline should include:
The beginning of the impact.
When the incident was brought to our attention, with a link to the source (Freshdesk ticket, slack message, etc).
When we responded to the incident. This would coincide with the creation of the PagerDuty incident.
Various debugging actions performed to ascertain the cause of the issue. Talking to yourself as you do this on the Slack channel helps a lot here, as it helps communicate your methods to others on the team as well as help improve processes in the future more easily.
For example:
Looked at hub logs with "kubectl logs -n temple -l component=hub" and found <this>Opened the cloud console and discovered notifications about quota.
Pasting in commands is very helpful! This is an important way for team members to learn from each other - what you take for granted is perhaps news to someone else, or you might learn alternate ways of doing things!
Actions taken to attempt to fix the issue, and their outcome. Paste commands executed if possible, as well as any GitHub PRs made. If you’ve already done this in the incident Slack channel you may simply copy/paste text here.
Any extra communication from the community affected that helped.
When the incident was fixed, and how that was verified.
Whatever else you think would be helpful to someone who finds this incident report a few months from now, trying to fix a similar incident.
Key terms#
- Incident Report#
- Incident Reports#
A document that describes what went wrong during an incident and what we’ll do to avoid it in the future. When we have an Incident, we create an Incident Report issue.
This helps us understand what went wrong, and how we can improve our systems to prevent a recurrance. Its goal is to identify improvements to process, technology, and team dynamics that can avoid incidents like this in the future. It is not meant to point fingers at anybody and care should be taken to avoid making it seem like any one person is at fault.
This is a very important part of making our infrastructure and human processes more stable and stress-free over time, so we should do this after each incident.[1].