Incident response process#
We prioritize the resolution of incidents above all other kinds of work, and have a special process we follow.
Incident sources#
Steps#
When an incident is identified via any of the above sources, the following steps must be taken:
1. Validate that we are dealing with an outage#
If the incident came via an automated PagerDuty alert and has a
take immediate actiontag, then it is definitely an outage.If if doesn’t have this tag, then based on the alert’s type follow the Manage Alerts guide and manually test the infrastructure to determine if it’s matching the definition of an outage or not.
If the incident report was triggered via a Freshdesk ticket, then try to reproduce the issue.
If you cannot reproduce it, ask for more information from the community representative via Freshdesk. If the issue is reproducible and matching the definition of an outage, then proceed with the next steps.
2. Officially mark the beginning of the incident#
Important
An incident officially starts when:
A PagerDuty P1 incident exists
Someone has acknowledged the incident in PagerDuty
A separate Slack channel for this incident exists
The community has been informed via Freshdesk of the ongoing incident
The incident already exists in PagerDuty, so make sure the conditions above are met.
You need to first create a
P1incident in PagerDutyThe fields are self-explanatory, but the most important ones are:
Setting the priority to
P1Choosing one of the Impacted Services that matches best the affected hub service. If not sure, choose
Misc alerts from Prometheus Alert Manager.Create a new Slack channel by checking the box for
Create a dedicated Public Slack channel for this incident. Use this channel for all conversations about the incident.
You can create a PagerDuty incident in two ways:
Using Slack by typing
/pd triggerand hittingenterin#pagerduty-notifications
Validate to the Community Representative, via the Freshdesk ticket, that there is indeed an incident happening. If you wish, use this canned response as a start for responding:
3. Try resolving the issue#
At all times, try to communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself.
Do not use threaded Slack messages
Do NOT use threads when communicating in this Slack channel. When coming to write the incident report after the event, PagerDuty can import messages from the Slack channel in order to construct a timeline. However, it cannot import threaded messages, only those that are sent directly to the channel. Hence if the cause of an incident was established in a thread, this cannot be reflected automatically in the incident report.
4. Get all hands on deck#
If there are other Infrastructure Engineering team members available, pull them in as Subject Matter Experts in order to investigate and resolve the incident quickly. When in doubt, delegate to the Tech Lead.[1]
5. Communicate our status every few hours#
The Communication Liaison is expected to communicate incident status and plan with the Community Representatives.
They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:
6. Make sure the incident is resolved#
The Technology Lead should be pulled in to validate and review the actions taken and suggested to be taken next.
7. Communicate when the incident is resolved#
When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal.
Mark the incident as Resolved in PagerDuty.
Mark the FreshDesk ticket as Resolved
8. Take follow-up actions#
See After the incident for more information.
Handing off Incident Responder status#
During an incident, it may be necessary to designate another person to be the Incident Responder. For example, if it is getting late in the current IR’s time zone, they feel burnt out from leading the incident response, or there is someone with better visibility or experience to be the Incident Responder. This is encouraged and expected, especially for more complex or longer incidents!
To designate another team member as the Incident Responder, follow these steps:
Confirm with them that they are able and willing to serve as the Incident Responder
Reassign the incident on PagerDuty to the new Responder