Incident response process#

We prioritize the resolution of incidents above all other kinds of work, and have a special process we follow.

Incident sources#

  1. Automated Pagerduty alerts

  2. Support Freshdesk tickets

Steps#

When an incident is identified via any of the above sources, the following steps must be taken:

1. Validate that we are dealing with an outage#

  • If the incident came via an automated PagerDuty alert and has a take immediate action tag, then it is definitely an outage.

  • If if doesn’t have this tag, then based on the alert’s type follow the Manage Alerts guide and manually test the infrastructure to determine if it’s matching the definition of an outage or not.

  • If the incident report was triggered via a Freshdesk ticket, then try to reproduce the issue.

  • If you cannot reproduce it, ask for more information from the community representative via Freshdesk. If the issue is reproducible and matching the definition of an outage, then proceed with the next steps.

2. Officially mark the beginning of the incident#

Important

An incident officially starts when:

  1. A PagerDuty P1 incident exists

  2. Someone has acknowledged the incident in PagerDuty

  3. A separate Slack channel for this incident exists

  4. The community has been informed via Freshdesk of the ongoing incident

The incident already exists in PagerDuty, so make sure the conditions above are met.

  1. You need to first create a P1 incident in PagerDuty

    The fields are self-explanatory, but the most important ones are:

    • Setting the priority to P1

    • Choosing one of the Impacted Services that matches best the affected hub service. If not sure, choose Misc alerts from Prometheus Alert Manager.

    • Create a new Slack channel by checking the box for Create a dedicated Public Slack channel for this incident. Use this channel for all conversations about the incident.

    You can create a PagerDuty incident in two ways:

    • Using the UI

    • Using Slack by typing /pd trigger and hitting enter in #pagerduty-notifications

  2. Validate to the Community Representative, via the Freshdesk ticket, that there is indeed an incident happening. If you wish, use this canned response as a start for responding:

    Incident first response template

3. Try resolving the issue#

At all times, try to communicate on the incident-specific channel while you gather information and perform actions - even if only to mark these as notes to yourself.

Do not use threaded Slack messages

Do NOT use threads when communicating in this Slack channel. When coming to write the incident report after the event, PagerDuty can import messages from the Slack channel in order to construct a timeline. However, it cannot import threaded messages, only those that are sent directly to the channel. Hence if the cause of an incident was established in a thread, this cannot be reflected automatically in the incident report.

4. Get all hands on deck#

If there are other Infrastructure Engineering team members available, pull them in as Subject Matter Experts in order to investigate and resolve the incident quickly. When in doubt, delegate to the Tech Lead.[1]

5. Communicate our status every few hours#

The Communication Liaison is expected to communicate incident status and plan with the Community Representatives.

They should provide periodic updates that describe the current state of the incident, what we have tried, and our intended next steps. Here is a canned response to get started:

Incident update template

6. Make sure the incident is resolved#

The Technology Lead should be pulled in to validate and review the actions taken and suggested to be taken next.

7. Communicate when the incident is resolved#

When we believe the incident is resolved, communicate with the Community Representative that things should be back to normal.

  • Mark the incident as Resolved in PagerDuty.

  • Mark the FreshDesk ticket as Resolved

8. Take follow-up actions#

See After the incident for more information.

Handing off Incident Responder status#

During an incident, it may be necessary to designate another person to be the Incident Responder. For example, if it is getting late in the current IR’s time zone, they feel burnt out from leading the incident response, or there is someone with better visibility or experience to be the Incident Responder. This is encouraged and expected, especially for more complex or longer incidents!

To designate another team member as the Incident Responder, follow these steps:

  1. Confirm with them that they are able and willing to serve as the Incident Responder

  2. Reassign the incident on PagerDuty to the new Responder