Skip to content

Commit 347fe07

Browse files
committed
Add hamilton's blog post about our PagerDuty rotation
1 parent f930768 commit 347fe07

File tree

1 file changed

+109
-0
lines changed

1 file changed

+109
-0
lines changed
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
---
2+
layout: post
3+
title: "Assigning pager duty to developers"
4+
author: hamiltonh
5+
tags:
6+
- oncall
7+
- pagerduty
8+
- incident response
9+
team: Core Platform
10+
---
11+
12+
Nobody likes to be woken up in the middle of the night, but if you've got to do
13+
it, make sure you pick the right person. Scribd has long used
14+
[PagerDuty](https://pagerduty.com) for managing on-call rotations, but only
15+
within the "Core Infrastructure" team. All production incidents were routed to
16+
a single group of infrastructure engineers. Clearly not a good idea. To help
17+
with our migration to AWS, we recognized the need to move to a more
18+
_distributed_ model of incident response, and the Core Platform team ended up
19+
being a suitable test subject.
20+
21+
22+
The idea of transitioning from "nobody is on-call" to "everybody is on-call"
23+
originally seemed too harsh, but we needed to ensure that dreaded production
24+
alerts would end up going to the developers who would be best suited to resolve
25+
the problem. We decided on a compromise: a "day-shift" for our on-call
26+
rotations which would route directly to developers unfamiliar with the rigors
27+
of production incident response. All the while, we still planned on relying on
28+
the Core Infrastructure team's existing rotation to fill in the gaps, covering
29+
the "night shift."
30+
31+
32+
## Trying it out
33+
34+
Getting everyone on board with the day/night shifts was the easy part,
35+
implementing the shifts in PagerDuty turned out far more difficult. TO begin, I created the
36+
`Core-Platform` schedule, adding all of the team members. The schedule
37+
was built using Pagerduty's "Restrict On-Call Times" in order to restrict the
38+
schedule's activation, limiting it to 7:00-17:00 PST.
39+
40+
Next I created an "Escalation Policy" with Core Platform as the first level,
41+
and then configuring the existing Core Infrastructure primary schedule as the
42+
next level escalation. In essence, incidents not handled by the Core Platform
43+
team would escalate after a timeout, such as 30 minutes, to the Core
44+
Infrastructure team. Then, _hopefully_, somebody would act on the alert and
45+
resolve the incident.
46+
47+
48+
## Bumpy roads
49+
50+
Having wired the settings together for Core Platform's services as a prototype,
51+
I shared the work with a developer we were working with from PagerDuty, it went
52+
_okay_. I explained the desired end-goal, and walked through what I expected to
53+
happen. Considering our settings, he explained that what would _actually_
54+
happen was:
55+
56+
* During the day, Core Platform developers
57+
* Outside of the day shift, there would **always** be a 30 minute delay, dead
58+
air, before anybody would be notified. After that 30 minute delay, the Core
59+
Infrastructure team would receive the alert.
60+
61+
Definitely not ideal.
62+
63+
64+
## Hack-arounds
65+
66+
The PagerDuty and I switched gears and tried to find ways in which we could
67+
arrive at something as close as possible to our desired end-state. We figured
68+
out a couple options:
69+
70+
71+
1. In the Core-Platform Schedule, Add a Secondary Layer built with the members of the Core-Infra Team
72+
* We would get the desired effect of skipping the Core-Platform
73+
Developers when outside of business hours.
74+
* This option would also put the management of part of Core-Infra Team's
75+
rotation in Core-Platform's hands, including managing overrides.
76+
1. In the Core-Platform Escalation Policy, add the Core-Infra Schedule to the first notification in addition to Core-Platform
77+
* This would require policy for Core-Infra to manually follow of only
78+
picking up the escalation outside of business hours, with know way to
79+
know if they should pick up or not.
80+
1. Create duplicated Services for day time vs night time, with different
81+
Escalation Policies to rout to different groups. Then with event rules,
82+
route alerts to the different services based on time of day.
83+
* This would create a lot of manual configuration bloat. Additionally if we
84+
ever needed to change anything we'd have to change it on ALL services
85+
running this style of escalations.
86+
1. Keep the baked-in delayed response for night shift alerts.
87+
* Obviously not a good choice for situations where every minute counts!
88+
1. Switch the Core Platform schedule to 24/7 by removing the restriction.
89+
* Pushes developers into new and uncomfortable positions of being on-call
90+
all the time, making team based escalations less appealing for adoption
91+
across the company.
92+
93+
94+
## Back to the start
95+
96+
97+
Things **will** go wrong in production. The goal for our incident response and
98+
escalation process is to make sure that we connect problems (incidents) with
99+
solutions (people) as quickly as possible. When we discussed the various
100+
options with the entire team, the only clear path forward was to adopt the last
101+
option: switch to a 24/7 schedule for the developers on Core Platform. We
102+
shared the entire process, and conclusion with our friends at PagerDuty. I hope that we
103+
see a feature in the future which allows Schedules to be built from both users
104+
_and_ other Schedules in the future. That level of composition would give us
105+
the flexibility to accomplish our _ideal_ end-state for developer incident
106+
response.
107+
108+
Until that feature arrives, we'll just be extra motivated to ensure the
109+
stability and availability of the services we deliver on Core Platform!

0 commit comments

Comments
 (0)