|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Assigning pager duty to developers" |
| 4 | +author: hamiltonh |
| 5 | +tags: |
| 6 | +- oncall |
| 7 | +- pagerduty |
| 8 | +- incident response |
| 9 | +team: Core Platform |
| 10 | +--- |
| 11 | + |
| 12 | +Nobody likes to be woken up in the middle of the night, but if you've got to do |
| 13 | +it, make sure you pick the right person. Scribd has long used |
| 14 | +[PagerDuty](https://pagerduty.com) for managing on-call rotations, but only |
| 15 | +within the "Core Infrastructure" team. All production incidents were routed to |
| 16 | +a single group of infrastructure engineers. Clearly not a good idea. To help |
| 17 | +with our migration to AWS, we recognized the need to move to a more |
| 18 | +_distributed_ model of incident response, and the Core Platform team ended up |
| 19 | +being a suitable test subject. |
| 20 | + |
| 21 | + |
| 22 | +The idea of transitioning from "nobody is on-call" to "everybody is on-call" |
| 23 | +originally seemed too harsh, but we needed to ensure that dreaded production |
| 24 | +alerts would end up going to the developers who would be best suited to resolve |
| 25 | +the problem. We decided on a compromise: a "day-shift" for our on-call |
| 26 | +rotations which would route directly to developers unfamiliar with the rigors |
| 27 | +of production incident response. All the while, we still planned on relying on |
| 28 | +the Core Infrastructure team's existing rotation to fill in the gaps, covering |
| 29 | +the "night shift." |
| 30 | + |
| 31 | + |
| 32 | +## Trying it out |
| 33 | + |
| 34 | +Getting everyone on board with the day/night shifts was the easy part, |
| 35 | +implementing the shifts in PagerDuty turned out far more difficult. TO begin, I created the |
| 36 | +`Core-Platform` schedule, adding all of the team members. The schedule |
| 37 | +was built using Pagerduty's "Restrict On-Call Times" in order to restrict the |
| 38 | +schedule's activation, limiting it to 7:00-17:00 PST. |
| 39 | + |
| 40 | +Next I created an "Escalation Policy" with Core Platform as the first level, |
| 41 | +and then configuring the existing Core Infrastructure primary schedule as the |
| 42 | +next level escalation. In essence, incidents not handled by the Core Platform |
| 43 | +team would escalate after a timeout, such as 30 minutes, to the Core |
| 44 | +Infrastructure team. Then, _hopefully_, somebody would act on the alert and |
| 45 | +resolve the incident. |
| 46 | + |
| 47 | + |
| 48 | +## Bumpy roads |
| 49 | + |
| 50 | +Having wired the settings together for Core Platform's services as a prototype, |
| 51 | +I shared the work with a developer we were working with from PagerDuty, it went |
| 52 | +_okay_. I explained the desired end-goal, and walked through what I expected to |
| 53 | +happen. Considering our settings, he explained that what would _actually_ |
| 54 | +happen was: |
| 55 | + |
| 56 | +* During the day, Core Platform developers |
| 57 | +* Outside of the day shift, there would **always** be a 30 minute delay, dead |
| 58 | + air, before anybody would be notified. After that 30 minute delay, the Core |
| 59 | + Infrastructure team would receive the alert. |
| 60 | + |
| 61 | +Definitely not ideal. |
| 62 | + |
| 63 | + |
| 64 | +## Hack-arounds |
| 65 | + |
| 66 | +The PagerDuty and I switched gears and tried to find ways in which we could |
| 67 | +arrive at something as close as possible to our desired end-state. We figured |
| 68 | +out a couple options: |
| 69 | + |
| 70 | + |
| 71 | +1. In the Core-Platform Schedule, Add a Secondary Layer built with the members of the Core-Infra Team |
| 72 | + * We would get the desired effect of skipping the Core-Platform |
| 73 | +Developers when outside of business hours. |
| 74 | + * This option would also put the management of part of Core-Infra Team's |
| 75 | +rotation in Core-Platform's hands, including managing overrides. |
| 76 | +1. In the Core-Platform Escalation Policy, add the Core-Infra Schedule to the first notification in addition to Core-Platform |
| 77 | + * This would require policy for Core-Infra to manually follow of only |
| 78 | + picking up the escalation outside of business hours, with know way to |
| 79 | + know if they should pick up or not. |
| 80 | +1. Create duplicated Services for day time vs night time, with different |
| 81 | + Escalation Policies to rout to different groups. Then with event rules, |
| 82 | + route alerts to the different services based on time of day. |
| 83 | + * This would create a lot of manual configuration bloat. Additionally if we |
| 84 | + ever needed to change anything we'd have to change it on ALL services |
| 85 | + running this style of escalations. |
| 86 | +1. Keep the baked-in delayed response for night shift alerts. |
| 87 | + * Obviously not a good choice for situations where every minute counts! |
| 88 | +1. Switch the Core Platform schedule to 24/7 by removing the restriction. |
| 89 | + * Pushes developers into new and uncomfortable positions of being on-call |
| 90 | + all the time, making team based escalations less appealing for adoption |
| 91 | + across the company. |
| 92 | + |
| 93 | + |
| 94 | +## Back to the start |
| 95 | + |
| 96 | + |
| 97 | +Things **will** go wrong in production. The goal for our incident response and |
| 98 | +escalation process is to make sure that we connect problems (incidents) with |
| 99 | +solutions (people) as quickly as possible. When we discussed the various |
| 100 | +options with the entire team, the only clear path forward was to adopt the last |
| 101 | +option: switch to a 24/7 schedule for the developers on Core Platform. We |
| 102 | +shared the entire process, and conclusion with our friends at PagerDuty. I hope that we |
| 103 | +see a feature in the future which allows Schedules to be built from both users |
| 104 | +_and_ other Schedules in the future. That level of composition would give us |
| 105 | +the flexibility to accomplish our _ideal_ end-state for developer incident |
| 106 | +response. |
| 107 | + |
| 108 | +Until that feature arrives, we'll just be extra motivated to ensure the |
| 109 | +stability and availability of the services we deliver on Core Platform! |
0 commit comments