@@ -32,10 +32,10 @@ the "night shift."
32
32
## Trying it out
33
33
34
34
Getting everyone on board with the day/night shifts was the easy part,
35
- implementing the shifts in PagerDuty turned out far more difficult. TO begin, I created the
36
- ` Core-Platform ` schedule, adding all of the team members. The schedule
37
- was built using Pagerduty's "Restrict On-Call Times" in order to restrict the
38
- schedule's activation, limiting it to 7:00-17:00 PST.
35
+ implementing the shifts in PagerDuty turned out far more difficult. To begin, I
36
+ created the ` Core-Platform ` schedule, adding all of the team members. The
37
+ schedule was built using Pagerduty's "Restrict On-Call Times" in order to
38
+ restrict the schedule's activation, limiting it to 7:00-17:00 PST.
39
39
40
40
Next I created an "Escalation Policy" with Core Platform as the first level,
41
41
and then configuring the existing Core Infrastructure primary schedule as the
@@ -48,12 +48,13 @@ resolve the incident.
48
48
## Bumpy roads
49
49
50
50
Having wired the settings together for Core Platform's services as a prototype,
51
- I shared the work with a developer we were working with from PagerDuty, it went
52
- _ okay_ . I explained the desired end-goal, and walked through what I expected to
51
+ I shared my progress with a developer from PagerDuty; it went
52
+ _ okay_ . I explained the desired end-goal, and we walked through what I expected to
53
53
happen. Considering our settings, he explained that what would _ actually_
54
- happen was :
54
+ happen:
55
55
56
- * During the day, Core Platform developers
56
+ * During the day, Core Platform developers would be notified when incidents
57
+ happened.
57
58
* Outside of the day shift, there would ** always** be a 30 minute delay, dead
58
59
air, before anybody would be notified. After that 30 minute delay, the Core
59
60
Infrastructure team would receive the alert.
@@ -63,28 +64,30 @@ Definitely not ideal.
63
64
64
65
## Hack-arounds
65
66
66
- The PagerDuty and I switched gears and tried to find ways in which we could
67
- arrive at something as close as possible to our desired end-state. We figured
68
- out a couple options:
69
-
70
-
71
- 1 . In the Core-Platform Schedule, Add a Secondary Layer built with the members of the Core-Infra Team
72
- * We would get the desired effect of skipping the Core-Platform
73
- Developers when outside of business hours.
74
- * This option would also put the management of part of Core-Infra Team's
75
- rotation in Core-Platform's hands, including managing overrides.
76
- 1 . In the Core-Platform Escalation Policy, add the Core-Infra Schedule to the first notification in addition to Core-Platform
77
- * This would require policy for Core-Infra to manually follow of only
78
- picking up the escalation outside of business hours, with know way to
79
- know if they should pick up or not.
67
+ The PagerDuty developer and I switched gears and tried to find ways in which we
68
+ could arrive at something as close as possible to our desired end-state. We
69
+ figured out a couple options:
70
+
71
+
72
+ 1 . In the ` Core-Platform ` Schedule, add a Secondary Layer built with the
73
+ members of the ` Core-Infra ` Team
74
+ * We would get the desired effect of skipping Core Platform developers when outside of business hours.
75
+ * This option would also put the management of part of Core
76
+ Infrastructure's rotation into Core Platform's hands, including the
77
+ management of explicit overrides.
78
+ 1 . In the ` Core-Platform ` Escalation Policy, add the ` Core-Infra ` Schedule to
79
+ the first notification in addition to the existing ` Core-Platform ` Schedule.
80
+ * This would require documented policy for engineers in Core Infrastructure
81
+ to only respond to incidents outside of the day shift, with no automamted
82
+ way for them to know whether they can ignore an alert until they receive it.
80
83
1 . Create duplicated Services for day time vs night time, with different
81
84
Escalation Policies to rout to different groups. Then with event rules,
82
85
route alerts to the different services based on time of day.
83
86
* This would create a lot of manual configuration bloat. Additionally if we
84
87
ever needed to change anything we'd have to change it on ALL services
85
88
running this style of escalations.
86
89
1 . Keep the baked-in delayed response for night shift alerts.
87
- * Obviously not a good choice for situations where every minute counts!
90
+ * Obviously not a good choice for situations where every minute counts!
88
91
1 . Switch the Core Platform schedule to 24/7 by removing the restriction.
89
92
* Pushes developers into new and uncomfortable positions of being on-call
90
93
all the time, making team based escalations less appealing for adoption
0 commit comments