Skip to content

Commit 6588ecb

Browse files
committed
I'm going to visit a school for kids that can't write good
1 parent 347fe07 commit 6588ecb

File tree

2 files changed

+29
-23
lines changed

2 files changed

+29
-23
lines changed

_data/authors.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,3 +51,6 @@ dfeldman:
5151

5252
ugi:
5353
name: Ugi Kutluoglu
54+
55+
hamiltonh:
56+
name: Hamilton Hord

_posts/2019-12-03-managing-pagerduty-rotations.md

Lines changed: 26 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ the "night shift."
3232
## Trying it out
3333

3434
Getting everyone on board with the day/night shifts was the easy part,
35-
implementing the shifts in PagerDuty turned out far more difficult. TO begin, I created the
36-
`Core-Platform` schedule, adding all of the team members. The schedule
37-
was built using Pagerduty's "Restrict On-Call Times" in order to restrict the
38-
schedule's activation, limiting it to 7:00-17:00 PST.
35+
implementing the shifts in PagerDuty turned out far more difficult. To begin, I
36+
created the `Core-Platform` schedule, adding all of the team members. The
37+
schedule was built using Pagerduty's "Restrict On-Call Times" in order to
38+
restrict the schedule's activation, limiting it to 7:00-17:00 PST.
3939

4040
Next I created an "Escalation Policy" with Core Platform as the first level,
4141
and then configuring the existing Core Infrastructure primary schedule as the
@@ -48,12 +48,13 @@ resolve the incident.
4848
## Bumpy roads
4949

5050
Having wired the settings together for Core Platform's services as a prototype,
51-
I shared the work with a developer we were working with from PagerDuty, it went
52-
_okay_. I explained the desired end-goal, and walked through what I expected to
51+
I shared my progress with a developer from PagerDuty; it went
52+
_okay_. I explained the desired end-goal, and we walked through what I expected to
5353
happen. Considering our settings, he explained that what would _actually_
54-
happen was:
54+
happen:
5555

56-
* During the day, Core Platform developers
56+
* During the day, Core Platform developers would be notified when incidents
57+
happened.
5758
* Outside of the day shift, there would **always** be a 30 minute delay, dead
5859
air, before anybody would be notified. After that 30 minute delay, the Core
5960
Infrastructure team would receive the alert.
@@ -63,28 +64,30 @@ Definitely not ideal.
6364

6465
## Hack-arounds
6566

66-
The PagerDuty and I switched gears and tried to find ways in which we could
67-
arrive at something as close as possible to our desired end-state. We figured
68-
out a couple options:
69-
70-
71-
1. In the Core-Platform Schedule, Add a Secondary Layer built with the members of the Core-Infra Team
72-
* We would get the desired effect of skipping the Core-Platform
73-
Developers when outside of business hours.
74-
* This option would also put the management of part of Core-Infra Team's
75-
rotation in Core-Platform's hands, including managing overrides.
76-
1. In the Core-Platform Escalation Policy, add the Core-Infra Schedule to the first notification in addition to Core-Platform
77-
* This would require policy for Core-Infra to manually follow of only
78-
picking up the escalation outside of business hours, with know way to
79-
know if they should pick up or not.
67+
The PagerDuty developer and I switched gears and tried to find ways in which we
68+
could arrive at something as close as possible to our desired end-state. We
69+
figured out a couple options:
70+
71+
72+
1. In the `Core-Platform` Schedule, add a Secondary Layer built with the
73+
members of the `Core-Infra` Team
74+
* We would get the desired effect of skipping Core Platform developers when outside of business hours.
75+
* This option would also put the management of part of Core
76+
Infrastructure's rotation into Core Platform's hands, including the
77+
management of explicit overrides.
78+
1. In the `Core-Platform` Escalation Policy, add the `Core-Infra` Schedule to
79+
the first notification in addition to the existing `Core-Platform` Schedule.
80+
* This would require documented policy for engineers in Core Infrastructure
81+
to only respond to incidents outside of the day shift, with no automamted
82+
way for them to know whether they can ignore an alert until they receive it.
8083
1. Create duplicated Services for day time vs night time, with different
8184
Escalation Policies to rout to different groups. Then with event rules,
8285
route alerts to the different services based on time of day.
8386
* This would create a lot of manual configuration bloat. Additionally if we
8487
ever needed to change anything we'd have to change it on ALL services
8588
running this style of escalations.
8689
1. Keep the baked-in delayed response for night shift alerts.
87-
* Obviously not a good choice for situations where every minute counts!
90+
* Obviously not a good choice for situations where every minute counts!
8891
1. Switch the Core Platform schedule to 24/7 by removing the restriction.
8992
* Pushes developers into new and uncomfortable positions of being on-call
9093
all the time, making team based escalations less appealing for adoption

0 commit comments

Comments
 (0)