I'm going to visit a school for kids that can't write good

rtyler · rtyler · commit 6588ecb886ab · 2019-12-02T18:55:07.000-08:00
diff --git a/_data/authors.yml b/_data/authors.yml
@@ -51,3 +51,6 @@ dfeldman:
 
 ugi:
   name: Ugi Kutluoglu
+
+hamiltonh:
+  name: Hamilton Hord
diff --git a/_posts/2019-12-03-managing-pagerduty-rotations.md b/_posts/2019-12-03-managing-pagerduty-rotations.md
@@ -32,10 +32,10 @@ the "night shift."
 ## Trying it out
 
 Getting everyone on board with the day/night shifts was the easy part,
-implementing the shifts in PagerDuty turned out far more difficult. TO begin, I created the
-`Core-Platform` schedule, adding all of the team members. The schedule
-was built using Pagerduty's "Restrict On-Call Times" in order to restrict the
-schedule's activation, limiting it to 7:00-17:00 PST. 
+implementing the shifts in PagerDuty turned out far more difficult. To begin, I
+created the `Core-Platform` schedule, adding all of the team members. The
+schedule was built using Pagerduty's "Restrict On-Call Times" in order to
+restrict the schedule's activation, limiting it to 7:00-17:00 PST.
 
 Next I created an "Escalation Policy" with Core Platform as the first level,
 and then configuring the existing Core Infrastructure primary schedule as the
@@ -48,12 +48,13 @@ resolve the incident.
 ## Bumpy roads
 
 Having wired the settings together for Core Platform's services as a prototype,
-I shared the work with a developer we were working with from PagerDuty, it went
-_okay_. I explained the desired end-goal, and walked through what I expected to
+I shared my progress with a developer from PagerDuty; it went
+_okay_. I explained the desired end-goal, and we walked through what I expected to
 happen. Considering our settings, he explained that what would _actually_
-happen was:
+happen:
 
-* During the day, Core Platform developers
+* During the day, Core Platform developers would be notified when incidents
+  happened.
 * Outside of the day shift, there would **always** be a 30 minute delay, dead
   air, before anybody would be notified. After that 30 minute delay, the Core
   Infrastructure team would receive the alert.
@@ -63,28 +64,30 @@ Definitely not ideal.
 
 ## Hack-arounds
 
-The PagerDuty and I switched gears and tried to find ways in which we could
-arrive at something as close as possible to our desired end-state. We figured
-out a couple options:
-
-
-1. In the Core-Platform Schedule, Add a Secondary Layer built with the members of the Core-Infra Team
-    * We would get the desired effect of skipping the Core-Platform
-Developers when outside of business hours.
-    * This option would also put the management of part of Core-Infra Team's
-rotation in Core-Platform's hands, including managing overrides. 
-1. In the Core-Platform Escalation Policy, add the Core-Infra Schedule to the first notification in addition to Core-Platform
-    * This would require policy for Core-Infra to manually follow of only
-      picking up the escalation outside of business hours, with know way to
-      know if they should pick up or not.
+The PagerDuty developer and I switched gears and tried to find ways in which we
+could arrive at something as close as possible to our desired end-state. We
+figured out a couple options:
+
+
+1. In the `Core-Platform` Schedule, add a Secondary Layer built with the
+   members of the `Core-Infra` Team
+    * We would get the desired effect of skipping Core Platform developers when outside of business hours.
+    * This option would also put the management of part of Core
+      Infrastructure's rotation into Core Platform's hands, including the
+      management of explicit overrides.
+1. In the `Core-Platform` Escalation Policy, add the `Core-Infra` Schedule to
+   the first notification in addition to the existing `Core-Platform` Schedule.
+    * This would require documented policy for engineers in Core Infrastructure
+      to only respond to incidents outside of the day shift, with no automamted
+      way for them to know whether they can ignore an alert until they receive it.
 1. Create duplicated Services for day time vs night time, with different
    Escalation Policies to rout to different groups. Then with event rules,
    route alerts to the different services based on time of day.
     * This would create a lot of manual configuration bloat. Additionally if we
       ever needed to change anything we'd have to change it on ALL services
       running this style of escalations.
 1. Keep the baked-in delayed response for night shift alerts.
-    * Obviously not a good choice for situations where every minute counts! 
+    * Obviously not a good choice for situations where every minute counts!
 1. Switch the Core Platform schedule to 24/7 by removing the restriction.
     * Pushes developers into new and uncomfortable positions of being on-call
       all the time, making team based escalations less appealing for adoption