View profile

123dev #25: Integration Test Email #25

123dev #25: Integration Test Email #25
By Justin Garrison • Issue #25 • View online

A building floor collapses as a construction crew lays concrete
A building floor collapses as a construction crew lays concrete
Who’s to blame?
Many of us probably heard about the HBOMax email that was sent out by mistake last week. It wasn’t a major outage, but a lot of people were talking about it. I especially loved the outpouring of support I saw online for whomever triggered the integration job that sent the email. 💖
Just like the gif, no one person is to blame. The person who poured cement isn’t any more to blame than the person who architected the building. With any incident, if a single person can cause an outage it is almost certainly not their fault.
My biggest outage
There was a CVE for Google Chrome and I was asked to update it on all the computers. We had tooling to run arbitrary commands on sets of machines so obviously I ran yum update -y google-chrome-stable. In that moment I had forgotten Chrome had special repos which did not point to our internal mirrors. I also forgot that Chrome was installed on servers. Instead of updating less than 800 computers from internal repos I triggered 4000+ systems to simultaneously download Chrome directly from Google.
Our ISP connection couldn’t handle that throughput (~225 GB) so systems started to back-off and retry as the bandwidth was used and systems re-connected. With that many connections coming at once Google automatically blocked our public IP which cause more systems to back-off and retry. It also blocked us from all of our GSuite tools so no one could check email or use docs.
I had caused a 3 hour delay to 1000+ employees on an already tight schedule. I learned a very important lesson and we took action to make sure an outage like that was not as easy to cause in the future.
Blameless isn’t enough to make postmortems useful. I like the wording this article uses with “actionable retrospectives” because that aligns with the goals. It also has tips that go a lot further than avoiding blame.
Blameless postmortems don't work. Be blame-aware but don't go negative
Reading postmortems is a great way to learn systems and how they break without needing to live it. Of course, living it helps cement the paranoia that many sysadmins have learned over years from on the job experience. Well written postmortem can teach a lot, and this one stuck out in my memory as one I learned from recently.
A Terrible, Horrible, No-Good, Very Bad Day at Slack - Slack Engineering
If you need some more guidance on conducting your own postmortems this post has more details on what you should and shouldn’t do and some tools that can help. I hope everyone can learn from their mistakes openly, and not have to fear retribution for their actions.
A Project Postmortem Toolkit: Apps and Approaches that Help You Learn More from Retrospectives
Did you enjoy this issue?
Justin Garrison

1 gif, 2 comments, and 3 links to make you a better developer and person

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
Los Angeles, CA