My biggest outage
There was a CVE for Google Chrome and I was asked to update it on all the computers. We had tooling to run arbitrary commands on sets of machines so obviously I ran
yum update -y google-chrome-stable. In that moment I had forgotten Chrome had special repos which did not point to our internal mirrors. I also forgot that Chrome was installed on servers. Instead of updating less than 800 computers from internal repos I triggered 4000+ systems to simultaneously download Chrome directly from Google.
Our ISP connection couldn’t handle that throughput (~225 GB) so systems started to back-off and retry as the bandwidth was used and systems re-connected. With that many connections coming at once Google automatically blocked our public IP which cause more systems to back-off and retry. It also blocked us from all of our GSuite tools so no one could check email or use docs.
I had caused a 3 hour delay to 1000+ employees on an already tight schedule. I learned a very important lesson and we took action to make sure an outage like that was not as easy to cause in the future.