Roblox had a 3 day outage and I feel really sorry for the team responsible for fixing it. I don’t know any details, but from a case study I read they use Hashicorp tools (consul, nomad, vault, terraform) with a 4 person SRE team serving 100,000,000 monthly players—might be more now.
One thing I found interesting was people online seemed surprised at the small team size relative to the amount of players. It’s not the scale that is a problem, but the scope of work and the company’s dependence on reliability that causes more burnout.
I was on a team of 4 in Disney Animation with a smaller scale and a larger scope and it was hard to manage all our responsibilities. However, with Disney Animation we had more freedom with reliability because we had no public facing services that would cause direct revenue loss.
My time as SRE at Disney+ was on a team of 4 with direct customer impact if we had an outage and similar scale and scope to Roblex. It was very stressful, but not entirely uncommon from what I’ve seen in the industry.