High profile high impact outages
Posted in :
High profile or high impact outage or incidents are just a way of the life in modern internet, or public cloud. The key is to recover quickly from the incident, and learn from those incidents: post mortem analysis, or in some places they do root cause analysis or RCA, and in my personal opinion, RCA is usually useless exercise, both due to the political nature of scapegoating in large orgs, as well as there are usually some unique breakpoints in an incident.
FB
https://blog.cloudflare.com/october-2021-facebook-outage/ (written by Cloudflare, good)
Roblox
Below is written by Roblox, and it’s good.
AWS
https://www.thousandeyes.com/blog/aws-outage-analysis-december-15-2021
https://aws.amazon.com/message/12721/
Azure (Microsoft)
Slack
Salesforce
Zoom