The Cloudflare 2024 BGP Outage: An SRE Postmortem Walkthrough
A walkthrough framed for SREs whose own services depend on edge providers. The lessons cost Cloudflare hours; we get them for the price of reading.
What happened, in plain terms
A configuration change to internal BGP propagated incorrectly, advertising routes that should have stayed inside Cloudflare's network as if they were public. Routers around the world started preferring those routes; traffic that should have reached customer origins was sucked into Cloudflare's blackholes.
The incident was global, not regional. That is the property of BGP problems, they propagate at the speed of routing announcements, which is fast and not bounded by geography.
How a route leak cascades
- When a CDN provider absorbs traffic that was supposed to go elsewhere, the customer experience is ‘everything is slow or broken.’ There is no error code that says ‘routing is wrong’; users see timeouts, 502s, or random misroutes.
- Worse, the cascade hits dependent services. Authentication services that route through the CDN are unreachable; a third of the internet’s session-cookie validations fail; downstream applications that thought they were independent discover they were not.
Why detection took longer than it should
Internal alerts caught it; external monitoring (the customer view) caught it slightly later because most external monitors live in or near the affected paths. The lesson: at least one monitoring vantage point must be outside the provider’s network.
The post-incident review noted that the initial rollback was complicated by the fact that the routing change had also been replicated to backup paths. Backups inherited the bug.
Four mitigations to put in place this quarter
- 1. Multiple authoritative DNS providers. So a CDN failure does not also remove your name resolution.
- 2. Origin shielding only when it pays. Direct-to-origin failover for critical paths.
- 3. External monitoring outside the provider. Synthetic checks from a different network.
- 4. A documented ‘CDN bypass’ runbook. Including the time-to-execute number from a real rehearsal.
Antipatterns this incident exposed
- Treating one CDN provider as zero risk. They are excellent operators; they still have incidents.
- Monitoring only from inside the affected provider. Self-referential health checks miss the kind of outage that matters.
What to do this week
Three moves. (1) Verify your monitoring includes at least one vantage point outside your CDN. (2) Document the CDN-bypass runbook for at least one critical service. (3) Add a quarterly tabletop exercise, ‘CDN provider is unreachable for 90 minutes’, to your incident-response calendar.