Facebook BGP 2021
Total outage.
Overview
The October 2021 Facebook BGP outage was a multi-hour total outage across Facebook, Instagram, and WhatsApp. A misconfigured BGP withdrawal removed Facebook's authoritative DNS from the internet; recovery was slow because the internal tooling that engineers needed to fix the problem depended on the very systems that were down. The case study generalises: when your fix-it tools share dependencies with the broken systems, you have built a recovery dead-end.
- BGP configuration risk. A single command withdrew DNS routes globally. Configuration validation that ran daily would have caught it.
- Circular internal-tooling dependency. Internal access depended on the down systems. The fix-it tools shared the broken substrate.
- Recovery required physical access. Engineers needed badge access to data centers because remote management was unreachable. Worst-case recovery, longest path.
- Cascading and transitive failures. DNS down cascaded to authentication, internal apps, status page. Dependency map would have predicted the cascade.
The approach
Five disciplines turn the Facebook lesson into operational practice: break-glass procedures that do not depend on production, out-of-band management network access, configuration validation before deploy, dependency mapping that catches circular reachability, regular game-day exercises that prove the recovery actually works.
- Break-glass procedures. Emergency access that does not depend on production systems. Tested quarterly, not just documented.
- Out-of-band access. Console servers and a separate management network. Real access when the production network is unreachable.
- Pre-deploy configuration validation. Audit tool catches risky changes before they ship. The validation tool itself does not depend on the systems it validates.
- Dependency mapping plus game-days. Know what depends on what; test recovery procedures regularly. Catches circular reachability before it traps real engineers.
Why this compounds
Each architecture review that applies the Facebook lesson catches a circular dependency before it becomes the next outage. Out-of-band access shortens worst-case MTTR. Configuration validation prevents the class of mistake that caused this incident. By year two the team's resilience model is shaped by the lesson rather than learning it the hard way.
- Reduced cascading failure. Out-of-band access supports recovery when production is unreachable. Real uptime under worst-case scenarios.
- Better incident response. Break-glass procedures shorten worst-case recovery. MTTR drops on the incidents that matter most.
- Operational maturity. Dependency mapping grows the team's understanding. Resilience becomes a property of the architecture, not a hope.
- Year-one investment, year-two habit. First procedure is the investment; subsequent procedures inherit the patterns and ship faster.