BGP Basics for SREs: What You Need to Know
Most SREs do not need BGP expertise; all SREs benefit from BGP literacy. The four concepts cover 80%.
Why SREs need BGP basics
Most SREs do not need to configure BGP, but all SREs benefit from BGP literacy. Cloud network outages routinely trace to BGP and understanding the language is what lets you parse the postmortem.
- Cloud outages. Major incidents (Cloudflare 2020, AWS Direct Connect, Facebook 2021) are BGP stories at root.
- Postmortem fluency. Vendor postmortems use BGP terms; literacy converts vendor jargon into action.
- Faster triage. Without basics, you wait for the network team to translate; the incident grows in the meantime.
- Cross-team conversation. Network team treats you as a peer when you can ask questions in their language.
Four concepts
- AS (Autonomous System): a network with one routing policy.
- Prefix: an IP range (e.g., 10.0.0.0/24).
- Path: the AS sequence to reach a prefix.
- Policy: rules for accepting/sending routes.
When to escalate
BGP issues cross team boundaries quickly. Knowing when to escalate beats trying to debug a problem that is two layers above your control plane.
- Cross-team scope. BGP issues span teams; cloud-provider BGP is the cloud's network team, not yours.
- Documented path. Escalation route written down; do not invent it during the incident.
- Tabletop rehearsal. Practice the escalation in chaos drills; the path stays warm.
- Vendor contacts. Pre-positioned contacts at the cloud's network team; the relationship matters when minutes count.
Anycast (BGP application)
Anycast is the most visible BGP application most SREs touch. Same IP advertised from multiple locations; BGP routes to the closest healthy one.
- Mechanism. Same prefix advertised from multiple locations; BGP picks the closest one for each client.
- Use cases. Major DNS resolvers (1.1.1.1, 8.8.8.8), CDN edges, global APIs use anycast routinely.
- Failover speed. Sub-second when a location withdraws; faster than any DNS-based failover.
- Operational caveat. Requires BGP control; cloud providers manage it for you on edge services.
Antipatterns
- Treating BGP as ‘the network team’s problem.’ Postmortems unintelligible.
- Configuring BGP without expertise. Outage.
- Ignoring ‘route hijacks’ in news. Could happen to you.
What to do this week
Three moves. (1) Apply this pattern to your highest-risk network path. (2) Measure the failure mode rate before/after. (3) Document the change so the next incident-responder inherits the knowledge.