DNS as Hidden Single Point of Failure: Patterns and Fixes
DNS outages reliably cascade further than the systems they take down. The blast radius is structural; the redundancy patterns are well-known but rarely implemented. Here is the playbook.
Why DNS cascades hard
When DNS resolution fails, services that depend on each other cannot find each other. A cache miss elsewhere triggers a fresh lookup; the lookup times out; the calling service times out; the next caller times out. A 30-second DNS blip can read as a 10-minute outage by the time the wave clears.
The half-resolved state is worse than full failure. Stale cached records keep some traffic flowing while new lookups fail. Half the fleet is happy; half is throwing errors. Debugging such an incident from the application logs alone is brutal.
Three resolution-path failure modes
Authoritative DNS down. Your provider (Route 53, Cloudflare DNS, NS1) loses a region. Cached records work; new resolutions fail. Most public-facing.
Recursive resolver overload. The DNS server your hosts query (often the cloud-provider VPC resolver, or a corp DNS server) gets overwhelmed. Lookups slow, time out, retry. Often invisible until everything is slow.
Local resolver misconfiguration. A bad /etc/resolv.conf, a stuck systemd-resolved, an Envoy DNS cache that won't refresh. Per-host or per-pod, hard to spot without per-instance metrics.
The dual-provider authoritative pattern
The single most impactful DNS reliability investment is two authoritative providers. Configure your domain with NS records pointing to both Route 53 and Cloudflare (or any two). Resolvers query whichever responds; the failure of one does not take you offline.
The setup is mechanical. Pick two providers; add the other's nameservers to your registrar's NS list; replicate the zone (zone transfers, or a tool like OctoDNS). Most teams set this up once and forget it. The dividend is enormous: a region-wide DNS provider outage becomes a non-event for you.
The cost. ~$30-100/month per provider for typical SaaS workloads. Cheap insurance against an entire class of incident.
Resolver redundancy on the host
Make sure each host has two recursive resolvers configured, in different ASes if possible. Cloud-provider VPC resolver as primary, secondary set to a public resolver (1.1.1.1, 8.8.8.8), or vice versa. The host fails over without intervention.
For Kubernetes, configure CoreDNS with multiple upstream forwarders. The default is often "the cloud DNS" only; that is a single point. Add a public fallback explicitly.
TTL strategy: the silent risk knob
Short TTLs (30-60s) make failover fast but DNS load high. Long TTLs (1h+) make load low but failover slow. Most teams default to long without thinking and discover it during their first DR test.
The pragmatic split. Records that change rarely (www, api) get medium TTLs (300s). Records that flip during failover (active region pointers) get short TTLs (30s). Nothing critical above 300s; nothing hot below 30s.
Test annually by killing primary and timing the actual recovery. The number is rarely what the spec says.
Antipatterns
Single authoritative provider. Two is the bare minimum. The cost is rounding error.
No DNS metrics. Resolution success rate, latency, and timeout count should be on a dashboard. Most teams discover DNS is broken via downstream symptoms.
Hardcoded IPs as the ‘DNS workaround.’ Tempting in incidents; long-term technical debt. The IPs change; you forget; the next incident is harder.
What to do this week
Three moves. (1) Check whether your apex domain has more than one authoritative provider; if not, add one. (2) Verify each host has two recursive resolvers configured; fix any that do not. (3) Add a DNS-resolution success-rate panel to your platform dashboard so the next DNS issue surfaces in seconds, not from incident reports.