DNS Failure Mode Checklist
DNS is the most common 'sudden everything is broken' cause. The checklist that ranks the seven failure modes.
The seven
DNS failures are particularly painful because DNS is at the foundation of nearly everything. When DNS fails, applications cannot resolve hostnames; service-to-service communication fails; users cannot reach the application. The DNS failure mode checklist is the structured guide to triaging DNS issues quickly under incident pressure.
What the seven failure modes are:
- Authoritative server down.: The authoritative DNS server (Route 53, Cloud DNS, similar) is unavailable. Queries cannot be answered authoritatively; resolvers eventually fall back to cached or stale data.
- Resolver down.: The local resolver (kube-dns, CoreDNS, the OS's resolver) is failing. The application cannot resolve hostnames even though the authoritative server is healthy.
- Cache poisoning.: Wrong records in caches. The resolver returns wrong IPs; the application connects to the wrong destination; security and operational issues both follow.
- NXDOMAIN cached too long.: A negative response (NXDOMAIN) was cached. The hostname now exists but the resolver continues returning NXDOMAIN until the cache expires. The application sees the hostname as nonexistent.
- TTL too high for change.: A record was changed but the old value is still cached at long TTL. The change propagates slowly; some clients see the new value, others see the old, until all caches expire.
- CNAME chain broken.: A CNAME points to a record that itself points elsewhere; somewhere along the chain, a record is missing or wrong. The resolution fails partway; the error is not always clear.
- DNSSEC validation failure.: DNSSEC-signed records that fail validation produce SERVFAIL. The resolver cannot verify the response; the client gets no answer.
The seven cover most DNS failure modes the team will encounter. Recognizing the pattern is the first step to fixing it.
Triage in order
The triage flow walks the failure modes in the order most likely to find the issue. Each step rules out one or more failure modes; the team converges on the cause.
- Start with: can the resolver reach authoritative?: The first question is connectivity. Can the resolver actually reach the authoritative server? If not, the issue is network-level; the DNS itself is fine.
- If no, network issue.: Network connectivity issues require network-layer investigation. The DNS team coordinates with networking; the DNS itself is not the problem.
- If yes: are records correct?: The resolver can reach authoritative; query the records directly. If records are wrong, the issue is record-level, not infrastructure.
- If no, change recently?: Recent record changes are the most common cause of wrong records. Check the change log; identify recent changes; correlate with the issue's timeline.
- Roll back.: If a recent change correlates, roll it back. The roll back is fast; the records return to their previous state; the issue typically resolves.
The triage flow produces fast resolution for most DNS issues. The team learns the flow; incident response becomes routine.
Prevention
Many DNS issues are preventable. The prevention strategies cost little to implement; the avoided incidents pay for the prevention.
- Short TTLs during planned changes.: Before a planned record change, lower the TTL well in advance. A TTL of 60 seconds takes 60 seconds to propagate; the team's change window is bounded; if the change goes wrong, rollback is fast.
- Plan 4 hours ahead.: The TTL change must reach all caches before the actual change. The 4-hour lead time covers typical TTL durations and replication delays. The discipline costs only planning; the value is significant.
- Multi-region authoritative DNS.: Authoritative DNS deployed across multiple regions. Loss of a single region's DNS does not affect resolution; the redundancy survives single-region failures.
- Health checks with automatic failover.: DNS records can be health-checked. Failed endpoints are automatically removed from the DNS responses. The failover is automatic; user impact is bounded.
- Test changes in a staging zone.: Major DNS changes are tested in a staging zone first. The staging zone exercises the same patterns; issues surface before they affect production.
DNS failure mode checklist is one of those operational disciplines that pays off proportionally to the team's reliance on DNS. Nova AI Ops integrates with DNS platforms and observability tools, surfaces DNS health and recent changes, and produces the triage view that incident response uses to converge on causes quickly.