DNS Monitoring
Track resolution.
What to monitor
DNS is the kind of dependency that is invisible until it fails. Three signals together cover almost every failure mode: resolution latency, resolution success rate, and cache-hit rate at the resolver. Per-region latency catches the regional failures that aggregate metrics hide.
- Resolution time per query. p99 per query. Should stay under 50ms typical.
- Resolution success rate. Per-resolver the success rate. Below 99.9 percent indicates resolver or authoritative issues.
- Cache-hit rate at the resolver. Per-resolver the hit rate. Below 90 percent suggests TTLs too short or unusual load.
- Per-region latency. Per-region the resolver latency. Matches user-perceived experience.
Authoritative DNS monitoring
Authoritative servers see only the traffic resolvers could not cache. Different signals matter at this layer: query mix, per-zone load, per-server health, NXDOMAIN rate that catches typo storms and scanning attempts.
- Per-zone query volume. Query rate per zone. Spikes indicate traffic surges or cache invalidation.
- Per-record query volume. Per-record query rate. Most-queried records are candidates for longer TTLs.
- Authoritative server health. Per-server response time, packet drop, error rate.
- Per-zone NXDOMAIN rate. Spikes catch typo storms or scanning attempts.
Synthetic DNS probes
Synthetic probes catch failures before customers do. Probe critical records from multiple regions; validate expected values; monitor TTL behaviour for drift that signals misconfiguration.
- Multi-geography probes. Probe per region. Detects regional resolution failures.
- TTL behaviour monitoring. Expected versus observed TTL per record. Drift catches misconfiguration.
- Catches issues before customer reports. Continuous monitoring beats reactive investigation.
- Per-record value validation. Expected-value check per record. Catches accidental changes.
Alerting on DNS
DNS failures cascade fast. Page immediately on real failures; downgrade configuration drift and cache-miss spikes to warnings investigated during business hours.
- Failed resolutions. Immediate page per failure. DNS failure cascades quickly to many systems.
- TTL anomalies. Warning, not page. Often config drift; investigate during business hours.
- Cache-miss spikes. Warning. May indicate scaling issue at resolver tier.
- TLSA record expiry. Cert expiry alerts on DNSSEC records. Matches DNSSEC operations.