CoreDNS Tuning at Scale
Default CoreDNS struggles at scale. The tuning.
Cache
CoreDNS tuning is the discipline of optimizing the cluster's DNS resolver for the workload's needs. Default settings work for small clusters; large clusters benefit from tuning.
What cache tuning provides:
- Increase cache TTL.: Increasing the cache TTL reduces upstream queries. CoreDNS holds responses longer; fewer queries propagate; the load on upstream resolvers drops.
- Reduces query load.: The cluster's DNS query load is bounded by the cache. Frequently-resolved names hit the cache; only first-time or expired queries reach upstream.
- Trade-off: stale records.: Longer TTL means staler records. When a service's IP changes, clients keep using the old IP until the cache expires. The trade-off is real.
- Per-zone configuration.: Different zones can have different cache TTLs. Internal cluster names (cluster.local) can have higher TTL than external names; the discipline matches the actual change frequency.
- Negative caching too.: Cache also covers negative responses (NXDOMAIN). Tuning negative cache TTL reduces noise for non-existent names.
Cache tuning is the foundation. Larger caches produce less query load.
Replicas
CoreDNS scales with the cluster. Default replica count is too few for large clusters; the discipline is matching replicas to query load.
- Scale CoreDNS replicas with cluster size.: Larger clusters produce more DNS queries. The CoreDNS replica count must match; otherwise the resolvers become bottlenecks.
- Default 2 is too few for large clusters.: The Kubernetes default of 2 CoreDNS replicas works for small clusters. Large clusters (hundreds of nodes, thousands of pods) need significantly more.
- HPA on CoreDNS.: Horizontal Pod Autoscaler can scale CoreDNS based on CPU. The autoscaling matches replica count to load; the operational story is automated.
- Per-cluster sizing.: Each cluster's CoreDNS sizing is per-cluster. The team's monitoring shows per-cluster query load; the replica counts adjust accordingly.
- Affinity for nodes.: Some teams place CoreDNS pods on specific nodes for predictable behavior. The placement matches the operational pattern.
Replica count is the operational lever. Right-sized replicas produce reliable DNS resolution.
ndots
The ndots configuration affects how DNS queries resolve. The default produces extra queries for many lookups; tuning lower can reduce query volume significantly.
- ndots=2 default causes search-domain queries.: The Kubernetes default is ndots=5 in pods. For names with fewer dots, the resolver tries each search domain in order; many queries are produced before the actual lookup.
- Tune lower if your apps don't use search domains.: If applications use fully-qualified names, lower ndots reduces query overhead. ndots=1 means only single-dot names use search domains; most queries skip the search.
- Per-pod tuning.: The dnsConfig field on pods supports custom ndots. Specific pods can have different settings; the discipline is per-pod where it matters.
- FQDNs eliminate the need.: When apps consistently use FQDNs (ending with dot), search domains are not used regardless of ndots. The discipline is at the application code level.
- Test the change.: Lowering ndots can break applications that depend on search domain resolution. The team tests in non-production; verifies behavior; rolls out carefully.
CoreDNS tuning is one of those Kubernetes operational disciplines that pays off proportionally to cluster size. Nova AI Ops integrates with cluster DNS telemetry, surfaces query patterns, and supports the team's tuning decisions.