Cloud & Infrastructure Intermediate By Samson Tanimawo, PhD Published Aug 18, 2026 5 min read

DNS as a Deployment Control Plane

DNS isn't just where users find your service. With weighted records, health checks, and short TTLs, DNS becomes a global traffic-shifting tool simpler than any service mesh.

DNS beyond resolution

DNS isn't just "name to IP". Modern DNS is an effective control plane, health-aware, weighted, geo-aware. Used right, it deploys, fails over, and routes traffic without touching application code.

The traditional view. DNS resolves domain.com → 1.2.3.4. End of story. TTL controls cache duration. This is true for naive use; it leaves DNS's most powerful capabilities on the table.

The modern view. DNS is a programmable routing layer. Records can be weighted, health-checked, geo-routed, and latency-routed. Changes propagate within TTL (60-300 seconds). For routing decisions that don't need millisecond reaction time, DNS is simpler than load balancers and faster to change than application config.

The control-plane framing. Treat DNS records as code. Manage them in version control via Terraform or Pulumi. Run them through CI like any other config. The "control plane" is your Git repo; the "data plane" is the resolved record. This makes DNS changes auditable and reversible, qualities that DNS-as-an-afterthought lacks.

Weighted records

Multiple A records with weights. 80% point at primary, 20% at canary. Adjust weights to ramp traffic. Cleaner than maintaining feature flags for routing decisions.

The use case. You're rolling out a new backend. Both old and new are healthy; you want to send 5% of traffic to new for 24 hours, watch metrics, then ramp. With weighted DNS: change one Terraform value (5 → 25), apply, wait. With application-level flags: write code, deploy, configure rollout, monitor, repeat.

The TTL implication. Weights take effect within TTL window. With 60-second TTL, a weight change is fully effective within 1-2 minutes. For canary ramps measured in hours or days, this is fast enough.

The consistency caveat. Weighted DNS distributes BY query, not BY user. A user resolving once gets one backend; subsequent requests within their resolver's TTL hit the same backend. For session-affinity needs, weighted DNS isn't a substitute for session-aware load balancing; it's complementary.

Health checks

Route53, NS1, Cloudflare Load Balancing, all check origins and stop returning unhealthy IPs. Failover happens within TTL, usually under 60 seconds. No application change required.

What gets health-checked. HTTP endpoints, TCP ports, SSL handshakes, custom probes. The check runs from multiple geographic locations to avoid false positives from regional network issues. Configure aggressive intervals (10-30 seconds) for fast failover; longer intervals (60-120 seconds) reduce false positives.

The cascade pattern. Primary in region A; secondary in region B; tertiary in region C. Health checks remove unhealthy regions from rotation. Customers always reach the closest healthy origin without manual intervention. The pattern works for active-passive; for active-active with traffic-aware routing, latency-based records add load awareness.

The split-brain risk. If health checks themselves are unreliable (provider outage, misconfigured probe), DNS may incorrectly remove a healthy origin. Always have a manual override path, Terraform-managed records with a "force-route-all-here" flag, for the case where automation is wrong.

Three deployment patterns DNS unlocks

Geo-routing. Different IPs by country. Send EU users to EU origins, US to US origins. Single domain; the regional split is invisible to clients. Latency drops for non-US users; compliance with data-residency rules becomes a configuration concern, not an application concern.

Implementation: Route53 geolocation records or Cloudflare's geo-routing rules. Define the region groupings; assign records per region; clients automatically resolve to the nearest. Useful for serving cached content close to users, segregating data-residency-sensitive traffic, and complying with regional regulations.

Blue-green at the DNS layer. Two complete environments, "blue" and "green". DNS points at one; cut over by changing the record. Rollback is reverting the record. The only thing simpler is shutting down green entirely.

Compared to in-cluster blue-green: simpler (one DNS change vs. ingress reconfiguration), more atomic (single point of failure for the cutover), and easier to test (verify green directly via its own subdomain before cutting). Tradeoff: TTL-bound rollback time vs. instant ingress-level rollback.

Region failover. Health-checked records that automatically remove failed regions. The most basic DR pattern; surprisingly few teams have it set up because they assume "DNS doesn't do failover", DNS very much does failover.

The TTL trade-off

Long TTLs (3600+) reduce DNS query load. Short TTLs (60) enable fast changes. For services that need fast control-plane changes, 60-second TTLs are right. The DNS query load is negligible at modern scales.

The query-load math. A service with 1M MAU and 60-second TTL generates roughly 100-1000 DNS queries per second to the authoritative server. Cloud DNS providers handle 100k+ QPS easily; the cost is a few dollars/month. The "TTL must be high to reduce load" argument is mostly outdated.

The propagation reality. Even with 60s TTL, full propagation can take 5-10 minutes because of intermediate resolvers (ISP DNS, Google DNS, Cloudflare DNS) caching independently. Plan for "most clients within 2 minutes; long-tail within 10 minutes" rather than instant.

The pre-cutover trick. Before a planned cutover, lower TTL from 3600 to 60, but do it 60+ minutes ahead of time. Existing caches honour the new TTL only after their old cache expires; lower-TTL-too-late means the cutover is still gated by the OLD TTL.

Gotchas

Some clients ignore TTLs (looking at you, JVM). They cache DNS for the lifetime of the JVM process. Plan for this, either restart on cutover, or use a service-discovery layer in front of DNS.

The JVM specifically. Java has a DNS cache TTL (`networkaddress.cache.ttl`) defaulting to either -1 (forever) or 30 seconds depending on security manager. Production JVM apps not explicitly setting this default to forever-cache, which means DNS changes are invisible until restart. Always set this property to something sane (60-300 seconds) at JVM startup.

The mobile-app cache. Native iOS/Android apps cache DNS aggressively. App developers frequently don't realise; a "DNS change" isn't visible to mobile clients until the OS-level cache expires (typically minutes-to-hours). For mobile-heavy products, treat DNS changes as eventually-consistent at the app layer.

The recursive resolver lie. Some recursive resolvers (small-ISP DNS, broken corporate DNS) ignore TTLs and cache for hours or days. There's nothing the authoritative side can do; the recursive resolver is the broken party. The only mitigation is having clients use known-good resolvers (1.1.1.1, 8.8.8.8) or having an application-level fallback.

The CNAME chain pitfall. DNS resolution time grows linearly with CNAME chain length. A → B → C → D is four resolution steps; each adds 5-30ms. For latency-sensitive workloads, flatten chains to ALIAS records (Route53) or direct A records.

Common antipatterns

3600s TTL on a record you change weekly. Every change has a 1-hour stale-cache window. Drop to 60-300s for actively-managed records.

DNS records managed in the provider console only. No version control, no audit trail, no review. Move them to Terraform yesterday.

Health-check intervals shorter than failure-detection requirements. 10-second checks generate false positives during normal load; 60-second checks miss outages. 30-second is the usual sweet spot.

JVM in production without DNS TTL setting. A single config line prevents an entire class of failover bugs. Set it at JVM startup.

What to do this week

Three moves. (1) Audit your DNS TTLs. Anything actively-managed should be at 60-300 seconds; anything genuinely static can stay at 3600+. The mismatch is usually obvious. (2) Move your DNS records into Terraform if they aren't already. The audit trail and PR-review workflow pays back in the first incident where someone "doesn't remember changing that". (3) Confirm your JVM (or other long-lived runtime) DNS cache TTL, set it explicitly to 60-300 seconds. The default is almost always wrong.