The TLS Certificate Rotation Automation
Cert expiry incidents are 100% preventable. The automation that catches expiring certs and rotates without human action.
Detection
TLS certificate rotation is one of those engineering disciplines that, when automated, becomes invisible. When manual, it becomes a recurring source of outages: the cert expires, the service goes down, the on-call gets paged at 3am. Automation removes the human from the loop and makes the discipline sustainable.
What good detection looks like:
- Daily scan: certs expiring in less than 30 days.: A scheduled job iterates over every endpoint and certificate the team owns and flags any expiring within 30 days. The 30-day window is wide enough to allow remediation without panic; narrower windows risk leaving no buffer.
- Alert and queue.: Findings produce both an alert and a queue entry. The alert raises awareness; the queue produces the work item. Without both, certificates either get overlooked (no queue) or produce alert fatigue (no queue means alerts come back daily).
- Catches manual certs that nobody owns.: The scan covers all endpoints, not just those managed by automation. Certificates that were issued manually years ago by people who no longer work at the company show up. The detection prevents the "no one knows who owns this" outage.
- Inventory across all surfaces.: Web servers, load balancers, internal mTLS endpoints, certificate-authenticated APIs all contain certificates. The inventory must cover all surfaces; partial coverage misses certificates in the unscanned surfaces.
- External certificate transparency feeds.: CT logs publish all issued certificates for the team's domains. Cross-referencing against the team's inventory finds certificates the team did not know existed. Shadow IT and forgotten subdomains surface.
Detection is the first line of defense. Without comprehensive detection, even the best rotation automation has gaps.
Rotation
The rotation itself is now mostly a solved problem. Modern tools handle the issuance and deployment lifecycle automatically. The team's job is to point the right tool at the right surface and let it work.
- ACM and cert-manager handle most cases.: AWS Certificate Manager covers AWS-fronted services (CloudFront, ALB, NLB, API Gateway). cert-manager covers Kubernetes-fronted services. Together they handle most modern infrastructure.
- Auto-renew on validation.: Both tools automatically renew certificates before expiration. The validation (ACME HTTP-01, DNS-01, or AWS DNS validation) happens automatically. The renewed certificate deploys to the surfaces that need it.
- Manual certs: convert to ACM/cert-manager.: When the team finds a manually managed certificate, the right move is migration to automated management. The migration is one-time effort; the ongoing rotation cost goes to zero.
- Internal CA for mTLS.: Service-to-service mTLS often uses an internal CA. Tools like cert-manager with an internal issuer or Vault PKI handle the certificate lifecycle. The internal CA can issue short-lived certificates safely because the rotation is automated.
- Test the rotation.: The first rotation is a test. The team verifies the automation produces working certificates and deploys them correctly. The first time the test should be done in a non-production environment so failures are recoverable.
The rotation is the mechanical part. With good tooling, it runs without human attention; with bad tooling or manual processes, it becomes the source of recurring outages.
Verification
The rotation is not complete until the team verifies the new certificate is actually serving traffic. Rotation that succeeds in the cert manager but fails to deploy to the endpoint is silent failure; verification catches it.
- After rotation: probe the endpoint.: An automated probe connects to the endpoint after rotation and pulls the certificate served. The probe runs from outside the deployment system; it sees what real clients see.
- Verify the served cert.: The probe extracts the certificate's expiration date and serial number and compares to the expected new certificate. A mismatch indicates the deployment did not complete; the team is alerted and remediates.
- Monitoring continues.: The probe runs continuously, not just immediately after rotation. The continuous probe catches deployment issues that would otherwise become outages later. The cost of probing is low; the value of catching issues early is high.
- Catches deployment issues.: Common issues include cached old certificates on intermediate proxies, deployment systems that did not pick up the rotation, certificates that rotated but were not propagated. Each is caught by the probe.
- Page on cert mismatch.: A served certificate that does not match the expected one is a real incident. The page goes to the on-call; the response is to investigate the deployment chain and remediate. The page severity is high because the consequence (cert expiration in the wild) is operationally severe.
TLS cert rotation automation is the discipline that prevents a class of outages entirely. Nova AI Ops integrates with certificate inventory and verification probes, surfaces upcoming expirations, and produces the audit-ready report that compliance and operations both reference.