Certificate Rotation Automation
cert-manager, ACM.
Overview
Certificate rotation automation replaces TLS certificates before they expire without humans in the loop. Manual rotation produces predictable incidents (someone forgot, the on-call discovered it at 3 AM); automation removes the bottleneck. Five primitives carry most operational use: cert-manager for Kubernetes, ACM for AWS-native, auto-renewal before expiry, in-place service reload, audit trail.
- cert-manager (Kubernetes). Issues and renews certs from ACME (Let's Encrypt), Vault, or internal CAs. Default for K8s workloads.
- AWS Certificate Manager (ACM). Free certs for ELB, CloudFront, API Gateway with automatic renewal. Default for AWS-native workloads.
- Auto-renewal plus zero-downtime reload. Certs renew before expiry; running service picks up the new cert without restart. Removes the human renewal-pressure trap.
- Audit trail. Every issuance and renewal logged. Supports compliance and post-incident investigation.
The approach
Platform-native automation (cert-manager for K8s, ACM for AWS, Vault PKI for internal mesh), expiration monitoring as a backstop for automation failure, non-prod testing before production rotation. The discipline is treating cert hygiene as a property of the platform rather than a per-cert calendar reminder.
- cert-manager for Kubernetes. Declarative cert management via CRDs. Matches K8s patterns.
- ACM for AWS-native plus Vault PKI for internal. Native integration with ELB and CloudFront; Vault as internal CA for service mesh. Both fully managed.
- Monitor expiration as backstop. Alert before any cert reaches the danger zone. Catches automation failures before they become outages.
- Test rotation in non-prod first. Synthetic rotation per environment. Catches surprises before they hit production.
Why this compounds
Each automated cert removes a class of recurring incidents. Engineering time stops being spent on renewal calendar work; short-lived certs (Let's Encrypt 90-day, Vault PKI hours) become practical because automation handles the rotation. By year two, "cert expired" is no longer a category of postmortem.
- Reduced incidents. Cert expiration is no longer a class of outage. Uptime improves.
- Reduced operational toil. Engineers do not chase cert renewals. Engineering time stays on product.
- Better security. Short-lived certs become practical. Modern PKI hygiene by default.
- Year-one investment, year-two habit. First cert pipeline is the investment; subsequent ones run on the patterns.