SSL Certificate Expiry: Detection, Renewal, Rollout
Three problems, three sub-agents, one orchestrator. The split, the integration with cert-manager, and the dry-run output that an SRE can sanity-check.
Three problems, three sub-agents
The cert flow splits into three sub-agents. Detection sub-agent scans certs daily and flags those expiring in less than 30 days. Renewal sub-agent triggers cert-manager (or your renewal flow) for flagged certs. Rollout sub-agent deploys the renewed cert and verifies the endpoint is serving the new chain.
- Detection sub-agent. Daily scan; flag certs expiring within 30 days.
- Renewal sub-agent. Trigger cert-manager or your renewal flow for flagged certs.
- Rollout sub-agent. Deploy the renewed cert; verify the endpoint serves the new chain.
- Per-stage independence. Each sub-agent is bounded; supports correct error attribution.
The orchestrator
The orchestrator is code, not LLM. State machine: detect, renew, rollout, with each transition deterministic; LLM is invoked only for non-routine cases (cert types the renewal sub-agent does not know how to renew, hosts that did not pick up the new cert); the split keeps routine cases cheap and fast.
- Code state machine. Deterministic transitions; the orchestrator is not LLM-based.
- LLM only for non-routine. Unknown cert types, stuck rollouts; the LLM handles edge cases.
- Routine cheap and fast. Most cases take the deterministic path; cost stays low.
- Per-transition logging. Each state change captured; supports investigation when something stalls.
Dry-run output
Dry-run is the safety check. Each step emits a dry-run summary (what cert, what host, what the agent would do, expected duration); the summary is reviewable by SREs before any action and most reviews approve in seconds; dry-run output also serves as documentation describing the cert renewal flow in plain language.
- Per-step dry-run summary. What cert, what host, what action, expected duration.
- SRE review before action. Most reviews approve in seconds; the dry-run is the safety check.
- Documentation as side-effect. Dry-run text describes the renewal flow in plain language.
- Per-cert audit trail. The dry-run record persists; supports later investigation.
Verification after rollout
Verification confirms the new cert reaches users. Probe the endpoint with openssl s_client and confirm the served cert matches the expected one; probe from multiple network paths if relevant (internal vs external, the cert may differ); verify within 5 minutes of rollout because cert mis-rollout breaks customer connections.
- openssl s_client probe. Confirm the served cert matches the expected one.
- Multiple network paths. Internal vs external; the cert may differ; probe both.
- 5-minute verify window. Cert mis-rollout breaks customer connections; verify quickly.
- Per-rollout immediate alert. If verification fails, alert immediately; the failure is high-stakes.
Escalation cases
Three escalation cases need human investigation. Renewal failed (CA rejected, rate-limited, or returned an error; surface the error); rollout failed (new cert issued but not picked up by the server; configuration issue); verification failed (rolled-out cert is not what is being served; possible deployment issue, the agent does not auto-revert).
- Renewal failure. CA rejected, rate-limited, error; human investigates.
- Rollout failure. Cert issued but not picked up; configuration issue; human investigates.
- Verification failure. Rolled-out cert not served; deployment issue; agent does not auto-revert.
- Per-escalation handoff. Each escalation includes context; human picks up where the agent stopped.