SSL Certificate Expiry: Detection, Renewal, Rollout
Three problems, three sub-agents, one orchestrator. The split, the integration with cert-manager, and the dry-run output that an SRE can sanity-check.
Three problems, three sub-agents
Detection sub-agent: scan certs daily, flag those expiring in <30 days.
Renewal sub-agent: trigger cert-manager (or your renewal flow) for flagged certs.
Rollout sub-agent: deploy the renewed cert and verify the endpoint is serving the new chain.
The orchestrator
Code, not LLM. The orchestrator is a state machine: detect → renew → rollout. Each transition is deterministic.
LLM is invoked only for non-routine cases: cert types the renewal sub-agent does not know how to renew, hosts that did not pick up the new cert.
The split keeps the routine cases cheap and fast.
Dry-run output
Each step emits a dry-run summary: what cert, what host, what the agent would do, expected duration.
The summary is reviewable by SREs before any action. Most reviews approve in seconds; the dry-run is the safety check.
Dry-run output also serves as documentation: the cert renewal flow is described, by the agent, in plain language.
Verification after rollout
Probe the endpoint with openssl s_client; confirm the served cert matches the expected one.
Probe from multiple network paths if relevant. Internal vs external; the cert may differ.
Verify within 5 minutes of rollout. If verification fails, alert immediately; cert mis-rollout breaks customer connections.
Escalation cases
Renewal failed: the CA rejected, rate-limited, or returned an error. Surface the error; human investigates.
Rollout failed: the new cert was issued but not picked up by the server. Configuration issue; human investigates.
Verification failed: the rolled-out cert is not what is being served. Possible deployment issue; the agent does not auto-revert this; human investigates.