SSL Certificate Expiry: Detection, Renewal, Rollout

Three problems, three sub-agents, one orchestrator. The split, the integration with cert-manager, and the dry-run output that an SRE can sanity-check.

Three problems, three sub-agents

The cert flow splits into three sub-agents. Detection sub-agent scans certs daily and flags those expiring in less than 30 days. Renewal sub-agent triggers cert-manager (or your renewal flow) for flagged certs. Rollout sub-agent deploys the renewed cert and verifies the endpoint is serving the new chain.

The orchestrator

The orchestrator is code, not LLM. State machine: detect, renew, rollout, with each transition deterministic; LLM is invoked only for non-routine cases (cert types the renewal sub-agent does not know how to renew, hosts that did not pick up the new cert); the split keeps routine cases cheap and fast.

Dry-run output

Dry-run is the safety check. Each step emits a dry-run summary (what cert, what host, what the agent would do, expected duration); the summary is reviewable by SREs before any action and most reviews approve in seconds; dry-run output also serves as documentation describing the cert renewal flow in plain language.

Verification after rollout

Verification confirms the new cert reaches users. Probe the endpoint with openssl s_client and confirm the served cert matches the expected one; probe from multiple network paths if relevant (internal vs external, the cert may differ); verify within 5 minutes of rollout because cert mis-rollout breaks customer connections.

Escalation cases

Three escalation cases need human investigation. Renewal failed (CA rejected, rate-limited, or returned an error; surface the error); rollout failed (new cert issued but not picked up by the server; configuration issue); verification failed (rolled-out cert is not what is being served; possible deployment issue, the agent does not auto-revert).