Real Outage: An Expired Intermediate TLS Cert
A payments processor watched its leaf certs renew on schedule for years. Nobody was watching the intermediate. It expired at 04:00 with a window of 11 hours and broke three downstream vendors before anyone understood why.
Timeline
Anonymised composite drawing on TLS-intermediate-expiry incidents at payments-related vendors. Times in UTC.
04:00, The intermediate CA cert in the chain served by a payments processor’s API expires. Server keeps serving the now-expired chain. Most clients ignore intermediate expiry as long as the path-validation reaches a trusted root via AIA fetch.
04:00 – 09:00, Quiet. Browser-based traffic is fine. Most server-to-server clients (their downstream vendors) are also fine because they’ve cached the previous valid intermediate from the AIA URL.
09:14, First downstream vendor reports payment failures. Their TLS library is configured to validate the entire chain end-to-end without AIA. The expired intermediate causes certificate has expired errors. About 8% of API traffic is now failing.
09:31, Second vendor reports. They run on a different language stack with a stricter validator. Their failure rate climbs from 0% to 100% over 12 minutes as their connection pool drains and reconnections start failing.
09:48, Third vendor reports. Their integration is using cert pinning, pinned to the intermediate by accident, not the leaf. They have zero working connections.
10:02, The processor’s on-call team gets paged after the third report comes through customer support. Detection: 2 hours, 2 minutes from first vendor failure.
10:18, Engineering finds the expired intermediate. The processor has a renewed intermediate sitting in their cert vault from a CA-mandated rotation that happened 6 weeks earlier. They’d staged the renewal and never deployed it.
10:34, New chain deployed to production. Most clients recover within seconds. The cert-pinned vendor needs a manual code change on their side; that takes another 4 hours.
10:58, All vendors except the cert-pinned one recovered. Total impact for the bulk of traffic: 6 hours from intermediate expiry to recovery; 1 hour 44 minutes of meaningful customer-facing failures.
The detection lag
2 hours from vendor pain to acknowledgement. The detection failure was structural: the processor monitored leaf-cert expiry rigorously (alerts at 30, 14, 7, and 1 days) and didn’t monitor intermediate expiry at all. Intermediates were treated as “the CA’s problem”. They are not.
The deeper failure: the team didn’t have a synthetic monitor that simulated a strict TLS client. Their internal monitors used the system’s default trust store with AIA fetching enabled, which masked the expired-intermediate problem because AIA pulled a valid one. A strict-validator probe (Java with no AIA, or OpenSSL with -no_aia) would have alarmed at 04:00.
The cascade
Three layers of vendor pain, each manifesting differently. Vendor A (Python with requests) saw partial failures because their connection pool included AIA-capable workers and non-AIA workers. Vendor B (Go without AIA) saw a slow ramp from 0% to 100% as their long-lived connections naturally re-handshook. Vendor C (cert-pinned Java) saw immediate 100% failure but only on new connections; existing TCP sessions stayed up.
The cascade was about the variety of TLS validator behaviour in the wild. Even with one bad chain on the server, three vendors saw three different failure patterns, on three different timelines. From the processor’s side it looked like “a third of vendors complaining at staggered times” not “single root cause”.
This is the worst kind of incident: the symptoms don’t obviously correlate, the dashboards stay green, and the only signal is a slow drip of customer reports.
What the runbook said
The TLS runbook had three sections: leaf-cert renewal (well-documented), CA migration (well-documented), and emergency cert revocation (well-documented). It had nothing on intermediate-cert lifecycle. The implicit assumption was that intermediates are stable for the lifetime of the trust relationship; that’s often true for legacy enterprise CAs and never true for modern public CAs.
The on-call who eventually got paged spent 16 minutes confirming the leaf was valid (it was), the chain was complete (it appeared to be), and DNS was correct (it was). They didn’t check the intermediate’s validity until a seasoned engineer joined the bridge and asked “but is the intermediate still valid?”
What actually fixed it
Deploy the renewed intermediate. The team had it in their vault; rotation just hadn’t been done because the leaf was still valid and they had no process tying intermediate rotation to leaf rotation. The actual deploy took 4 minutes, reload nginx with the new chain across the fleet.
The cert-pinning vendor was a separate problem. They’d pinned the intermediate’s SHA-256 in their app config three years earlier, and that hash was now mismatched. The fix was a config push on their side; coordinating that took several hours of joint debugging.
Action items
- Intermediate-expiry monitoring. Same alert ladder as leaf certs (30/14/7/1 days). Now monitored separately for every public-facing endpoint.
- Strict-validator synthetic probe. New synthetic monitor that hits the public API with AIA fetching disabled and full chain validation. Would have alarmed at 04:00 in this incident.
- Vendor-integration test on every cert rotation. Rotation now goes to staging, gets validated by a fleet of test clients matching each vendor’s TLS stack, then promotes to prod. Catches strict-validator and pinning issues before they hit production.
- Cert-pinning audit with major vendors. The processor reached out to all 40+ vendor integrations to ask “do you pin?” If yes, “what are you pinning, leaf, intermediate, or root?” Three more vendors discovered they’d been pinning intermediates by mistake; all migrated to leaf-pin or root-pin.
- Renewal-deploy SLA. Renewed intermediate certs in the vault must be deployed within 7 days of receipt, regardless of leaf status. The 6-week gap that bit this incident is no longer possible.
The architectural change
The architectural answer was: the cert chain is a single artefact, not three independent certs. The team built a “chain-as-config” model where every leaf has an explicit chain manifest, and any change to any cert in the chain triggers the same rotation pipeline. Intermediate ageing now causes leaf rotation; the cert pipeline can’t distinguish them.
The deeper change was about vendor coordination. The processor now publishes a “TLS roadmap” quarterly listing upcoming intermediate changes, root migrations, and any other chain-affecting events. Vendors get 90 days notice on anything that might break a strict validator or a pinning configuration. The 4 hours it took the cert-pinned vendor to recover was 4 hours of preventable damage; communication eliminates that class of failure.
The wiki line: “You don’t own the certificate. You own the trust relationship. The certificate is just one expression of it.”