Postmortem Advanced By Samson Tanimawo, PhD Published Aug 26, 2026 11 min read

Real Outage: An Expired Intermediate TLS Cert

A payments processor watched its leaf certs renew on schedule for years. Nobody was watching the intermediate. It expired at 04:00 with a window of 11 hours and broke three downstream vendors before anyone understood why.

Timeline

Anonymised composite drawing on TLS-intermediate-expiry incidents at payments-related vendors. Times in UTC.

04:00, The intermediate CA cert in the chain served by a payments processor’s API expires. Server keeps serving the now-expired chain. Most clients ignore intermediate expiry as long as the path-validation reaches a trusted root via AIA fetch.

04:00 – 09:00, Quiet. Browser-based traffic is fine. Most server-to-server clients (their downstream vendors) are also fine because they’ve cached the previous valid intermediate from the AIA URL.

09:14, First downstream vendor reports payment failures. Their TLS library is configured to validate the entire chain end-to-end without AIA. The expired intermediate causes certificate has expired errors. About 8% of API traffic is now failing.

09:31, Second vendor reports. They run on a different language stack with a stricter validator. Their failure rate climbs from 0% to 100% over 12 minutes as their connection pool drains and reconnections start failing.

09:48, Third vendor reports. Their integration is using cert pinning, pinned to the intermediate by accident, not the leaf. They have zero working connections.

10:02, The processor’s on-call team gets paged after the third report comes through customer support. Detection: 2 hours, 2 minutes from first vendor failure.

10:18, Engineering finds the expired intermediate. The processor has a renewed intermediate sitting in their cert vault from a CA-mandated rotation that happened 6 weeks earlier. They’d staged the renewal and never deployed it.

10:34, New chain deployed to production. Most clients recover within seconds. The cert-pinned vendor needs a manual code change on their side; that takes another 4 hours.

10:58, All vendors except the cert-pinned one recovered. Total impact for the bulk of traffic: 6 hours from intermediate expiry to recovery; 1 hour 44 minutes of meaningful customer-facing failures.

The detection lag

2 hours from vendor pain to acknowledgement. The detection failure was structural: the processor monitored leaf-cert expiry rigorously (alerts at 30, 14, 7, and 1 days) and didn’t monitor intermediate expiry at all. Intermediates were treated as “the CA’s problem”. They are not.

The deeper failure: the team didn’t have a synthetic monitor that simulated a strict TLS client. Their internal monitors used the system’s default trust store with AIA fetching enabled, which masked the expired-intermediate problem because AIA pulled a valid one. A strict-validator probe (Java with no AIA, or OpenSSL with -no_aia) would have alarmed at 04:00.

The cascade

Three layers of vendor pain, each manifesting differently. Vendor A (Python with requests) saw partial failures because their connection pool included AIA-capable workers and non-AIA workers. Vendor B (Go without AIA) saw a slow ramp from 0% to 100% as their long-lived connections naturally re-handshook. Vendor C (cert-pinned Java) saw immediate 100% failure but only on new connections; existing TCP sessions stayed up.

The cascade was about the variety of TLS validator behaviour in the wild. Even with one bad chain on the server, three vendors saw three different failure patterns, on three different timelines. From the processor’s side it looked like “a third of vendors complaining at staggered times” not “single root cause”.

This is the worst kind of incident: the symptoms don’t obviously correlate, the dashboards stay green, and the only signal is a slow drip of customer reports.

What the runbook said

The TLS runbook had three sections: leaf-cert renewal (well-documented), CA migration (well-documented), and emergency cert revocation (well-documented). It had nothing on intermediate-cert lifecycle. The implicit assumption was that intermediates are stable for the lifetime of the trust relationship; that’s often true for legacy enterprise CAs and never true for modern public CAs.

The on-call who eventually got paged spent 16 minutes confirming the leaf was valid (it was), the chain was complete (it appeared to be), and DNS was correct (it was). They didn’t check the intermediate’s validity until a seasoned engineer joined the bridge and asked “but is the intermediate still valid?”

What actually fixed it

Deploy the renewed intermediate. The team had it in their vault; rotation just hadn’t been done because the leaf was still valid and they had no process tying intermediate rotation to leaf rotation. The actual deploy took 4 minutes, reload nginx with the new chain across the fleet.

The cert-pinning vendor was a separate problem. They’d pinned the intermediate’s SHA-256 in their app config three years earlier, and that hash was now mismatched. The fix was a config push on their side; coordinating that took several hours of joint debugging.

Action items

The architectural change

The architectural answer was: the cert chain is a single artefact, not three independent certs. The team built a “chain-as-config” model where every leaf has an explicit chain manifest, and any change to any cert in the chain triggers the same rotation pipeline. Intermediate ageing now causes leaf rotation; the cert pipeline can’t distinguish them.

The deeper change was about vendor coordination. The processor now publishes a “TLS roadmap” quarterly listing upcoming intermediate changes, root migrations, and any other chain-affecting events. Vendors get 90 days notice on anything that might break a strict validator or a pinning configuration. The 4 hours it took the cert-pinned vendor to recover was 4 hours of preventable damage; communication eliminates that class of failure.

The wiki line: “You don’t own the certificate. You own the trust relationship. The certificate is just one expression of it.”