HTTP Status Codes (Incident Edition)
Not every code; just the ones that wake you up. What each one usually means, what's actually broken upstream, and the first thing to check before you escalate.
4xx, client errors that aren't really
4xx blames the caller. In incidents, it's almost always one of you. The first move is always: was this working an hour ago?
- 400 Bad Request, malformed input. Check first: recent client deploy. New schema validation rejecting the old payload shape is the classic.
- 404 Not Found, route doesn't exist. Check first: ingress / load balancer rules. Path-prefix mismatch after a routing change beats "the app is broken" by 10:1.
- 405 Method Not Allowed, route exists, method doesn't. Check first: CORS preflight; browsers send
OPTIONSand your handler may not respond. - 409 Conflict, concurrent write or duplicate key. Check first: retries hitting the same idempotency key from two processes; bad clock sync.
- 410 Gone, resource was deliberately removed. Different from 404; the server knows it used to exist. Watch for these on deprecation rollouts.
- 422 Unprocessable Entity, syntactically fine, semantically rejected. Check first: schema validation (e.g., a required field that's now nullable upstream).
4xx, auth & permissions
Auth incidents look like 401 or 403 spikes. Two distinct codes; they mean different things; treat them differently.
- 401 Unauthorized, "I don't know who you are." Check first: token issuer health (auth service, IdP). Check JWT expiry config didn't shrink in a recent deploy.
- 403 Forbidden, "I know who you are; you're not allowed." Check first: RBAC / IAM policy change, role rebinding, recent permission revoke. 403 spikes after deploys are often missing-scope bugs.
- 407 Proxy Authentication Required, corporate proxy missing creds. Check first: outbound HTTP proxy rotated credentials. Common in restrictive networks.
- 418 I'm a Teapot, April-fools real, sometimes used by ratelimit middleware as a "polite reject". Read the body to disambiguate.
- If 401 affects only one path/service, it's a token audience mismatch. If it's everything, the IdP is sick.
- If 403 jumped to 100% on a single endpoint, you shipped a permission check bug. Roll back, file the bug.
4xx, rate limits & oversize
- 413 Payload Too Large, upload exceeded server limit. Check first: nginx
client_max_body_size, ingress annotations, app-level body limits. Often a recent default change. - 414 URI Too Long, rare but real. Caller is putting a giant blob in the query string. Move to POST body.
- 429 Too Many Requests, rate limit. Check first: caller deploy that increased fanout, missing backoff, dropped Redis cache forcing recompute.
- 431 Request Header Fields Too Large, massive cookie or header. Check first: session bloat, oversized cookies from a new feature.
- The Retry-After header on 429 is the contract. If your client ignores it, you're the problem.
- For 429 floods, look at the rate limit scope (per-IP, per-user, per-route, global). The global limiter is the one that makes prod look down to everyone.
5xx, the obvious ones
5xx is your fault. The differential between 500, 502, 503, 504 is what tells you which layer.
- 500 Internal Server Error, unhandled exception in your code. Check first: logs of the failing pod for the stack trace.
kubectl logs --previousif it crashed. - 501 Not Implemented, rare. The route is stubbed. Usually a deploy that exposed a route the new code hasn't built yet.
- 503 Service Unavailable, "I exist but I can't serve right now." Check first: circuit breaker open, readiness probes failing, health check failing. Not the app being down (that's 502 from the proxy).
- 507 Insufficient Storage, disk full. WebDAV-flavored but real. Check first: the volume the app writes to. Logs eating the disk is the classic.
- 500s with mixed shapes (some succeed, some fail) point to a bad pod in the rollout. Check rollout status.
- 500s aligned with a deploy timestamp = you broke prod. Roll back first, debug second.
5xx, proxy & gateway
If you're behind nginx, ELB, Cloudflare, or any service mesh, the 5xx codes mean very specific things about where the failure is.
- 502 Bad Gateway, proxy reached upstream and got garbage. Check first: upstream pod is alive but returning malformed responses, or upstream crashed mid-response.
- 503 Service Unavailable from the proxy, no upstreams available. Check first: all upstream pods unhealthy, no endpoints registered, all backends drained.
- 504 Gateway Timeout, upstream took too long. Check first: upstream latency, recent timeout config change, downstream DB slow query.
- 508 Loop Detected, a redirect or proxy loop. Check first: ingress rule that points back at itself; an X-Forwarded-* header check that's pinging the wrong service.
- 502 vs 504, 502 is "bad data". 504 is "no data in time". Treat them with different runbooks.
- Mass 502s after a deploy = pods are starting but your readiness probe lies. Tighten the readiness check.
The weird ones you'll still see
- 0 / connection reset, not really an HTTP code. The TCP connection died before headers came back. Check first: intermediate firewall, idle-connection killer, long requests crossing a 60-second LB timeout.
- 525 SSL Handshake Failed (Cloudflare), backend cert expired or chain broken. Check first: cert renewal job.
- 526 Invalid SSL Certificate (Cloudflare), backend cert isn't trusted. Often cert + intermediate not concatenated correctly.
- 521 Web Server Down (Cloudflare), CF can't open a connection at all. Origin is hard-down or firewall changed.
- 499 Client Closed Request (nginx), client gave up before we responded. Indicates slow upstream; not really a "client error" in the usual sense.
- Any code that doesn't appear in RFCs is a vendor extension. Look it up in that vendor's docs, not in the IETF spec.