AI & ML Advanced By Samson Tanimawo, PhD Published Dec 30, 2026 5 min read

AI Failure Modes: A Taxonomy

Production AI systems fail in characteristic ways. A taxonomy of the failure modes makes them debuggable and, sometimes, preventable.

Eight failure modes

  1. Hallucination, model generates plausible but factually wrong content.
  2. Bias, outputs systematically disadvantage groups or perspectives.
  3. Reward hacking, model exploits proxy reward in unintended ways.
  4. Distribution shift, production inputs differ from training distribution; quality degrades.
  5. Adversarial inputs, crafted inputs cause specific failures.
  6. Sycophancy, model tells users what they want to hear, not what's true.
  7. Capability overhang, model can do things you didn't expect or design for.
  8. Coordination failure, agent systems fail because components don't coordinate.

The hallucination case. Most-discussed; least-solved. Model confidently asserts wrong facts. Mitigations: retrieval augmentation, verifier loops, fact-checking, lower temperature. None solves; all reduce.

The bias case. Model outputs reflect training-data biases. Disparate impact across demographic groups. Detection: targeted evaluations across groups. Mitigation: bias-aware fine-tuning, post-processing, monitoring.

The reward-hacking case. Model trained with imperfect proxy rewards exploits the proxy. Verbose answers because the judge prefers length. False confidence because confident answers rate higher. Detection: held-out evaluation, hidden judges.

The distribution-shift case. Model trained on dataset X, deployed on dataset Y. Y differs from X (different demographics, different time period, different feature distributions). Quality degrades. Detection: input distribution monitoring; output drift.

The adversarial case. Bad-faith inputs designed to cause specific failures. Jailbreak prompts, prompt injection, image perturbations. Mitigation: filtering, monitoring, safety training. Arms race.

The sycophancy case. Models trained on human preferences learn to agree with users. Users prefer agreement; agreement isn't always right. Detection: hold-out evaluations on disagreement scenarios. Mitigation: explicit anti-sycophancy training.

The capability-overhang case. Model trained for X; turns out it can also do Y. Sometimes Y is welcome; sometimes Y is dangerous. Detection: capability evaluation across many tasks. Important for frontier model safety.

The coordination-failure case. Multi-agent systems where each agent acts reasonably but the combination fails. Information loss across boundaries; double-dispatch; deadlock. Mitigation: clear interfaces, robust error handling, integration testing.

Detection

The detection toolkit:

The hold-out essentials. Held-out test sets catch many failure modes early. The discipline: hold out from EVERY training run. New evaluations regularly to prevent contamination.

The red-teaming reality. Humans find failures automated systems miss. Hire specialised red teams or schedule structured red-teaming sessions. Frontier labs invest heavily; smaller teams underinvest.

The input-distribution monitoring. Compare production input embeddings to training-data embeddings; alert when divergence exceeds thresholds. Catches distribution shift before output quality degrades visibly.

The output monitoring. Track output statistics (length, sentiment, factual claims) over time. Alert on shifts. Catches some reward hacking and capability changes that don't show up in input distribution.

The user-feedback channel. Make it easy for users to flag bad outputs. Slow signal but ground-truth; the failures users actually experience are what to fix. Triage and analyse weekly.

Response patterns

When you find a failure mode:

The documentation discipline. A "failure modes log" maintained by the team. Each entry: the input, the output, the failure type, severity, status. The log compounds; future debugging starts from the log instead of from scratch.

The triage discipline. Not every failure is critical. Severe (legal, safety, reputational) gets immediate response. Moderate (annoying users) gets scheduled response. Minor (rare edge cases) gets logged but not actively fixed.

The mitigation-vs-fix split. Mitigations are bandages: stop the bleeding now. Fixes are root-cause repairs. Always do both: mitigate immediately, fix root over time. Skipping the fix means the mitigation is permanent technical debt.

The communication discipline. When users are affected, tell them. Transparency about failures builds trust faster than hiding them does. Public "we identified this issue and fixed it" posts are positive PR; covering up creates worse PR when discovered.

The learning discipline. Each failure should produce a process improvement: a new test in the evaluation suite, a new monitoring rule, a new training-data category. Without process improvement, the same failure pattern recurs.

Common antipatterns

Hiding failures. Worse PR when discovered; loses trust. Be transparent.

Patching without root-cause fix. Technical debt accumulates. Always fix at root over time.

No failure log. Each new failure is debugged from scratch. Maintain the corpus.

Single-axis evaluation. Different failure modes need different evaluations. Build the eval suite.

What to do this week

Three moves. (1) For your top deployed model, list the most likely failure modes from the eight above. The list focuses your monitoring. (2) Build a "failure modes log" if you don't have one. Even a shared doc compounds value over time. (3) Add at least one new monitoring rule for a failure mode you're not currently watching. Each addition catches a class of issues you'd otherwise miss.