AI & ML Advanced By Samson Tanimawo, PhD Published Aug 6, 2026 6 min read

Emergent Capabilities: Real or Mirage?

Some abilities seem to appear suddenly as models scale up. Are they really emergent or is the metric fooling us? The honest 2026 answer.

The original claim

In 2022, the "emergent abilities" paper argued that large language models gain qualitatively new capabilities at certain scales, capabilities that aren't present at smaller scales and appear suddenly. The headline examples: arithmetic, multi-step reasoning, instruction following. The implication was profound: scale doesn't just make models better at the same tasks; it unlocks new tasks.

Why the claim landed. The trajectory of AI before 2020 was incremental, models got better at translation, classification, summarisation by predictable margins as scale grew. The 2020-2022 generation broke that pattern. GPT-3 could write code; PaLM could explain jokes; Chinchilla could reason about counterfactuals. None of these capabilities were directly trained for. They emerged from scale alone.

The implications for forecasting. If capabilities emerge at unpredictable scales, then progress is impossible to plan. A team estimating "we need scale X for capability Y" might be wrong by 10x. Safety researchers can't anticipate dangerous capabilities; product teams can't time launches. The unpredictability itself was the disturbing part of the original claim.

The cultural moment. The emergence framing matched what users were experiencing, ChatGPT felt qualitatively different from earlier chatbots, not just quantitatively better. The paper provided academic legitimacy to the felt sense of discontinuous progress. Whether or not the science was airtight, the framing took hold.

The measurement-artefact rebuttal

A 2023 follow-up showed that many "emergent" capabilities looked emergent only because of harsh metrics (exact-match accuracy). Switch to softer metrics (token-level probability, partial credit) and the curves smooth out. Capabilities improve gradually with scale; they only LOOK sudden when measured by all-or-nothing checks.

The intuition. Imagine a multi-step arithmetic problem. A small model gets some intermediate steps right but errors propagate; the final answer is wrong. Exact-match scores it 0. As the model scales, intermediate-step accuracy improves smoothly. Eventually intermediate accuracy is high enough that error doesn't propagate; the final answer becomes right. Exact-match flips from 0 to 1 in a narrow scale range. The underlying capability grew smoothly; the metric made the growth look sudden.

The implications for the original framing. If the discontinuity is in the metric, not the capability, then forecasting is more tractable than the emergence framing suggested. Capabilities improve smoothly with scale; they're predictable from training-loss curves. The "we don't know what scale unlocks dangerous capabilities" narrative weakens.

The unresolved cases. Not every emergent capability is metric-induced. Some capabilities (in-context learning quality, chain-of-thought benefit) really do appear with phase-transition character even on smooth metrics. The artefact rebuttal explains many but not all examples; the truth is probably "some emergent, some artefact" rather than "all artefact" or "all emergent".

Capabilities that look genuinely emergent

In-context learning at non-trivial complexity. Chain-of-thought benefit (smaller models barely help themselves with CoT; larger models gain a lot). The use of tools when offered. These look phase-transition-like on smooth metrics too, at small scale the capability isn't there at all, at large scale it works well, with a narrow transition zone.

The in-context-learning case. Below ~1B parameters, providing few-shot examples in the prompt barely helps. Above ~10B, examples work well. In between is a transition zone where examples help inconsistently. The capability isn't smoothly improving, it appears in a narrow scale window.

The CoT case. CoT prompting helps large models substantially but doesn't help small models at all (sometimes hurts). The capability is "use intermediate reasoning to get a better answer". Small models lack the capability; CoT prompting gives them more rope but they hang themselves with it. Above ~50B parameters, CoT becomes consistently helpful. Below, it's noise or worse.

The tool-use case. Below a threshold scale, providing tool-use examples doesn't translate into the model using tools effectively. Above the threshold, the model uses tools robustly across novel scenarios. The transition is fast, narrower than typical capability ramps.

The honest framing. "Some capabilities show smooth scale curves; others show sharp transitions even on smooth metrics." Both are true. The original "everything emerges suddenly" claim was too strong; the rebuttal "everything is smooth" is also too strong.

Why this matters for forecasting

For safety: if capabilities can emerge sharply, you must red-team capability bands you haven't yet trained. For product: don't expect specific capabilities until you've measured them in your size class. The honest framing, "some capabilities are smooth, some are step-shaped, all are surprising", is more useful than picking a side.

The safety implication. A frontier lab training a model 3x larger than the previous generation can't assume "capability X scales linearly from where we were". They must red-team for capabilities that might emerge in the new size class. The protocol: train; benchmark a wide capability suite (not just ones you care about); look for unexpected sharp gains; address before deployment.

The product implication. A team picking a model size for their use case shouldn't assume "next-size-up will have capability Y" without testing. Capabilities in your specific use case might be in the smooth regime (where size matters but each step gives small gains) or the step regime (where you need to cross a threshold). Test in the size class you intend to use.

The forecasting implication. Long-term capability forecasting is hard because we can't model phase transitions in advance. Short-term forecasting (next training run) is more tractable; you can extrapolate from training loss curves and benchmark suites. The middle ground (3-12 months out) is the genuinely hard zone where capability emergence creates uncertainty.

Common antipatterns

Treating "emergent" as magic. The capabilities are emergent in the sense of "appearing in larger models", not in the sense of "literally unexplainable". They have causes; the causes are findable; treating them as ineffable slows progress.

Picking a model size based on a paper's claims about emergence. Different evaluations show different transition points. Test in YOUR specific use case before committing to a size; the literature's transitions may not be yours.

Assuming smooth progress. Some capabilities really do emerge sharply. If your product depends on a capability at a specific size, validate experimentally; don't extrapolate from prior generations.

Ignoring the metric definition. "Accuracy" can mean exact-match, F1, BLEU, partial credit, expert judgment. Each metric has different smoothness properties. The same model can look emergent on one metric and smooth on another.

What to do this week

Three moves. (1) For your highest-stakes use case, test 2-3 model sizes (smaller, current, larger) on your real eval set. The shape of the size-vs-quality curve tells you whether you're in a smooth or step regime. (2) Define your eval metric carefully. Exact-match is harsh; partial-credit metrics surface smooth improvements that matter for product UX. (3) If you're depending on a capability that's near the emergence threshold for your model size, plan for the threshold to move, set up A/B testing infrastructure that lets you compare models without redeploying app code.