AI & ML Advanced By Samson Tanimawo, PhD Published Dec 16, 2025 7 min read

The Bitter Lesson Applied in 2026

Rich Sutton’s 2019 essay said scale beats clever in AI. Six years later the lesson keeps being relearned. Here is what it has and hasn’t predicted, and where the next turn might surprise us.

The thesis, exactly

From Sutton’s 2019 essay: “The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective.”

The implication: clever, hand-engineered approaches that bake in human insight tend to lose, over time, to simple methods that scale with available compute. Search and learning are general; everything else has a ceiling.

The supporting history

Three canonical examples Sutton cited:

Chess: hand-crafted heuristics ruled until brute-force search (Deep Blue, then minimax with alpha-beta) won by leveraging compute.
Go: hand-crafted patterns and life-and-death tactics lost to AlphaGo’s search + neural networks, then to AlphaZero’s pure self-play.
Speech recognition: phoneme dictionaries and linguistic features were beaten by deep nets trained end-to-end on raw audio.

The pattern: simple methods plus more compute eventually beat clever methods plus less compute. Always.

Where it still holds in 2026

The most striking recent case: language models. Researchers spent years on hand-crafted parsers, semantic networks, knowledge graphs. They were beaten by next-token prediction at scale.

Even in 2026, the pattern recurs:

Hand-tuned RAG pipelines lose to longer-context models that just absorb more.
Carefully engineered chain-of-thought prompts lose to dedicated reasoning models trained on reasoning chains.
Specialist code models lose ground to general models trained on more code at scale.

Counter-evidence

The Bitter Lesson isn’t always immediate. Several places where cleverness still helps:

Tool use and agents. The right scaffolding around an LLM beats raw generation by a wide margin. Probably temporary, future models will internalise these patterns.
Data quality. Curating great training data matters more, not less, in 2026. Compute hasn’t replaced the need for thoughtful data work.
Inference efficiency. Speculative decoding, quantisation, MoE routing, all are clever optimisations that compute alone wouldn’t produce.

The honest reading: scale wins on the central challenges. Cleverness still matters at the edges and on engineering economics.

The next turn

Two ways the lesson could be wrong in the next 2 years:

Compute walls. If energy or chip supply caps the scale curve, more compute might stop being available, breaking the assumption underneath the lesson.
Architectural breakthroughs. A new architecture could change the constants enough that “more of the same” gets dethroned. State-space models, retentive networks, and similar 2024-2025 candidates haven’t broken transformers yet, but the field is open.

Best bet for 2026-2028: the lesson keeps holding for the central capabilities. The interesting work is at the edges, the small clever things that compound at scale.