Mechanistic Interpretability: Reading Attention Heads
Models are mostly opaque. Mechanistic interpretability is the project of opening one up and tracing how specific neurons and attention heads compute specific things.
The interpretability goal
Make a transformer’s computation legible. Not “the model says X because it was trained on data that says X”, that’s not interpretation. Real interpretation: identify the specific weights and intermediate activations that turn the input into the output, in a way a human can follow.
Induction heads
The most-celebrated early result. In a transformer, certain attention heads learn a specific algorithm: when token X follows token Y earlier in the context, predict X again the next time Y appears. This implements simple in-context pattern completion.
Researchers traced this circuit precisely: which heads in which layers, how they communicate, what the weights look like. It was the first concrete proof that transformers learn discrete algorithms, not just statistical surfaces.
Circuits
The follow-up: identify the small set of components (heads, MLP neurons) responsible for a specific behaviour. The Indirect Object Identification circuit, modular arithmetic circuits, and grokking circuits are early classic findings.
Tools: activation patching (replace one component’s activations with another’s and see if the behaviour changes), causal tracing, attribution patching. All identify which parts of the model are necessary or sufficient for a behaviour.
Sparse autoencoders
The 2023-2024 frontier technique. Train an autoencoder on residual stream activations with a sparsity penalty. The autoencoder’s features turn out to be far more interpretable than raw neurons: a feature might fire on “legal jargon” or “Python error messages” or “flattery.”
Anthropic’s 2024 work scaled SAEs to Claude-class models and identified millions of features. This is the most concrete progress on interpretability so far.
The safety case
If you can find the “deception” feature or the “follow user instructions even if dangerous” circuit, you can monitor for it. Or ablate it. Or use it as a training signal.
This is why interpretability is funded heavily by labs concerned with AI safety. In 2026, mechanistic interpretability is the strongest empirical line on what’s actually happening inside frontier models. It’s slow work, and it lags capabilities, but it’s producing concrete results that nothing else does.