Mechanistic Interpretability: Reading Attention Heads
Models are mostly opaque. Mechanistic interpretability is the project of opening one up and tracing how specific neurons and attention heads compute specific things.
The interpretability goal
Mechanistic interpretability is the project of reverse-engineering trained neural networks to understand WHY they produce specific outputs. Not just "this token had high probability"; the actual computation: which neurons fire, which attention heads attend, why. The goal is the same kind of understanding circuit designers have for chips, knowing each part's role.
The motivation. Trained models work; we don't know why. For most software, "works" is enough. For models making consequential decisions, "we don't know why it works" is uncomfortable. Mechanistic interpretability aims to close the gap between black-box behaviour and white-box understanding.
The scale problem. A 70B-parameter model has 70 billion knobs. Manually inspecting each is infeasible. Mechanistic interpretability needs methods that scale, techniques that find structure (heads, circuits, features) that compose into explanations rather than treating each parameter independently.
The paid-off bet. As of 2026, mech interp has produced concrete findings for specific small-scale phenomena (induction heads, IOI circuits in GPT-2). Whether the techniques scale to frontier models is the open question. The bet is that they will, that there are structures at any scale, just larger structures.
Induction heads
The most studied finding. Certain attention heads in trained transformers implement a specific algorithm: "find recent tokens like this one, and copy what came after them". Induction heads are why models can learn from in-context examples, they're the mechanical substrate of in-context learning.
The discovery. Anthropic's "Mathematical Framework" paper showed induction heads emerge during training at a specific point, and the model's in-context-learning ability emerges at the same point. The correlation is too tight to be coincidence; induction heads ARE in-context learning, mechanically.
The structure. An induction head is a pair: a "previous token head" and a "copying head". The previous token head identifies tokens that match the current context. The copying head reads what came after those matches and writes it as the next-token prediction. The pair implements a primitive lookup.
Why the finding matters. It's the first concrete evidence that transformers contain identifiable, named computations rather than amorphous "neural goo". Other circuits exist; finding them is now a research methodology, not just hope.
The generalisation question. Induction heads are well-established in small models (GPT-2 size). Whether the same heads exist in 70B+ models, or whether they're replaced by more sophisticated structures, is partially answered. The structures are still there; they're more numerous and combine in more complex ways.
Circuits
An attention head is one component. A circuit is multiple components composed to do a thing. The IOI circuit (Indirect Object Identification, figuring out who the indirect object is in a sentence) involves ~26 named heads working in concert. Each head's role has been mapped.
The IOI methodology. Researchers patched activations between heads, observing which heads influenced which downstream computations. The patching revealed a directed graph of "head A's output feeds head B's input for this computation". The graph is the circuit.
Why circuits, not heads alone. Many capabilities require multi-step computation: identify the subject, identify the verb, identify the indirect object, output the answer. Each step might be a head; the steps must connect. The circuit is the connection structure.
The labour problem. Mapping a circuit takes researcher-months. There are arbitrarily many capabilities; each has its own circuit. Manual circuit-mapping doesn't scale. Automated circuit discovery is the next frontier, algorithms that find circuits given a behaviour to explain.
Sparse autoencoders
The newer method. Train a separate small network (autoencoder) to encode model activations as sparse combinations of "features". Each feature, ideally, corresponds to a human-meaningful concept. SAEs scale further than circuit-finding because they don't require behavioural targets; they just decompose activations.
The dictionary metaphor. An SAE trains a dictionary of features. Any activation in the model is a sparse combination of dictionary entries. Each entry, when interpreted, often turns out to mean something, "Golden Gate Bridge", "code that uses Python decorators", "negation in the next clause".
The scaling property. SAEs scale to models 10x larger than what circuit-finding can handle. Anthropic published SAE features for Claude 3 Sonnet in 2024, not just toy GPT-2 demonstrations. The technique generalises.
The interpretation challenge. Discovering a feature is one step; labeling it (with human-meaningful names) is another. Auto-labeling using LLMs has had partial success; manual labeling is the fallback. The labeling cost is the bottleneck for using SAE outputs in safety reviews.
The intervention possibility. SAE features can be artificially activated or suppressed to steer model behaviour. "Make it more polite" maps to upweighting politeness-related features. The intervention is a research tool now; it's plausibly a deployment tool in the future for fine-grained behavioural control.
The safety case
If we can read the circuits, we can verify "the model isn't doing X" before deployment. If we can detect deceptive cognition mechanically, we can catch it before customers do. Mechanistic interpretability is the most ambitious safety bet: turn AI from black box to verifiable.
The deception-detection version. A model trained to be helpful might have a circuit for "what humans want to hear" and a separate circuit for "what's actually true". When the two diverge, we'd want the model to output truth; deception is when it outputs the former despite knowing the latter. If the two circuits are mechanically separable, we can detect deception by reading both.
The current state. Detection is possible in toy settings; not yet in frontier models. The techniques are improving fast; whether they outpace capability growth is uncertain. The bet is that interpretability stays close enough to capability that we can red-team frontier models before deploying them.
The dual-use concern. Tools that find circuits can also help adversaries find ways to manipulate models. The mitigation is the same as for cybersecurity: defenders need the tools sooner and use them better. Open research helps the defender side; weaponising specific findings is a different question.
Common antipatterns
Cherry-picking explanations. Finding ONE circuit and claiming the model "works this way". Most behaviours have many overlapping circuits; one explanation is one perspective.
Confusing description with cause. A circuit that fires for "Golden Gate Bridge" describes a correlation; whether the circuit CAUSES the behaviour requires intervention experiments.
Ignoring scale. Findings on GPT-2 may or may not transfer to GPT-5. Test on the model you actually care about, not the small model that's tractable.
Treating SAE features as ground truth. SAE features are an approximation. Different SAE training runs find different feature dictionaries; the "true" feature decomposition isn't unique.
What to do this week
Three moves. (1) Read Anthropic's "Mathematical Framework" and "Towards Monosemanticity" papers. They're the canonical introductions and the field has converged on their vocabulary. (2) Try TransformerLens (the open-source library) on a small model. Even an hour of hands-on patching builds intuition that papers don't. (3) If you're at a frontier lab or building on top of one, ask whether interpretability tools are part of your safety review. If not, pushing for them is high-leverage advocacy.