AI & ML Advanced By Samson Tanimawo, PhD Published Aug 8, 2026 7 min read

Sparse Autoencoders for Feature Discovery

A model has billions of neurons but they don’t map cleanly to concepts. Sparse autoencoders pull out cleaner features, one per discrete idea. The most concrete progress in interpretability.

The core idea

Train a sparse autoencoder on top of a transformer's activations. The autoencoder's bottleneck has many more dimensions than the transformer's hidden state, but only a few are active for any input. Each dimension, ideally, corresponds to a human-meaningful concept, "Golden Gate Bridge", "code that uses recursion", "negation".

The setup. Take frozen transformer activations at some layer. Train an autoencoder: encoder compresses the activation to a high-dimensional sparse code; decoder reconstructs the original activation from the code. The L1 sparsity penalty forces most code dimensions to zero for any input. The dimensions that activate are the "features".

The learned dictionary. After training, the encoder's weight matrix is a dictionary of feature vectors. Each row is a feature direction in activation space. When an input activates feature 1042, that means the activation has a strong component along the row-1042 direction. The dictionary is the SAE's contribution; the features are its interpretable products.

The intuition for why it should work. Trained transformers represent concepts as superpositions of activations, many concepts share the same neurons (polysemanticity). SAEs hypothesise that the right basis (the SAE dictionary) decomposes the superposition into individual concepts. Whether the right basis exists is the empirical question; SAEs let us test.

Why it works

Transformers naturally represent many concepts in superposition (polysemanticity). A single neuron might activate for both "France" and "music", not because the model is confused, but because it efficiently packs concepts into limited width. SAEs find the right basis to disentangle the superposition. The features they discover are often clean, one feature for "France", another for "music", correlated as needed.

The polysemanticity story. Suppose a model has 2,048 hidden dimensions but needs to represent 10,000 concepts. Some neurons must encode multiple concepts. The model uses sparse codes, each input activates a small subset of "concept directions" in the high-dimensional space. The directions don't have to be neuron-aligned; they can be any directions in activation space.

The basis problem. The model's natural basis (the neurons) is not aligned to concepts. The right basis is rotated. SAEs find the rotation by training on the property that codes should be sparse, the feature directions that produce sparse codes are the concept directions.

The empirical confirmation. Trained SAE features ARE individually interpretable. Researchers have inspected thousands of features and found human-meaningful labels for most. The technique works; the polysemanticity hypothesis was right; the right basis can be found.

What it doesn't explain. Why are concepts roughly orthogonal in activation space? Why does L1 sparsity find the right basis (rather than other sparse bases)? These are open questions; the technique works empirically before the theory fully explains why.

Dictionary size

How many features should the SAE have? Too few: features are still polysemantic. Too many: features are noisy or redundant. Empirically, dictionaries 8-32x larger than the transformer's hidden dimension work well. For Claude 3 Sonnet's 4K hidden dim, dictionaries of 32K-128K features.

The under-complete failure mode. With dictionary size equal to hidden dim, the SAE can't disentangle superposition, it's a rotation, not an expansion. Features stay polysemantic. The "8x minimum" rule comes from the empirical observation that significant superposition requires significant expansion.

The over-complete diminishing returns. Beyond 32x expansion, features start to split into duplicates ("Golden Gate Bridge" might split into "Golden Gate Bridge in summer" and "Golden Gate Bridge in fog", interpretable but redundant). The optimal point depends on the application; for general interpretability, 8-32x is the sweet spot.

The compute cost. SAE training is non-trivial. For frontier models, the activation dataset alone is petabytes; training the SAE is a serious compute investment. Anthropic disclosed using significant fractions of Claude's training compute for SAE training. The cost is justified by safety value, not by direct product impact.

Production uses

Steering, turn up "honesty feature" to make models more honest. Detection, flag activations on "deceptive plan feature". Debugging, "why did the model output this?" Read which features were active. We're early; the techniques work in research and are starting to show up in production safety pipelines.

The steering use case. Activate or suppress specific features at inference time to bias model behaviour. "More polite" might be one feature; "less verbose" might be another. Steering is fine-grained control beyond what fine-tuning provides; it's mechanism-level instead of behaviour-level.

The detection use case. Set up monitors that fire when specific features activate. "Deception planning" features could trigger review pipelines. "Self-harm assistance" features could trigger refusal escalation. Detection lets safety teams catch behaviours before they reach users.

The debugging use case. When a model produces an unexpected output, trace which features were active. The active features explain the output mechanically. Debugging shifts from "guess what the model was thinking" to "read what the model was thinking".

The maturity caveat. As of 2026, these uses are emerging. Production deployments are early; most labs are still building the tooling. Expected timeline: by 2027, SAE-based monitoring is standard at frontier labs; by 2028, available as a service for downstream developers.

Common antipatterns

Trusting auto-generated feature labels blindly. Auto-labelling with LLMs is convenient; labels are often imprecise. For high-stakes uses (safety review), human verification is required.

Using SAEs trained on a different model. SAE features are model-specific. A dictionary trained on Claude 3 doesn't apply to Claude 4. Retrain when the underlying model changes.

Picking dictionary size by intuition. Empirically validate. Train SAEs at 4x, 8x, 16x, 32x; measure interpretability and reconstruction loss; pick the elbow.

Treating one feature as the explanation. Most behaviours have multiple contributing features. Single-feature explanations are simplifications; production safety reviews need to consider feature combinations.

What to do this week

Three moves. (1) Try Anthropic's open-sourced SAE for GPT-2 small. The hands-on experience is more useful than reading papers. (2) If you're working on safety, identify 3-5 concepts you'd want detection for (deception, jailbreak attempts, harmful content categories). These are the SAE features your future detection pipeline will look for. (3) For applied teams: track when SAE-based steering becomes available as a production API. The capability shift will be substantial when it lands.