AI & ML Advanced By Samson Tanimawo, PhD Published Dec 26, 2026 4 min read

Model Interpretability Tools

Inspect, TransformerLens, Sparse Autoencoders, attention visualisation. The toolkit for opening up an LLM has matured. Here is the 2026 stack.

Research tools

Active research interpretability tools include TransformerLens (Python library for hooking into transformer activations), Garcon (Anthropic's circuit-finding toolkit, partially open-sourced), Tuned Lens (project trained probes on hidden states for interpretation), Sparse Autoencoders (Anthropic's published implementations). These tools enable mechanistic interpretability work; using them is a research skill rather than a production capability.

The TransformerLens case. Open-source library by Neel Nanda. Hooks into HuggingFace transformers; lets you intercept and modify activations. The starter tool for mechanistic interpretability research. Provides primitives for activation patching, attention analysis, circuit identification.

The TunedLens / LogitLens. Project hidden-state activations into the output vocabulary at each layer. Reveals what the model "thinks" at each layer. Useful for understanding gradual computation across layers; production usefulness is limited.

The SAE tools. Anthropic published SAE implementations and trained dictionaries for some models. Researchers can use them to inspect features of those specific models; training new SAEs requires substantial compute.

The vibe of the research-tool space. Fast-moving; tools come and go. The intent is to enable mechanistic understanding, not deploy production interpretation. Researchers who use these tools are mostly safety teams at frontier labs and academic groups.

Production tools

Production interpretability is much narrower. SHAP and LIME for tabular ML. Attention visualisation for transformer models. Captum for PyTorch. Custom dashboards built on logit-lens or activation patching. The production landscape is far less mature than the research landscape.

The SHAP/LIME case. Mature for tabular ML. Show feature attributions: which features drove the prediction. Not directly applicable to LLMs but used widely for credit decisions, fraud detection, structured-data classification. Required by some regulations.

The attention-visualisation case. Show which tokens attended to which other tokens. Useful for sanity checking; less useful for "explanation" in regulated sense. Many libraries (BertViz, Hugging Face's tools) provide this out-of-the-box.

The Captum case. PyTorch's interpretability library. Includes integrated gradients, layer attribution, saliency maps. Applies to PyTorch models broadly. Production-ready; battle-tested.

The custom dashboard reality. Most production interpretability is custom dashboards: per-application visualisations of model behaviour. Not generic; specific to use case. Built ad-hoc; useful for specific debugging needs.

The regulated-domain reality. Healthcare, finance, hiring use interpretability for compliance. Required: feature attribution per decision. SHAP is the workhorse. Documentation of methodology is itself a compliance artifact.

Practical uses

Debugging, when the model produces unexpected output, interpretability tools help understand why. Compliance, regulated domains require explanation of decisions. Bias auditing, find which features the model relies on for sensitive predictions. Safety review, at frontier labs, interpretability is part of pre-deployment safety reviews.

The debugging use. Production model produces a strange output. What features drove it? Inspecting attributions often reveals the bug, model relied on a feature it shouldn't have, or missed one it should have. The 30 minutes of debugging is a useful investment.

The compliance use. Regulated decisions (credit, hiring, healthcare) require explanation. Per-decision feature attributions are produced and stored. Auditors can review. The interpretability investment is part of compliance overhead, not optional.

The bias-auditing use. Test the model on demographic groups; track feature attribution patterns; find disparate reliance. Common pattern: model relies on a proxy variable (zip code) that correlates with race. Bias audit detects; team mitigates.

The safety-review use. Frontier labs use interpretability in pre-deployment reviews. Look for circuits that suggest harmful behaviours, deceptive cognition, capability we didn't expect. The reviews are not bulletproof; they catch some issues that other safety measures miss.

The "I want to understand my model" use. Curiosity-driven; surprisingly common; usually less productive than people expect. Models are complex; understanding rarely matches expectations. Interpretability tools satisfy curiosity but rarely produce immediate ROI.

Common antipatterns

Trusting attention attribution as causation. Attention is correlation, not cause. Use intervention experiments (patching) for causal claims.

Single-method interpretability. Different methods reveal different things. Cross-check with multiple approaches.

Skipping domain-specific eval. Generic interpretability metrics rarely match what your stakeholders care about. Build domain-specific eval criteria.

Over-trusting LLM explanations of their own reasoning. Models confabulate. Self-explanations may not match actual computation.

What to do this week

Three moves. (1) For a regulated decision in your stack, verify you have decision-level explanations. If not, add SHAP or similar before regulators ask. (2) For your highest-impact ML model, run a simple interpretability pass (SHAP for tabular, attention visualization for transformers). The first pass usually surfaces 1-2 surprises. (3) If you're at a frontier lab, build interpretability into your safety review process. The best time to add it is before you need it.