Model Interpretability Tools
Inspect, TransformerLens, Sparse Autoencoders, attention visualisation. The toolkit for opening up an LLM has matured. Here is the 2026 stack.
Research tools
- TransformerLens: Python library for activation extraction, patching, and intervention. The de facto standard for circuits research.
- NNsight: similar capability, modern API.
- Anthropic Inspect: framework for evaluating model behaviour with structured tests.
- OpenAI Evals: similar, focused on capability and safety evaluations.
Production tools
- Pyroscope-style attention visualisation: see which prompt tokens influenced which output.
- Sparse autoencoder readouts: see which interpretable features fired during a generation.
- LangSmith / LangFuse: trace LLM call chains for debugging and evaluation.
Practical uses
Debug surprising outputs. Audit for prompt-injection signals. Build steering interventions (suppress “hallucination feature” while generating). Investigate why a model behaviour changed after a fine-tune.