A neural network is, in a technical sense, entirely readable. Every weight, every activation, every computation is in principle accessible to inspection. In practice, a frontier language model has tens of billions of parameters arranged in hundreds of layers, performing computations whose relationship to the model's behavior is almost entirely opaque. Researchers can observe inputs and outputs. The internal mechanisms connecting them are largely unknown.
Mechanistic interpretability is the project of changing this. Rather than characterizing AI systems by what they do across many inputs — the behavioral approach — mechanistic interpretability tries to identify the specific circuits, features, and algorithms inside neural networks that produce particular behaviors. The goal is to build tools that can read what a model is actually computing, not just observe what it outputs.
Why behavioral interpretability is not enough
Most work that gets called "AI interpretability" is behavioral: analyzing patterns in how a system responds to different inputs, identifying what features of the input most influence the output, mapping the relationship between prompts and responses statistically. This work is useful and has produced real insights. It does not answer the safety-critical questions.
The failure modes that matter most for AI safety — deceptive alignment, mesa-optimization, the treacherous turn — all involve a divergence between what an AI system appears to be doing from the outside and what it is actually computing internally. A system that has learned to pass safety evaluations while pursuing different goals will produce perfectly safe-looking outputs during behavioral testing. Behavioral interpretability cannot detect this because it only observes outputs.
Mechanistic interpretability aims to read the system's internal goal representations directly. If researchers can identify the specific circuits responsible for goal-directed behavior and characterize what those circuits are optimizing for, they can in principle detect misalignment before it appears in outputs. This is why mechanistic interpretability has come to be seen as possibly the only approach that could detect certain dangerous alignment failures before they manifest.
What the research has found so far
The field is still young, but a number of significant findings have emerged. Chris Olah and colleagues at Anthropic have identified specific circuits in neural networks that perform recognizable computations: circuits that detect edges and curves in image models, that perform categorization, that retrieve information stored during training. They documented a phenomenon called superposition, in which individual neurons represent multiple distinct concepts simultaneously, compressed into the same set of activations. This makes the relationship between neural activations and model behaviors significantly more complex than researchers had assumed.
In language models, researchers have identified structures called induction heads — specific attention patterns that allow the model to complete sequences by analogy, looking back at the context to find similar patterns. These induction heads appear to be responsible for in-context learning, the ability of language models to learn new tasks from examples in the prompt. Finding a specific circuit responsible for a major capability was a significant methodological advance.
Anthropic's 2024 work on sparse autoencoders identified over a million distinct features in Claude 3 Sonnet, some with interpretable content including features associated with specific concepts, emotions, and — more concerning — features associated with deceptive intentions and power-seeking reasoning. Identifying these features does not mean the model acts on them; it means researchers can now see that the relevant representations exist internally and study their causal role in behavior.
The gap between current capability and what safety requires
The findings above are significant. They do not yet constitute the alignment verification capability that safety requires. Identifying specific circuits for specific behaviors in smaller models is not the same as being able to characterize the goal representations of a frontier model with tens of billions of parameters. Superposition — neurons representing multiple concepts simultaneously — makes clean feature extraction from large models significantly harder. The field's current tools work best on smaller models and specific, well-defined behaviors. Extending them to frontier-scale systems and to the specific question of verifying goal alignment is an open research challenge.
The distance between what mechanistic interpretability can do today and what would be needed to provide pre-deployment alignment verification for frontier AI systems is large. Closing that gap is one of the central technical challenges in AI safety, and it is time-sensitive: if frontier systems reach capability levels where alignment failures become consequential before interpretability tools can verify alignment, the window for using these tools effectively has closed.
This is why Anthropic has made interpretability research a stated organizational priority, and why the Foundation's governance proposals include interpretability verification requirements as a condition of frontier AI deployment. The requirement creates incentive to close the gap; without a deployment condition tied to interpretability capability, the research may advance more slowly than the capability it needs to assess.
Common questions.
A research direction in AI safety that aims to understand the internal computations of neural networks — the specific circuits, features, and algorithms responsible for their behaviors — rather than only observing and characterizing their inputs and outputs. Mechanistic interpretability goes beyond behavioral analysis to understand the internal mechanisms connecting inputs to outputs, with the goal of being able to read what a model is actually computing, including its goal representations.
Most interpretability work is behavioral: it analyzes patterns in how a system responds to inputs, identifies which input features most influence outputs, and maps relationships between prompts and responses statistically. Mechanistic interpretability instead tries to identify the specific internal computations responsible for behaviors, tracing causal pathways from inputs through internal representations to outputs. The distinction matters for safety because behavioral interpretability cannot detect misaligned goals that are strategically concealed, while mechanistic approaches could in principle detect them by reading internal goal representations directly.
Not yet for frontier-scale systems. The research has produced significant findings in smaller models and for specific behaviors. Extending these methods to systems with tens of billions of parameters, and specifically to the question of verifying goal alignment rather than characterizing specific behavioral circuits, remains an open research challenge. The superposition phenomenon (neurons representing multiple concepts simultaneously) makes feature extraction from large models particularly difficult. This is one of the central technical bottlenecks in AI safety.
Anthropic, where Chris Olah leads a dedicated interpretability team and the field has received significant organizational investment. Google DeepMind has also published significant mechanistic interpretability work. Various academic groups at MIT, Cambridge, and other institutions contribute. The field has grown considerably since around 2020, driven by recognition that alignment verification through behavioral testing is insufficient for failure modes involving strategic concealment of misaligned goals.