← Back to Blog

Mechanistic Interpretability, Explained: From Neurons and Activations to Circuits and Steering

Editorial image for Mechanistic Interpretability, Explained: From Neurons and Activations to Circuits and Steering about Data & ML.

Key Takeaways

  • Neurons are fixed units, activations are their per-input values, and the most useful concepts often live at the feature or circuit level instead of in one neuron.
  • Polysemanticity and superposition are why single-neuron explanations often fail: models pack multiple features into limited dimensions.
  • Sparse autoencoders try to untangle those mixed representations into sparser, more interpretable features, but they are a helpful tool rather than a complete solution.
  • Activation patching is valuable because it tests causality by swapping internal states between runs instead of relying on correlations alone.
  • Steering vectors show that some internal representations can be nudged directly, which links interpretability to practical model control.
BLOOMIE
POWERED BY NEROVA

Mechanistic interpretability is the effort to reverse-engineer a neural network into testable internal parts: which neurons, activations, features, attention heads, and pathways caused a behavior, and what changes when you intervene on them.

That makes it different from treating a model as a black box. Instead of only measuring inputs and outputs, mechanistic interpretability asks what computation happened inside the model. The field is still early and incomplete, but it is already moving AI analysis from educated guessing toward a more scientific loop of identify, intervene, and verify.

Mechanistic interpretability terms at a glance

TermPlain-language meaningWhy it matters
NeuronA single dimension or unit inside the networkUseful, but often too small and too mixed to tell the whole story
ActivationThe value a neuron or feature takes on a specific inputShows what is active right now, not what the unit always means
FeatureA reusable concept or pattern represented inside the modelOften a better unit of analysis than one neuron
CircuitA connected set of components that together implement a behaviorGets you from local parts to causal explanations
PolysemanticityOne neuron or direction representing multiple unrelated thingsExplains why simple neuron labels are often misleading
Activation patchingSwap in internal activations from one run to anotherTests whether a component causally matters
Steering vectorAn internal direction added during inference to nudge behaviorShows that some model properties can be intervened on directly

The units inside the black box: neurons, activations, features, and circuits

Neurons are fixed slots; activations are their current values

A neuron is a component in the network architecture. An activation is the number that component takes on for one specific token or input. This distinction matters because saying “this neuron is the sarcasm neuron” is usually too simplistic. The same neuron may activate strongly for different reasons in different contexts, and a low activation on one prompt does not mean the underlying capability vanished.

A good first mental model is this: neurons are like coordinates in the model’s internal space, while activations are the coordinates for the current example. Mechanistic interpretability studies both the structure and the changing state.

Features are often more useful than single neurons

Researchers increasingly treat features as the more useful unit of analysis. A feature is a pattern represented by the model, such as a quotation mark, a city name, a language-independent concept, a coding type signature, or a refusal-related behavior. Sometimes a feature lines up with one neuron, but often it is spread across multiple neurons or mixed with other concepts.

This is why modern mechanistic interpretability often moves from “which neuron fired?” to “which feature was present, and how was it encoded?” That shift makes the field more practical for large language models, where one-neuron stories often break down.

Circuits are the causal pathways behind behavior

A circuit is a small subgraph of the model that collectively performs a job. One component may detect a pattern, another may copy information across positions, and another may push the final logits toward a specific output. When people say mechanistic interpretability is trying to understand how a model actually computes, circuits are usually the destination.

This is the big leap from descriptive interpretability to causal interpretability. A neuron label tells you what correlates with activity. A circuit explanation tries to show how activity flows from input to output.

Why single-neuron stories break: polysemanticity and superposition

The hardest obstacle in mechanistic interpretability is that models often do not store one clean concept per neuron. Instead, they compress many features into limited dimensions. This is where polysemanticity and superposition come in.

Polysemanticity means one neuron or direction responds to several different concepts. A unit might light up for legal language, DNA sequences, and formatting artifacts, not because those things belong together in human language, but because the network found an efficient way to pack them into the same representational space.

Superposition is the broader idea that neural networks can represent more features than they have obvious dimensions for, especially when those features are sparse. That makes the network efficient, but it makes interpretation harder. If several concepts share representational real estate, inspecting one neuron in isolation can be misleading.

This is why mechanistic interpretability now focuses so much on finding better bases for understanding model internals. The question is no longer just “what does this neuron mean?” It is “what hidden features are tangled together here, and can we separate them into something more legible?”

The current toolkit: sparse autoencoders, activation patching, and steering vectors

Sparse autoencoders try to untangle features

Sparse autoencoders, often shortened to SAEs, are one of the most important current tools in mechanistic interpretability. You train an auxiliary model on the activations of the original network. The SAE learns a larger set of sparse latent units that can reconstruct those activations while turning on only a small number of latents at once.

The practical hope is that these sparse latent units behave more like clean features than raw neurons do. Instead of one neuron half-representing five things, the SAE may recover several separate feature directions. That is why SAE papers often talk about monosemanticity: the goal is to recover units that each correspond to one more coherent concept.

But SAEs are not magic. They help, not solve. They can still miss features, split one concept into several latents, or recover features that are interpretable in some contexts and confusing in others. For teams reading the literature, the right posture is: promising decomposition method, not final microscope.

Activation patching tests causal importance

Activation patching is a causal intervention method. You run the model on two inputs: usually one clean case where it behaves correctly and one corrupted case where a crucial piece of information is missing or scrambled. Then you copy a chosen internal activation from the clean run into the corrupted run. If the output recovers, that activation probably mattered.

This is powerful because it moves beyond correlation. Instead of saying “this head seems active when the answer is right,” patching asks “if I restore this internal state, does the behavior come back?” That is the kind of evidence you need to support a circuit claim.

In practice, patching is used to localize where information lives, which components transport it, and whether a hypothesized circuit is actually doing work or just happening to be nearby.

Steering vectors intervene rather than just observe

Steering vectors are internal directions added to a model’s hidden states during inference to shift behavior without retraining the full model. If a direction corresponds to truthfulness, refusal, sentiment, style, or another property, adding or subtracting that direction can nudge outputs.

This makes steering vectors especially interesting because they sit at the border between interpretability and control. They are not just a measurement tool. They are an intervention on the computation itself.

That said, steerability is not the same as understanding. A steering vector can work even if your explanation of why it works is incomplete. Recent mechanistic work is valuable because it tries to connect steering back to actual internal circuits instead of treating the vector as a mysterious knob.

What mechanistic interpretability is already showing us

The field is no longer limited to toy diagrams and vague heatmaps. Researchers have shown examples of shared cross-lingual features, internal planning behavior, partial circuits for specific tasks, and interpretable sparse subcircuits for simple algorithmic behaviors. That does not mean we can read a frontier model like source code, but it does mean there are real internal regularities to find, test, and sometimes manipulate.

  • Neuron-level work can surface units that respond to recognizable patterns, which is useful but limited.
  • Feature-level work can recover more coherent concepts than raw neuron inspection often allows.
  • Circuit-level work can trace how information moves through multiple components to produce an answer or failure mode.
  • Intervention methods like patching and steering can distinguish “this pattern correlates with the behavior” from “this mechanism helps cause the behavior.”

That is why people describe mechanistic interpretability as AI moving from black-box guessing toward testable internals. The key word is testable. A useful interpretability claim should let you predict what will happen if you ablate, patch, or steer a component.

How to think about business value without overclaiming

Most companies do not need a dedicated mechanistic interpretability pipeline before deploying their first agent or chatbot. For many workflows, good evals, grounding, guardrails, permissions, and human review will do more immediate work than frontier interpretability research.

But mechanistic interpretability matters for business teams for three reasons.

  1. Debugging harder failures. When output-based testing tells you a system is failing but not why, internal analysis can help localize the failure mechanism.
  2. Governance and assurance. High-stakes systems increasingly need more than “it seems to work on benchmarks.” Internal evidence may become part of how teams justify trust.
  3. Control. If features and circuits can be localized and intervened on, then future AI systems may become easier to audit, constrain, and adapt without full retraining.

The realistic near-term view is that mechanistic interpretability is a complement to operational safety, not a substitute for it. You still need evals, access controls, grounded retrieval, escalation rules, and monitoring. Interpretability adds another layer: insight into what the model appears to be computing internally.

Common mistakes when reading or applying mechanistic interpretability

  • Confusing neurons with meanings. A neuron is not automatically a concept label.
  • Confusing correlation with causality. A component that lights up near a behavior may not be causing it.
  • Treating SAEs as ground truth. They are learned decompositions with tradeoffs, not final ontologies of the model.
  • Assuming steering means full understanding. A working knob is not yet a full mechanistic explanation.
  • Replacing external evaluation with internal stories. Internal analysis is strongest when it is tied back to measurable model behavior.

A practical checklist after reading this guide

  • Define the specific behavior you want to understand before you inspect internals.
  • Separate neurons from activations in your mental model.
  • Look for features and circuits, not just interesting single units.
  • Expect polysemanticity and superposition to make raw neuron labels unreliable.
  • Use sparse autoencoders to propose cleaner feature decompositions when the task justifies it.
  • Use activation patching when you need causal evidence, not just a visualization.
  • Treat steering vectors as interventions to test hypotheses, not proof that you fully understand the mechanism.
  • Combine interpretability with evals, grounding, approvals, and monitoring in real production systems.

The bottom line is simple: mechanistic interpretability is the attempt to turn model internals into something more like engineering and less like folklore. We are not at full transparency yet. But the field now has enough tools to make specific, falsifiable claims about some internal computations, and that is a meaningful step away from black-box AI.

Frequently Asked Questions

Is mechanistic interpretability the same as explainable AI?

No. Explainable AI often focuses on making outputs easier to justify or summarize, while mechanistic interpretability tries to reverse-engineer the internal computation that produced those outputs.

Why are single neurons often not enough to explain model behavior?

Because many neurons are polysemantic, meaning they respond to multiple unrelated patterns. Important concepts are often distributed across several neurons or mixed together through superposition.

What do sparse autoencoders actually do?

They learn a sparse latent representation that reconstructs a model's internal activations. The hope is that these latent units correspond more cleanly to interpretable features than the original neurons do.

What is activation patching used for?

Activation patching is used to test whether a specific internal activation causally matters for an output. Researchers copy activations from one run into another and check whether the behavior changes.

Should business teams wait for mechanistic interpretability before deploying AI?

No. Most teams should still prioritize evals, grounding, permissions, guardrails, and human review. Mechanistic interpretability is best treated as an additional layer of understanding for harder or higher-stakes systems.

Map where your AI workflows need stronger controls

If this guide made you think about black-box risk, the next step is to identify which workflows need better evals, grounding, approvals, or guardrails before rollout. Nerova's Scope audit helps teams prioritize the safest, highest-leverage AI deployments.

Run an AI rollout audit
Ask Bloomie about this article