TransformerLens in Practice: Reading Model Circuits with Activation Patching

In the previous post, we treated Lens as a window into the model's intermediate thoughts.

But "reading" alone cannot answer the most important question:

Does the model actually *use* this information?

Just because a hidden state at some layer contains "Paris" does not mean that layer causally contributes to the final answer. Information can be present but unused. A layer might hold the right answer in its representation, yet the model might arrive at its output through entirely different pathways.

To determine what actually matters, we need more than visualization. We need causal intervention: directly manipulating the model's internals and observing how the output changes.

1. TransformerLens: A Surgical Toolkit for Interpretability

TransformerLens is a mechanistic interpretability library created by Neel Nanda. Its core capability is attaching hooks to every internal activation in a Transformer, allowing you to read, modify, and replace activations at will.

pip install transformer_lens

HookedTransformer: A Model Wired with Hooks

The central class in TransformerLens is HookedTransformer. It behaves identically to a standard Transformer, but every important activation site has a HookPoint inserted into it.

from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

This loads GPT-2 Small (12 layers, 12 heads, 768-dimensional residual stream) with hook points already wired throughout the architecture. TransformerLens supports many popular models out of the box: GPT-2, GPT-Neo, GPT-J, Llama, Pythia, and others.

Hook Point Map: Every Observable Site

For GPT-2 Small, each layer exposes the following hook points:

Residual Stream:

blocks.{l}.hook_resid_pre -- input to the block
blocks.{l}.hook_resid_mid -- after attention, before MLP
blocks.{l}.hook_resid_post -- output of the block

Attention:

blocks.{l}.attn.hook_q -- Query [batch, pos, n_heads, d_head]
blocks.{l}.attn.hook_k -- Key
blocks.{l}.attn.hook_v -- Value
blocks.{l}.attn.hook_pattern -- Attention pattern (after softmax)
blocks.{l}.attn.hook_result -- Per-head output

MLP:

blocks.{l}.mlp.hook_pre -- before activation function
blocks.{l}.mlp.hook_post -- after activation function

For GPT-2 Small with its 12 layers and 12 heads, this gives over 100 individual hook points. Every one of them is a site where you can observe or intervene on the model's computation.

ActivationCache: Capturing Everything in One Forward Pass

Calling run_with_cache() stores every hook point's activation in a single forward pass.

prompt = "When John and Mary went to the store, John gave a drink to"
tokens = model.to_tokens(prompt)

logits, cache = model.run_with_cache(tokens)

# Access any activation from the cache
resid = cache["blocks.5.hook_resid_post"]           # Layer 5 residual
attn_pattern = cache["blocks.8.attn.hook_pattern"]   # Layer 8 attention pattern

ActivationCache is not a plain dictionary. It provides analysis methods that make interpretability work much more convenient:

cache.decompose_resid(layer) -- decompose the residual stream into per-component contributions
cache.accumulated_resid(layer) -- accumulated residual (useful for Logit Lens-style analysis)
cache.logit_attrs(direction) -- compute each component's contribution toward a specific token direction
cache.stack_head_results(layer) -- separate attention output by individual heads

These methods save you from writing boilerplate tensor manipulation code and let you focus on the analysis itself.

2. Activation Patching: The Core of Causal Tracing

Why Patching Is Necessary

Logit Lens shows that "Paris" is the top prediction at a given layer. But that is correlation, not causation. The real question is:

"If I change this layer's activation, does the model's final answer change?"

If the answer is yes, then that activation is causally important for producing the output. If the answer is no, then whatever information that activation carries is either redundant or unused.

The Activation Patching Algorithm

Activation patching (also called causal tracing) works in three steps:

Step 1: Clean Run (correct execution)

Feed the model a prompt where it produces the correct answer, and cache all activations.

Clean: "When John and Mary went to the store, John gave a drink to" -> " Mary" (correct)

Step 2: Corrupted Run (baseline execution)

Feed a slightly modified prompt that causes the model to produce the wrong answer. This establishes a baseline.

Corrupted: "When John and Mary went to the store, Mary gave a drink to" -> " John" (wrong)

The corruption here is simple: we swapped which name appears in the "gave" position. This causes the model to predict the opposite name.

Step 3: Patched Run (intervention)

Run the corrupted prompt again, but replace a specific activation with the corresponding one from the clean run. Then observe whether the output recovers.

Corrupted input -> run model, but replace Layer 8's activation with the clean version -> does output recover to "Mary"?

If the output recovers, that activation is causally important for producing the correct answer.

Activation Patching: Clean/Corrupted/Patched Run Comparison

The Patching Metric: Logit Difference

To measure how much the output changes, we use the logit difference between the correct and incorrect tokens:

def get_logit_diff(logits, correct_token, incorrect_token):
    """Difference in logits between the correct and incorrect answer tokens."""
    return logits[0, -1, correct_token] - logits[0, -1, incorrect_token]

We then normalize this to a 0-1 scale:

$$\text{normalized\_metric} = \frac{\text{patched\_diff} - \text{corrupted\_diff}}{\text{clean\_diff} - \text{corrupted\_diff}}$$

0 = same as corrupted (no restoration at all)
1 = same as clean (full restoration)

A value of 0.8 means that patching this activation recovered 80% of the clean performance. Values close to 1 indicate that the patched activation is critical for producing the correct answer.

3. Hands-On: Activation Patching on the IOI Task

IOI (Indirect Object Identification) Task

The IOI task is a standard benchmark for activation patching experiments. The model must identify the indirect object in sentences like:

"When John and Mary went to the store, John gave a drink to ___" -> " Mary"

This task is ideal for several reasons:

The answer is unambiguous (determined entirely by context)
Clean/corrupted pairs are easy to construct (just swap names)
GPT-2 Small solves it with high accuracy, making circuit analysis feasible

Step-by-Step Code

import torch
from transformer_lens import HookedTransformer
from transformer_lens.utils import get_act_name

model = HookedTransformer.from_pretrained("gpt2-small")
model.set_use_attn_result(True)  # Enable per-head output access

# 1. Define prompts
clean_prompt = "When John and Mary went to the store, John gave a drink to"
corrupted_prompt = "When John and Mary went to the store, Mary gave a drink to"

clean_tokens = model.to_tokens(clean_prompt)
corrupted_tokens = model.to_tokens(corrupted_prompt)

# Correct/incorrect answer tokens
mary_token = model.to_single_token(" Mary")
john_token = model.to_single_token(" John")

# 2. Clean and corrupted runs
clean_logits, clean_cache = model.run_with_cache(clean_tokens)
corrupted_logits, corrupted_cache = model.run_with_cache(corrupted_tokens)

clean_diff = (clean_logits[0, -1, mary_token] - clean_logits[0, -1, john_token]).item()
corrupted_diff = (corrupted_logits[0, -1, mary_token] - corrupted_logits[0, -1, john_token]).item()

print(f"Clean logit diff: {clean_diff:.2f}")      # Positive: prefers Mary
print(f"Corrupted logit diff: {corrupted_diff:.2f}")  # Negative: prefers John

# 3. Residual stream patching across layers and positions
results = torch.zeros(model.cfg.n_layers, clean_tokens.shape[1])

for layer in range(model.cfg.n_layers):
    for pos in range(clean_tokens.shape[1]):
        hook_name = get_act_name("resid_pre", layer)

        def patch_hook(activation, hook, pos=pos):
            activation[:, pos, :] = clean_cache[hook.name][:, pos, :]
            return activation

        patched_logits = model.run_with_hooks(
            corrupted_tokens,
            fwd_hooks=[(hook_name, patch_hook)]
        )

        patched_diff = (patched_logits[0, -1, mary_token] - patched_logits[0, -1, john_token]).item()
        results[layer, pos] = (patched_diff - corrupted_diff) / (clean_diff - corrupted_diff)

Let us break down what this code does:

We run the model on both prompts and cache all activations.
We compute the logit difference baseline for both clean and corrupted runs.
For every (layer, position) pair, we run the corrupted prompt but swap in the clean activation at that specific point.
We measure how much the output recovers and store the normalized result.

The get_act_name("resid_pre", layer) helper returns the hook name string "blocks.{layer}.hook_resid_pre". The closure trick with pos=pos in the hook function is important: without it, Python's late binding would cause every hook to use the final value of pos.

Interpreting the Results

When you visualize the results tensor as a heatmap (layers on the y-axis, token positions on the x-axis, color intensity indicating restoration), clear patterns emerge:

Early layers (L0-L4) at the second "John" (subject) position: The highest restoration effect. At this stage, the model encodes the critical information "the name that just appeared is John" into the residual stream. Restoring this information recovers the correct answer.
Late layers (L9-L11) at the final token "to" position: Another key region. This is where the extracted information is used to determine the final answer "Mary" and write it to the output — the stage where Name Mover Heads operate.
First "John" and "Mary" positions: Almost no effect. The initial mention in the setup clause ("When John and Mary went to...") does not directly contribute to the circuit. The effect concentrates exclusively at the second "John" (subject) and the final "to" (output position).

Residual Stream Activation Patching heatmap

This is a fundamentally different kind of evidence than what Logit Lens provides. Logit Lens might show "Paris" appearing at Layer 8, but that is observational. Activation patching proves that Layer 8 *matters*: if you break it, the answer breaks too.

4. Head-Level Patching

Drilling Down from Layers to Individual Heads

Residual stream patching tells us which layers matter. But each layer contains 12 attention heads and an MLP block. Which specific heads are doing the work? To find out, we patch individual attention head outputs:

# Patch each head's output, one at a time
head_results = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)

for layer in range(model.cfg.n_layers):
    for head in range(model.cfg.n_heads):
        hook_name = get_act_name("result", layer)

        def head_patch_hook(activation, hook, head=head):
            activation[:, :, head, :] = clean_cache[hook.name][:, :, head, :]
            return activation

        patched_logits = model.run_with_hooks(
            corrupted_tokens,
            fwd_hooks=[(hook_name, head_patch_hook)]
        )

        patched_diff = (patched_logits[0, -1, mary_token] - patched_logits[0, -1, john_token]).item()
        head_results[layer, head] = (patched_diff - corrupted_diff) / (clean_diff - corrupted_diff)

Note that get_act_name("result", layer) returns "blocks.{layer}.attn.hook_result", which has shape [batch, pos, n_heads, d_head]. The hook replaces the output of a single head (indexed by the head dimension) with the clean version.

Reading the Heatmap

Visualizing the results as a heatmap (y-axis: layer, x-axis: head, color: restoration score) reveals clear patterns:

Attention Head Activation Patching heatmap

Strong positive (blue): L8.H6, L9.H9, L8.H10 show the strongest positive scores. Among these, L9.H9 is a Name Mover Head that directly writes the correct name to the output position, while L8.H6 and L8.H10 are S-Inhibition Heads that suppress the subject name (the wrong answer), helping the correct answer get selected.
Strong negative (red): L10.H7, L11.H10, among others. Restoring these heads actually *hurts* performance. These are Negative Name Mover Heads that write the wrong name (the subject) to the output. In the clean run, these heads were writing "John" (the wrong answer), so restoring them in the corrupted context works against recovery.
L0 row is nearly empty: Duplicate Token Heads (L0.H1, L0.H10) detect that a name has appeared before, but they do not directly affect the final output. Their contribution is indirect — they pass information to later heads. This is why they do not appear in the direct head patching heatmap. Their role is confirmed through Path Patching (Section 6).

Key Discoveries: The IOI Circuit

Combining head-level patching with further analysis (path patching, etc.), the full IOI circuit structure described by Wang et al. (2022) emerges:

These heads form a connected circuit with a clear algorithmic interpretation:

$$\text{Duplicate Token Heads (L0)} \rightarrow \text{S-Inhibition Heads (L7-8)} \rightarrow \text{Name Mover Heads (L9-10)} \rightarrow \text{Output}$$

In plain language, the circuit implements the following algorithm:

Detect duplication: Early heads notice that one name has appeared twice (once in the setup clause, once as the subject of the action).
Inhibit the subject: Middle-layer heads suppress the duplicated name so it is less likely to be predicted.
Move the other name: Late-layer heads copy the *other* name (the indirect object) to the output position.

This is a genuine algorithm, discovered purely through causal intervention on model internals. It was first fully described by Wang et al. (2022) in the IOI paper.

Below is the attention pattern of L9.H9, a Name Mover Head. The final token ("to") attends strongly to the indirect object ("Mary"), directly copying that name to the output.

L9.H9 Attention Pattern: Name Mover Head verification

5. Built-in Patching Functions

Writing manual loops over layers, positions, and heads works, but it is verbose and error-prone. TransformerLens provides a built-in patching module that handles the boilerplate:

from transformer_lens.patching import generic_activation_patch

def metric_fn(logits):
    diff = logits[0, -1, mary_token] - logits[0, -1, john_token]
    return (diff - corrupted_diff) / (clean_diff - corrupted_diff)

# Residual stream patching (layer x position)
result = generic_activation_patch(
    model=model,
    corrupted_tokens=corrupted_tokens,
    clean_cache=clean_cache,
    patching_metric=metric_fn,
    patch_setter=layer_pos_patch_setter,
    activation_name="resid_pre",
    index_axis_names=["layer", "pos"],
)
# result.shape: [n_layers, seq_len]

The generic_activation_patch function handles the loop, the hook setup, and the metric computation. You just specify what to patch and how to measure the effect.

Available Patch Setters

TransformerLens provides several built-in patch setters for different granularities:

For most analyses, you will start with layer_pos_patch_setter to identify important layers and positions, then switch to layer_head_vector_patch_setter to pinpoint specific heads.

6. Path Patching: Tracing Information Flow Between Heads

Beyond "What Is Important" to "How Does Information Flow"

Standard activation patching answers: "Is this component important?" Path patching answers a more refined question:

"When Head A's output flows into Head B's Query, is that specific pathway important?"

This is the difference between knowing that a road exists and knowing which cars are driving on it. Path patching lets us trace the actual information flow between components.

Splitting Q, K, V Inputs

To do path patching, we need to be able to patch the Q, K, and V inputs to a head independently. TransformerLens supports this:

model.set_use_split_qkv_input(True)  # Enable separate Q, K, V input hooks

# Patch only the Query input to Layer 9 Head 6
hook_name = "blocks.9.attn.hook_q_input"

def q_patch_hook(activation, hook):
    activation[:, :, 6, :] = clean_cache[hook.name][:, :, 6, :]
    return activation

patched_logits = model.run_with_hooks(
    corrupted_tokens,
    fwd_hooks=[(hook_name, q_patch_hook)]
)

When set_use_split_qkv_input(True) is called, TransformerLens creates separate hook points for each of Q, K, and V inputs:

blocks.{l}.attn.hook_q_input
blocks.{l}.attn.hook_k_input
blocks.{l}.attn.hook_v_input

Each has shape [batch, pos, n_heads, d_head], allowing you to target a specific head's specific input channel.

Why Q, K, V Separation Matters

Recall how attention works: the Query determines *what to look for*, the Key determines *what to advertise*, and the Value determines *what to transmit*. Patching these separately reveals different aspects of information flow:

Patching Q of Head B: Tests whether Head B's "question" (what it is looking for) depends on the clean activation.
Patching K of Head B: Tests whether what Head B "sees" at certain positions depends on the clean activation.
Patching V of Head B: Tests whether the "content" that Head B reads from certain positions depends on the clean activation.

By systematically patching Q, K, and V inputs of downstream heads while corrupting upstream heads, you can map out the precise wiring diagram of a circuit.

The IOI Circuit via Path Patching

Path patching on the IOI task confirms the following circuit structure:

$$\text{Duplicate Token Heads (L0)} \xrightarrow{K, V} \text{S-Inhibition Heads (L7-8)} \xrightarrow{Q} \text{Name Mover Heads (L9-10)} \rightarrow \text{Output logits}$$

The information flow is specific:

Duplicate Token Heads write information into the residual stream that S-Inhibition Heads read via their Keys and Values.
S-Inhibition Heads write information that Name Mover Heads read via their Queries (the "question" of where to copy the name from).

This level of detail is only possible through path patching. Standard activation patching tells you *which* heads matter; path patching tells you *how* they are connected.

7. Practical Tips

GPU is essential. Patching requires one forward pass per component you test. Layer-by-position patching on GPT-2 Small involves 12 layers x ~15 positions = 180 forward passes. Head-level patching adds 12 layers x 12 heads = 144 more. Path patching multiplies this further. Without a GPU, these experiments are impractical.

Start coarse, then drill down. Follow a systematic top-down approach:

resid_pre patching to find important layers
Head-level patching to find important heads within those layers
Path patching (Q/K/V) to trace connections between the important heads

Each step narrows the search space for the next, keeping the total compute manageable.

Batch your prompts. Results from a single prompt can be noisy. Create 8-16 similar prompts (different names, different sentence structures) and average the patching results. This stabilizes the signal and reveals patterns that generalize.

prompts = [
    ("When John and Mary went to the store, John gave a drink to", " Mary", " John"),
    ("When Alice and Bob went to the park, Alice gave a gift to", " Bob", " Alice"),
    ("When Sarah and Tom went to the office, Sarah handed a file to", " Tom", " Sarah"),
    # ... more prompts
]

Always call `model.set_use_attn_result(True)`. Head-level output patching requires this setting to be enabled. Without it, TransformerLens does not cache per-head outputs, and your hooks on hook_result will not work as expected.

Mind the memory. run_with_cache() stores every activation in memory. For larger models, this can exhaust GPU RAM quickly. Use the names_filter parameter to only cache the activations you need:

# Only cache residual stream activations
logits, cache = model.run_with_cache(
    tokens,
    names_filter=lambda name: "resid_pre" in name
)

8. Wrap-up

Three Levels of Interpretability

TransformerLens and activation patching elevate interpretability from correlational analysis to causal analysis. We move from "this layer seems to know the answer" to "this layer provably causes the answer."

What Comes Next

But there is still a fundamental limitation. Activations are dense and polysemantic. A single neuron responds to many unrelated concepts simultaneously. When you patch an entire layer or an entire head, you are moving hundreds or thousands of features at once. You cannot isolate individual concepts.

This is where the next generation of tools comes in. Sparse Autoencoders (SAEs) learn to decompose dense activations into interpretable, monosemantic features. Combined with tools like SAELens and TensorLens, they let you move from "which heads matter" to "which *features* matter."

We will cover SAEs and feature-level interpretability in Part 3 of this series.

References

TransformerLens GitHub

https://github.com/TransformerLensOrg/TransformerLens

Neel Nanda. *Activation Patching in TransformerLens* (demo notebook)

https://github.com/TransformerLensOrg/TransformerLens/blob/main/demos/Activation_Patching_in_TL_Demo.ipynb

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). "Locating and Editing Factual Associations in GPT." *NeurIPS*. https://arxiv.org/abs/2202.05262
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2022). "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small." *ICLR 2023*. https://arxiv.org/abs/2211.00593
Neel Nanda. *Mechanistic Interpretability Intro.*

https://www.neelnanda.io/mechanistic-interpretability