AI ResearchKR

TransformerLens in Practice: Reading Model Circuits with Activation Patching

Using TransformerLens to directly manipulate model activations, we trace which layers and heads causally produce the answer. A hands-on guide to activation patching.

TransformerLens in Practice: Reading Model Circuits with Activation Patching

TransformerLens in Practice: Reading Model Circuits with Activation Patching

In the previous post, we treated Lens as a window into the model's intermediate thoughts.

But "reading" alone cannot answer the most important question:

Does the model actually *use* this information?

Just because a hidden state at some layer contains "Paris" does not mean that layer causally contributes to the final answer. Information can be present but unused. A layer might hold the right answer in its representation, yet the model might arrive at its output through entirely different pathways.

To determine what actually matters, we need more than visualization. We need causal intervention: directly manipulating the model's internals and observing how the output changes.

1. TransformerLens: A Surgical Toolkit for Interpretability

TransformerLens is a mechanistic interpretability library created by Neel Nanda. Its core capability is attaching hooks to every internal activation in a Transformer, allowing you to read, modify, and replace activations at will.

bash
pip install transformer_lens

HookedTransformer: A Model Wired with Hooks

🔒

Sign in to continue reading

Create a free account to access the full content.

Related Posts