SAE and TensorLens: The Age of Feature Interpretability

SAE and TensorLens: The Age of Feature Interpretability
In the previous two posts, we:
- Logit/Tuned Lens: Read the model's intermediate predictions
- Activation Patching: Traced which activations are causally responsible for the answer
But here we hit a fundamental problem:
What do the activations we observe and manipulate actually *mean*?
Each dimension of an activation vector corresponds to an individual neuron. But these neurons are polysemantic -- a single neuron fires for academic citations, English dialogue, HTTP requests, and Korean text simultaneously. Clean interpretation at the neuron level is impossible.
This post covers two modern approaches that address this problem:
- Sparse Autoencoder (SAE): Decompose dense activations into sparse, monosemantic features
- TensorLens: Unify the entire Transformer computation into a single high-order tensor
1. Superposition: Why Neurons Are Uninterpretable
The Problem: Polysemantic Neurons
If you observe a single MLP neuron in GPT-2, you see something like this:
- Neuron #4721: Activates for academic citation formats, English dialogue, and base64-encoded strings
- Neuron #1389: Activates simultaneously for mathematical symbols, Python code, and LaTeX expressions
Why does this happen? Is the model "poorly trained"?
No. This is an intentional (or at least inevitable) phenomenon called superposition.
Superposition: Representing More Concepts Than Dimensions
Anthropic's "Toy Models of Superposition" (Elhage et al., 2022) provided the mathematical foundation for this phenomenon.
The key insight:
In high-dimensional spaces, the number of nearly-orthogonal directions far exceeds the number of dimensions.
For example, in a 512-dimensional space:
- You can construct exactly 512 perfectly orthogonal vectors (the basis vectors)
- But you can construct *thousands* of vectors that are nearly orthogonal to each other (cosine similarity < 0.1)
Models exploit this. In a 512-dimensional hidden state, they store not 512 but thousands of features by overlapping them. This is superposition.
$$\mathbf{h} = \sum_i f_i \cdot \mathbf{d}_i$$
($\mathbf{d}_i$ = feature direction, $f_i$ = feature activation)
Each feature $\mathbf{d}_i$ is nearly orthogonal to the others, but not perfectly so. This means there is some interference between features. However, if features are sufficiently sparse (they activate infrequently), the interference cost remains low.
The critical condition: the sparser the features, the more superposition is possible.
$$I_i > \sum_j I_j \cdot p_j \cdot (\mathbf{d}_i \cdot \mathbf{d}_j)^2$$
Here $I_i$ is the importance of feature $i$, and $p_j$ is the probability that feature $j$ is active.
Think about what this means. If feature $j$ is active only 1% of the time ($p_j = 0.01$), then even if $\mathbf{d}_i$ and $\mathbf{d}_j$ are not perfectly orthogonal, the expected interference $I_j \cdot p_j \cdot (\mathbf{d}_i \cdot \mathbf{d}_j)^2$ is very small. The model can afford to pack many more features into the same space as long as they do not all fire at once.
Result: Neuron =/= Feature
Because of superposition:
- One neuron = a mixture of several features --> polysemantic
- One feature = distributed across several neurons --> distributed representation
Using neurons as the unit of interpretation is structurally impossible. We need to extract the features.
2. Sparse Autoencoder: The Feature Extractor
The Core Idea
A Sparse Autoencoder (SAE) is a tool that "reverses" superposition. It decomposes dense activations into sparse feature activations.

SAE Architecture
Given a model activation vector $\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}$:
Step 1 -- Encode (extract sparse features):
$$f(\mathbf{x}) = \text{ReLU}(W_{\text{enc}} (\mathbf{x} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}})$$
$W_{\text{enc}} \in \mathbb{R}^{d_{\text{sae}} \times d_{\text{model}}}$: encoder weights$\mathbf{b}_{\text{enc}} \in \mathbb{R}^{d_{\text{sae}}}$: encoder bias$d_{\text{sae}} \gg d_{\text{model}}$: the number of SAE features is much larger than the model dimension (typically 4x to 256x expansion)- ReLU ensures most
$f_i(\mathbf{x})$are zero --> sparse
Step 2 -- Decode (reconstruct):
$$\hat{\mathbf{x}} = W_{\text{dec}} \cdot f(\mathbf{x}) + \mathbf{b}_{\text{dec}}$$
$W_{\text{dec}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{sae}}}$: decoder weights- Each column
$\mathbf{d}_i$is a direction vector for feature$i$(unit vector)
Writing it out:
$$\hat{\mathbf{x}} = \mathbf{b}_{\text{dec}} + \sum_{i:\, f_i > 0} f_i(\mathbf{x}) \cdot \mathbf{d}_i$$
This is the key insight: the activation is decomposed into a weighted sum of a small number of interpretable features.
Training Objective
$$\mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|_2^2 + \lambda \|f(\mathbf{x})\|_1$$
- First term: reconstruction loss -- the original activation must be faithfully reconstructed
- Second term: L1 sparsity penalty -- feature activations must be sparse
$\lambda$: hyperparameter controlling sparsity strength
Why L1 and not L2? L2 regularization ($\|f\|_2^2$) pushes all features to be uniformly small. L1 regularization ($\|f\|_1$) pushes most features to be exactly zero while letting a few remain large. The mathematical reason: L1's gradient is a constant ($\pm\lambda$), so even tiny values get pushed to zero. L2's gradient is proportional to the value ($2\lambda f_i$), so small values shrink slowly and never quite reach zero. This difference is what gives us the "sparse = mostly zeros" property we need.
Additional constraint:
$$\|\mathbf{d}_i\|_2 = 1$$
Without this constraint, the model could increase encoder weights and decrease decoder weights to circumvent the L1 penalty -- making features appear sparse in magnitude without actually reducing the number of active features.
Monosemantic Features: It Actually Works
Anthropic's "Towards Monosemanticity" (Bricken et al., 2023) demonstrated that SAE-extracted features are genuinely monosemantic -- each feature corresponds to a single, clear concept:
The clean interpretations that were impossible with individual neurons become possible with individual SAE features. Feature #1847 is a particularly memorable example: Anthropic demonstrated that artificially amplifying this feature causes the model to compulsively bring up the Golden Gate Bridge in every response, regardless of the prompt.
Feature Splitting and Dead Features
Interesting phenomena emerge when you increase the SAE size $d_{\text{sae}}$:
Feature splitting: A "Python code" feature splits into "Python import statements," "Python function definitions," and "Python string manipulation." This suggests a hierarchical structure of concepts. The larger the SAE, the finer-grained the features become -- like adjusting the resolution on a microscope.
Dead features: Features that never activate for any data. These represent wasted capacity. The proportion of dead features is a key diagnostic metric for SAE quality. Common mitigation strategies include resampling dead neurons during training or using auxiliary losses to encourage utilization.
Beyond ReLU: Modern SAE Variants
The original ReLU-based SAE has known weaknesses — the L1 penalty creates a constant shrinkage on active features, and dead features waste capacity. Several architectural variants address these issues:
- TopK SAE (Gao et al., 2024): Replaces ReLU with a hard top-K selection:
$f(\mathbf{x}) = \text{TopK}(W_{\text{enc}} (\mathbf{x} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}})$. Only the K largest pre-activations survive. This eliminates the need for the L1 penalty entirely, gives direct control over sparsity (exactly K features are active), and substantially reduces dead features. - JumpReLU SAE (Rajamanoharan et al., 2024): Uses a learnable threshold
$\theta$per feature:$f(\mathbf{x}) = (z - \theta) \cdot \mathbf{1}[z > \theta]$. This creates a sharper sparsity pattern than ReLU, improving the reconstruction-sparsity tradeoff. The threshold is trained via straight-through estimators.
Scaling Monosemanticity: SAE on Production Models
Anthropic's "Scaling Monosemanticity" (Templeton et al., 2024) took SAEs from toy models to production scale by applying them to Claude 3 Sonnet.
Key results:
- Extracted 34 million features (
$d_{\text{sae}}$= 34M) - Features captured multilingual and multimodal concepts: not just "Golden Gate Bridge" but abstract concepts like "safety-related deception" and "code security vulnerabilities"
- Artificially clamping features produced predictable changes in model behavior, confirming that features are not just correlational but causal
- First demonstration that SAE-based interpretability scales to production-grade models
This work established that SAE-based interpretability has reached a level of maturity sufficient for real AI safety research. It transformed SAEs from an interesting research direction into a practical tool.
SAE in Practice: SAELens
SAELens is a dedicated SAE analysis library that was spun out of TransformerLens.
from sae_lens import SAE
# Load a pre-trained SAE
sae = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre",
)
# Decompose activations into features
features = sae.encode(cache["blocks.8.hook_resid_pre"])
# Check active features
active_features = (features > 0).sum(dim=-1)
print(f"Active features: {active_features.float().mean():.0f} / {sae.cfg.d_sae}")This gives you a sparse vector where each nonzero entry corresponds to a monosemantic feature, along with its activation strength. From here, you can inspect which features fire for specific tokens, track feature activations across layers, or use the features as a basis for circuit analysis.
3. TensorLens: The Entire Transformer as One Tensor
The Evolution of Lens Methods
Let us take stock of the Lens approaches we have covered so far:
Every method analyzes the Transformer piecemeal. Attention is analyzed as attention, MLP as MLP, normalization separately.
TensorLens (Atad et al., 2025) fundamentally resolves this limitation:
It represents the entire Transformer computation as a single 4th-order tensor.


Intuition: Upgrading Attention to 4K Resolution
The traditional attention matrix is $L \times L$. It tells you "how much does token A attend to token B?" as a single scalar.
But that is very coarse. Which of token "Paris"'s 768 channels are important? How does that influence arrive at specific channels of the output token? The attention matrix cannot tell you. It is like knowing "a package was sent from Seoul to Busan" but not knowing what was inside or where exactly it was delivered.
TensorLens upgrades this to $L \times D \times L \times D$:
$$\begin{aligned} \text{Attention:} \quad & [\text{token} \rightarrow \text{token}] \rightarrow L \times L \text{ matrix (low resolution)} \\ \text{TensorLens:} \quad & [\text{token} \times \text{channel} \rightarrow \text{token} \times \text{channel}] \rightarrow L \times D \times L \times D \text{ tensor (high resolution)} \end{aligned}$$
This is the difference between "a package was sent" and "electronics were shipped from Warehouse A in Gangnam to Warehouse B in Haeundae." You get full channel-level tracing of information flow.
Mathematical Details: High-Order Attention-Interaction Tensor
This section is for readers comfortable with mathematical notation. The intuitive understanding above is sufficient for the key ideas.
The central object in TensorLens is a 4th-order tensor $T \in \mathbb{R}^{L \times D \times L \times D}$:
$$F(X)[i,:] = \sum_j T[i,:,j,:] \cdot X[j,:]^T + B$$
Where:
$F$is the entire Transformer function$X$is the input embedding matrix$[L, D]$$T[i,:,j,:]$is a$D \times D$matrix that captures the influence of input token$j$on output token$i$$L$is the sequence length,$D$is the hidden dimension
With this tensor, the model's final output can be exactly reconstructed as a linear operation on the input.
In intuitive terms: the attention matrix tells you "who looks at whom" ($L \times L$). The TensorLens tensor tells you "who extracts what information and adds it to whose representation" ($L \times D \times L \times D$).
Tensorizing Each Component
The power of TensorLens comes from its ability to unify every sub-component of the Transformer into tensor form:
Self-Attention:
$$\text{vec}[\text{Attn}(X)] = \sum_h \left((W_{v,h} W_{o,h})^T \otimes A_h\right) \text{vec}[X]$$
The Kronecker product $\otimes$ combines token-mixing (the attention matrix $A_h$) with channel-mixing (the value/output projections). This captures both *which tokens interact* and *how their features transform* in a single mathematical object.
What is the Kronecker product (`$\otimes$`)? Given matrices A ($m \times n$) and B ($p \times q$), their Kronecker product is an$(m \cdot p) \times (n \cdot q)$matrix formed by replacing each element$a_{ij}$of A with the block$a_{ij} \cdot B$. Here, attention handles "which tokens connect" (token-mixing) while value/output projections handle "how connected information transforms" (channel-mixing). The$\otimes$fuses these two operations into a single matrix that describes both simultaneously.
LayerNorm:
Linearized by freezing the actual variance statistics from the forward pass. LayerNorm is a nonlinear operation (it divides by the standard deviation), but for a specific input, the variance is a fixed scalar. Fixing it turns LayerNorm into a linear transformation.
FFN/Activation:
The nonlinear activation function is linearized using the ratio $\phi(z)/z$:
- GELU:
$H = 0.5 \cdot (1 + \text{erf}(z/\sqrt{2}))$ - SiLU (SwiGLU):
$H = \sigma(z)$
This converts each activation into a diagonal matrix that scales each hidden dimension by the ratio of its post-activation to pre-activation value. The approximation is exact for the specific input being analyzed.
Full Transformer Block:
$$T^n = L_2^n (M^n + I) L_1^n (A^n + I)$$
Where $L_1, L_2$ are LayerNorm, $A$ is attention, $M$ is MLP, and $I$ is the residual connection. The $+ I$ terms elegantly handle the skip connections: the information either passes through the sublayer or bypasses it through the residual.
Full Model:
$$\text{vec}[F(X)] = (T^N \cdot T^{N-1} \cdots T^1) \text{vec}[X]$$
Multiplying the tensors from all layers yields the tensor for the entire model. This is the key result: a single mathematical object that describes the complete input-to-output transformation, accounting for every attention head, every MLP, every normalization layer, and every residual connection.
Relevance Scoring: Quantifying Token Influence
Collapsing the 4th-order tensor to an $[L, L]$ matrix gives us the influence of each input token on each output token:
Method 1 -- Norm-based (Eq. 20):
$$T_{\text{norm}}[i] = \|T[\text{out},:,i,:]\|_2$$
Takes the L2 norm of the $D \times D$ block. This measures the "magnitude" of influence regardless of direction. It answers: "How much does input token $i$ affect the output, in total?"
Method 2 -- Input+Output inner product (Eq. 21):
$$T_{\text{IO}}[i] = X_{\text{out}}^T \cdot T[\text{out},:,i,:] \cdot X_i$$
This considers both the input and output hidden states. It has a nice property: the sum of all relevance scores is proportional to the norm of the output. It answers: "How much does input token $i$ contribute to the output *in the direction the model actually uses*?"
Method 3 -- Class-specific:
$$\text{rel}[i] = U[:,\text{class}]^T \cdot T[\text{out},:,i,:] \cdot X_i$$
Uses the specific class weights from the LM head to directly compute each token's contribution to a specific prediction. This answers: "How much does input token $i$ contribute to the model predicting *this particular class*?"
Performance vs. Attention-Only Methods
TensorLens's central claim is that looking at attention alone is insufficient:
In perturbation tests, TensorLens consistently and substantially outperforms existing attention-based methods. The message is clear: ignoring the contributions of FFN layers, normalization, and residual connections leads to incomplete and misleading analysis.
This makes intuitive sense. Attention determines *where* information flows, but the MLP layers determine *what happens* to information after it arrives. Looking at attention alone is like tracking which letters were delivered to which houses, without reading what the letters actually say.
Supported Models
TensorLens supports a range of architectures:
The framework handles both autoregressive (causal) and bidirectional (masked) models, as well as vision Transformers. The tensorization procedure adapts to each architecture's specific arrangement of components (serial vs. parallel blocks, pre-LN vs. post-LN, standard MLP vs. SwiGLU).
4. SAE + TensorLens: Complementary Approaches
SAE and TensorLens approach the same goal from different angles:
SAE gives you a vocabulary of interpretable features. TensorLens gives you the complete map of information flow. Neither alone provides the full picture.
Combining the two approaches yields a powerful workflow:
- TensorLens identifies which input tokens are important for the output ("Token 5 has the highest relevance score for the model's prediction of 'Paris'")
- SAE decomposes the activation at that token position into monosemantic features ("At token 5, features for 'European capital city,' 'French language,' and 'geographical location' are active")
- Together, you get a complete token-feature-output causal pathway: you know *which* tokens matter, *what concepts* they encode, and *how* those concepts propagate to the output
This combined approach moves us closer to mechanistic interpretability's ultimate goal: a complete, human-understandable explanation of why the model produced a specific output.
5. The Complete Lens Paradigm Map
Across the three posts in this series, we have built up a comprehensive picture of the Lens family of interpretability tools:
$$\begin{aligned} & \textbf{Observation} \\ & \quad \text{Logit Lens: "What does this layer predict?"} \\ & \quad \text{Tuned Lens: "Read it more accurately"} \\[6pt] & \textbf{Causal Analysis} \\ & \quad \text{Activation Patching: "Is this activation the cause?"} \\ & \quad \text{Path Patching: "Through which pathway does information flow?"} \\[6pt] & \textbf{Structural Decomposition} \\ & \quad \text{SAE: "What features are inside this dense activation?"} \\ & \quad \text{TensorLens: "View the entire model as a single linear operator"} \end{aligned}$$
These tools do not compete with each other. They complement each other. Each one analyzes the model at a different resolution and from a different perspective. Together, they form a multi-scale microscope for neural networks.
Wrap-up
Interpretability has now moved beyond simply "looking at what neurons do" into an era of structural and mathematical analysis.
The trajectory is clear. Each new tool adds a higher-resolution view of the model's internals. Logit Lens let us read intermediate predictions. Activation Patching let us test causal claims. SAE let us decompose activations into meaningful units. TensorLens let us see the entire computation as one unified object.
The question all of these tools are ultimately driving toward is a single one:
Can we explain, in complete mathematical terms, *why* the model answered the way it did?
We have not reached that answer yet. But every year, we get one step closer.
References
- Elhage et al. *Toy Models of Superposition* (2022)
https://arxiv.org/abs/2209.10652
- Bricken et al. *Towards Monosemanticity: Decomposing Language Models with Dictionary Learning and Sparse Autoencoders* (2023)
https://transformer-circuits.pub/2023/monosemantic-features
- Atad et al. *TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors* (2025)
https://arxiv.org/abs/2601.17958
- TensorLens GitHub
https://github.com/idoatad/TensorLens
- SAELens GitHub
https://github.com/jbloomAus/SAELens
- Templeton et al. *Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet* (2024)
https://transformer-circuits.pub/2024/scaling-monosemanticity
- Gao et al. *Scaling and Evaluating Sparse Autoencoders* (2024)
https://arxiv.org/abs/2406.04093
- Rajamanoharan et al. *Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders* (2024)
https://arxiv.org/abs/2407.14435
- Anthropic. *A Mathematical Framework for Transformer Circuits*
https://transformer-circuits.pub