SAE and TensorLens: The Age of Feature Interpretability

In the previous two posts, we:

Logit/Tuned Lens: Read the model's intermediate predictions
Activation Patching: Traced which activations are causally responsible for the answer

But here we hit a fundamental problem:

What do the activations we observe and manipulate actually *mean*?

Each dimension of an activation vector corresponds to an individual neuron. But these neurons are polysemantic -- a single neuron fires for academic citations, English dialogue, HTTP requests, and Korean text simultaneously. Clean interpretation at the neuron level is impossible.

This post covers two modern approaches that address this problem:

Sparse Autoencoder (SAE): Decompose dense activations into sparse, monosemantic features
TensorLens: Unify the entire Transformer computation into a single high-order tensor

SAE and TensorLens: The Age of Feature Interpretability

SAE and TensorLens: The Age of Feature Interpretability

1. Superposition: Why Neurons Are Uninterpretable

The Problem: Polysemantic Neurons

Sign in to continue reading

Related Posts

MIRAGE — Do Multimodal AIs Actually "See" Images?

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence