AI ResearchKR

SAE and TensorLens: The Age of Feature Interpretability

Individual neurons are uninterpretable. Sparse Autoencoders extract monosemantic features from model internals, and TensorLens analyzes the entire Transformer as a single unified tensor.

SAE and TensorLens: The Age of Feature Interpretability

SAE and TensorLens: The Age of Feature Interpretability

In the previous two posts, we:

  • Logit/Tuned Lens: Read the model's intermediate predictions
  • Activation Patching: Traced which activations are causally responsible for the answer

But here we hit a fundamental problem:

What do the activations we observe and manipulate actually *mean*?

Each dimension of an activation vector corresponds to an individual neuron. But these neurons are polysemantic -- a single neuron fires for academic citations, English dialogue, HTTP requests, and Korean text simultaneously. Clean interpretation at the neuron level is impossible.

This post covers two modern approaches that address this problem:

  1. Sparse Autoencoder (SAE): Decompose dense activations into sparse, monosemantic features
  2. TensorLens: Unify the entire Transformer computation into a single high-order tensor
🔒

Sign in to continue reading

Create a free account to access the full content.

Related Posts