llm-interpretation

#llm-interpretation

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG ↗ · 3d ago Cached

Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.

0 favorites 0 likes

llm-interpretation

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Submit Feedback