Tag
Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.