query-lens

Tag

Cards List
#query-lens

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG · 3d ago Cached

Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.

0 favorites 0 likes
← Back to home

Submit Feedback