logit-lens

Tag

Cards List
#logit-lens

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG · 3d ago Cached

Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.

0 favorites 0 likes
#logit-lens

Predicting Where Steering Vectors Succeed

arXiv cs.CL · 2026-04-20 Cached

This paper introduces the Linear Accessibility Profile (LAP), a diagnostic method using logit lens to predict steering vector effectiveness across model layers, achieving ρ=+0.86 to +0.91 correlation on 24 concept families across five models. The work provides a systematic framework to determine which layers and concepts are suitable for steering interventions, replacing ad-hoc trial-and-error approaches.

0 favorites 0 likes
← Back to home

Submit Feedback