Tag
This paper introduces Rift, a method that uses the residual rank of hidden states to detect deceptive responses in language models. It achieves perfect separation across various deception types, model families, and languages, and demonstrates cross-family zero-shot transfer without retraining.
The paper challenges the assumption that cosine alignment between supervised latents and visual targets improves accuracy in vision-language models, finding a strong negative correlation. It introduces PRISM diagnostics revealing that answers are decoded downstream from latents, not within them, and that the auxiliary loss reshapes the language model via shared parameters.