Tag
SWITCH is a switchable latent reasoning framework that uses explicit boundary tokens to enable trainable and interpretable recurrent hidden-state reasoning via on-policy reinforcement learning, outperforming prior approaches.
This paper presents a systematic mechanistic analysis of six preference optimization methods (PPO, DPO, SimPO, ORPO, GRPO, KTO) across three open-weight model families, using probing and sparse autoencoders to reveal how alignment algorithms reshape internal representations in qualitatively distinct ways.
This paper identifies a failure mode in LLMs where they do not verify the validity of numerical statistics when synthesizing multiple sources, instead relying on the stylistic markers of analytical rigor. The authors term this 'epistemic alignment' and show that it persists across models and domains, resisting prompting-based mitigations.
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.
This paper proposes that in-context learning in LLMs operates through low-dimensional concept subspaces, where task-relevant information concentrates in a small fraction of the representation space, supported by experiments on Llama-3-8B and Qwen2.5-7B.
This paper investigates prompt-induced hallucinations in vision-language models through mechanistic analysis, identifying specific attention heads responsible for the models' tendency to favor textual prompts over visual evidence. The authors demonstrate that ablating these PIH-heads reduces hallucinations by at least 40% without additional training, revealing model-specific mechanisms underlying this failure mode.