Tag
This paper introduces a method for monitoring the reasoning process of Large Reasoning Models by analyzing probe trajectories—the evolution of a concept's probability across generated tokens. The approach uses temporal and signal-processing features from hidden representations to better predict future model behavior, achieving up to 95% AUROC with max-pooling.
This paper demonstrates that a small post-transformer adapter (786K parameters) can correct suppressed log-probabilities in alignment-tuned language models, particularly on politically sensitive topics. The adapter shows 31-39% generalization to held-out facts across Qwen3 models while maintaining coherent generation when applied at the final prediction position.