Tag
This paper introduces DASH, a method that uses intermediate answer commitments within reasoning traces to assign segment-level credit, reducing overthinking behaviors and improving accuracy on competition-level math benchmarks.
This paper proposes Trajectory-Augmented Policy Optimization (TAPO), which constructs micro-reflective correction trajectories using the model's own correct and incorrect rollouts to improve reasoning in large language models, outperforming standard self-distillation methods on math benchmarks.
This paper formalizes reasoning redundancy in LLMs as the fraction of trailing steps that can be truncated without affecting correctness, quantifying 61-93% redundancy across frontier models and proving that redundancy is a structural consequence of length-agnostic outcome rewards.