Tag
This paper introduces Adaptive-Horizon and Selective-Advantage variants of GRPO that use entropy-based token-level discounting to stabilize training and improve performance on math reasoning tasks, achieving stronger results with lower variance.