entropy-adaptive

#entropy-adaptive

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

arXiv cs.LG ↗ · 2026-06-05 Cached

This paper introduces Adaptive-Horizon and Selective-Advantage variants of GRPO that use entropy-based token-level discounting to stabilize training and improve performance on math reasoning tasks, achieving stronger results with lower variance.

0 favorites 0 likes

entropy-adaptive

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

Submit Feedback