entropy-adaptive

Tag

Cards List
#entropy-adaptive

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

arXiv cs.LG · 2026-06-05 Cached

This paper introduces Adaptive-Horizon and Selective-Advantage variants of GRPO that use entropy-based token-level discounting to stabilize training and improve performance on math reasoning tasks, achieving stronger results with lower variance.

0 favorites 0 likes
← Back to home

Submit Feedback