Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
Summary
Bebop proposes entropy-aware multi-token prediction with rejection sampling and a novel TV loss to accelerate RL training of LLMs, achieving up to 1.8x speedup. The method addresses the degradation of acceptance rates during RL by optimizing training objectives.
View Cached Full Text
Cached at: 06/11/26, 01:41 PM
Paper page - Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
Source: https://huggingface.co/papers/2606.12370 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Bebop addresses the efficiency bottleneck in reinforcement learning training of large language models by optimizing multi-token prediction techniques through entropy-aware sampling and novel training objectives that improve acceptance rates and inference throughput.
Reinforcement learning(RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. AlthoughMulti-Token Prediction(MTP) offers a natural solution to accelerate rollouts throughspeculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation ofmodel entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show thatprobabilistic rejection samplinglargely alleviates the disturbance introduced by entropy in RL compared togreedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropyor KL) are suboptimal in such settings, and therefore we propose a novel end-to-endTV lossthat directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2eTV lossand rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration inasync RL trainingof Qwen3.5, Qwen3.6, and Qwen3.7 models.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.12370
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.12370 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.12370 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12370 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control
This paper introduces Entrocraft, a rejection-sampling method for RL that controls entropy schedules to prevent performance saturation in LLMs. It demonstrates improved generalization and training longevity, allowing smaller models to outperform larger baselines.
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
This paper introduces Adaptive-Horizon and Selective-Advantage variants of GRPO that use entropy-based token-level discounting to stabilize training and improve performance on math reasoning tasks, achieving stronger results with lower variance.
Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
This paper proposes Adaptive Entropy Regularization (AER), a framework that dynamically balances exploration and exploitation in LLM reinforcement learning by addressing policy entropy collapse through difficulty-aware coefficient allocation and initial-anchored target entropy. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements in both accuracy and exploration capability.
How Maximum Entropy makes Reinforcement Learning Robust
This article explains how incorporating Shannon entropy into reinforcement learning objectives creates more robust agents capable of handling unexpected or adversarial changes in rewards and dynamics.
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.