EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
Summary
EfficientRollout is a system-aware self-speculative decoding framework that accelerates reinforcement learning rollouts for LLMs by adapting drafters to evolving policies and optimizing speculative decoding regimes, reducing latency by up to 19.6%.
View Cached Full Text
Cached at: 06/18/26, 11:57 AM
Paper page - EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
Source: https://huggingface.co/papers/2606.18967
Abstract
EfficientRollout is a system-aware self-speculative decoding framework that accelerates reinforcement learning rollouts by adapting drafters to evolving policies and optimizing speculative decoding regimes.
Reinforcement learning(RL) has become a representative post-training paradigm for LLMs, enabling strong reasoning and agentic capabilities. However,rollout generationremains a dominant latency bottleneck becauseautoregressive samplingdecodes responses sequentially and a small number of long-tailed generations often determine completion time.Speculative decoding(SD) offers a natural way to address this bottleneck, as it is a well-established technique for serving fixed LLMs that reduces latency by rapidly drafting tokens and accepting them through parallel verification while preserving the target-model distribution. However, its practical speedups do not directly carry over to RL rollouts: (i) the evolving target policy makes any fixed drafter increasingly mismatched with the policy’s output distribution; and (ii) active batch sizes shrink throughout rollout decoding, shifting decoding from compute-bound tomemory-bound regimeswhere parallel verification can exploit underutilized compute. Therefore, accelerating RL rollouts requires both a drafter that remains effective under long, high-temperature generations from an evolving policy and system-aware use of SD that avoidscompute-bound regimes. We present EfficientRollout, a system-aware self-SD framework designed to address this gap for RL rollouts. EfficientRollout induces a quantized drafter from the target model (i.e.self-speculative decoding), keeping it coupled to the evolving policy without separate drafter pretraining or online adaptation. It further coordinates a system-aware SD toggle policy withacceptance-aware draft-length adaptation, enabling speculation only in beneficial regimes while matching the drafting budget to evolving drafter quality. EfficientRollout reduces rollout and end-to-end latency by up to 19.6% and 12.7%, respectively, over an accelerated AR rollout baseline, while preserving final model quality.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2606\.18967
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.18967 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.18967 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.18967 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning
Introduces LEDE, a framework using offline reinforcement learning to dynamically select exit layers and speculation lengths for self-speculative decoding in LLMs, achieving up to 2.7x speedup over autoregressive decoding.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
This paper introduces Parallel Speculative Decoding (PSD), a training-free framework that accelerates diffusion LLM inference by jointly improving spatial and temporal efficiency, achieving up to 5.5× tokens per forward pass with comparable quality to greedy decoding.
Cross-Epoch Adaptive Rollout Optimization for RL Post-Training
This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.
@SOURADIPCHAKR18: Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts…
This work proposes using privileged information to actively sample rollouts in reinforcement learning, improving on typical blind sampling methods.
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.