Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

arXiv cs.CL Papers

Summary

Proposes PPOW, a reinforcement learning framework for optimizing draft models in speculative decoding using window-level objectives and adaptive windowing, achieving significant speedups across multiple benchmarks.

arXiv:2605.14978v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:24 AM

# Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
Source: [https://arxiv.org/html/2605.14978](https://arxiv.org/html/2605.14978)
###### Abstract

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model\. In practice, speculative efficiency is often bottlenecked by hard\-to\-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window\. Most learning\-based drafters are still optimized with token\-level supervised objectives, even though speculative utility is inherently window\-level and prefix\-sensitive\. We proposePPOW\(Performance\-DrivenPolicyOptimization with AdaptiveWindowing\), a reinforcement learning framework that shifts drafter optimization from token\-level imitation to window\-level optimization\. PPOW combines a Cost\-Aware Speedup Reward, a Distribution\-Based Proximity Reward, and Adaptive Divergence\-Aware Windowing, which prioritizes informative windows with high confidence\-weighted draft–target divergence\. PPOW achieves average acceptance lengths of 6\.29–6\.52 and speedups of 3\.39–4\.36×\\timesacross multiple model families and benchmarks under a unified decoding protocol\. These results show that performance\-driven window\-level optimization is a practical approach to improving speculative decoding efficiency\.

## 1Introduction

Speculative decodingLeviathanet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib1)\); Chenet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib2)\)accelerates Large Language Model \(LLM\) generation while preserving the output distribution of the target model\. At each speculative step, a lightweight draft model \(drafter\) proposes a*speculative window*of candidate tokens, which the target model then verifies in parallel\. In practice, the realized speedup is often limited by hard\-to\-draft positions\. Even when most tokens within a speculative window are well modeled by the drafter, a single early mismatch can truncate the accepted prefix and invalidate the rest of the window, limiting overall throughput\.

This behavior exposes a mismatch between the inference\-time goal of speculative decoding and the training objectives commonly used for drafters\. Recent methods, including MEDUSACaiet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib6)\), HydraAnkneret al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib35)\), EAGLELiet al\.\([2024b](https://arxiv.org/html/2605.14978#bib.bib8),[a](https://arxiv.org/html/2605.14978#bib.bib9),[2025](https://arxiv.org/html/2605.14978#bib.bib10)\), HASSZhanget al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib14)\), and GRIFFINHuet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib36)\), improve drafter quality through architectural design and supervised training, while distillation\-based approachesZhouet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib13)\); Liuet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib26)\); Zafriret al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib52)\)further improve distributional alignment with the target model\. However, these methods are still primarily based on token\-level supervised objectives\. Such objectives improve local next\-token prediction, but speculative utility is inherently*window\-level*and*prefix\-sensitive*: once the accepted prefix is truncated due to an early mismatch, the remaining drafted tokens in the speculative window are invalidated\. As a result, improvements in token\-level imitation may not consistently lead to longer accepted prefixes or higher speculative efficiency, especially when end\-to\-end performance is governed by a few acceptance bottlenecks\.

Motivated by this mismatch, we proposePPOW\(Performance\-DrivenPolicyOptimization with AdaptiveWindowing\), a reinforcement learning framework that shifts drafter training from token\-level supervision to window\-level optimization over speculative windows \(Figure[2](https://arxiv.org/html/2605.14978#S1.F2)\)\. This window\-level formulation aligns optimization with the prefix\-sensitive structure induced by speculative verification\. Concretely, PPOW combines a*Cost\-Aware Speedup Reward*, which encourages longer accepted prefixes while accounting for relative drafter cost, with a*Distribution\-Based Proximity Reward*, an auxiliary signal that preserves partial credit for speculative windows that remain close to the target model’s preferences even when the accepted prefix is truncated early\.

Although PPOW optimizes with window\-level rewards, not all speculative windows are equally informative during training\. Treating all windows uniformly can disperse optimization effort across non\-bottleneck windows, limiting focus on the acceptance bottlenecks most critical to speculative efficiency\. PPOW therefore introduces*Adaptive Divergence\-Aware Windowing*\(ADAW\), which prioritizes windows with large confidence\-weighted draft–target divergence\. This criterion is grounded in our analysis of potential acceptance bottlenecks in Appendix[B](https://arxiv.org/html/2605.14978#A2), and our experiments show that prioritizing such windows improves speculative performance\.

Our contributions are summarized as follows:

- •A Window\-Level RL Framework for Drafter Optimization\.We formulate drafter training for speculative decoding as a reinforcement learning problem over speculative windows, which better matches inference\-time acceptance behavior\.
- •Performance\-Driven Reward Design\.PPOW combines a Cost\-Aware Speedup Reward with an auxiliary Distribution\-Based Proximity Reward to better align training with speculative decoding\. The latter provides additional credit for speculative windows that remain close to the target distribution, even when the accepted prefix is truncated early, as illustrated in Figure[1](https://arxiv.org/html/2605.14978#S1.F1)\.
- •Adaptive Divergence\-Aware Windowing\.We introduce a window prioritization strategy based on confidence\-weighted draft–target divergence that focuses training on more informative windows associated with acceptance bottlenecks\. Our analysis motivates this criterion, and our experiments show that prioritizing such windows improves speculative performance\.

\(a\) Cost\-Aware Speedup RewardContext:The answer isaccepted prefix:k=3k=38/9\.EOSDraft8/9,thenTargetCost\-Aware Speedup\[1\.5pt\]Rspeedup=k/\(k​γ\+1\)R\_\{\\text\{speedup\}\}=k\\,/\\,\(k\\gamma\+1\)\(b\) Distribution\-Based Proximity RewardContext:The math result is…early truncation⇒k=0\\Rightarrow k=0$\\\\backslashfrac\{8\}\{9\}$Draft\*\*8/9\*\*TargetAcceptance Feedback\[1\.5pt\]k=0,Rspeedup=0k=0,\\;R\_\{\\text\{speedup\}\}=0Distribution\-Based Proximity\[1\.5pt\]Rdist=η,if​Δ<ϵR\_\{\\text\{dist\}\}=\\eta,\\;\\text\{ if \}\\Delta<\\epsilon

Figure 1:PPOW uses a Cost\-Aware Speedup Reward together with a Distribution\-Based Proximity Reward\.\(a\) The Cost\-Aware Speedup Reward increases with accepted prefix length and directly encourages speculative decoding efficiency\. \(b\) When verification is truncated early, resulting ink=0k=0, the Distribution\-Based Proximity Reward still provides auxiliary credit if the speculative window remains close to the target\-preferred window under cumulative target log\-likelihood\.![Refer to caption](https://arxiv.org/html/2605.14978v1/figures/ppow_frame.png)Figure 2:Overview of PPOW\.PPOW performs policy optimization at the window level for speculative decoding\.Left:Adaptive windowing uses confidence\-weighted draft–target divergence scores to prioritize informative training windows\.Right:The drafter samples a rollout group of speculative windows for policy optimization with performance\-driven rewards and KL regularization\.
## 2Related Work

Recent work has improved speculative decoding through advances in drafter design, draft–target alignment, and inference\-time decoding optimization\. One line of work improves the drafter itself\. Head\-based methods, such as MEDUSACaiet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib6)\)and HydraAnkneret al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib35)\), augment the model with auxiliary draft heads for multi\-token prediction\. Feature\-based methods, including EAGLELiet al\.\([2024b](https://arxiv.org/html/2605.14978#bib.bib8),[a](https://arxiv.org/html/2605.14978#bib.bib9),[2025](https://arxiv.org/html/2605.14978#bib.bib10)\), HASSZhanget al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib14)\), and GRIFFINHuet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib36)\), improve drafting by leveraging hidden\-state representations from the target model\. Another line of work improves draft–target alignment through distillation or online adaptation, as in DistillSpecZhouet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib13)\), OSDLiuet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib26)\), and FastDraftZafriret al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib52)\)\. LookaheadFuet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib7)\)provides a training\-free alternative by constructing future candidates through Jacobi\-style parallel decoding\. While these methods substantially improve proposal quality, their objectives are typically defined at the token or local distribution level, leaving the window\-level and prefix\-sensitive nature of speculative verification less explicitly optimized\.

Beyond drafter modeling, prior work also optimizes the speculative decoding pipeline at inference time, including multi\-candidate selection and verification algorithmsSunet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib29),[2024](https://arxiv.org/html/2605.14978#bib.bib39)\), tree\-based speculative inference and verificationMiaoet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib30)\), tree\-structured and hardware\-aware speculationChenet al\.\([2024a](https://arxiv.org/html/2605.14978#bib.bib32)\), cascaded or adaptive speculation strategiesChenet al\.\([2024b](https://arxiv.org/html/2605.14978#bib.bib31)\); Mamouet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib38)\); Liuet al\.\([2025b](https://arxiv.org/html/2605.14978#bib.bib37)\), and parallel draft–verify execution mechanismsSvirschevskiet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib33)\); Liuet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib44)\)\. These approaches mainly operate on the inference procedure, such as how draft candidates are constructed, organized, or verified, and are complementary to drafter\-training methods\.

Recent studies have also explored reward signals or reinforcement learning in speculative decoding\. RSDLiaoet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib16)\)uses an inference\-time process reward model to guide draft acceptance, providing richer semantic guidance than standard token\-level verification, without aiming to preserve the target\-model distribution as in standard speculative verification\. Another related direction includes Spec\-RLLiuet al\.\([2025a](https://arxiv.org/html/2605.14978#bib.bib41)\), RLHFSpecWanget al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib42)\), and ReSpecChenet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib43)\), which adapt speculative decoding to improve the efficiency of reinforcement learning rollout generation, whereas PPOW optimizes the drafter for standard speculative decoding inference\.

Across these directions, PPOW differs by formulating drafter training as a reinforcement learning problem with performance\-driven window\-level objectives and adaptive prioritization of informative windows, targeting the window\-level and prefix\-sensitive utility of speculative verification while preserving the target\-model distribution under standard speculative decoding protocol\.

## 3Preliminaries

We define the notation for speculative decoding used in this paper\. Letπtarget\\pi\_\{\\text\{target\}\}andπθ\\pi\_\{\\theta\}denote the target model and the draft model \(drafter\), respectively, whereθ\\thetaparameterizes the drafter\. Given a current prefix𝒙=\(x1,…,xn1\)\\boldsymbol\{x\}=\(x\_\{1\},\\dots,x\_\{n\_\{1\}\}\), the drafter proposes a speculative window ofKKcandidate tokens,

𝒚^=\(y^1,…,y^K\)\.\\hat\{\\boldsymbol\{y\}\}=\(\\hat\{y\}\_\{1\},\\dots,\\hat\{y\}\_\{K\}\)\.The target model then performs parallel verification with rejection sampling under the standard speculative decoding procedure, resulting in an accepted prefix of lengthk∈\{0,…,K\}k\\in\\\{0,\\dots,K\\\}\.

### 3\.1Feature\-Based Drafting

In this paper, we use feature\-based drafters from the EAGLE familyLiet al\.\([2024b](https://arxiv.org/html/2605.14978#bib.bib8),[a](https://arxiv.org/html/2605.14978#bib.bib9),[2025](https://arxiv.org/html/2605.14978#bib.bib10)\)as the base drafters in our experiments\. Feature\-based drafters augment token\-based drafting with hidden\-state features from the target model\. Specifically, at drafting steptt, the drafter predicts the next\-token distribution as

πθ​\(y^t∣𝒙,𝒚^<t,𝐇\),\\pi\_\{\\theta\}\(\\hat\{y\}\_\{t\}\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{<t\},\\mathbf\{H\}\),where

𝐇=\[𝐇1:n1tgt,𝐇<td\]\\mathbf\{H\}=\\left\[\\mathbf\{H\}\_\{1:n\_\{1\}\}^\{\\text\{tgt\}\},\\mathbf\{H\}\_\{<t\}^\{d\}\\right\]denotes the concatenation of the target model’s prefix hidden states𝐇1:n1tgt\\mathbf\{H\}\_\{1:n\_\{1\}\}^\{\\text\{tgt\}\}and the drafter’s hidden states𝐇<td\\mathbf\{H\}\_\{<t\}^\{d\}within the current speculative window\. Further details of the speculative verification procedure are provided in Appendix[C\.1](https://arxiv.org/html/2605.14978#A3.SS1)\.

### 3\.2Policy Optimization over Speculative Windows

In contrast to standard policy optimization formulations that operate on full responses, we formulate drafter training over fixed\-length speculative windows\. For each input prefix𝒙\\boldsymbol\{x\}, the drafter defines a policy over speculative windows𝒚^=\(y^1,…,y^K\)\\hat\{\\boldsymbol\{y\}\}=\(\\hat\{y\}\_\{1\},\\dots,\\hat\{y\}\_\{K\}\), and a scalar rewardR​\(𝒙,𝒚^\)R\(\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\)evaluates the window as a whole\. The objective is

J​\(θ\)=𝔼𝒚^∼πθ​\[R​\(𝒙,𝒚^\)\]\.J\(\\theta\)=\\mathbb\{E\}\_\{\\hat\{\\boldsymbol\{y\}\}\\sim\\pi\_\{\\theta\}\}\\\!\\left\[R\(\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\)\\right\]\.
This formulation matches the speculative decoding procedure, in which a speculative window is verified as a whole and its utility is determined by the resulting accepted prefix\. Appendix[C](https://arxiv.org/html/2605.14978#A3)provides further discussion of window\-level policy optimization versus token\-level training for speculative decoding\.

## 4PPOW

PPOW is a performance\-driven training framework for speculative drafters that optimizes window\-level behavior for speculative decoding\. Concretely, PPOW uses Adaptive Divergence\-Aware Windowing to prioritize informative speculative windows, assigns them rewards using the Cost\-Aware Speedup Reward based on the accepted prefix length, complemented by the Distribution\-Based Proximity Reward when accepted\-prefix\-based signals are sparse, and updates the drafter through window\-level reinforcement learning\. The framework is compatible with different drafter architectures\. In our experiments, we instantiate PPOW with feature\-based drafters from EAGLE\-3Liet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib10)\)\. Figure[2](https://arxiv.org/html/2605.14978#S1.F2)gives an overview of PPOW, and Algorithm[1](https://arxiv.org/html/2605.14978#alg1)summarizes the full training procedure\.

### 4\.1Window\-Level Reinforcement Learning for Speculative Decoding

PPOW formulates drafter training as reinforcement learning over speculative windows, so as to better align optimization with the inference mechanism of speculative decoding\. In speculative decoding, the drafter produces speculative windows and the target model verifies them in parallel, yielding accepted\-prefix outcomes for different speculative windows under the same prefix\. Since these outcomes determine speculative efficiency, PPOW optimizes at the window level and learns from the relative speculative utility of speculative windows\.

For a given input prefix𝒙\\boldsymbol\{x\}, PPOW samplesGrollG\_\{\\mathrm\{roll\}\}speculative windows𝒚^=\(y^1,…,y^K\)\\hat\{\\boldsymbol\{y\}\}=\(\\hat\{y\}\_\{1\},\\dots,\\hat\{y\}\_\{K\}\)from the drafter policy, forming a rollout group for the same prefix\. PPOW assigns each speculative window a scalar reward and normalizes rewards within the rollout group to produce group\-relative advantages, thereby learning from the relative speculative utility of multiple windows sampled under the same context\. Each speculative window is thus treated as a single training unit: it receives one window\-level scalar reward, which is converted into a group\-relative advantage\. This grouped formulation is motivated by multi\-candidate speculative decoding settings used in practice\.

In addition, PPOW incorporates a KL regularization term into the objective\. We treat the drafter as the policyπθ\\pi\_\{\\theta\}and the target modelπtarget\\pi\_\{\\text\{target\}\}as a cross\-scale distributional anchor, and use the KL divergenceDKL​\(πθ∥πtarget\)D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\text\{target\}\}\)to encourage alignment during RL exploration\. This target\-anchored regularization is intended to stabilize training while keeping policy updates aligned with the target distribution used for speculative verification\.

The resulting objective can be written as

J​\(θ\)=1Groll∑i=1Groll1K∑t=1K\[min\(ri,t\(θ\)A^i,clip\(ri,t\(θ\),1−ϵclip,1\+ϵclip\)A^i\)−βDKL\(πθ\(⋅∣𝒙,𝒚^i,<t,𝐇\)∥πtarget\(⋅∣𝒙,𝒚^i,<t\)\)\],\\begin\{split\}J\(\\theta\)=&\\frac\{1\}\{G\_\{\\mathrm\{roll\}\}\}\\sum\_\{i=1\}^\{G\_\{\\mathrm\{roll\}\}\}\\frac\{1\}\{K\}\\sum\_\{t=1\}^\{K\}\\Big\[\\min\\Big\(r\_\{i,t\}\(\\theta\)\\hat\{A\}\_\{i\},\\,\\mathrm\{clip\}\\big\(r\_\{i,t\}\(\\theta\),1\-\\epsilon\_\{\\mathrm\{clip\}\},1\+\\epsilon\_\{\\mathrm\{clip\}\}\\big\)\\hat\{A\}\_\{i\}\\Big\)\\\\ &\\qquad\-\\beta\\,D\_\{\\mathrm\{KL\}\}\\\!\\big\(\\pi\_\{\\theta\}\(\\cdot\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{i,<t\},\\mathbf\{H\}\)\\,\\\|\\,\\pi\_\{\\mathrm\{target\}\}\(\\cdot\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{i,<t\}\)\\big\)\\Big\],\\end\{split\}where

ri,t​\(θ\)=πθ​\(y^i,t∣𝒙,𝒚^i,<t,𝐇\)πold​\(y^i,t∣𝒙,𝒚^i,<t,𝐇\),r\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(\\hat\{y\}\_\{i,t\}\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{i,<t\},\\mathbf\{H\}\)\}\{\\pi\_\{\\mathrm\{old\}\}\(\\hat\{y\}\_\{i,t\}\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{i,<t\},\\mathbf\{H\}\)\},andA^i\\hat\{A\}\_\{i\}denotes the normalized group\-relative advantage of theii\-th speculative window, capturing its relative speculative utility within the rollout group\. Specifically, for a rollout group of speculative windows sampled under the same prefix, we compute

A^i=Ri−μRσR\+δ,\\hat\{A\}\_\{i\}=\\frac\{R\_\{i\}\-\\mu\_\{R\}\}\{\\sigma\_\{R\}\+\\delta\},whereRiR\_\{i\}is the scalar reward of theii\-th speculative window,μR\\mu\_\{R\}andσR\\sigma\_\{R\}are the mean and standard deviation of rewards within the rollout group, andδ\\deltais a small positive constant for numerical stability\.

Although the policy is autoregressively factorized over tokens, optimization is performed at the speculative\-window level, with all tokens within the same window sharing a common advantage signal\.

### 4\.2Adaptive Divergence\-Aware Windowing

Window\-level training better matches speculative decoding at inference time, but it also leads to substantial sample redundancy\. A full response can be decomposed into many overlapping speculative windows, and many of them are already adequately modeled by a supervised\-initialized drafter\. At the same time, speculative performance is often constrained by certain hard\-to\-draft positions, where draft–target mismatch can more strongly affect the accepted prefix\. PPOW therefore prioritizes speculative windows that are both more informative for training and more consequential for speculative acceptance length through Adaptive Divergence\-Aware Windowing \(ADAW\)\.

LetPtP\_\{t\}andQtQ\_\{t\}denote the target and drafter distributions at positiontt,

Pt=πtarget\(⋅∣𝒙,𝒚<t\),Qt=πθ\(⋅∣𝒙,𝒚<t,𝐇\),P\_\{t\}=\\pi\_\{\\text\{target\}\}\(\\cdot\\mid\\boldsymbol\{x\},\\boldsymbol\{y\}\_\{<t\}\),\\qquad Q\_\{t\}=\\pi\_\{\\theta\}\(\\cdot\\mid\\boldsymbol\{x\},\\boldsymbol\{y\}\_\{<t\},\\mathbf\{H\}\),and define the token\-level*criticality score*as

vt=C​\(Pt\)⋅DKL​\(Pt∥Qt\),v\_\{t\}=C\(P\_\{t\}\)\\cdot D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\,\\\|\\,Q\_\{t\}\),where

C​\(Pt\)=1−H​\(Pt\)log⁡\|𝒱\|,C\(P\_\{t\}\)=1\-\\frac\{H\(P\_\{t\}\)\}\{\\log\|\\mathcal\{V\}\|\},and𝒱\\mathcal\{V\}denotes the vocabulary\.

The criticality scorevtv\_\{t\}is a confidence\-weighted measure of draft–target divergence\. Here,DKL​\(Pt∥Qt\)D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\,\\\|\\,Q\_\{t\}\)captures mismatch between the drafter and target distributions, whileC​\(Pt\)C\(P\_\{t\}\)emphasizes contexts in which the target distribution is more concentrated and speculative verification is more sensitive to such mismatch\. In this way, ADAW emphasizes contexts that are both difficult for the drafter and more consequential for speculative acceptance\.

Because speculative decoding operates on complete windows, PPOW aggregates token\-level criticality over each speculative window:

sj=1K​∑t=jj\+K−1vt,s\_\{j\}=\\frac\{1\}\{K\}\\sum\_\{t=j\}^\{j\+K\-1\}v\_\{t\},and prioritizes windows with largersjs\_\{j\}during training\. This adaptive windowing strategy reduces redundancy while focusing optimization on windows more directly associated with acceptance bottlenecks\. Appendix[B](https://arxiv.org/html/2605.14978#A2)analyzes the connection between draft–target divergence and speculative acceptance behavior, and further discusses the role of confidence weighting in prioritizing training windows\.

### 4\.3Performance\-Driven Rewards

PPOW uses a window\-level reward with two components\. The primary component is a Cost\-Aware Speedup Reward that encourages longer accepted prefixes while accounting for both drafting and verification cost\. The second component is a complementary Distribution\-Based Proximity Reward that provides auxiliary credit when exact verification terminates early but the speculative window still achieves a similar cumulative target\-model log\-likelihood\.

#### 4\.3\.1Cost\-Aware Speedup Reward

The primary goal of speculative decoding is to improve inference efficiency\. PPOW therefore uses a cost\-aware reward based on the accepted prefix lengthkkof a speculative window and the relative costγ\\gammaof the drafter:

Rspeedup=kk​γ\+1,R\_\{\\text\{speedup\}\}=\\frac\{k\}\{k\\gamma\+1\},wherek∈\{0,…,K\}k\\in\\\{0,\\dots,K\\\}is the accepted length andγ\\gammadenotes the relative computational cost ofπθ\\pi\_\{\\theta\}with respect toπtarget\\pi\_\{\\text\{target\}\}, estimated in our implementation by the ratio of non\-embedding parameters\.

This formulation accounts for the computational structure of speculative decoding by balancing accepted length against relative drafting cost\. The termk​γ\+1k\\gamma\+1combines the drafting cost and the target\-model verification cost within a speculative step\. Compared with using the raw accepted prefix lengthkkalone,RspeedupR\_\{\\text\{speedup\}\}provides a cost\-aware objective intended to better reflect the efficiency trade\-off in speculative decoding\.

We use this formulation during training instead of rewards derived from directly measured speedup, since measured speedup depends on the execution environment and is therefore less suitable as a general speculative\-window\-level training reward\. Appendix[D](https://arxiv.org/html/2605.14978#A4)compares the Cost\-Aware Speedup Reward with a Measured\-Speedup\-Based alternative and shows that it preserves the same acceptance\-related trend while remaining effective for optimization\.

#### 4\.3\.2Distribution\-Based Proximity Reward

The accepted\-prefix reward alone can become sparse when verification terminates early, even if the drafted window remains broadly compatible with the target model’s preferences\. To provide auxiliary partial credit in such cases, PPOW introduces the Distribution\-Based Proximity Reward, which compares the drafted window with a target\-preferred window under the target model’s cumulative log\-likelihood\. Figure[1](https://arxiv.org/html/2605.14978#S1.F1)\(b\) illustrates an example of such early truncation, where verification yieldsk=0k=0because the drafter and target model diverge at the token level\.

For a drafted speculative window𝒚^\\hat\{\\boldsymbol\{y\}\}, we construct a target\-preferred reference window𝒚\\boldsymbol\{y\}autoregressively from the target model under the same context, and compare the cumulative target log\-probabilities of the two windows:

Δ=∑t=1K\[log⁡πtarget​\(yt∣𝒙,𝒚<t\)−log⁡πtarget​\(y^t∣𝒙,𝒚^<t\)\],\\Delta=\\sum\_\{t=1\}^\{K\}\\Big\[\\log\\pi\_\{\\text\{target\}\}\(y\_\{t\}\\mid\\boldsymbol\{x\},\\boldsymbol\{y\}\_\{<t\}\)\-\\log\\pi\_\{\\text\{target\}\}\(\\hat\{y\}\_\{t\}\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{<t\}\)\\Big\],where

yt=arg⁡maxy⁡πtarget​\(y∣𝒙,𝒚<t\)\.y\_\{t\}=\\arg\\max\_\{y\}\\pi\_\{\\text\{target\}\}\(y\\mid\\boldsymbol\{x\},\\boldsymbol\{y\}\_\{<t\}\)\.We then define

Rdist=η⋅𝟏​\[Δ<ϵ\],R\_\{\\text\{dist\}\}=\\eta\\cdot\\mathbf\{1\}\[\\Delta<\\epsilon\],whereϵ\\epsilonis a tolerance threshold andη\\etascales the contribution of this reward\.

This auxiliary reward is designed to capture training signal that may be missed by accepted\-prefix\-based reward alone\. In PPOW,RdistR\_\{\\text\{dist\}\}is activated only when verification yields no accepted token, providing bounded partial credit to drafted windows whose cumulative target log\-probabilities remain close to those of the target\-preferred reference window\.

As a result,RdistR\_\{\\text\{dist\}\}complements the accepted\-prefix reward with a softer window\-level signal and can provide denser feedback during training\. Section[5\.5](https://arxiv.org/html/2605.14978#S5.SS5)evaluates its overall contribution in ablations, and Appendix[E](https://arxiv.org/html/2605.14978#A5)further examines its effect on easy and hard windows\.

## 5Experiments

We evaluate PPOW from four perspectives: \(1\) speculative decoding performance across model families and tasks under a unified decoding protocol, \(2\) practical efficiency under different inference candidate group sizes, \(3\) whether PPOW outperforms continued supervised training with the same number of post\-training steps, \(4\) ablations and stability studies of the reward design and Adaptive Divergence\-Aware Windowing\. Additional results on the speedup proxy, easy/hard\-window behavior, and broader baseline comparisons are provided in the appendix\.

### 5\.1Experimental Setup

##### Models and Tasks\.

We evaluate PPOW on two representative model families: LLaMA\-3Grattafioriet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib22)\)with 8B and 70B target models, and Qwen3Yanget al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib4)\)with 8B and 32B target models\. PPOW is applied to feature\-based drafters from the EAGLE familyLiet al\.\([2024b](https://arxiv.org/html/2605.14978#bib.bib8),[a](https://arxiv.org/html/2605.14978#bib.bib9),[2025](https://arxiv.org/html/2605.14978#bib.bib10)\)and initialized from supervised checkpoints before RL optimization\. We evaluate on MT\-BenchZhenget al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib19)\)for multi\-turn dialogue, HumanEvalChenet al\.\([2021](https://arxiv.org/html/2605.14978#bib.bib25)\)for code generation, and GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2605.14978#bib.bib23)\)for mathematical reasoning\.

##### Metrics\.

We evaluate speculative decoding performance using the following two metrics:

- •Speedup Ratio:The measured end\-to\-end speedup relative to vanilla autoregressive decoding\.
- •Average Acceptance Length \(τ\\tau\):The average number of tokens accepted from the drafter in each speculative verification step\.

Unless otherwise specified, all reported results are obtained with decoding temperature set to 0\.0\.

##### Baselines\.

Our main comparisons are with learned drafters, including EAGLE\-3Liet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib10)\)and GRIFFINHuet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib36)\)\. Additional results on broader natural\-language tasks and supplementary comparisons with OSDLiuet al\.\([2023](https://arxiv.org/html/2605.14978#bib.bib26)\), LookaheadFuet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib7)\), and FastDraftZafriret al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib52)\)are provided in Appendix[F](https://arxiv.org/html/2605.14978#A6)\. Detailed optimization hyperparameters and the unified decoding protocol are provided in Appendix[A](https://arxiv.org/html/2605.14978#A1)\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2605.14978#S5.T1)summarizes the main results\. PPOW achieves the best mean performance over the evaluated benchmarks for each model family under both temperature settings\.

Table 1:Main speculative decoding results\.Average acceptance length \(τ\\tau\) and speedup across models and benchmarks under a unified decoding protocol\. L31, L33, and Q3 refer to LLaMA\-3\.1\-Instruct, LLaMA\-3\.3\-Instruct, and Qwen3, respectively\.ModelMethodMT\-BenchHumanEvalGSM8KMeanτ\\tauSpeedupτ\\tauSpeedupτ\\tauSpeedupτ\\tauSpeedupTemperature = 0\.0L31\-8BGRIFFIN5\.142\.58×\\times6\.723\.96×\\times5\.983\.38×\\times5\.953\.31×\\timesEAGLE\-35\.532\.91×\\times6\.633\.92×\\times6\.123\.41×\\times6\.093\.41×\\timesPPOW5\.472\.72×\\times7\.234\.14×\\times6\.503\.52×\\times6\.403\.46×\\timesL33\-70BGRIFFIN5\.083\.60×\\times6\.694\.78×\\times5\.903\.99×\\times5\.894\.12×\\timesEAGLE\-35\.123\.63×\\times6\.784\.80×\\times5\.934\.02×\\times5\.944\.15×\\timesPPOW5\.453\.73×\\times6\.964\.82×\\times6\.474\.54×\\times6\.294\.36×\\timesQ3\-8BEAGLE\-34\.952\.64×\\times6\.683\.40×\\times6\.863\.47×\\times6\.163\.17×\\timesPPOW5\.583\.02×\\times7\.013\.62×\\times6\.973\.54×\\times6\.523\.39×\\timesQ3\-32BEAGLE\-35\.253\.47×\\times6\.524\.02×\\times6\.213\.94×\\times5\.993\.81×\\timesPPOW5\.783\.54×\\times6\.914\.16×\\times6\.624\.05×\\times6\.443\.92×\\timesTemperature = 1\.0L31\-8BGRIFFIN4\.032\.24×\\times6\.013\.29×\\times5\.132\.89×\\times5\.062\.81×\\timesEAGLE\-34\.172\.35×\\times5\.943\.25×\\times5\.162\.89×\\times5\.092\.83×\\timesPPOW4\.122\.29×\\times6\.333\.37×\\times5\.633\.13×\\times5\.362\.93×\\timesL33\-70BGRIFFIN4\.703\.38×\\times6\.314\.46×\\times5\.543\.87×\\times5\.523\.90×\\timesEAGLE\-34\.823\.40×\\times6\.284\.43×\\times5\.984\.16×\\times5\.694\.00×\\timesPPOW5\.143\.52×\\times6\.574\.62×\\times6\.294\.49×\\times6\.004\.21×\\timesQ3\-8BEAGLE\-34\.002\.41×\\times5\.712\.11×\\times5\.522\.89×\\times5\.082\.47×\\timesPPOW4\.722\.82×\\times5\.962\.33×\\times5\.763\.01×\\times5\.482\.72×\\timesQ3\-32BEAGLE\-34\.512\.60×\\times5\.532\.59×\\times5\.232\.81×\\times5\.092\.67×\\timesPPOW4\.952\.81×\\times6\.172\.77×\\times5\.613\.01×\\times5\.582\.86×\\timesAcross model families, PPOW consistently improves average acceptance length and wall\-clock speedup over strong learned\-drafter baselines under the unified decoding protocol\. The gains are particularly clear on HumanEval and GSM8K, where PPOW delivers the strongest and most consistent improvements across both LLaMA and Qwen models, suggesting that it is especially effective when speculative success depends on relatively structured decoding decisions\. By contrast, the trend is less pronounced on MT\-Bench, likely because open\-ended dialogue admits a broader range of valid continuations, making speculative acceptance less sensitive to improvements in the drafter policy\.

### 5\.3Inference Candidate Group\-Size Trade\-offs

Table 2:Average acceptance length \(τ\\tau\) under different inference candidate group sizes\.PPOW shows consistently higher acceptance length than the supervised baseline with smaller candidate group sizes on LLaMA\-3\.1\-8B / GSM8K\.
Table 3:Average acceptance length \(τ\\tau\) after continued training\.PPOW achieves higher final acceptance length than EAGLE\-3 with continued supervised training \(CST\) across learning rates on GSM8K with LLaMA\-3\.1\-8B\.

Table[3](https://arxiv.org/html/2605.14978#S5.T3)compares PPOW with the corresponding supervised baseline under different inference candidate group sizes\. Here, the supervised baseline refers to the EAGLE\-3Liet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib10)\)initialization without PPOW RL post\-training\. In practical speculative decoding, acceptance can often be improved by generating multiple candidate token sequences—for example through branching or tree\-style drafting—and then verifying them in parallel\. Larger candidate groups therefore tend to improve acceptance, but they also increase verification overhead\. PPOW reaches a high acceptance length with much smaller candidate groups: at inference candidate group size 4, it achievesτ=6\.33\\tau=6\.33, whereas the supervised baseline reaches onlyτ=5\.03\\tau=5\.03under the same setting and requires a much larger candidate group size of 16 to approach PPOW’s performance\. This result suggests that PPOW uses the candidate budget more effectively under the same verification budget\.

![Refer to caption](https://arxiv.org/html/2605.14978v1/x1.png)Figure 3:PPOW versus continued supervised training under matched training steps\.On GSM8K with LLaMA\-3\.1\-8B, the supervised baseline initially improves average acceptance length but later degrades, whereas PPOW continues to improve and achieves a higher final acceptance length\. CST denotes continued supervised training from the EAGLE\-3 checkpoint\.
![Refer to caption](https://arxiv.org/html/2605.14978v1/x2.png)Figure 4:Training dynamics after enabling ADAW\.Switching to ADAW at 44k training steps causes an immediate drop in training reward and a corresponding rise in KL divergence, indicating that the sampled windows have become more challenging\. As training continues, the policy adapts to these harder windows and gradually recovers performance\.

### 5\.4PPOW versus Continued Supervised Training

We next compare PPOW against a continued supervised training baseline to determine whether its gains can be explained by additional supervised training alone\. Both methods are initialized from the same EAGLE\-3 checkpointLiet al\.\([2025](https://arxiv.org/html/2605.14978#bib.bib10)\)and trained for an identical number of additional steps\. Figure[4](https://arxiv.org/html/2605.14978#S5.F4)shows that, on GSM8K with LLaMA\-3\.1\-8BGrattafioriet al\.\([2024](https://arxiv.org/html/2605.14978#bib.bib22)\), the supervised baseline initially improves average acceptance length but later degrades, whereas PPOW continues to improve throughout training and reaches a higher final value\. Table[3](https://arxiv.org/html/2605.14978#S5.T3)shows that the same trend holds across learning rates after 30k additional training steps\. These results suggest that simply extending imitation\-style training does not reliably optimize speculative acceptance behavior\. The objective\-level difference between PPOW and the supervised baseline is further discussed in Appendix[C](https://arxiv.org/html/2605.14978#A3)\. We observe the same qualitative trend on natural\-language tasks, as shown in Appendix[F\.2](https://arxiv.org/html/2605.14978#A6.SS2)\.

### 5\.5Ablation Studies

We examine the contribution of PPOW’s two key components beyond the base speedup reward: the Distribution\-Based Proximity Reward and Adaptive Divergence\-Aware Windowing\. Table[4](https://arxiv.org/html/2605.14978#S5.T4)reports component ablations on LLaMA\-3\.1\-8B\.

Table 4:Ablation of PPOW components on LLaMA\-3\.1\-8B\.“w/oRdistR\_\{\\text\{dist\}\}” removes the Distribution\-Based Proximity Reward, “w/o ADAW” replaces Adaptive Divergence\-Aware Windowing with uniform window sampling, and “w/o both” removesRdistR\_\{\\text\{dist\}\}and uses uniform window sampling\.Removing eitherRdistR\_\{\\text\{dist\}\}or ADAW reduces both acceptance length and end\-to\-end speedup, and removing both leads to the largest drop on both benchmarks\. On MT\-Bench, removing both reducesτ\\taufrom 5\.47 to 4\.38 and speedup from 2\.72×\\timesto 2\.39×\\times\. This pattern suggests that the two components make complementary contributions in practice:RdistR\_\{\\text\{dist\}\}provides a denser learning signal when exact verification is sparse, while ADAW improves training efficiency by focusing optimization on more informative windows\.

Figure[4](https://arxiv.org/html/2605.14978#S5.F4)provides further evidence for the effectiveness of ADAW\. When training switches to ADAW at 44k steps, reward drops immediately while KL divergence rises, indicating that training shifts toward harder windows with larger draft–target divergence\. As optimization continues, the policy adapts to this more difficult window distribution and recovers performance\. This behavior is consistent with ADAW prioritizing harder\-to\-draft windows rather than repeatedly sampling easy or redundant ones\. Additional analysis of ADAW is provided in Appendix[B](https://arxiv.org/html/2605.14978#A2), which gives analytical support for weighting draft–target divergence by target confidence\. Appendix[E](https://arxiv.org/html/2605.14978#A5)provides additional ablations and an easy/hard\-window analysis of the Distribution\-Based Proximity Reward\.

## 6Conclusion

We presented PPOW, a performance\-driven reinforcement learning framework for speculative decoding that optimizes the drafter over speculative windows\. PPOW combines the Cost\-Aware Speedup Reward, the Distribution\-Based Proximity Reward, and Adaptive Divergence\-Aware Windowing to better align drafter optimization with inference\-time acceptance behavior\. Empirically, PPOW improves average acceptance length and speedup across diverse settings\. It also remains effective with substantially smaller inference candidate group sizes, suggesting practical value under constrained verification budgets\. We further provide analytical support for the confidence\-weighted draft–target divergence criterion used in ADAW\. Overall, our results provide strong evidence for the value of performance\-driven drafter optimization in improving speculative decoding efficiency\.

## References

- \[1\]Z\. Ankner, R\. Parthasarathy, A\. Nrusimha, C\. Rinard, J\. Ragan\-Kelley, and W\. Brandon\(2024\)Hydra: sequentially\-dependent draft heads for medusa decoding\.arXiv preprint arXiv:2402\.05109\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1)\.
- \[2\]O\. Bojar, C\. Buck, C\. Federmann, B\. Haddow, P\. Koehn, J\. Leveling, C\. Monz, P\. Pecina, M\. Post, H\. Saint\-Amand,et al\.\(2014\)Findings of the 2014 workshop on statistical machine translation\.InProceedings of the ninth workshop on statistical machine translation,pp\. 12–58\.Cited by:[§F\.2](https://arxiv.org/html/2605.14978#A6.SS2.p1.1)\.
- \[3\]T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao\(2024\)Medusa: simple llm inference acceleration framework with multiple decoding heads\.arXiv preprint arXiv:2401\.10774\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1)\.
- \[4\]C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper\(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p1.1)\.
- \[5\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1)\.
- \[6\]Q\. Chen, Z\. Liu, P\. Sun, S\. Li, G\. Wang, Z\. Liu, Y\. Wen, S\. Feng, and T\. Zhang\(2025\)ReSpec: towards optimizing speculative decoding in reinforcement learning systems\.arXiv preprint arXiv:2510\.26475\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p3.1)\.
- \[7\]Z\. Chen, A\. May, R\. Svirschevski, Y\. Huang, M\. Ryabinin, Z\. Jia, and B\. Chen\(2024\)Sequoia: scalable, robust, and hardware\-aware speculative decoding\.arXiv preprint arXiv:2402\.12374\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[8\]Z\. Chen, X\. Yang, J\. Lin, C\. Sun, K\. C\. Chang, and J\. Huang\(2024\)Cascade speculative drafting for even faster llm inference\.Advances in Neural Information Processing Systems37,pp\. 86226–86242\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[9\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1)\.
- \[10\]N\. Ding, Y\. Chen, B\. Xu, Y\. Qin, S\. Hu, Z\. Liu, M\. Sun, and B\. Zhou\(2023\)Enhancing chat language models by scaling high\-quality instructional conversations\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 3029–3051\.Cited by:[§A\.1](https://arxiv.org/html/2605.14978#A1.SS1.p1.1)\.
- \[11\]Y\. Fu, P\. Bailis, I\. Stoica, and H\. Zhang\(2024\)Break the sequential dependency of llm inference using lookahead decoding\.arXiv preprint arXiv:2402\.02057\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px3.p1.1)\.
- \[12\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1),[§5\.4](https://arxiv.org/html/2605.14978#S5.SS4.p1.1)\.
- \[13\]S\. Hu, J\. Li, X\. Xie, Z\. Lu, K\. Toh, and P\. Zhou\(2025\)Griffin: effective token alignment for faster speculative decoding\.arXiv preprint arXiv:2502\.11018\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px3.p1.1)\.
- \[14\]Y\. Leviathan, M\. Kalman, and Y\. Matias\(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§B\.1](https://arxiv.org/html/2605.14978#A2.SS1.p1.4),[§1](https://arxiv.org/html/2605.14978#S1.p1.1)\.
- \[15\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)Eagle\-2: faster inference of language models with dynamic draft trees\.InProceedings of the 2024 conference on empirical methods in natural language processing,pp\. 7421–7432\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.14978#S3.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1)\.
- \[16\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)Eagle: speculative sampling requires rethinking feature uncertainty\.arXiv preprint arXiv:2401\.15077\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.14978#S3.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1)\.
- \[17\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2025\)Eagle\-3: scaling up inference acceleration of large language models via training\-time test\.arXiv preprint arXiv:2503\.01840\.Cited by:[§A\.1](https://arxiv.org/html/2605.14978#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.14978#S3.SS1.p1.1),[§4](https://arxiv.org/html/2605.14978#S4.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px3.p1.1),[§5\.3](https://arxiv.org/html/2605.14978#S5.SS3.p1.2),[§5\.4](https://arxiv.org/html/2605.14978#S5.SS4.p1.1)\.
- \[18\]B\. Liao, Y\. Xu, H\. Dong, J\. Li, C\. Monz, S\. Savarese, D\. Sahoo, and C\. Xiong\(2025\)Reward\-guided speculative decoding for efficient llm reasoning\.arXiv preprint arXiv:2501\.19324\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p3.1)\.
- \[19\]B\. Liu, A\. Wang, Z\. Min, L\. Yao, H\. Zhang, Y\. Liu, A\. Zeng, and J\. Su\(2025\)SPEC\-rl: accelerating on\-policy reinforcement learning via speculative rollouts\.arXiv preprint arXiv:2509\.23232\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p3.1)\.
- \[20\]T\. Liu, Y\. Li, Q\. Lv, K\. Liu, J\. Zhu, W\. Hu, and X\. Sun\(2024\)Pearl: parallel speculative decoding with adaptive draft length\.arXiv preprint arXiv:2408\.11850\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[21\]X\. Liu, L\. Hu, P\. Bailis, A\. Cheung, Z\. Deng, I\. Stoica, and H\. Zhang\(2023\)Online speculative decoding\.arXiv preprint arXiv:2310\.07177\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1),[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px3.p1.1)\.
- \[22\]X\. Liu, B\. Lei, R\. Zhang, and D\. D\. Xu\(2025\)Adaptive draft\-verification for efficient large language model decoding\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 24668–24676\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[23\]J\. Mamou, O\. Pereg, D\. Korat, M\. Berchansky, N\. Timor, M\. Wasserblat, and R\. Schwartz\(2024\)Dynamic speculation lookahead accelerates speculative decoding of large language models\.arXiv preprint arXiv:2405\.04304\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[24\]X\. Miao, G\. Oliaro, Z\. Zhang, X\. Cheng, Z\. Wang, Z\. Zhang, R\. Y\. Y\. Wong, A\. Zhu, L\. Yang, X\. Shi,et al\.\(2024\)Specinfer: accelerating large language model serving with tree\-based speculative inference and verification\.InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3,pp\. 932–949\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[25\]S\. Narayan, S\. B\. Cohen, and M\. Lapata\(2018\)Don’t give me the details, just the summary\! topic\-aware convolutional neural networks for extreme summarization\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 1797–1807\.Cited by:[§F\.2](https://arxiv.org/html/2605.14978#A6.SS2.p1.1)\.
- \[26\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§A\.2](https://arxiv.org/html/2605.14978#A1.SS2.p1.1)\.
- \[27\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§A\.2](https://arxiv.org/html/2605.14978#A1.SS2.p1.1)\.
- \[28\]Z\. Sun, U\. Mendlovic, Y\. Leviathan, A\. Aharoni, J\. H\. Ro, A\. Beirami, and A\. T\. Suresh\(2024\)Block verification accelerates speculative decoding\.arXiv preprint arXiv:2403\.10444\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[29\]Z\. Sun, A\. T\. Suresh, J\. H\. Ro, A\. Beirami, H\. Jain, and F\. Yu\(2023\)Spectr: fast speculative decoding via optimal transport\.Advances in Neural Information Processing Systems36,pp\. 30222–30242\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[30\]R\. Svirschevski, A\. May, Z\. Chen, B\. Chen, Z\. Jia, and M\. Ryabinin\(2024\)Specexec: massively parallel speculative decoding for interactive llm inference on consumer devices\.Advances in Neural Information Processing Systems37,pp\. 16342–16368\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p2.1)\.
- \[31\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1)\.
- \[32\]S\. Wang, H\. Yang, J\. Zhu, X\. Wang, Y\. Xu, and D\. Qian\(2025\)RLHFSpec: breaking the efficiency bottleneck in rlhf training via adaptive drafting\.arXiv preprint arXiv:2512\.04752\.Cited by:[§2](https://arxiv.org/html/2605.14978#S2.p3.1)\.
- \[33\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1)\.
- \[34\]O\. Zafrir, I\. Margulis, D\. Shteyman, S\. Guskin, and G\. Boudoukh\(2025\)Fastdraft: how to train your draft\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 22488–22505\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1),[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px3.p1.1)\.
- \[35\]L\. Zhang, X\. Wang, Y\. Huang, and R\. Xu\(2024\)Learning harmonized representations for speculative sampling\.arXiv preprint arXiv:2408\.15766\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1)\.
- \[36\]Y\. Zhao, A\. Gu, R\. Varma, L\. Luo, C\. Huang, M\. Xu, L\. Wright, H\. Shojanazeri, M\. Ott, S\. Shleifer,et al\.\(2023\)Pytorch fsdp: experiences on scaling fully sharded data parallel\.arXiv preprint arXiv:2304\.11277\.Cited by:[§A\.1](https://arxiv.org/html/2605.14978#A1.SS1.p1.1)\.
- \[37\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§F\.1](https://arxiv.org/html/2605.14978#A6.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14978#S5.SS1.SSS0.Px1.p1.1)\.
- \[38\]Y\. Zhou, K\. Lyu, A\. S\. Rawat, A\. K\. Menon, A\. Rostamizadeh, S\. Kumar, J\. Kagy, and R\. Agarwal\(2023\)Distillspec: improving speculative decoding via knowledge distillation\.arXiv preprint arXiv:2310\.08461\.Cited by:[§1](https://arxiv.org/html/2605.14978#S1.p2.1),[§2](https://arxiv.org/html/2605.14978#S2.p1.1)\.

## Appendix AOptimization and Implementation Details

This appendix summarizes the training setup, optimization procedure, hyperparameter settings, and full PPOW training algorithm used in our experiments\.

### A\.1Training Setup

Our training uses a two\-stage pipeline: supervised initialization followed by PPOW\-based reinforcement learning\. For LLaMA\-3, we initialize from the official EAGLE\-3 checkpoints\[[17](https://arxiv.org/html/2605.14978#bib.bib10)\]\. For Qwen, we first train an EAGLE\-3 drafter on ShareGPT and UltraChat\-200k\[[10](https://arxiv.org/html/2605.14978#bib.bib53)\], and then further optimize it with PPOW on the same data mixture\. All experiments are run on NVIDIA H100 \(80GB\) GPUs using PyTorch with FSDP\[[36](https://arxiv.org/html/2605.14978#bib.bib51)\]to support a frozen target model and a trainable drafter\. PPOW training costs about 50, 100, and 200 GPU\-hours for 8B, 32B, and 70B targets, respectively\.

### A\.2Policy Optimization and Target\-Anchored KL Regularization

PPOW uses a group\-relative clipped policy objective\[[26](https://arxiv.org/html/2605.14978#bib.bib12),[27](https://arxiv.org/html/2605.14978#bib.bib11)\]\. For each selected speculative window, we sample a rollout group ofGrollG\_\{\\mathrm\{roll\}\}drafts, compute a scalar reward for each draft, and normalize rewards within the group to obtain group\-relative advantages\.

A key difference from standard RL training setups lies in the KL regularization\. Rather than constraining the current policy toward an initialization policy or an auxiliary reference model, PPOW computes KL regularization between the drafter policy and the frozen target model\. Specifically, we compute the token\-wise KL divergence between the draft and target distributions at each drafted position and average it over the speculative window\. This keeps exploration anchored to the distribution that ultimately governs speculative verification\.

### A\.3Training and Decoding Configurations

Table[A\.3](https://arxiv.org/html/2605.14978#A1.SS3)summarizes the default training hyperparameters, hard\-window curriculum schedule, and decoding settings used throughout our experiments\. Under hard\-window curriculum scheduling, PPOW begins with a smaller proportion of ADAW\-selected hard windows and gradually increases this proportion over the course of training\.

Table 5:Training hyperparameters, curriculum schedule, and decoding configurations used in our experiments\.### A\.4Full PPOW Training Algorithm

Algorithm 1PPOW Training Procedure1:Inputs:drafter

πθ\\pi\_\{\\theta\}, target model

πtarget\\pi\_\{\\text\{target\}\}, speculative window size

KK, rollout group size

GrollG\_\{\\mathrm\{roll\}\}, clip ratio

ϵclip\\epsilon\_\{\\mathrm\{clip\}\}, KL coefficient

β\\beta, relative cost

γ\\gamma, tolerance threshold

ϵ\\epsilon, scaling factor

η\\eta
2:foreach training prefix

𝒙\\boldsymbol\{x\}do

3:

⊳\\trianglerightGenerate target\-side outputs and draft distributions

4:

\(𝒚,𝐇,\{Pt\}t=1T\)←πtarget​\(𝒙\)\(\\boldsymbol\{y\},\\mathbf\{H\},\\\{P\_\{t\}\\\}\_\{t=1\}^\{T\}\)\\leftarrow\\pi\_\{\\text\{target\}\}\(\\boldsymbol\{x\}\)
5:

\{Qt\}t=1T←πθ​\(𝒙,𝒚,𝐇\)\\\{Q\_\{t\}\\\}\_\{t=1\}^\{T\}\\leftarrow\\pi\_\{\\theta\}\(\\boldsymbol\{x\},\\boldsymbol\{y\},\\mathbf\{H\}\)
6:

⊳\\trianglerightCompute criticality score

7:for

t=1,…,Tt=1,\\dots,Tdo

8:

vt←\(1−H​\(Pt\)log⁡\|𝒱\|\)⋅DKL​\(Pt∥Qt\)v\_\{t\}\\leftarrow\\Big\(1\-\\frac\{H\(P\_\{t\}\)\}\{\\log\|\\mathcal\{V\}\|\}\\Big\)\\cdot D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\,\\\|\\,Q\_\{t\}\)
9:endfor

10:

⊳\\trianglerightADAW\-based window selection

11:foreach window start index

j∈\{1,…,T−K\+1\}j\\in\\\{1,\\dots,T\-K\+1\\\}do

12:

sj←1K​∑t=jj\+K−1vts\_\{j\}\\leftarrow\\frac\{1\}\{K\}\\sum\_\{t=j\}^\{j\+K\-1\}v\_\{t\}
13:endfor

14:Sample a training window start

j⋆j^\{\\star\}from the normalized weights proportional to

\{sj\}\\\{s\_\{j\}\\\}
15:

⊳\\trianglerightGrouped rollout and reward computation

16:for

g=1,…,Grollg=1,\\dots,G\_\{\\mathrm\{roll\}\}do

17:Roll out a speculative window

𝒚^g∼πθ\(⋅∣𝒙,𝒚<j⋆,𝐇<j⋆\)\\hat\{\\boldsymbol\{y\}\}\_\{g\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\boldsymbol\{x\},\\boldsymbol\{y\}\_\{<j^\{\\star\}\},\\mathbf\{H\}\_\{<j^\{\\star\}\}\)
18:

kg←Verify​\(𝒚^g,πtarget\)k\_\{g\}\\leftarrow\\textsc\{Verify\}\(\\hat\{\\boldsymbol\{y\}\}\_\{g\},\\pi\_\{\\text\{target\}\}\)
19:Compute the proximity gap

Δg\\Delta\_\{g\}for

𝒚^g\\hat\{\\boldsymbol\{y\}\}\_\{g\}under the same base context

\(𝒙,𝒚<j⋆\)\(\\boldsymbol\{x\},\\boldsymbol\{y\}\_\{<j^\{\\star\}\}\)as defined in Section[4\.3\.2](https://arxiv.org/html/2605.14978#S4.SS3.SSS2)

20:

rg←kgkg​γ\+1\+η⋅𝟏​\[kg=0\]​𝟏​\[Δg<ϵ\]r\_\{g\}\\leftarrow\\dfrac\{k\_\{g\}\}\{k\_\{g\}\\gamma\+1\}\+\\eta\\cdot\\mathbf\{1\}\[k\_\{g\}=0\]\\mathbf\{1\}\[\\Delta\_\{g\}<\\epsilon\]
21:endfor

22:Compute normalized group\-relative advantages

\{A^g\}g=1Groll\\\{\\hat\{A\}\_\{g\}\\\}\_\{g=1\}^\{G\_\{\\mathrm\{roll\}\}\}from rewards

\{rg\}g=1Groll\\\{r\_\{g\}\\\}\_\{g=1\}^\{G\_\{\\mathrm\{roll\}\}\}
23:

⊳\\trianglerightPolicy update

24:Update

θ\\thetausing the clipped group\-relative policy objective with target\-anchored KL regularization

25:endfor

## Appendix BAnalysis of Confidence\-Weighted Draft–Target Divergence

In Section[4\.2](https://arxiv.org/html/2605.14978#S4.SS2), we define the token\-level*criticality score*

vt=C​\(Pt\)​DKL​\(Pt∥Qt\),v\_\{t\}=C\(P\_\{t\}\)\\,D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\,\\\|\\,Q\_\{t\}\),wherePtP\_\{t\}andQtQ\_\{t\}denote the target and drafter distributions at positiontt, andC​\(Pt\)C\(P\_\{t\}\)is an entropy\-normalized confidence score of the target distribution\. In the main text, this score serves as a prioritization signal for training windows\. This appendix provides analytical support for this design by relating draft–target divergence to speculative acceptance behavior and clarifying the role of confidence weighting\.

### B\.1Acceptance Probability and Draft–Target Divergence

Letαt\\alpha\_\{t\}denote the expected acceptance probability at positionttunder speculative decoding\. Following\[[14](https://arxiv.org/html/2605.14978#bib.bib1)\], when a draft token is sampled fromQtQ\_\{t\}and verified againstPtP\_\{t\}, the expected acceptance probability is

αt\\displaystyle\\alpha\_\{t\}=𝔼y∼Qt​\[min⁡\(1,Pt​\(y\)Qt​\(y\)\)\]\\displaystyle=\\mathbb\{E\}\_\{y\\sim Q\_\{t\}\}\\\!\\left\[\\min\\\!\\left\(1,\\frac\{P\_\{t\}\(y\)\}\{Q\_\{t\}\(y\)\}\\right\)\\right\]=∑y∈𝒱min⁡\(Pt​\(y\),Qt​\(y\)\)\.\\displaystyle=\\sum\_\{y\\in\\mathcal\{V\}\}\\min\(P\_\{t\}\(y\),Q\_\{t\}\(y\)\)\.
Using the identitymin⁡\(a,b\)=12​\(a\+b−\|a−b\|\)\\min\(a,b\)=\\frac\{1\}\{2\}\(a\+b\-\|a\-b\|\), we obtain

αt\\displaystyle\\alpha\_\{t\}=12​∑y∈𝒱\(Pt​\(y\)\+Qt​\(y\)−\|Pt​\(y\)−Qt​\(y\)\|\)\\displaystyle=\\frac\{1\}\{2\}\\sum\_\{y\\in\\mathcal\{V\}\}\\left\(P\_\{t\}\(y\)\+Q\_\{t\}\(y\)\-\|P\_\{t\}\(y\)\-Q\_\{t\}\(y\)\|\\right\)=1−12​∑y∈𝒱\|Pt​\(y\)−Qt​\(y\)\|\\displaystyle=1\-\\frac\{1\}\{2\}\\sum\_\{y\\in\\mathcal\{V\}\}\|P\_\{t\}\(y\)\-Q\_\{t\}\(y\)\|=1−δ​\(Pt,Qt\),\\displaystyle=1\-\\delta\(P\_\{t\},Q\_\{t\}\),whereδ​\(Pt,Qt\)\\delta\(P\_\{t\},Q\_\{t\}\)denotes the total variation distance\.

Applying Pinsker’s inequality,

δ​\(Pt,Qt\)≤12​DKL​\(Pt∥Qt\),\\delta\(P\_\{t\},Q\_\{t\}\)\\leq\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\,\\\|\\,Q\_\{t\}\)\},yields

αt≥1−12​DKL​\(Pt∥Qt\)\.\\alpha\_\{t\}\\geq 1\-\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\,\\\|\\,Q\_\{t\}\)\}\.
This bound shows that smaller draft–target KL divergence implies a higher lower bound on the acceptance probability\. It therefore supports the use of draft–target divergence as a proxy of speculative difficulty\.

### B\.2Confidence Modulation of Draft–Target Divergence

The analysis above shows that draft–target divergence is informative about acceptance behavior\. However, the practical significance of a given divergence also depends on the uncertainty of the target distribution\. For the same level of draft–target divergence, discrepancies are typically more consequential when the target distribution is concentrated than when the target itself is uncertain\.

To account for this effect, we introduce the confidence factor

C​\(Pt\)=1−H​\(Pt\)log⁡\|𝒱\|∈\[0,1\],C\(P\_\{t\}\)=1\-\\frac\{H\(P\_\{t\}\)\}\{\\log\|\\mathcal\{V\}\|\}\\in\[0,1\],which assigns larger values to lower\-entropy target distributions and smaller values to higher\-entropy ones\.

The resulting criticality score,

vt=C​\(Pt\)​DKL​\(Pt∥Qt\),v\_\{t\}=C\(P\_\{t\}\)\\,D\_\{\\mathrm\{KL\}\}\(P\_\{t\}\\,\\\|\\,Q\_\{t\}\),can be viewed as a confidence\-modulated measure of draft–target divergence\. It increases the priority of positions where the drafter distribution diverges from a confident target distribution, while reducing the influence of divergence in higher\-entropy contexts\. Aggregatingvtv\_\{t\}over a speculative window therefore yields a window\-level signal that emphasizes regions more likely to constrain speculative acceptance in practice\.

## Appendix COptimization Differences Between PPOW and Continued Supervised Training

PPOW and continued supervised training target different optimization objectives in speculative decoding\. Continued supervised training improves token\-level imitation of the target model, whereas PPOW directly optimizes the window\-level utility aligned with speculative decoding efficiency at inference time\.

### C\.1Speculative Verification and Window\-Level Utility

Speculative decoding commonly uses rejection\-sampling verification to preserve the target\-model distribution during inference\. Under this verification rule, the practical gain of a speculative window depends on how many of its tokens are accepted by the target model, making the accepted prefix length, rather than token\-level likelihood alone, the relevant measure of speculative utility\.

For a speculative window𝒚^=\(y^1,…,y^K\)\\hat\{\\boldsymbol\{y\}\}=\(\\hat\{y\}\_\{1\},\\dots,\\hat\{y\}\_\{K\}\), the acceptance probability of the token at positionttunder rejection\-sampling verification is

αt=min⁡\(1,πtarget​\(y^t∣𝒙,𝒚^<t\)πθ​\(y^t∣𝒙,𝒚^<t\)\)\.\\alpha\_\{t\}=\\min\\left\(1,\\,\\frac\{\\pi\_\{\\text\{target\}\}\(\\hat\{y\}\_\{t\}\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{<t\}\)\}\{\\pi\_\{\\theta\}\(\\hat\{y\}\_\{t\}\\mid\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\_\{<t\}\)\}\\right\)\.The acceptance lengthτ\\tauis then determined by the first position at which verification fails:

τ=∑n=1K∏t=1n𝟏​\[ut≤αt\],ut∼U​\(0,1\)\.\\tau=\\sum\_\{n=1\}^\{K\}\\prod\_\{t=1\}^\{n\}\\mathbf\{1\}\[u\_\{t\}\\leq\\alpha\_\{t\}\],\\qquad u\_\{t\}\\sim U\(0,1\)\.Becauseτ\\tauis defined by consecutive acceptance from the beginning of the speculative window, early truncation under verification immediately eliminates the contribution of all subsequent tokens in that window\. In this sense, speculative utility is inherently window\-level, and positions that cause early verification truncation can become bottlenecks with disproportionate impact on inference\-time efficiency\. This creates a mismatch with token\-level training objectives and motivates the window\-level optimization used in PPOW\.

### C\.2Limitations of Continued Supervised Training for Speculative Decoding

Continued supervised training retains the same token\-level cross\-entropy objective as standard supervised training, and this token\-level training objective is not fully aligned with the window\-level speculative utility relevant at inference time:

ℒsup\\displaystyle\\mathcal\{L\}\_\{\\text\{sup\}\}=𝔼𝒚∼πtarget​\[−log⁡πθ​\(𝒚∣𝒙\)\]\\displaystyle=\\mathbb\{E\}\_\{\\boldsymbol\{y\}\\sim\\pi\_\{\\text\{target\}\}\}\\left\[\-\\log\\pi\_\{\\theta\}\(\\boldsymbol\{y\}\\mid\\boldsymbol\{x\}\)\\right\]=𝔼𝒚∼πtarget​\[∑t=1K−log⁡πθ​\(yt∣𝒙,𝒚<t\)\]\.\\displaystyle=\\mathbb\{E\}\_\{\\boldsymbol\{y\}\\sim\\pi\_\{\\text\{target\}\}\}\\left\[\\sum\_\{t=1\}^\{K\}\-\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid\\boldsymbol\{x\},\\boldsymbol\{y\}\_\{<t\}\)\\right\]\.This objective improves draft–target imitation\. However, its effect on speculative decoding is indirect, because the optimized quantity is token\-level likelihood rather than the realized utility of a speculative window\. In other words, matching the target model token by token does not necessarily maximize the number of tokens that survive verification within an accepted speculative window\.

As a result, continued supervised training does not directly optimize accepted prefix length, focus explicitly on bottleneck positions that cause early verification truncation, or optimize across multiple sampled speculative windows for the same context\. These mismatches suggest that continued supervised training may provide only limited additional gains for inference\-time speculative efficiency, especially once token\-level imitation has largely saturated, and further motivate the window\-level optimization used in PPOW\.

### C\.3PPOW as Window\-Level Policy Optimization

PPOW formulates speculative decoding as a policy optimization problem, where the drafter is optimized over speculative windows using window\-level rewards aligned with inference\-time speculative performance \(Section[4\.3](https://arxiv.org/html/2605.14978#S4.SS3)\):

J​\(θ\)=𝔼𝒚^∼πθ​\[R​\(𝒙,𝒚^\)\]\.J\(\\theta\)=\\mathbb\{E\}\_\{\\hat\{\\boldsymbol\{y\}\}\\sim\\pi\_\{\\theta\}\}\\big\[R\(\\boldsymbol\{x\},\\hat\{\\boldsymbol\{y\}\}\)\\big\]\.
Here, optimization is driven by the realized reward of sampled speculative windows rather than by token\-level supervision on a fixed target token sequence\. Since a full response can be decomposed into many overlapping speculative windows, not all windows contribute equally informative training signals\. Window\-level optimization therefore makes it possible to focus learning on more informative speculative windows, especially those containing positions that are more critical to early truncation under verification\.

PPOW also samples multiple speculative windows for the same context and computes group\-relative advantages among them\. This grouped optimization reflects practical settings that evaluate multiple candidate token sequences during decoding and provides a stronger relative learning signal toward speculative windows with better acceptance behavior\.

## Appendix DCost\-Aware Speedup Reward versus Measured Speedup Reward

A natural alternative to the Cost\-Aware Speedup Reward in Section[4\.3](https://arxiv.org/html/2605.14978#S4.SS3)is to define training rewards directly from measured inference speedup\. In practice, however, measured speedup depends strongly on the serving environment, including hardware, backend implementation, cache behavior, batching, and verification configuration\. It can also provide noisy or low\-resolution feedback at the speculative\-window level\. This appendix compares the Cost\-Aware Speedup Reward and a Measured\-Speedup\-Based Reward alternative from two perspectives: whether they exhibit similar reward trends as accepted length increases, and whether training with measured speedup yields meaningful gains in a fixed setup\.

Table 6:Reward values induced by accepted prefix length\.Comparison between a Measured\-Speedup\-Based Reward and PPOW’s Cost\-Aware Speedup Reward as accepted length increases\. The two rewards are not numerically identical, but they preserve the same monotonic trend\.Table 7:Training with PPOW’s Cost\-Aware Speedup Reward vs\. Measured\-Speedup\-Based Reward\.Final average acceptance length \(τ\\tau\) on GSM8K with LLaMA\-3\.1\-8B under the same training budget\.Table[6](https://arxiv.org/html/2605.14978#A4.T6)compares the reward values induced by accepted prefix length under the two formulations\. Although the Cost\-Aware Speedup Reward does not numerically match the Measured\-Speedup\-Based Reward, the two exhibit the same monotonic trend: longer accepted prefixes receive larger reward under both formulations\. The two rewards differ in scale because the measured reward is system\-dependent, but they preserve the same ordering over accepted\-prefix outcomes\. This shared trend is important for training, since it preserves the relative preference for speculative windows with better efficiency characteristics\.

Table[7](https://arxiv.org/html/2605.14978#A4.T7)compares training with the Cost\-Aware Speedup Reward against training with a Measured\-Speedup\-Based Reward under the same budget\. In this fixed setup, the measured\-speedup\-based reward yields only a small improvement in final acceptance length\. However, such a reward remains tightly coupled to the specific serving stack used during training and may provide noisy or low\-resolution feedback at the window level\. The cost\-aware formulation therefore offers a more portable and practical default objective\.

Overall, these results support the use of the Cost\-Aware Speedup Reward during training: it tracks the same acceptance\-related trend as measured speedup and remains effective for optimization, while avoiding the system dependence of direct runtime\-based rewards\.

## Appendix EEffect of the Distribution\-Based Proximity Reward Across Easy and Hard Windows

To characterize the effect of the Distribution\-Based Proximity RewardRdistR\_\{\\text\{dist\}\}, we partition speculative windows according to the acceptance behavior of the supervised baseline\. We refer to windows with full acceptance \(k=Kk=K\) as*easy*, and to those for which verification terminates before the end of the speculative window \(k<Kk<K\) as*hard*\.

Table 8:Effect of the Distribution\-Based Proximity Reward on easy and hard windows\.Easy windows are fully accepted by the supervised baseline \(k=Kk=K\), whereas hard windows terminate early within the speculative window \(k<Kk<K\)\. AddingRdistR\_\{\\text\{dist\}\}preserves performance on easy windows and further improves acceptance length while reducing the alignment divergence metric∇\\nablaon hard windows\.Under this partition, we compare the supervised baseline, PPOW withoutRdistR\_\{\\text\{dist\}\}, and full PPOW\. Table[8](https://arxiv.org/html/2605.14978#A5.T8)reports both average acceptance lengthτ\\tauand an alignment divergence metric∇\\nabla, defined token\-wise as

∇=exp⁡\(δ\)−δ−1,δ=log⁡πtarget​\(y^t∣⋅\)−log⁡πθ​\(y^t∣⋅\)\.\\nabla=\\exp\(\\delta\)\-\\delta\-1,\\qquad\\delta=\\log\\pi\_\{\\text\{target\}\}\(\\hat\{y\}\_\{t\}\\mid\\cdot\)\-\\log\\pi\_\{\\theta\}\(\\hat\{y\}\_\{t\}\\mid\\cdot\)\.This metric measures draft–target alignment on drafted tokens through token\-level log\-probability differences, with lower values indicating better alignment\.

On easy windows, the supervised baseline, PPOW withoutRdistR\_\{\\text\{dist\}\}, and full PPOW have identical acceptance length, and their∇\\nablavalues differ only slightly\. On hard windows, however, full PPOW improves acceptance length \(4\.58→\\rightarrow4\.86\) while further reducing∇\\nabla\(3\.644→\\rightarrow3\.275\) relative to PPOW withoutRdistR\_\{\\text\{dist\}\}\. These results support the inclusion ofRdistR\_\{\\text\{dist\}\}, which preserves performance on easy windows while yielding additional gains on hard windows\.

## Appendix FAdditional Baseline Comparisons

### F\.1Comparisons with OSD, Lookahead, and FastDraft

Table[9](https://arxiv.org/html/2605.14978#A6.T9)provides additional comparisons of PPOW against OSD\[[21](https://arxiv.org/html/2605.14978#bib.bib26)\], Lookahead\[[11](https://arxiv.org/html/2605.14978#bib.bib7)\], and FastDraft\[[34](https://arxiv.org/html/2605.14978#bib.bib52)\]\. PPOW substantially outperforms OSD on Vicuna\-7B\[[37](https://arxiv.org/html/2605.14978#bib.bib19)\]and consistently exceeds Lookahead on LLaMA\-2\-7B\[[31](https://arxiv.org/html/2605.14978#bib.bib21)\]over both GSM8K\[[9](https://arxiv.org/html/2605.14978#bib.bib23)\]and HumanEval\[[5](https://arxiv.org/html/2605.14978#bib.bib25)\]\. Relative to FastDraft on LLaMA\-3\.1\-8B\-Instruct\[[12](https://arxiv.org/html/2605.14978#bib.bib22)\], PPOW achieves markedly larger gains on GSM8K while remaining competitive on HumanEval\. These results further confirm the effectiveness of PPOW\.

Table 9:Additional baseline comparisons\.PPOW substantially improves over OSD, Lookahead, and FastDraft\.
### F\.2Additional Results on Natural\-Language Tasks

Table[10](https://arxiv.org/html/2605.14978#A6.T10)reports PPOW results on X\-SUM\[[25](https://arxiv.org/html/2605.14978#bib.bib55)\]and WMT14\[[2](https://arxiv.org/html/2605.14978#bib.bib56)\]\. Compared with the larger improvements observed on GSM8K, the gains on X\-SUM and WMT14 are less pronounced\. This may be because these tasks are more open\-ended, allowing a broader set of continuations to be accepted during speculative decoding and thereby weakening the acceptance\-aware group\-relative reward signal used by PPOW\. This effect appears weaker on X\-SUM because outputs remain partially anchored to the source, and more pronounced on WMT14 because greater lexical and syntactic variation broadens the space of acceptable continuations\. This is consistent with our broader claim that PPOW is most effective when speculative performance is governed by relatively structured bottleneck decisions\.

Table 10:Additional natural\-language results with LLaMA\-3\.1\-8B\-Instruct\.Average acceptance length \(τ\\tau\) and speedup on X\-SUM and WMT14 for EAGLE\-3, EAGLE\-3 \(CST\), and PPOW\. CST denotes continued supervised training from the EAGLE\-3 checkpoint\.

## Appendix GLimitations and Broader Impacts

### G\.1Limitations

PPOW improves speculative decoding by aligning training with inference\-time acceptance behavior, but this stronger alignment comes with additional training overhead\. Compared with supervised drafter training, PPOW requires grouped rollouts, speculative verification, and target\-anchored KL regularization, and our current implementation uses both a frozen target model and a trainable drafter\. This makes PPOW more resource\-intensive and more complex to implement than continued supervised training\.

This alignment objective also introduces several hyperparameters beyond standard reinforcement learning, including speculative\-decoding\-specific settings such as the speculative window sizeKK, the relative costγ\\gamma, the proximity reward weightη\\eta, the proximity thresholdϵ\\epsilon, and the hard\-window curriculum in ADAW\. While we use a unified configuration across experiments and observe stable gains, reducing such dependence through more adaptive or self\-tuning variants is a natural direction for future work\.

### G\.2Broader Impacts

PPOW suggests an additional algorithmic perspective on LLM inference: some components of the inference pipeline may be optimized directly with performance\-driven objectives rather than only with token\-level supervision\. In this setting, reinforcement learning provides a practical way to align training with system\-level inference behavior when the utility of a model component is determined by structured interactions during decoding rather than by local prediction quality alone\.

Framing speculative decoding as a learning problem may also relate to other inference\-time decisions, such as candidate allocation, request routing, scheduling, and load balancing\. These components are often optimized separately, but in practice they jointly affect end\-to\-end serving efficiency\. The formulation in PPOW may offer a useful perspective for future work that treats such inference\-time modules as part of a unified decision\-making environment\.

Similar Articles

Near-Future Policy Optimization

Hugging Face Daily Papers

Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.

Proximal Policy Optimization

OpenAI Blog

OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Reddit r/LocalLLaMA

This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.