BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding
Summary
BlockPilot proposes an instance-adaptive policy that predicts the optimal block size for diffusion-based speculative decoding, achieving significant speedup with minimal overhead.
View Cached Full Text
Cached at: 07/01/26, 03:40 AM
Paper page - BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding
Source: https://huggingface.co/papers/2606.31315
Abstract
Speculative decoding with adaptive block size selection improves inference efficiency by predicting optimal block sizes from prefilling representations, achieving significant speedup with minimal overhead.
Speculative decodingaccelerates inference by using a lightweightdraft modelto generate candidate tokens in parallel, and are then verified by thetarget model, enabling lossless acceleration. Recently,diffusion-based speculative decodingfurther improves parallelism by generating multiple tokens per forward pass viablock-level diffusion, achieving state-of-the-art (SOTA) performance. However, existing methods adopt a fixedinference block sizeand assume a uniform optimal decoding strategy across all inputs. In this paper, we show that this assumption is suboptimal, as the optimal block size varies across samples and plays a critical role inspeculative decodingperformance. Moreover, these values exhibit a clear local structure, concentrating around the training block size, which reduces the problem to a low-dimensional and structured decision space. Based on these insights, we propose BlockPilot, a sample-adaptive policy that predicts the optimal block size from theprefilling representation. Specifically, we formulate block size selection as a lightweightpolicy learningproblem and propose aninstance-adaptive decision mechanismthat predicts the optimal block size based on the representation of the prefilling stage. The prediction is performed only once after prefilling, allowing for seamless integration. Extensive experiments demonstrate that our method is plug-and-play, introduces minimal overhead, and consistently improves efficiency, achieving an acceptance length of 5.92 and a 4.20times speedup on Qwen3-4B under temperature T=1.
View arXiv pageView PDFGitHub20Add to collection
Get this paper in your agent:
hf papers read 2606\.31315
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.31315 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.31315 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.31315 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
Proposes PPOW, a reinforcement learning framework for optimizing draft models in speculative decoding using window-level objectives and adaptive windowing, achieving significant speedups across multiple benchmarks.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
This paper introduces SpecBlock, a block-iterative speculative decoding method that combines path dependence with efficient drafting to accelerate LLM inference. It demonstrates improved speedup over existing methods like EAGLE-3 while maintaining lower drafting costs.
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
This paper introduces Parallel Speculative Decoding (PSD), a training-free framework that accelerates diffusion LLM inference by jointly improving spatial and temporal efficiency, achieving up to 5.5× tokens per forward pass with comparable quality to greedy decoding.
TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding
TAPS proposes a target-aware prefix tree selection method for diffusion-drafted speculative decoding, achieving up to 7.9x lossless end-to-end speedup by improving the acceptance-cost tradeoff over prior methods.