ConFu: Contemplate the Future for Better Speculative Sampling
Summary
ConFu introduces a novel speculative decoding framework that enables draft models to anticipate future generation directions through contemplate tokens and soft prompts, achieving 8-20% improvements in token acceptance rates and generation speed over EAGLE-3 across multiple LLM models.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# ConFu: Contemplate the Future for Better Speculative Sampling
Source: https://arxiv.org/html/2603.08899
###### Abstract
Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose ConFu (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) *contemplate tokens* and *soft prompts* that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a *dynamic contemplate token mechanism with MoE* to enable context-aware future prediction, and (iii) a training framework with *anchor token sampling* and *future prediction replication* that learns robust future prediction. ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8–11% on Llama-3 3B/8B and by approximately 20% on Qwen-3 4B across downstream tasks. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.
Machine Learning, ICML
## 1 Introduction
Large language models (LLMs) have achieved remarkable performance across a wide range of natural language processing tasks, yet their inference remains prohibitively expensive due to the autoregressive nature of text generation. Each decoding step requires a forward pass through the full model, resulting in high latency and computational cost. To mitigate this issue, a growing body of work has explored *speculative decoding* (Leviathan et al., 2023; Miao et al., 2024; Qin et al., 2024, 2025; Li et al., 2024b, a, 2025), an inference paradigm that employs a lightweight *draft model* to propose candidate tokens which are subsequently verified by the target model. By amortizing multiple draft tokens within a single verification pass of the target model, speculative decoding can accelerate generation without compromising the quality of outputs.
A central factor determining the effectiveness of speculative decoding is the quality of the draft model. Recent advances have led to a series of draft models with increasingly strong predictive capabilities. Notably, the EAGLE family (Li et al., 2024b, a, 2025) represents the state of the art in speculative decoding. EAGLE-1 (Li et al., 2024b) first demonstrated the effectiveness of training a single-layer transformer that exploits the hidden states of the target model to generate draft tokens autoregressively. EAGLE-2 (Li et al., 2024a) introduced a new technique of context-aware dynamic draft tree into draft modeling. EAGLE-3 further enhanced both architecture and training framework, setting new benchmarks in speculative decoding speed. Across diverse benchmarks, the EAGLE models consistently deliver superior speedups compared to prior draft models (Cai et al., 2024; Zhang et al., 2024), and are recognized as the current best-in-class approach.
Despite these successes, existing draft models, including the EAGLE series, have a shared drawback: they generate draft tokens by conditioning solely on the current prefix. This design is prone to error accumulation. As decoding proceeds, small errors accumulate, the draft distribution drifts from the target distribution, and token acceptance rates decline. This misalignment undermines the potential efficiency gains of speculative decoding.
In this work, we argue that draft models should not merely focus on predicting the immediate next token, but should also anticipate the *future direction* of generation. Intuitively, before committing to specific token choices, a draft model can benefit from understanding what the target model is planning to generate next at a higher level, namely, the target model's current "thought". If the draft model is provided with information about the target model's current "thought" and is encouraged to draft tokens that follow this direction, it becomes more likely to propose candidates that stay on the same semantic trajectory as planned by the target model. As a result, the draft tokens are more accurate, and therefore less likely to be rejected during the verification stage.
We instantiate this idea in ConFu (Contemplate the Future), a novel speculative decoding framework. ConFu introduces three key innovations. First, we introduce *contemplate tokens* and *soft prompts* that encourage the target model to expose signals of its intermediate reasoning with minimal additional inference cost. These signals are then provided to the draft model as auxiliary inputs, enabling more accurate and reliable token drafting. Second, we propose a *dynamic contemplate token mechanism based on Mixture-of-Experts (MoE)*, which allows contemplate tokens to adapt to diverse contexts and achieve greater expressive capacity. Third, we develop a training framework based on *anchor token sampling* and *future prediction replication*, which efficiently and effectively trains the model to learn robust future predictions.
Experiments on SpecBench (Xia et al., 2024) demonstrate that ConFu consistently improves both token acceptance rates and decoding speed over the state-of-the-art speculative decoding baseline, EAGLE-3 (Li et al., 2025). Across a wide range of downstream tasks, including writing, question answering, summarization, translation, coding, and mathematical reasoning, ConFu achieves substantial gains under diverse decoding conditions. On average, ConFu improves token acceptance rates and generation speed by 8–11% with Llama-3 3B and 8B models. These improvements are consistent across all task categories, sampling temperatures, and computation budgets.
More broadly, our results suggest that speculative decoding can be significantly strengthened by equipping draft models with the ability to *contemplate the future*. By conditioning draft generation on the target model's predicted semantic trajectory, ConFu produces draft tokens that align more closely with the target distribution, thereby reducing rejection rates during verification and improving overall throughput. EAGLE (Li et al., 2024b) introduced a method for adding target-biased guidance to draft models, and subsequent works have focused on mitigating training and inference mismatch (Li et al., 2025; Zhang et al., 2024; Hu et al., 2025). In this work, we provide a new direction for improving draft generation by additionally conditioning the draft model with contemplate tokens and future tokens. We view ConFu as an important step toward integrating speculative decoding with latent reasoning paradigms (Hao et al., 2024; Cheng and Van Durme, 2024; Shen et al., 2025). To the best of our knowledge, this is the first work to explicitly bridge speculative decoding with continuous latent "thought" representations, opening a new direction for accelerating LLM inference through future-aware generation.
## 2 Preliminaries
Speculative decoding utilizes a small, fast *draft model* (M_d) to generate a sequence of candidate tokens, which are then verified in a single, parallel forward pass by the large, powerful *target model* (M_t) (Leviathan et al., 2023; Miao et al., 2024).
In its standard form, the process works as follows:
1. **Drafting**: Given a prompt or a previously generated sequence x_{1:n}, the draft model M_d autoregressively generates a short sequence of K draft tokens, x̃_{n+1},..., x̃_{n+K}.
2. **Verification**: The target model M_t takes the combined sequence x_{1:n}, x̃_{n+1},..., x̃_{n+K} as input and performs a single forward pass to compute the probability distributions for the next token at each position.
3. **Acceptance/Rejection**: The draft tokens are checked sequentially. For each position i from 1 to K, the draft token x̃_{n+i} is accepted if it matches the token sampled from the target model's distribution p_t(·|x_{1:n}, x̃_{n+1},..., x̃_{n+i-1}). If a token is accepted, the process continues to the next one. If a token is rejected, it and all subsequent draft tokens are discarded.
4. **Correction**: The first token that was rejected is replaced by a new token sampled from the target model's corrected distribution at that position. The final accepted sequence becomes the input for the next drafting step.
The speedup comes from the number of tokens accepted in a single verification step, effectively replacing multiple sequential forward passes of the target model with one.
To improve the acceptance rate, the drafting process can be extended to generate a *tree of candidate tokens* instead of a single linear sequence (Miao et al., 2024; Sun et al., 2023). The draft model proposes multiple potential tokens at each step, creating a tree of draft tokens. The target model then validates all paths in this tree in parallel using a tree attention mechanism. The longest path that is consistent with the target model's predictions is accepted. This approach increases the likelihood that at least one drafted sequence will be correct, leading to a higher average number of accepted tokens per step.
EAGLE (Li et al., 2024b, a, 2025) is an advanced speculative decoding framework that addresses the core challenge of low acceptance rates by eliminating the need for a separate, misaligned draft model. Instead, it integrates the drafting mechanism directly into the target model itself.
The key innovation in EAGLE is the use of lightweight draft heads. The draft model can be seen as a single-layer transformer model that exploits the hidden states of the target model. By exploiting the target model's hidden representations, the EAGLE draft model achieves high acceptance rates for the draft tokens. And due to its lightweight architecture, the cost of generating draft tokens is much smaller than running an independent draft model. EAGLE-3 further improves the architecture of EAGLE by utilizing the hidden states of the target model from multiple layers. Specifically, EAGLE-3 concatenates target hidden-states from initial, middle, and final layer as h_t^{M_t,cat} ∈ ℝ^{3d}, which is then down-projected to obtain h_t^{M_d} = W_{proj} h_t^{M_t,cat} ∈ ℝ^d. The draft model then utilizes the hidden state h_t^{M_d} to generate draft tokens autoregressively.
## 3 ConFu: The Methodology
In this section, we introduce our model architecture design and how the draft model is trained. Specifically, Section 3.1 introduces the overall architecture of ConFu and the inference framework with contemplate tokens. Then Section 3.2 illustrates how we utilize MoE to achieve dynamic contemplate tokens. Finally, Section 3.3 illustrates how ConFu is trained.
### 3.1 Capture Future with Contemplate Tokens
The goal of future prediction is to generate a continuous embedding that captures the current "thought" of the target model, which can then guide the draft model in sampling more accurate future tokens. Two key requirements must be satisfied: (1) the future prediction module must have *sufficient capacity* to approximate the target model's internal reasoning, and (2) it should incur *minimal additional cost* during inference.
Recent studies on latent reasoning demonstrate that LLMs can generate continuous "thought tokens", after post-training, which serve as intermediate reasoning states (Hao et al., 2024; Cheng and Van Durme, 2024; Shen et al., 2025). While effective, generating such tokens requires an autoregressive process with multiple forward passes of the target model, which is prohibitively expensive. Instead, we propose to exploit *contemplate tokens*, also known as *pause tokens* (Goyal et al., 2023). A pause token is a special token appended to the input prefix that causes the LLM to perform additional computation before producing the next output. Goyal et al. (2023) observed that introducing pause tokens improves reasoning accuracy, and attributed this effect to...Similar Articles
DFlash: Block Diffusion for Flash Speculative Decoding
DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.
Faster LLM Inference via Sequential Monte Carlo
This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.
@zhijianliu_: DFlash is now running in a production inference stack. More draft models coming soon. https://github.com/z-lab/dflash
DFlash is a lightweight block diffusion model for speculative decoding, now running in production with support for various LLMs like Qwen and Gemma.
z-lab/Qwen3.6-35B-A3B-DFlash
z-lab releases DFlash, a speculative decoding drafter that uses a lightweight block-diffusion model to draft 15–16 tokens in parallel, yielding up to 2.9× speedup for Qwen3.6-35B-A3B inference.
z-lab/dflash
DFlash introduces a block diffusion method for flash speculative decoding to enhance inference speed in large language models.