Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Summary
Fast-dDrive is a block-diffusion VLA model for end-to-end autonomous driving that achieves state-of-the-art trajectory accuracy while delivering over 12x throughput speedup over autoregressive baselines, addressing the trade-off between high-fidelity planning and efficient inference for edge deployment.
View Cached Full Text
Cached at: 05/25/26, 08:59 AM
# Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Source: [https://arxiv.org/html/2605.23163](https://arxiv.org/html/2605.23163)
Kewei Zhang1\*Jin Wang3,2\*Sensen Gao2Chengyue Wu3,2Yulong Cao2 Songyang Han2Boris Ivanovic2Langechuan Liu2Marco Pavone2Song Han4,2 Daquan Zhou1†Enze Xie2† 1Peking University2NVIDIA3The University of Hong Kong4MIT \*Equal contribution†Co\-lead
###### Abstract
End\-to\-end autonomous driving via Vision\-Language\-Action \(VLA\) models demands a precarious balance between high\-fidelity trajectory planning and efficient inference\. Existing paradigms typically fall short: autoregressive \(AR\) VLAs are memory\-bandwidth\-bound on edge hardware and prone to exposure\-bias drift, while full\-sequence diffusion models preclude KV\-cache reuse and suffer from “logical leakage” that violates the fundamental perceive\-then\-plan causality\. We present Fast\-dDrive, a block\-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them\. Leveraging the observation that driving VLAs often emit structured JSON\-like outputs, Fast\-dDrive freezes structural tokens into a section scaffold and employs a section\-aware training recipe that prioritizes safety\-critical planning\. We further introduce Scaffold Speculative Decoding to achieve AR\-equivalent quality at significantly higher throughput\. Finally, we propose a low\-overhead test\-time scaling scheme: by forkingNNstochastic trajectory rollouts from a single shared\-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost\. Empirical results demonstrate that Fast\-dDrive redefines the speed\-accuracy frontier for driving agents\. On the WOD\-E2E test set, Fast\-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion\-based VLAs; on nuScenes, it reduces average L2 error to0\.320\.32m \(a22%22\\%improvement\)\. When integrated with SGLang, our framework delivers12×12\\timesthroughput speedup over the AR baseline, narrowing the gap between high\-capacity VLAs and the efficiency demands of real\-time on\-vehicle deployment\.
Links:[Github Code](https://github.com/NVlabs/Fast-dLLM)\|\|[Project Page](https://nvlabs.github.io/Fast-dLLM/fast_ddrive/)
## 1Introduction
End\-to\-end \(E2E\) autonomous driving has progressed rapidly by unifying perception, reasoning, and planning within a single trainable system\(Huet al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib43); Jianget al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib44); Xuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib39)\)\. A growing line of work extends this paradigm with Vision\-Language Models \(VLMs\) and Vision\-Language\-Action \(VLA\) models\(Tianet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib10);[Zhouet al\.,](https://arxiv.org/html/2605.23163#bib.bib3); Roweet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib6); Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\), which leverage broad world knowledge and natural\-language reasoning to handle the long\-tail scenarios that dominate real\-world driving and to expose interpretable explanations of the agent’s decisions\. For any such system to be practically useful, two requirements must be met*simultaneously*: the predicted trajectory must be accurate and globally consistent with the model’s reasoning, and inference must be efficient enough on edge hardware at batch size one to remain competitive with classical planners\. Existing VLAs typically satisfy at most one of these criteria\.
Figure 1:\(a\) Comparison of E2E driving VLA paradigms\. AR VLAs are memory\-bandwidth\-bound at batch size one and produce only one token per forward pass; full\-sequence diffusion VLAs preclude KV\-cache reuse and admit logical leakage across the perceive\-then\-plan stages\. Fast\-dDrive overcomes both via section\-aligned block diffusion, and further pre\-fills template tokens as a frozen scaffold to accelerate inference and inject schema priors\. \(b\) Our combined approach achieves up to 11\.8×\\timesend\-to\-end speedup compared to the AR baseline, measured on a single NVIDIA H100 GPU\.Driving VLAs are predominantly built on autoregressive \(AR\) decoders inherited from general\-purpose VLMs\(Liuet al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib41); Baiet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib40)\), which emit the structured reasoning trace and the trajectory tokens one at a time\. Sequential decoding causes a well\-known*exposure\-bias*effect: each waypoint conditions on previously emitted \(and possibly noisy\) coordinates, so small errors at the start of a 5 s plan can compound into physically implausible maneuvers\(Huanget al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib42)\)\. In addition, single\-token decoding at batch size one is strictly memory\-bandwidth\-bound on modern GPUs: each new token reloads the full set of model weights while leaving the available parallel compute largely idle, making efficient on\-vehicle deployment fundamentally hard\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29),[2025](https://arxiv.org/html/2605.23163#bib.bib33)\)\.
Recent diffusion\-based language models\(Nieet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib21); Youet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib26); Yuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib27)\), typically formulated as masked\-diffusion modeling \(MDM\) where masked tokens are iteratively unmasked via bidirectional attention, replace AR with iterative denoising that provides global context at every refinement step\. Applied to driving, dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\)reformulates the structured driving response as a single bidirectional denoising target and improves reasoning–action consistency over AR baselines, but at two structural costs: \(i\) full\-sequence bidirectional attention precludes KV\-cache reuse, keeping end\-to\-end latency far above AR baselines; and \(ii\) treating the response as one bidirectional unit ignores its inherent causal structure \(perception, explanation, meta\-behavior decision, and trajectory in that order\), admitting*logical leakage*where the planned trajectory can retroactively influence the model’s stated perception\. We instead propose Fast\-dDrive \(Figure[1](https://arxiv.org/html/2605.23163#S1.F1)\), a block\-diffusion VLA that decodes the structured driving output section by section under strict causal ordering, with bidirectional refinement confined within each section, directly resolving both costs while preserving the global\-context benefit of diffusion\.
On top of this paradigm, Fast\-dDrive further exploits a structural observation about modern driving VLMs\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4); Roweet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib6);[Zhouet al\.,](https://arxiv.org/html/2605.23163#bib.bib3)\): their structured outputs bundle perception, chain\-of\-thought, and trajectory into a schema\-defined JSON whose keys and syntax are determined entirely by the schema rather than by the model\(Guet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib5)\)\. We treat those deterministic tokens as a frozen*scaffold*and denoise only the value tokens, concentrating model capacity on the few positions that actually require prediction\. Building on this scaffold and the Fast\-dVLM\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\)architecture with a Qwen2\.5\-VL\-3B\(Baiet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib40)\)backbone, our contributions span three axes: a section\-weighted, noise\-adaptive training scheme that prioritizes safety\-critical reasoning; a scaffold\-aware self\-speculative decoder that auto\-accepts structural tokens and verifies an MDM draft with the AR head, delivering AR\-quality outputs at substantially lower latency; and a low\-overhead test\-time inference scaling scheme that, with the deterministic prefix decoded once, samples the AR verifier of Scaffold Speculative Decoding only on the trajectory section and averages a small number of trajectory rollouts forked from a shared KV cache, trading a fraction of additional inference compute for a meaningful accuracy gain\. Concretely:
- •Section\-Aware Structured Diffusion \(SASD\)\.A scaffold\-based training scheme that aligns block boundaries with semantic sections \(ensuring100%100\\%structural validity by construction\) and uses section\-weighted cross\-entropy together with a section\-adaptive Beta noise schedule to concentrate capacity on safety\-critical sections, at zero inference overhead\.
- •Scaffold Speculative Decoding and shared\-prefix test\-time scaling\.Scaffold Speculative Decoding \(SS\) auto\-accepts scaffold tokens and lets the AR head verify a parallel MDM draft, producing outputs identical to pure AR at substantially lower latency\. We further turn the deterministic SS verifier into a tunable inference\-scaling axis: with the prefix decoded once and the verifier sampled at non\-zero temperature only on the trajectory section,NNtrajectory rollouts are forked from a shared KV cache and averaged, trading a fraction of extra inference compute for a meaningful accuracy gain\.
- •State\-of\-the\-art accuracy at𝟏𝟐×\\mathbf\{12\\times\}throughput\.On the WOD\-E2E test set, Fast\-dDrive achieves the lowest ADE@3s and ADE@5s among compared methods while maintaining the highest RFS among diffusion\-based VLAs\. It delivers this SOTA accuracy at over200200tokens per second on a single H100—representing a6×6\\timesthroughput increase over full\-sequence diffusion and4×4\\timesover AR baselines\. When integrated with SGLang, this efficiency gain scales to a12×12\\timesspeedup over AR baselines, demonstrating that high\-capacity VLAs can effectively bridge the gap toward real\-time on\-vehicle deployment without accuracy compromises\.
These results indicate block\-diffusion VLAs, when paired with structure\-aware training and inference, can match or exceed the accuracy of strong AR and full\-sequence\-diffusion baselines while running at substantially higher throughput, without sacrificing the interpretability of structured CoT outputs\.
## 2Related Work
Vision\-Language\-Action Models for Autonomous Driving\.Vision\-Language\-Action \(VLA\) models unify perception, reasoning, and planning within a single multimodal framework\. Autoregressive VLAs leverage language\-model reasoning to improve trajectory prediction in long\-tail scenarios, with recent works further incorporating chain\-of\-thought reasoning\(Wanget al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib9); Tianet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib10);[Zhouet al\.,](https://arxiv.org/html/2605.23163#bib.bib3)\)\. However, AR decoding is inherently sequential and memory\-bandwidth\-bound at batch size 1\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\), a critical efficiency bottleneck for latency\-sensitive driving deployments, and the autoregressive factorization introduces exposure bias that compounds waypoint errors over longer horizons\. To address these issues, diffusion\-based VLAs have been explored for driving\. dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\)applies discrete masked diffusion to jointly generate structured reasoning and trajectories, improving behavior\-trajectory consistency\. Concurrent works\(Liet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib14); Wenet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib15)\)also adopt discrete diffusion for driving VLAs\. However, these methods rely on full\-sequence bidirectional diffusion, which precludes KV\-cache reuse and incurs high computational overhead\. Our work addresses this efficiency gap by adopting block diffusion, enabling parallel generation within blocks while maintaining causal ordering across blocks\.
Diffusion Large Language Models\.Discrete diffusion for text has progressed from foundational formulations\(Austinet al\.,[2021](https://arxiv.org/html/2605.23163#bib.bib16); Liet al\.,[2022](https://arxiv.org/html/2605.23163#bib.bib17)\)through refined masked diffusion objectives\(Louet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib18); Sahooet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib19); Shiet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib20)\)to large\-scale models such as LLaDA\(Nieet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib21)\)and Dream\(Yeet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib22)\)that match autoregressive performance\. Post\-training methods\(Zhuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib23); Wanget al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib24)\)further align diffusion LMs with human preferences, and multimodal extensions\(Yanget al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib25); Youet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib26); Yuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib27)\)integrate visual instruction tuning\. A key limitation of full\-sequence diffusion LMs is the inability to leverage KV caching\. Block Diffusion\(Arriolaet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib28)\)addresses this by partitioning the output into fixed\-size blocks with bidirectional attention*within*blocks and causal attention*across*blocks, recovering KV\-cache compatibility\. Fast\-dVLM\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\)extends this to vision\-language models, achieving a significant speedup over AR baselines via direct AR\-to\-diffusion conversion and self\-speculative decoding\(Wuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib33)\)\. Our work builds upon Fast\-dVLM and introduces structure\-aware scaffold diffusion with safety\-prioritized training that exploits the structured output format of autonomous driving\.
Efficient Decoding and Test\-Time Scaling\. Speculative decoding\(Leviathanet al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib30); Chenet al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib31)\)accelerates AR generation by drafting multiple tokens for parallel verification\. Self\-speculative variants\(Zhanget al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib32)\)eliminate the separate draft model by reusing the same model for both drafting and verification\. Fast\-dLLM\(Wuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib33)\)extends this to block diffusion, where the MDM head drafts tokens via bidirectional attention and an AR pass with causal attention verifies the draft\. Medusa\(Caiet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib46)\)and EAGLE\(Liet al\.,[2024a](https://arxiv.org/html/2605.23163#bib.bib47)\)propose lightweight draft heads for tree\-structured verification, further improving acceptance rates\. Our Scaffold Speculative Decoding builds on the self\-speculative framework of Fast\-dLLM but exploits the known output structure to auto\-accept scaffold tokens and skip redundant verification\. Test\-time compute scaling has been explored through Best\-of\-N sampling\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.23163#bib.bib34); Lightmanet al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib35)\), reward\-guided search\(Snellet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib48)\), and multi\-modal trajectory selection in diffusion planners\(Liaoet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib36); Yanget al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib37)\)\. These approaches typically require a separate verifier or a large sample budget\. Our shared\-prefix rollout scheme instead exploits the deterministic structure of the first three sections to amortize prefix computation, applying stochasticity only to the trajectory section at a fractional per\-rollout cost\.
## 3Methodology
We present Fast\-dDrive, a block\-diffusion VLA for end\-to\-end autonomous driving\. We first review the block\-diffusion formulation \(§[3\.1](https://arxiv.org/html/2605.23163#S3.SS1)\), then describe our structure\-aware scaffold diffusion training \(§[3\.2](https://arxiv.org/html/2605.23163#S3.SS2)\), the two inference modes it admits \(Section Diffusion and Scaffold Speculative Decoding, §[3\.3](https://arxiv.org/html/2605.23163#S3.SS3)\), and a low\-overhead test\-time inference scaling scheme that decodes the deterministic prefix once and averages multiple stochastic trajectory\-section rollouts forked from a shared KV cache \(§[3\.4](https://arxiv.org/html/2605.23163#S3.SS4)\)\.
### 3\.1Preliminaries: Block\-Causal Masked Diffusion
#### Masked Diffusion Language Models\.
Let𝐱0=\(x1,…,xL\)\\mathbf\{x\}\_\{0\}=\(x\_\{1\},\\ldots,x\_\{L\}\)be the target token sequence and𝐜=\(𝐯,𝐩\)\\mathbf\{c\}=\(\\mathbf\{v\},\\mathbf\{p\}\)the conditioning context \(visual features and text prompt\)\. A masked diffusion model\(Sahooet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib19)\)defines a forward process that randomly replaces tokens with a special\[MASK\]\[\\mathrm\{MASK\}\]token according to a noise schedule\{λt\}t=1T\\\{\\lambda\_\{t\}\\\}\_\{t=1\}^\{T\}, yielding a corrupted sequence𝐱t\\mathbf\{x\}\_\{t\}\. The reverse process applies a denoising policypθp\_\{\\theta\}that predicts replacements for masked positions while keeping visible tokens fixed\. Training minimizes the masked cross\-entropy loss:
ℒMDM\(θ\)=𝔼t,𝐱0,𝐱t\[−1\|ℳt\|∑i∈ℳtlogpθ\(x0i∣𝐱t,𝐜\)\],\\mathcal\{L\}\_\{\\mathrm\{MDM\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{0\},\\mathbf\{x\}\_\{t\}\}\\left\[\-\\frac\{1\}\{\|\\mathcal\{M\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\},\\mathbf\{c\}\)\\right\],\(1\)whereℳt=\{i:xti=\[MASK\]\}\\mathcal\{M\}\_\{t\}=\\\{i:x\_\{t\}^\{i\}=\[\\mathrm\{MASK\}\]\\\}is the set of masked indices at steptt\.
SectionValueScaffoldBlockscritical\_objects12801explanation19266future\_meta\_behavior6181trajectory70203Total28012411
\{
"critical\_objects":\{
"nearby\_vehicle":"yes","pedestrian":"no",
"cyclist":"no","traffic\_element":"yes",
"road\_hazard":"no","weather\_condition":"no",
"construction":"no","emergency\_vehicle":"no",
"animal":"no","special\_vehicle":"no",
"conflicting\_vehicle":"no",
"door\_opening\_vehicle":"no"
\},
"explanation":"This is an example\.",
"future\_meta\_behavior":\{
"longitudinal":"decelerate",
"lateral":"keep lane"
\},
"trajectory":
"\[\[ \+003\.30,\-000\.01\], \[ \+006\.71,\-000\.04\],
\[ \+010\.12,\-000\.09\], \[ \+013\.42,\-000\.16\],
\[ \+016\.30,\-000\.24\]\]"
\}
Table 1:Top: per\-section value/scaffold token counts in our schema\. The scaffold accounts for∼30%\{\\sim\}30\\%of decoded tokens\.Bottom: an example structured output of Fast\-dDrive\. Onlyvalue tokensneed to be decoded\.
#### Block\-Causal Diffusion\.
Full\-sequence bidirectional diffusion\(Nieet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib21)\)precludes KV\-cache reuse and requires full recomputation at every denoising step\. Block Diffusion\(Arriolaet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib28)\)addresses this by partitioning the output intoBBblocks of sizedd:𝐱0=\[𝐛1,…,𝐛B\]\\mathbf\{x\}\_\{0\}=\[\\mathbf\{b\}\_\{1\},\\ldots,\\mathbf\{b\}\_\{B\}\], where blocks are generated left\-to\-right with*bidirectional*attention within each block and*causal*attention across blocks\. Formally, block𝐛j\\mathbf\{b\}\_\{j\}attends to the full prompt𝐜\\mathbf\{c\}and all preceding blocks𝐛1:j−1\\mathbf\{b\}\_\{1:j\-1\}\(whose KV cache can be reused\), but not to future blocks𝐛j\+1:B\\mathbf\{b\}\_\{j\+1:B\}\. This recovers KV\-cache compatibility while retaining parallel generation within each block\.
Fast\-dVLM\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\)extends block diffusion to vision\-language models via direct conversion from autoregressive VLMs, and introduces*self\-speculative decoding*\(Wuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib33)\): for each block, the MDM head drafts all tokens in parallel via bidirectional attention, then an AR head with causal attention verifies the draft sequentially, accepting tokens until the first mismatch plus one bonus token\. This achieves significant speedup with quality equivalent to pure AR decoding\.
### 3\.2Structure\-Aware Scaffold Diffusion
Figure 2:Pipeline of Fast\-dDrive during training\. The structured JSON output is decomposed into four semantic sections; template tokens form a frozen*scaffold*\(grey, pre\-filled and never masked\), while value tokens are denoised in section\-aligned blocks with bidirectional attention within each block and strict causal ordering across sections\. SASD adds per\-section loss weights and Beta noise schedules at training time only\.#### Scaffold Construction and Section\-Aligned Blocks\.
Following prior work\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4); Roweet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib6); Guet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib5)\), the model outputs a structured JSON with four semantic sections:critical\_objects\(12 binary detections\),explanation\(free\-form reasoning\),future\_meta\_behavior\(categorical actions\), andtrajectory\(5 waypoint coordinates over 5 s\)\. These sections differ dramatically in token count, difficulty, and safety impact\.
We exploit the fixed JSON schema by pre\-filling all structural tokens \(keys, brackets, punctuation\) as a frozenscaffold𝐱^T\\hat\{\\mathbf\{x\}\}\_\{T\}, leaving only*value tokens*masked\. Let𝒜\\mathcal\{A\}denote scaffold \(anchor\) positions andℰ=\{1:L\}∖𝒜\\mathcal\{E\}=\\\{1\{:\}L\\\}\\setminus\\mathcal\{A\}the editable value positions; the diffusion process operates exclusively onℰ\\mathcal\{E\}:
ℒscaffold\(θ\)=𝔼t,𝐱0,𝐱t\[−1\|ℳt\|∑i∈ℳtlogpθ\(x0i∣𝐱t,𝐜\)\],ℳt⊆ℰ\.\\mathcal\{L\}\_\{\\mathrm\{scaffold\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{0\},\\mathbf\{x\}\_\{t\}\}\\left\[\-\\frac\{1\}\{\|\\mathcal\{M\}\_\{t\}\|\}\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\},\\mathbf\{c\}\)\\right\],\\quad\\mathcal\{M\}\_\{t\}\\subseteq\\mathcal\{E\}\.\(2\)This guarantees100%100\\%structural correctness and reduces the denoising workload by∼30%\{\\sim\}30\\%\(Table[1](https://arxiv.org/html/2605.23163#S3.T1)\)\. We further align block boundaries with section boundaries, partitioning each sectionssintons=⌈\|ℰs\|/d⌉n\_\{s\}=\\lceil\|\\mathcal\{E\}\_\{s\}\|/d\\rceilblocks\. Sections are denoised in the causal orderCO→\\toExpl→\\toFMB→\\toTraj, each block providing complete intra\-section bidirectional context\. Variable\-length sections use aNULLtoken for padding, stripped at inference time\.
#### Safety\-Prioritized Training\.
The four sections differ vastly in safety impact: a wrong trajectory coordinate may cause a collision, while a slightly imperfect explanation has no such consequence\. We introduce two complementary training\-time mechanisms to bias learning capacity toward safety\-critical sections\.*Section\-weighted loss*assigns each sectionssa positive scalar weightwsw\_\{s\}that scales its per\-token cross\-entropy:
ℒtrain\(θ\)=𝔼t,𝐱0,𝐱t\[−∑sws\|ℳts\|∑i∈ℳtslogpθ\(x0i∣𝐱t,𝐜\)\],\\mathcal\{L\}\_\{\\mathrm\{train\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{0\},\\mathbf\{x\}\_\{t\}\}\\left\[\-\\sum\_\{s\}\\frac\{w\_\{s\}\}\{\|\\mathcal\{M\}\_\{t\}^\{s\}\|\}\\sum\_\{i\\in\\mathcal\{M\}\_\{t\}^\{s\}\}\\log p\_\{\\theta\}\(x\_\{0\}^\{i\}\\mid\\mathbf\{x\}\_\{t\},\\mathbf\{c\}\)\\right\],\(3\)where larger weights are assigned to safety\-critical sections so that gradients on hard, high\-impact tokens dominate the update\.*Section\-adaptive noise*replaces the uniform diffusion schedule with per\-section Beta distributionsts∼Beta\(αs,βs\)t\_\{s\}\\sim\\mathrm\{Beta\}\(\\alpha\_\{s\},\\beta\_\{s\}\), allowing the noise schedule to be tailored to each section’s difficulty profile\. Concrete values for\{ws\}\\\{w\_\{s\}\\\}and\{\(αs,βs\)\}\\\{\(\\alpha\_\{s\},\\beta\_\{s\}\)\\\}are reported in §[4\.1](https://arxiv.org/html/2605.23163#S4.SS1)\. Both mechanisms incurzero inference overhead\.
#### Joint AR and Diffusion Training\.
Following Fast\-dVLM\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\), we train under a dual\-stream objective that combines our section\-weighted MDM loss \(Eq\.[3](https://arxiv.org/html/2605.23163#S3.E3)\) with a token\-level causal LM lossℒAR\\mathcal\{L\}\_\{\\mathrm\{AR\}\}over the same response labels on the clean stream:
ℒ=αℒtrain\(θ\)\+βℒAR\(θ\),α=β=0\.5\.\\mathcal\{L\}=\\alpha\\,\\mathcal\{L\}\_\{\\mathrm\{train\}\}\(\\theta\)\+\\beta\\,\\mathcal\{L\}\_\{\\mathrm\{AR\}\}\(\\theta\),\\quad\\alpha=\\beta=0\.5\.\(4\)The diffusion branch learns parallel value denoising under intra\-block bidirectional attention, while the causal branch preserves the pretrained AR decoding capability\. As shown in §[3\.3](https://arxiv.org/html/2605.23163#S3.SS3), this joint objective is what enables a single trained Fast\-dDrive to expose both a diffusion\-only and a self\-speculative decoding mode without further fine\-tuning\.
### 3\.3Inference: Section Diffusion and Scaffold Spec
Figure 3:Scaffold Speculative Decoding\. For each block: \(1\) scaffold tokens are auto\-accepted; \(2\) the MDM head drafts value tokens via bidirectional attention; \(3\) the AR head verifies the draft with causal attention, accepting tokens until the first mismatch plus one bonus token\.Because the joint AR \+ diffusion objective in Eq\. \([4](https://arxiv.org/html/2605.23163#S3.E4)\) preserves both decoding heads on the same weights, Fast\-dDrive supports two complementary inference modes over the same scaffold and section\-aligned blocks, mirroring the dual\-mode setup of Fast\-dVLM\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\)\.
#### Section Diffusion \(SD\)\.
SD reuses the training\-time procedure at inference: starting from the pre\-filled scaffold𝐱^T\\hat\{\\mathbf\{x\}\}\_\{T\}, the MDM head iteratively unmasks value positions section by section over the section\-aligned dynamic blocks of §[3\.2](https://arxiv.org/html/2605.23163#S3.SS2), attending to preceding blocks via cached causal context \(i\.e\.,*causal context decoding*in the sense of Fast\-dVLM\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\)\)\. KV caches from the scaffold and from earlier sections are reused without recomputation, yielding a diffusion\-only baseline that does not invoke the AR head\.
#### Scaffold Speculative Decoding \(SS\)\.
The second mode invokes self\-speculative decoding\(Wuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib33),[2026](https://arxiv.org/html/2605.23163#bib.bib29)\), in which the MDM head drafts a block in parallel and the AR head verifies it sequentially\. Vanilla self\-spec operates on fixed\-size blocks without awareness of scaffolds or section structure; we extend it toScaffold Speculative Decoding\(SS\), which exploits the scaffold from §[3\.2](https://arxiv.org/html/2605.23163#S3.SS2)to further reduce computational overhead while preserving generation quality\.
#### Algorithm\.
Given the pre\-filled scaffold𝐱^T\\hat\{\\mathbf\{x\}\}\_\{T\}, Scaffold Spec processes each block𝐛j\\mathbf\{b\}\_\{j\}in the section\-ordered sequence as follows:
1. 1\.Auto\-accept scaffold: All scaffold positions within𝐛j\\mathbf\{b\}\_\{j\}are directly accepted without drafting or verification\. Only value positionsℰj=ℰ∩𝐛j\\mathcal\{E\}\_\{j\}=\\mathcal\{E\}\\cap\\mathbf\{b\}\_\{j\}enter the draft\-verify cycle\.
2. 2\.Draft\(MDM head\): A single forward pass with block\-bidirectional attention fills all\|ℰj\|\|\\mathcal\{E\}\_\{j\}\|masked value positions simultaneously, producing draft tokens\{x~i\}i∈ℰj\\\{\\tilde\{x\}\_\{i\}\\\}\_\{i\\in\\mathcal\{E\}\_\{j\}\}\.
3. 3\.Verify\(AR head\): A causal forward pass over the entire block computes AR logits\. For each value positioni∈ℰji\\in\\mathcal\{E\}\_\{j\}in left\-to\-right order, ifargmaxpθAR\(⋅∣𝐱<i\)=x~i\\arg\\max p\_\{\\theta\}^\{\\mathrm\{AR\}\}\(\\cdot\\mid\\mathbf\{x\}\_\{<i\}\)=\\tilde\{x\}\_\{i\}, the token is accepted; otherwise, the AR token replaces the draft and all subsequent draft tokens are discarded\. One bonus token is always accepted at the rejection point\.
#### Efficiency Analysis\.
Each block requires exactly 2 forward passes \(draft \+ verify\), regardless of block size\. The key speedup over vanilla self\-speculative decoding comes from two sources: \(1\) scaffold tokens are auto\-accepted with*zero*forward passes; \(2\) section\-aligned blocks ensure that the MDM draft has complete semantic context, improving draft acceptance rate compared to arbitrary fixed\-size blocks\. Combined, this yields a remarkable speedup over standard self\-speculative decoding\.
### 3\.4Test\-Time Inference Scaling via Shared\-Prefix Multi\-Trajectory Rollouts
Scaffold Spec \(§[3\.3](https://arxiv.org/html/2605.23163#S3.SS3)\) decodes the structured output deterministically: a single SS pass already returns the model’s most\-confident trajectory\. To convert additional inference compute into additional accuracy, we introduce stochasticity*inside*the AR verifier and averageNNtrajectory rollouts\. Two design choices keep this scheme both cheap and quality\-preserving\.
Figure 4:Shared\-prefix multi\-trajectory rollouts\.\(a\)On a representative WOD\-E2E val sample,N=4N\{=\}4rollouts \(light blue\) disagree most at the late waypoints, while their mean \(dark blue\) lies on top of the ground truth \(black\)\.\(b\)Mean ADE@5s on WOD\-E2E val decreases monotonically withNN, confirming the variance\-of\-the\-mean argument\.Trajectory\-only stochasticity\.The first three sections \(critical\_objects,explanation,future\_meta\_behavior\) are heavily structured by the schema and have sharply peaked posteriors; sampling them adds no useful diversity and only degrades downstream sections\. We therefore keep the AR verifier greedy on the first three sections and only enable softmax sampling once decoding enters the trajectory section\.
Shared prefix\.Because the first three sections are deterministic, their KV cache is identical across rollouts\. We decode them*once*, fork the KV cacheNNtimes, and continue Scaffold Spec on the trajectory sectionNNtimes, each with independent random draws\. Since the trajectory section is short relative to the full output, this adds only a fractional cost per extra rollout rather than a full SS pass\.
#### Trajectory averaging\.
Let\{𝝉\(i\)\}i=1N\\\{\\boldsymbol\{\\tau\}^\{\(i\)\}\\\}\_\{i=1\}^\{N\}be theNNrollout trajectories, each interpolated to 20 waypoints via Jerk\-Minimizing Trajectory \(JMT\) fitting\. The output is the equal\-weight average𝝉out=1N∑i=1N𝝉\(i\)\.\\boldsymbol\{\\tau\}\_\{\\mathrm\{out\}\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\boldsymbol\{\\tau\}^\{\(i\)\}\.By the variance\-of\-the\-mean argument, averagingNNrollouts reduces residual variance by a factor of1/N1/Nwhile leaving any deterministic bias unchanged\. Each rollout is still produced by Scaffold Spec \(only the trajectory\-section verifier step is sampled\), so per\-rollout quality stays close to the deterministic SS baseline, a regime that sampling the verifier on the full output cannot reach\.
Figure[4](https://arxiv.org/html/2605.23163#S3.F4)illustrates the effect: individual rollouts disagree mostly at the late waypoints, their mean lies closer to the ground truth than any single rollout, and mean ADE@5s decreases monotonically withNN\.
## 4Experiments
### 4\.1Experimental Setup
#### Datasets\.
We evaluate in open\-loop settings on two established benchmarks\.nuScenes\(Caesaret al\.,[2020](https://arxiv.org/html/2605.23163#bib.bib38)\)contains 1,000 urban driving scenes split 700/150/150 for train/val/test, with annotated keyframes sampled at 2 Hz\.Waymo Open Dataset End\-to\-End \(WOD\-E2E\)\(Xuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib39)\)comprises 4,021 long\-tail driving segments of 20 s each, split 2,037/479/1,505; for test evaluation, only the first 12 s of each segment are provided and predictions must be generated from information available up to that point\. We adopt the chain\-of\-thought annotations from dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\)for the four\-section structured output\. Sensor specifications and capture rates are deferred to Appendix[A](https://arxiv.org/html/2605.23163#A1)\.
#### Input Modalities\.
The model consumes RGB camera frames, ego state, and a high\-level navigation command; we use no LiDAR, radar, or HD map\. Following prior open\-loop VLA work\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4); Roweet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib6)\), we use 3 past front\-camera frames spanning the last 1 s on nuScenes and the 3 front\-facing cameras at the current frame on WOD\-E2E; each image is resized so that the longer side is at most512512px before being patchified by Qwen2\.5\-VL’s vision encoder\. Frame timestamps, per\-view sizing, and the joint\-view WOD\-E2E variant we explored are detailed in Appendix[A](https://arxiv.org/html/2605.23163#A1)\.
#### Evaluation Metrics\.
*Planning accuracy\.*For nuScenes, following dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\), we report the L2 distance error at 1/2/3 s horizons\. For WOD\-E2E, we report Average Displacement Error \(ADE\) at 3 s and 5 s horizons, and Rater Feedback Score \(RFS\)\(Xuet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib39)\), a human\-aligned trust\-region score where higher values indicate better matches to multiple human\-rated reference trajectories\.*Inference efficiency\.*On a single NVIDIA H100 with batch size 1 we report*Latency*\(ms per sample\),*TPS*\(tokens per second\), and*Tok/Step*\(tokens committed per model forward pass; AR decoding givesTok/Step=1\\text\{Tok/Step\}\{=\}1\)\.
#### Implementation Details\.
Fast\-dDrive is built on Qwen2\.5\-VL\-3B\(Baiet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib40)\)converted to the Fast\-dVLM\(Wuet al\.,[2026](https://arxiv.org/html/2605.23163#bib.bib29)\)block\-diffusion architecture and outputs a structured JSON of the four sections defined in §[3\.2](https://arxiv.org/html/2605.23163#S3.SS2)\. We train the model on8×8\{\\times\}H100 GPUs with block\-causal attention, fine\-tuning for 3 epochs on the WOD\-E2E training set\. For WOD\-E2E we mix the3030k CoT\-annotated samples with an additional6060k trajectory\-only samples \(no CoT\) to improve trajectory coverage; the nuScenes training set \(2323k samples\) is used separately\. SASD instantiates Eq\. \([3](https://arxiv.org/html/2605.23163#S3.E3)\) with section loss weights\{ws\}=\{3\.0,2\.0,1\.5,1\.0\}\\\{w\_\{s\}\\\}=\\\{3\.0,2\.0,1\.5,1\.0\\\}and Beta noise parameters\{\(αs,βs\)\}=\{\(2,1\),\(1,1\.5\),\(1,2\),\(1,1\)\}\\\{\(\\alpha\_\{s\},\\beta\_\{s\}\)\\\}=\\\{\(2,1\),\(1,1\.5\),\(1,2\),\(1,1\)\\\}fortrajectory,future\_meta\_behavior,critical\_objects, andexplanationrespectively\. Efficiency benchmarks are measured on a single H100\.
### 4\.2Main Results
Table 2:Comparison on WOD\-E2Etest set\.∗: zero\-shot \(no fine\-tuning\)\.†: measured by us with the original backbone under the same conditions as our model\. TPS and Tok/Step are measured on a single H100\.MethodParadigmRFS↑\\uparrowADE \(5s\)↓\\downarrowADE \(3s\)↓\\downarrowTPS↑\\uparrowTok/Step↑\\uparrowOpenEMMA∗\(Xinget al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib1)\)AR5\.15812\.4766\.684–1LightEMMA∗\(Qiaoet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib2)\)AR6\.5173\.7401\.705–1NaiveEMMAAR7\.5283\.0181\.320–1AutoVLA\([Zhouet al\.,](https://arxiv.org/html/2605.23163#bib.bib3)\)AR7\.5572\.9581\.35151\.2†1Poutine\-Base\(Roweet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib6)\)AR7\.9092\.9401\.27051\.2†1dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\)Diffusion7\.6333\.0221\.28535\.22\.82Fast\-dDrive \(Scaffold Spec\)Block Diff\.7\.8232\.9071\.254210\.44\.90\+ Inference scaling \(N=4\)Block Diff\.7\.8272\.8211\.240114\.72\.76
#### WOD\-E2E Results\.
Table 3:L2 Error on nuScenesval set\.∗: zero\-shot\.MethodL2 Error \(m\)↓\\downarrow1s2s3sAvg\.Training\-based PolicyUniAD\(Huet al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib43)\)0\.200\.420\.750\.46VAD\-Base\(Jianget al\.,[2023](https://arxiv.org/html/2605.23163#bib.bib44)\)0\.170\.340\.600\.37BEV\-Planner\(Liet al\.,[2024b](https://arxiv.org/html/2605.23163#bib.bib45)\)0\.160\.320\.570\.35VLMs / VLAs with ReasoningOpenEMMA∗\(Xinget al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib1)\)1\.453\.213\.762\.81DriveVLM\(Tianet al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib10)\)0\.180\.340\.680\.40AutoVLA\([Zhouet al\.,](https://arxiv.org/html/2605.23163#bib.bib3)\)0\.250\.460\.730\.48dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\)0\.150\.400\.680\.41Fast\-dDrive \(ours\)0\.120\.330\.500\.32
Table[2](https://arxiv.org/html/2605.23163#S4.T2)reports planning accuracy and decoding throughput on the WOD\-E2E test set against representative AR baselines and the diffusion baseline dVLM\-AD\. With a single inference run, Fast\-dDrive \(Scaffold Spec\) attains the lowest ADE@3s and ADE@5s among the compared methods, and an RFS that surpasses the diffusion baseline dVLM\-AD by a clear margin and is competitive with the strongest AR baseline despite our model using neither GRPO post\-training nor a larger trajectory pool\. Adding the shared\-prefix multi\-trajectory rollout of §[3\.4](https://arxiv.org/html/2605.23163#S3.SS4)further reduces both ADE values at sub\-2×2\{\\times\}wall\-clock cost relative to a single Scaffold\-Spec pass, since only the trajectory section is rolled outNNtimes from a forked KV cache\. On efficiency, Fast\-dDrive runs at4×4\{\\times\}–6×6\{\\times\}the decoding throughput of dVLM\-AD and the AR baselines while committing∼5\{\\sim\}5tokens per model forward pass, giving a clear advantage on the accuracy–efficiency Pareto frontier\.
#### nuScenes Results\.
Table[3](https://arxiv.org/html/2605.23163#S4.T3)reports L2 errors on the nuScenes validation set following the dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\)protocol\. Fast\-dDrive achieves the lowest average L2 among the listed VLM/VLA systems with reasoning, with consistent gains over the diffusion and AR\-with\-CoT baselines across all three horizons; it also matches or improves upon classical training\-based driving policies that lack interpretable reasoning\. Combined with the WOD\-E2E results, this indicates that our structure\-aware design transfers between predominantly nominal urban driving and long\-tail scenarios without per\-dataset tuning\.
### 4\.3Efficiency & Performance Analysis
Table[4](https://arxiv.org/html/2605.23163#S4.T4)compares Fast\-dDrive inference variants against the AR baseline \(Qwen2\.5\-VL\-3B trained with the same data and recipe but standard autoregressive decoding\) and dVLM\-AD on the WOD\-E2E validation set \(single H100, batch size 1\)\.
Among the Fast\-dDrive variants, Scaffold Spec achieves the lowest latency and highest throughput, nearly doubling the TPS of vanilla self\-speculative decoding while matching its accuracy\. The speedup stems from scaffold auto\-acceptance, which removes∼30%\{\\sim\}30\\%of tokens from the draft\-verify loop\. Compared to dVLM\-AD, which requires full\-sequence recomputation at every denoising step, Scaffold Spec achieves roughly6×6\{\\times\}the throughput by combining block\-level KV\-cache reuse with scaffold\-aware speculative acceptance\. The AR baseline, despite sharing the same backbone and training data, is both slower \(limited to one token per forward pass\) and less accurate than the block\-diffusion variants, suggesting that bidirectional context within blocks produces more globally consistent trajectories than purely sequential decoding\. Section Diffusion yields competitive throughput but slightly higher ADE than the speculative variants, indicating that causal AR verification contributes meaningfully to trajectory quality; this accuracy gap motivates the shared\-prefix rollout scheme of §[3\.4](https://arxiv.org/html/2605.23163#S3.SS4)\. All Fast\-dDrive variants substantially outperform dVLM\-AD on both ADE and RFS, confirming that section\-aware block diffusion with SASD training is more effective than full\-sequence diffusion for structured driving outputs\. Finally, integrating Scaffold Spec into SGLang\(Zhenget al\.,[2024](https://arxiv.org/html/2605.23163#bib.bib49)\)yields an additional∼3×\{\\sim\}3\{\\times\}speedup via optimized kernels and CUDA graph, demonstrating that the algorithmic gains compose well with system\-level optimizations\.
Table 4:Inference efficiency and accuracy comparison on WOD\-E2E val set\. Latency: average wall\-clock time per sample\. TPS: tokens per second \(including scaffold tokens\)\. Tok/Step: effective tokens committed per model forward pass\.MethodDecodingLatency \(ms\)↓\\downarrowTPS↑\\uparrowTok/Step↑\\uparrowADE \(3s\)↓\\downarrowADE \(5s\)↓\\downarrowRFS↑\\uparrowAR Baseline \(Qwen2\.5\-VL\-3B\)Autoregressive785551\.610\.8392\.0837\.931dVLM\-AD \(Full\-seq MDM\)Iterative Denoise9575\(0\.8×\\times\)35\.22\.821\.1193\.0247\.187Fast\-dDrive \(Self\-Spec\)Draft \+ Verify3714\(2\.1×\\times\)109\.02\.410\.8111\.9737\.959Fast\-dDrive \(Section Diffusion\)Iterative MDM3006\(2\.6×\\times\)134\.43\.280\.8402\.0587\.928\+Scaffold SpecScaffold \+ D&V1919\(4\.1×\\times\)210\.44\.900\.8121\.9827\.934\+SGLangScaffold \+ D&V665\(11\.8×\\times\)608\.54\.930\.8161\.9957\.914
### 4\.4Ablation Studies
Table 5:SASD ablation \(WOD\-E2E val, Scaffold Spec\)\. IWL: Section\-Importance\-Weighted Loss\. SNS: Section\-Adaptive Noise Schedule\.IWLSNSADE \(5s\)↓\\downarrowRFS↑\\uparrow2\.0287\.735✓2\.0037\.855✓2\.0507\.807✓✓2\.0347\.916We ablate the two components of SASD training \(Section\-Importance\-Weighted Loss, IWL; Section\-Adaptive Noise Schedule, SNS\) by re\-training under each of the four on/off combinations while holding all other factors fixed; results are in Table[5](https://arxiv.org/html/2605.23163#S4.T5)\. IWL is the primary contributor: by up\-weighting trajectory and meta\-behavior tokens, it directly amplifies the gradient on the positions most critical for planning quality, yielding a clear RFS improvement over the uniform\-weight baseline\. SNS alone provides a smaller but complementary gain by biasing the noise schedule toward harder denoising configurations for safety\-critical sections\. When combined, IWL and SNS achieve the best RFS among all configurations, indicating that loss weighting and noise shaping address complementary aspects of the training objective\.
Table[4](https://arxiv.org/html/2605.23163#S4.T4)further confirms that Scaffold Spec is the most efficient inference method at no accuracy cost relative to Self\-Spec, while Section Diffusion offers a useful alternative when diversity is needed \(see §[3\.4](https://arxiv.org/html/2605.23163#S3.SS4)\)\. For test\-time scaling, Figure[4](https://arxiv.org/html/2605.23163#S3.F4)\(b\) shows that ADE@5s decreases monotonically with the number of trajectory rolloutsNN; we adoptN=4N\{=\}4as the default, which provides a favorable accuracy–latency trade\-off\.
## 5Conclusion
We presented Fast\-dDrive, a block\-diffusion VLA that exploits the inherent structure of driving outputs to simultaneously advance planning accuracy and inference efficiency\. By treating deterministic schema tokens as a frozen scaffold, aligning diffusion blocks with semantic sections, and prioritizing safety\-critical tokens during training, Fast\-dDrive achieves state\-of\-the\-art trajectory accuracy at6×6\{\\times\}the throughput of full\-sequence diffusion baselines, demonstrating that structured generation and efficient decoding are complementary rather than conflicting objectives\. The shared\-prefix multi\-trajectory rollout scheme further shows that block diffusion naturally admits a low\-cost test\-time scaling axis unavailable to AR models\. We believe these results point toward a broader principle: when model outputs have known structure, encoding that structure into the diffusion process yields compounding gains in both quality and speed\.
## Appendix AInput Modalities: Full Details
This appendix gives the complete per\-dataset description of the visual inputs and image\-processing pipeline summarized in §[4\.1](https://arxiv.org/html/2605.23163#S4.SS1)\.
#### nuScenes\.
Following dVLM\-AD\(Maet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib4)\), we use only the front camera \(CAM\_FRONT\) and provide three frames covering the past11s at relative timestampst∈\{−1\.0,−0\.5,0\}t\\in\\\{\-1\.0,\\,\-0\.5,\\,0\\\}s relative to the prediction time\. The three frames are presented to the model in chronological order so that short\-horizon ego\-motion cues \(acceleration, yaw rate, immediate\-future intent\) are recoverable from the visual input alone, complementing the explicit ego\-state vector\. We do not use the side or rear cameras of nuScenes; the front\-camera\-only setup matches the dVLM\-AD evaluation protocol and keeps the comparison fair\.
#### WOD\-E2E\.
Following Poutine\(Roweet al\.,[2025](https://arxiv.org/html/2605.23163#bib.bib6)\), we use the three front\-facing cameras \(FRONT\_LEFT,FRONT,FRONT\_RIGHT\) of WOD\-E2E at the current frame \(t=0t=0s\), rather than all eight surround views\. Although omitting the side and rear views reduces contextual coverage, in our preliminary experiments we found the three forward views sufficient for the open\-loop planning task while keeping the visual token budget tractable for our33B\-parameter backbone\.
#### Joint\-view variant on WOD\-E2E\.
As an additional variant, we explore a*joint\-view*input in which the three WOD\-E2E front views are horizontally concatenated \(FRONT\_LEFT\|\|FRONT\|\|FRONT\_RIGHT\) into a single panoramic image before resizing\. Compared with feeding the three views as three separate images, the joint\-view variant trades per\-view resolution for a single\-image visual context and roughly halves the visual token count, at the cost of a wider but lower\-resolution panorama\.
#### Image resizing\.
For both datasets and both WOD\-E2E variants, we resize each input image so that its longer side is at most512512pixels, with the aspect ratio preserved, before patchification by Qwen2\.5\-VL’s native vision encoder\. We do not perform any cropping, color jittering, or other photometric augmentation at inference time; at training time we follow the standard Qwen2\.5\-VL preprocessing pipeline\.
#### Non\-visual inputs\.
On both datasets, the model is conditioned on the ego state at the prediction time \(position, velocity, acceleration, yaw, yaw rate, in the ego frame\) and a high\-level natural\-language navigation command \(e\.g\., “go straight,” “turn left at the next intersection”\), serialized into the prompt prior to the visual tokens\. We do not use LiDAR, radar, HD maps, or any other auxiliary sensor modality\.
## Appendix BQualitative Case Studies
We present five qualitative examples drawn from*five different*Waymo end\-to\-end driving scenes to illustrate Fast\-dDrive’s behaviour across diverse driving regimes: a planned left turn at night, lane\-following behind a pickup truck, a right turn into a side street, a green\-light cruise through a signalised intersection, and a wet\-weather left turn at a stop\-controlled intersection \(the same scene used in our demo video, §[4\.1](https://arxiv.org/html/2605.23163#S4.SS1)\)\. In each figure the left panel is the cylindrical panorama \(FRONT\_LEFT\|\\\!\|\\\!FRONT\|\\\!\|\\\!FRONT\_RIGHT\) with the*ground\-truth*future trajectory overlaid \(blue gradient, near→\\rightarrowfar\), and the right panel is the top\-down trajectory plot of Fast\-dDrive’s predicted waypoints in vehicle frame \(green gradient, near→\\rightarrowfar\) — matching the demo\-video style\. Below each figure we list the raw JSON Chain\-of\-Thought that Fast\-dDrive produced for that frame \(Scaffold\-Spec, single rollout\); only thevalue tokensare decoded by the model\. The model’s predictions match the GT trajectory direction in every case\.

\{
"critical\_objects":\{
"nearby\_vehicle":"no","pedestrian":"no","cyclist":"no",
"construction":"no","traffic\_element":"no","weather\_condition":"no",
"road\_hazard":"no","emergency\_vehicle":"no","animal":"no",
"special\_vehicle":"no","conflicting\_vehicle":"no","door\_opening\_vehicle":"no"
\},
"future\_meta\_behavior":\{"longitudinal":"speed up","lateral":"left turn"\},
"explanation":"There are no visible vehicles, pedestrians, cyclists, construction, or traffic elements in the immediate vicinity that would affect the ego vehicle’s path within the next 3 seconds\. The road ahead appears clear with no obstacles or hazards requiring the ego vehicle to slow down, yield, or change lanes\. The ego vehicle is expected to continue accelerating and make a left turn smoothly without interference\.",
"trajectory":"\[\[\+1\.27, \+0\.03\], \[\+4\.21, \+0\.42\], \[\+8\.20, \+1\.90\], \[\+12\.08, \+5\.38\], \[\+14\.50, \+10\.72\]\]"
\}
Figure 5:Night left turn into a clear cross street\.With no critical objects in the scene, Fast\-dDrive commits tospeed up / left turnand emits a smoothly curving trajectory whose 5\-second arc tracks the GT \(blue\) tightly across the FRONT and FRONT\_LEFT cameras\.
\{
"critical\_objects":\{
"nearby\_vehicle":"yes","pedestrian":"no","cyclist":"no",
"construction":"no","traffic\_element":"no","weather\_condition":"no",
"road\_hazard":"no","emergency\_vehicle":"no","animal":"no",
"special\_vehicle":"no","conflicting\_vehicle":"no","door\_opening\_vehicle":"no"
\},
"future\_meta\_behavior":\{"longitudinal":"keep speed","lateral":"lane follow"\},
"explanation":"There is a nearby vehicle visible ahead in the front camera, approximately in the ego lane, which may influence the ego vehicle’s speed and lane keeping\. This vehicle’s presence requires the ego vehicle to maintain a safe following distance and possibly adjust speed\. No pedestrians, cyclists, construction, or traffic elements are present that would affect the ego vehicle’s immediate path\. No conflicting vehicles or other hazards are detected that would require yielding or lane changes\.",
"trajectory":"\[\[\+15\.66, \-0\.10\], \[\+31\.65, \-0\.42\], \[\+47\.85, \-1\.02\], \[\+64\.15, \-1\.85\], \[\+80\.46, \-2\.77\]\]"
\}
Figure 6:Lane follow with a pickup truck ahead\.The model flags the single lead vehicle as critical, pickslane follow / keep speed, and emits a long forward trajectory that maintains the lane while keeping pace with the lead\. GT \(blue\) and the predicted final waypoint agree on the longitudinal range\.
\{
"critical\_objects":\{
"nearby\_vehicle":"yes","pedestrian":"no","cyclist":"no",
"construction":"no","traffic\_element":"no","weather\_condition":"no",
"road\_hazard":"no","emergency\_vehicle":"no","animal":"no",
"special\_vehicle":"no","conflicting\_vehicle":"no","door\_opening\_vehicle":"no"
\},
"future\_meta\_behavior":\{"longitudinal":"keep speed","lateral":"right turn"\},
"explanation":"There is a nearby vehicle visible ahead in the front camera, positioned in the ego lane, which may influence the ego vehicle’s speed and lateral positioning\. This vehicle’s presence requires the ego vehicle to adjust longitudinal behavior, likely slowing or maintaining speed to keep a safe distance\. No pedestrians, cyclists, or conflicting vehicles are observed that would require yielding or stopping\. No traffic elements or road hazards are present to affect immediate behavior\. The ego vehicle is expected to execute a right lateral maneuver based on the trajectory, aligning with the right turn, while adjusting speed accordingly\.",
"trajectory":"\[\[\+6\.50, \-1\.77\], \[\+13\.50, \-5\.00\], \[\+20\.88, \-8\.80\], \[\+28\.40, \-12\.88\], \[\+35\.90, \-17\.00\]\]"
\}
Figure 7:Sharp right turn off a multi\-lane road\.Fast\-dDrive flips the lateral channel toright turnand emits a deeply curving trajectory whose lateral excursion reaches−17\-17m at 5 s\. The predicted arc tracks the GT \(blue\) closely across the FRONT→\\rightarrowFRONT\_RIGHT seam\.
\{
"critical\_objects":\{
"nearby\_vehicle":"no","pedestrian":"no","cyclist":"no",
"construction":"no","traffic\_element":"yes","weather\_condition":"no",
"road\_hazard":"no","emergency\_vehicle":"no","animal":"no",
"special\_vehicle":"no","conflicting\_vehicle":"no","door\_opening\_vehicle":"no"
\},
"future\_meta\_behavior":\{"longitudinal":"speed up","lateral":"go straight"\},
"explanation":"The traffic lights visible ahead in the front camera are relevant as they regulate the ego vehicle’s movement through the intersection\. They are currently green, allowing the ego vehicle to proceed\. No nearby vehicles, pedestrians, or other hazards are present that would require yielding or stopping\. The ego vehicle is expected to continue moving forward while monitoring the traffic signal for any changes, adjusting speed accordingly to maintain safe and lawful travel through the intersection\.",
"trajectory":"\[\[\+6\.40, \-0\.02\], \[\+14\.62, \-0\.15\], \[\+23\.52, \-0\.62\], \[\+32\.12, \-1\.76\], \[\+39\.99, \-3\.47\]\]"
\}
Figure 8:Green\-light cruise through a signalised intersection\.The model correctly reads the green signal, tags onlytraffic\_elementas critical, and commits tospeed up / go straight\. The forward trajectory matches the GT \(blue\) heading down the same lane through the junction\.
\{
"critical\_objects":\{
"nearby\_vehicle":"no","pedestrian":"no","cyclist":"no",
"construction":"no","traffic\_element":"yes","weather\_condition":"yes",
"road\_hazard":"no","emergency\_vehicle":"no","animal":"no",
"special\_vehicle":"no","conflicting\_vehicle":"no","door\_opening\_vehicle":"no"
\},
"future\_meta\_behavior":\{"longitudinal":"slow down","lateral":"left turn"\},
"explanation":"There is a stop sign clearly visible on the right side of the intersection, which is a critical traffic element requiring the ego vehicle to yield or stop before proceeding\. Additionally, the road surface appears wet from rain, indicating slippery conditions that may affect braking distance and vehicle control\. No nearby vehicles, pedestrians, or other conflicting objects are present that would immediately impact the ego vehicle’s path\. The ego vehicle must account for the stop sign and weather conditions to safely navigate the intersection, potentially slowing down and ensuring safe stopping or yielding before continuing the planned left turn\.",
"trajectory":"\[\[\+2\.12, \+0\.09\], \[\+5\.50, \+0\.88\], \[\+8\.85, \+3\.00\], \[\+11\.38, \+6\.40\], \[\+12\.73, \+10\.77\]\]"
\}
Figure 9:Wet\-weather left turn at a stop\-controlled intersection\(same scene as the demo video\)\. Fast\-dDrive jointly tags the stop sign \(traffic\_element\) and the wet road \(weather\_condition\), commits toslow down / left turn, and emits a sharply curving trajectory that swings from FRONT into FRONT\_LEFT\. The cylindrical projection \(§[A](https://arxiv.org/html/2605.23163#A1)\) keeps the arc geometrically continuous across the camera seam, and the predicted curve overlays the GT \(blue\) closely for the entire 5 s\.These five cases together span the most informative planning regimes encountered in our evaluation: a deliberate steering manoeuvre with a clear road, longitudinal\-only following behind another vehicle, a sharp lateral commit at a turn, a green\-light cruise that exercises the traffic\-element detection head, and a wet\-weather left turn that demands joint reasoning over traffic infrastructure and adverse road conditions\. In every case the structured Chain\-of\-Thought emitted by Scaffold\-Spec is concise, self\-consistent with the visible scene, and the predicted trajectory follows the GT direction\.
## Appendix CLimitations
While Fast\-dDrive significantly improves the throughput and planning accuracy of driving VLAs, it has certain limitations\. First, our Scaffold Construction relies on a fixed JSON schema; while this covers the majority of current end\-to\-end driving tasks, it may require manual template adjustment if the task definition \(e\.g\., the number of detected objects or the granularity of reasoning\) changes fundamentally\. Second, the shared\-prefix inference scaling provides accuracy gains at the cost of some additional compute; while this cost is fractional, it may still be constrained in extremely low\-latency edge environments where even a single additional forward pass is prohibitive\. Finally, our current evaluation is focused on open\-loop benchmarks\. Although these are standard for assessing planning quality against human experts, future work should involve closed\-loop simulations to further validate the model’s reactive capabilities in dynamic environments\.
## References
- Block diffusion: interpolating between autoregressive and diffusion language models\.arXiv preprint arXiv:2503\.09573\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.23163#S3.SS1.SSS0.Px2.p1.7)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.Advances in neural information processing systems34,pp\. 17981–17993\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-vl technical report\.External Links:2502\.13923,[Link](https://arxiv.org/abs/2502.13923)Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p2.1),[§1](https://arxiv.org/html/2605.23163#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px4.p1.6)\.
- H\. Caesar, V\. Bankiti, A\. H\. Lang, S\. Vora, V\. E\. Liong, Q\. Xu, A\. Krishnan, Y\. Pan, G\. Baldan, and O\. Beijbom \(2020\)Nuscenes: a multimodal dataset for autonomous driving\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 11621–11631\.Cited by:[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao \(2024\)Medusa: simple llm inference acceleration framework with multiple decoding heads\.arXiv preprint arXiv:2401\.10774\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper \(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- Y\. Gu, Y\. Wang, Y\. Chen, Y\. You, W\. Luo, Y\. Wang, W\. Ding, B\. Li, H\. Yang, B\. Ivanovic,et al\.\(2026\)Accelerating structured chain\-of\-thought in autonomous vehicles\.arXiv preprint arXiv:2602\.02864\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p4.1),[§3\.2](https://arxiv.org/html/2605.23163#S3.SS2.SSS0.Px1.p1.1)\.
- Y\. Hu, J\. Yang, L\. Chen, K\. Li, C\. Sima, X\. Zhu, S\. Chai, S\. Du, T\. Lin, W\. Wang,et al\.\(2023\)Planning\-oriented autonomous driving\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 17853–17862\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p1.1),[Table 3](https://arxiv.org/html/2605.23163#S4.T3.4.2.5.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p2.1)\.
- B\. Jiang, S\. Chen, Q\. Xu, B\. Liao, J\. Chen, H\. Zhou, Q\. Zhang, W\. Liu, C\. Huang, and X\. Wang \(2023\)Vad: vectorized scene representation for efficient autonomous driving\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 8340–8350\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p1.1),[Table 3](https://arxiv.org/html/2605.23163#S4.T3.4.2.6.1)\.
- Y\. Leviathan, M\. Kalman, and Y\. Matias \(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- P\. Li, Y\. Zheng, Y\. Wang, H\. Wang, H\. Zhao, J\. Liu, X\. Zhan, K\. Zhan, and X\. Lang \(2025\)Discrete diffusion for reflective vision\-language\-action models in autonomous driving\.arXiv preprint arXiv:2509\.20109\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p1.1)\.
- X\. Li, J\. Thickstun, I\. Gulrajani, P\. S\. Liang, and T\. B\. Hashimoto \(2022\)Diffusion\-lm improves controllable text generation\.Advances in neural information processing systems35,pp\. 4328–4343\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2024a\)Eagle: speculative sampling requires rethinking feature uncertainty\.arXiv preprint arXiv:2401\.15077\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- Z\. Li, Z\. Yu, S\. Lan, J\. Li, J\. Kautz, T\. Lu, and J\. M\. Alvarez \(2024b\)Is ego status all you need for open\-loop end\-to\-end autonomous driving?\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14864–14873\.Cited by:[Table 3](https://arxiv.org/html/2605.23163#S4.T3.4.2.7.1)\.
- B\. Liao, S\. Chen, H\. Yin, B\. Jiang, C\. Wang, S\. Yan, X\. Zhang, X\. Li, Y\. Zhang, Q\. Zhang,et al\.\(2025\)Diffusiondrive: truncated diffusion model for end\-to\-end autonomous driving\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 12037–12047\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.InThe twelfth international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee \(2023\)Visual instruction tuning\.Advances in neural information processing systems36,pp\. 34892–34916\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p2.1)\.
- A\. Lou, C\. Meng, and S\. Ermon \(2024\)Discrete diffusion modeling by estimating the ratios of the data distribution\.InProceedings of the 41st International Conference on Machine Learning,pp\. 32819–32848\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- Y\. Ma, Y\. Cao, W\. Ding, S\. Zhang, Y\. Wang, B\. Ivanovic, M\. Jiang, M\. Pavone, and C\. Xiao \(2025\)DVLM\-ad: enhance diffusion vision\-language\-model for driving via controllable reasoning\.arXiv preprint arXiv:2512\.04459\.Cited by:[Appendix A](https://arxiv.org/html/2605.23163#A1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.23163#S1.p1.1),[§1](https://arxiv.org/html/2605.23163#S1.p3.1),[§1](https://arxiv.org/html/2605.23163#S1.p4.1),[§2](https://arxiv.org/html/2605.23163#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.23163#S3.SS2.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.23163#S4.SS2.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2605.23163#S4.T2.13.9.11.1),[Table 3](https://arxiv.org/html/2605.23163#S4.T3.4.2.11.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p3.1),[§2](https://arxiv.org/html/2605.23163#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.23163#S3.SS1.SSS0.Px2.p1.7)\.
- Z\. Qiao, H\. Li, Z\. Cao, and H\. X\. Liu \(2025\)Lightemma: lightweight end\-to\-end multimodal model for autonomous driving\.arXiv preprint arXiv:2505\.00284\.Cited by:[Table 2](https://arxiv.org/html/2605.23163#S4.T2.11.7.7.1)\.
- L\. Rowe, R\. de Schaetzen, R\. Girgis, C\. Pal, and L\. Paull \(2025\)Poutine: vision\-language\-trajectory pre\-training and reinforcement learning post\-training enable robust end\-to\-end autonomous driving\.arXiv preprint arXiv:2506\.11234\.Cited by:[Appendix A](https://arxiv.org/html/2605.23163#A1.SS0.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2605.23163#S1.p1.1),[§1](https://arxiv.org/html/2605.23163#S1.p4.1),[§3\.2](https://arxiv.org/html/2605.23163#S3.SS2.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2605.23163#S4.T2.13.9.9.2)\.
- S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.Advances in Neural Information Processing Systems37,pp\. 130136–130184\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.23163#S3.SS1.SSS0.Px1.p1.6)\.
- J\. Shi, K\. Han, Z\. Wang, A\. Doucet, and M\. Titsias \(2024\)Simplified and generalized masked diffusion for discrete data\.Advances in neural information processing systems37,pp\. 103131–103167\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- X\. Tian, J\. Gu, B\. Li, Y\. Liu, Y\. Wang, Z\. Zhao, K\. Zhan, P\. Jia, X\. Lang, and H\. Zhao \(2024\)Drivevlm: the convergence of autonomous driving and large vision\-language models\.arXiv preprint arXiv:2402\.12289\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p1.1),[§2](https://arxiv.org/html/2605.23163#S2.p1.1),[Table 3](https://arxiv.org/html/2605.23163#S4.T3.4.2.9.1)\.
- T\. Wang, E\. Xie, R\. Chu, Z\. Li, and P\. Luo \(2024\)Drivecot: integrating chain\-of\-thought reasoning with end\-to\-end driving\.arXiv preprint arXiv:2403\.16996\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p1.1)\.
- Y\. Wang, L\. Yang, B\. Li, Y\. Tian, K\. Shen, and M\. Wang \(2025\)Revolutionizing reinforcement learning framework for diffusion large language models\.arXiv preprint arXiv:2509\.06949\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- J\. Wen, M\. Zhu, J\. Liu, Z\. Liu, Y\. Yang, L\. Zhang, S\. Zhang, Y\. Zhu, and Y\. Xu \(2025\)Dvla: diffusion vision\-language\-action model with multimodal chain\-of\-thought\.arXiv preprint arXiv:2509\.25681\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p1.1)\.
- C\. Wu, S\. Lan, Y\. Fu, S\. Gao, J\. Wang, J\. Yu, J\. M\. Alvarez, P\. Molchanov, P\. Luo, S\. Han,et al\.\(2026\)Fast\-dvlm: efficient block\-diffusion vlm via direct conversion from autoregressive vlm\.arXiv preprint arXiv:2604\.06832\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p2.1),[§1](https://arxiv.org/html/2605.23163#S1.p4.1),[§2](https://arxiv.org/html/2605.23163#S2.p1.1),[§2](https://arxiv.org/html/2605.23163#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.23163#S3.SS1.SSS0.Px2.p2.1),[§3\.2](https://arxiv.org/html/2605.23163#S3.SS2.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2605.23163#S3.SS3.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2605.23163#S3.SS3.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2605.23163#S3.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px4.p1.6)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.arXiv preprint arXiv:2505\.22618\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p2.1),[§2](https://arxiv.org/html/2605.23163#S2.p2.1),[§2](https://arxiv.org/html/2605.23163#S2.p3.1),[§3\.1](https://arxiv.org/html/2605.23163#S3.SS1.SSS0.Px2.p2.1),[§3\.3](https://arxiv.org/html/2605.23163#S3.SS3.SSS0.Px2.p1.1)\.
- S\. Xing, C\. Qian, Y\. Wang, H\. Hua, K\. Tian, Y\. Zhou, and Z\. Tu \(2025\)Openemma: open\-source multimodal model for end\-to\-end autonomous driving\.InProceedings of the Winter Conference on Applications of Computer Vision,pp\. 1001–1009\.Cited by:[Table 2](https://arxiv.org/html/2605.23163#S4.T2.10.6.6.1),[Table 3](https://arxiv.org/html/2605.23163#S4.T3.4.2.2.1)\.
- R\. Xu, H\. Lin, W\. Jeon, H\. Feng, Y\. Zou, L\. Sun, J\. Gorman, E\. Tolstaya, S\. Tang, B\. White,et al\.\(2025\)Wod\-e2e: waymo open dataset for end\-to\-end driving in challenging long\-tail scenarios\.arXiv preprint arXiv:2510\.26125\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.23163#S4.SS1.SSS0.Px3.p1.1)\.
- B\. Yang, H\. Su, N\. Gkanatsios, T\. Ke, A\. Jain, J\. Schneider, and K\. Fragkiadaki \(2024\)Diffusion\-es: gradient\-free planning with diffusion for autonomous and instruction\-guided driving\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15342–15353\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- L\. Yang, Y\. Tian, B\. Li, X\. Zhang, K\. Shen, Y\. Tong, and M\. Wang \(2025\)Mmada: multimodal large diffusion language models\.arXiv preprint arXiv:2505\.15809\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- Z\. You, S\. Nie, X\. Zhang, J\. Hu, J\. Zhou, Z\. Lu, J\. Wen, and C\. Li \(2025\)Llada\-v: large language diffusion models with visual instruction tuning\.arXiv preprint arXiv:2505\.16933\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p3.1),[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- R\. Yu, X\. Ma, and X\. Wang \(2025\)Dimple: discrete diffusion multimodal large language model with parallel decoding\.arXiv preprint arXiv:2505\.16990\.Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p3.1),[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.
- J\. Zhang, J\. Wang, H\. Li, L\. Shou, K\. Chen, G\. Chen, and S\. Mehrotra \(2024\)Draft& verify: lossless large language model acceleration via self\-speculative decoding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11263–11282\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p3.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez,et al\.\(2024\)Sglang: efficient execution of structured language model programs\.Advances in neural information processing systems37,pp\. 62557–62583\.Cited by:[§4\.3](https://arxiv.org/html/2605.23163#S4.SS3.p2.3)\.
- \[43\]Z\. Zhou, T\. Cai, S\. Z\. Zhao, Y\. Zhang, Z\. Huang, B\. Zhou, and J\. MaAutoVLA: a vision\-language\-action model for end\-to\-end autonomous driving with adaptive reasoning and reinforcement fine\-tuning\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.23163#S1.p1.1),[§1](https://arxiv.org/html/2605.23163#S1.p4.1),[§2](https://arxiv.org/html/2605.23163#S2.p1.1),[Table 2](https://arxiv.org/html/2605.23163#S4.T2.12.8.8.2),[Table 3](https://arxiv.org/html/2605.23163#S4.T3.4.2.10.1)\.
- F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen,et al\.\(2025\)Llada 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[§2](https://arxiv.org/html/2605.23163#S2.p2.1)\.Similar Articles
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 is a new discrete diffusion planner for autonomous driving that uses reinforcement learning to enable self-editing of trajectory tokens, achieving high performance and low latency on the NAVSIM benchmark.
TBD-VLA: Temporal Block Diffusion Vision Language Action Model
TBD-VLA introduces a discrete vision-language-action framework that combines block diffusion with autoregressive generation to achieve efficient temporal action modeling and faster inference, significantly outperforming prior VLA approaches in simulation and real-world manipulation tasks.
@simplifyinAI: Researchers just made LLMs 8.5x faster with zero accuracy loss. It's called DFlash. It replaces the slow autoregressive…
Researchers introduced DFlash, a method that replaces autoregressive drafters with block diffusion models to achieve 8.5x faster LLM inference with zero accuracy loss.
Just open-sourced FastVLA
FastVLA, an open-source Vision-Language-Action model, now runs 5 Hz robotics on an L4 GPU.
PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation
This paper introduces PersonaDrive, a pipeline that conditions a vision-language-action (VLA) driving agent on retrieved demonstrations from a style-instructed human driving dataset, enabling style-diverse non-ego agents for closed-loop simulation and improving driving scores on Bench2Drive.