ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
Summary
ReflectDrive-2 is a new discrete diffusion planner for autonomous driving that uses reinforcement learning to enable self-editing of trajectory tokens, achieving high performance and low latency on the NAVSIM benchmark.
View Cached Full Text
Cached at: 05/08/26, 07:19 AM
Paper page - ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
Source: https://huggingface.co/papers/2605.04647
Abstract
ReflectDrive-2 employs a masked discrete diffusion planner with parallel decoding for autonomous driving, enabling in-place trajectory revision through token rewriting and achieving high performance with efficient reflective decoding.
We introduce ReflectDrive-2, amasked discrete diffusion plannerwith separate action expert for autonomous driving that represents plans asdiscrete trajectory tokensand generates them throughparallel masked decoding. This discrete token space enables in-place trajectory revision:AutoEditrewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we constructstructure-aware perturbationsof expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout withreinforcement learning(RL), assigning terminal driving reward to the final post-edit trajectory and propagatingpolicy-gradient creditthrough full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-timeAutoEditimprovesPDMSby at most 0.3, whereas RL increases its gain to 1.9. We also co-design an efficient reflective decoding stack for thedecision--draft--reflect pipeline, combiningshared-prefix KV reuse,Alternating Step Decode, andfused on-device unmasking. OnNAVSIM, ReflectDrive-2 achieves 91.0PDMSwith camera-only input and 94.8PDMSin a best-of-6 oracle setting, while running at 31.8 ms average latency on NVIDIA Thor.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.04647
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.04647 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.04647 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.04647 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Fast-dDrive is a block-diffusion VLA model for end-to-end autonomous driving that achieves state-of-the-art trajectory accuracy while delivering over 12x throughput speedup over autoregressive baselines, addressing the trade-off between high-fidelity planning and efficient inference for edge deployment.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 introduces an online reinforcement learning framework using GRPO and a steering reward mechanism to improve safety in diffusion models without requiring supervised data or reward tuning, achieving state-of-the-art performance on multiple harm categories.
FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning
FlowR2A proposes a novel method that combines dense reward supervision with dynamic proposal generation using a flow-matching decoder for multimodal driving planning, achieving state-of-the-art results on the NAVSIM benchmarks.