Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

arXiv cs.AI Papers

Summary

Introduces Neuro-Symbolic Drive, a framework that uses rule-grounded reasoning traces from classical planners to fine-tune a driving VLA (Qwen3.5-4B), achieving significant reductions in average displacement error and miss rate compared to standard CoT reasoning.

arXiv:2606.23938v1 Announce Type: new Abstract: Driving VLA models incorporating Chain-of-Thought (CoT) reasoning are attractive because they leverage pretrained VLM representations and expose intermediate decisions in natural language, yet current rationales often lack the step-by-step decision semantics needed to keep the rationale causally connected to the planned motion. We introduce Neuro-Symbolic Drive, a neuro-symbolic driving framework that supervises a driving VLA with rule-grounded reasoning traces extracted directly from classical rule-based planners. Our key observation is that rule-based planners are symbolic AI systems that already function as executable reasoning engines: they reason about active safety constraints, search over candidate maneuvers, and select a final trajectory. We instrument these planners in simulation to capture both the executed trajectory and the internal decision trace at each rule-evaluation step. Each trace is serialized into structured rule-grounded reasoning and paired with the trajectory to fine-tune Qwen3.5-4B as a driving VLA. Because these traces are derived directly from the planner states that determine the action, they ensure reasoning is structurally coupled to motion generation by construction, rather than by post-hoc alignment. On our simulator-generated benchmark, detailed rule-grounded reasoning reduces ADE@3s from 0.47 to 0.26 and miss rate from 8.30% to 6.40% under three-camera perception, and from 0.54 to 0.26 and 10.13% to 5.99% under eight-camera perception. Neuro-Symbolic Drive thus converts neuro-symbolic planning logic into structured supervision. Code base: https://github.com/XiangboGaoBarry/Neural-Symbolic-Drive.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:43 AM

# Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs
Source: [https://arxiv.org/html/2606.23938](https://arxiv.org/html/2606.23938)
Xiangbo Gao1∗Xiukun Huang2∗†Boyu Lu3Junge Zhang4 Mengjie Mao5Jiachen Li4Wei Xiong Zhengzhong Tu1 1Texas A&M University2Carnegie Mellon University3University of Maryland 4University of California, Riverside5University of Pittsburgh ∗Equal contribution†Project lead

###### Abstract

Driving VLA models incorporating Chain\-of\-Thought \(CoT\) reasoning are attractive because they leverage pretrained VLM representations and expose intermediate decisions in natural language, yet current rationales often lack the step\-by\-step decision semantics needed to keep the rationale causally connected to the planned motion\. We introduceNeuro\-Symbolic Drive, a neuro\-symbolic driving framework that supervises a driving VLA with rule\-grounded reasoning traces extracted directly from classical rule\-based planners\. Our key observation is that rule\-based planners are symbolic AI systems that already function as executable reasoning engines: they reason about active safety constraints, search over candidate maneuvers, and select a final trajectory\. We instrument these planners in simulation to capture both the executed trajectory and the internal decision trace at each rule\-evaluation step\. Each trace is serialized into structured rule\-grounded reasoning and paired with the trajectory to fine\-tune Qwen3\.5\-4B as a driving VLA\. Because these traces are derived directly from the planner states that determine the action, they ensure reasoning is structurally coupled to motion generation by construction, rather than by post\-hoc alignment\. On our simulator\-generated benchmark, detailed rule\-grounded reasoning reduces ADE@3s from 0\.47 to 0\.26 and miss rate from 8\.30% to 6\.40% under three\-camera perception, and from 0\.54 to 0\.26 and 10\.13% to 5\.99% under eight\-camera perception\.Neuro\-Symbolic Drivethus converts neuro\-symbolic planning logic into structured supervision\. All source code is publicly available\.111Reasoning trace generation:[https://github\.com/zcxgi/nuplan\-reason](https://github.com/zcxgi/nuplan-reason); Model training:[https://github\.com/XiangboGaoBarry/Neural\-Symbolic\-Drive](https://github.com/XiangboGaoBarry/Neural-Symbolic-Drive)\.

## 1Introduction

Classical rule\-based planners and end\-to\-end models represent two complementary paradigms in autonomous driving: the former, which depends on deep perception modules to interpret raw sensor data, offers inspectable decision procedures and explicit safety constraints, but is brittle and limited by perception capabilities; the latter learns flexible representations directly from data but operates as a black box, lacking interpretability and explicit safety constraints\. Before the rise of VLAs, interpretability was largely engineered into the planning stack through structured intermediate representations fed by deep perception, cost volumes, semantic maps, candidate trajectories, and other planning\-oriented abstractions\[[56](https://arxiv.org/html/2606.23938#bib.bib31),[4](https://arxiv.org/html/2606.23938#bib.bib32),[18](https://arxiv.org/html/2606.23938#bib.bib3),[22](https://arxiv.org/html/2606.23938#bib.bib4),[44](https://arxiv.org/html/2606.23938#bib.bib17),[1](https://arxiv.org/html/2606.23938#bib.bib20)\]\. VLA systems offer a different path: by exposing reasoning in natural language, they enable driving agents to articulate scene understanding, interaction logic, and planned behaviors in a form that humans can inspect and audit\[[5](https://arxiv.org/html/2606.23938#bib.bib5),[33](https://arxiv.org/html/2606.23938#bib.bib6),[36](https://arxiv.org/html/2606.23938#bib.bib11),[52](https://arxiv.org/html/2606.23938#bib.bib7),[34](https://arxiv.org/html/2606.23938#bib.bib10),[20](https://arxiv.org/html/2606.23938#bib.bib13),[51](https://arxiv.org/html/2606.23938#bib.bib14),[21](https://arxiv.org/html/2606.23938#bib.bib34),[41](https://arxiv.org/html/2606.23938#bib.bib33),[45](https://arxiv.org/html/2606.23938#bib.bib9),[30](https://arxiv.org/html/2606.23938#bib.bib12),[53](https://arxiv.org/html/2606.23938#bib.bib45),[6](https://arxiv.org/html/2606.23938#bib.bib46),[29](https://arxiv.org/html/2606.23938#bib.bib47),[19](https://arxiv.org/html/2606.23938#bib.bib48),[50](https://arxiv.org/html/2606.23938#bib.bib35),[48](https://arxiv.org/html/2606.23938#bib.bib58)\]\.

The central difficulty is that readable reasoning is not necessarily action\-bearing\. Many driving VLAs generate language explanations while predicting trajectories either through a separate regression head or by discretizing continuous actions into tokens appended to the vocabulary\. Such designs can produce accurate motions, but offer no guarantee that the stated rationale controls the final trajectory\. CoT studies show that plausible explanations can be unfaithful\[[43](https://arxiv.org/html/2606.23938#bib.bib41),[25](https://arxiv.org/html/2606.23938#bib.bib42)\], and recent driving work similarly questions whether language reasoning is closely coupled to planning\[[32](https://arxiv.org/html/2606.23938#bib.bib21),[10](https://arxiv.org/html/2606.23938#bib.bib22),[26](https://arxiv.org/html/2606.23938#bib.bib24),[37](https://arxiv.org/html/2606.23938#bib.bib25),[38](https://arxiv.org/html/2606.23938#bib.bib26)\]\. The root cause is a supervision mismatch: current reasoning labels are typically generated post\-hoc by humans, other VLMs, or high\-level causal annotation pipelines, rather than derived from the mechanism that actually determines the motion\. DeepSeek\-R1 suggests a principled fix: strong reasoning emerges when intermediate steps are grounded in verifiable, rule\-based feedback rather than free\-form imitation alone\[[9](https://arxiv.org/html/2606.23938#bib.bib44)\]\. We carry this principle into driving: the verifiable feedback is the execution trace of the rule\-based planner that determined the trajectory\.

Rule\-based planners are a natural fit for this role: their internal execution unfolds as a structured decision procedure, enforcing safety rules to prune the action space, scoring competing candidate trajectories, and committing to the best feasible motion\. Crucially, these traces are causally responsible for the selected motion, not post\-hoc reconstructions\. Deployed L4 systems, which already operate driverless fleets across multiple cities worldwide, run precisely such planner stacks: multiple rule\-based modules specialized per scenario family, whose decisions are explainable, inspectable, and by construction the causal source of every trajectory they select\. Close in spirit, Alpamayo\-R1 takes an important step toward decision\-grounded supervision by constructing structured Chain\-of\-Causation \(CoC\) labels, integrating action generation, and applying post\-training alignment to bridge reasoning and trajectory generation for long\-tail driving\[[47](https://arxiv.org/html/2606.23938#bib.bib49)\]\. We are inspired by its focus on reasoning\-action alignment, but target a different supervision source: rather than relying on human\- or VLM\-generated causal labels, we extract reasoning directly from the executable rule\-based planner\.

![Refer to caption](https://arxiv.org/html/2606.23938v1/x1.png)Figure 1:Overview ofNeuro\-Symbolic Drive\.The left side extracts structured decision traces from rule\-based planners\. For each simulated driving scene, planner traces expose active constraints, relevant agents, candidate maneuvers, safety gates, score components, and final selections\. A scenario\-aware selector chooses the trace best matched to the scene family\. The right side uses these traces to supervise a driving VLA, so the model learns to jointly produce faithful reasoning and the corresponding future trajectory\.We proposeNeuro\-Symbolic Drive, a neuro\-symbolic framework that converts rule\-based planner execution traces into structured supervision for a driving VLA\. Given a simulated scene, we run complementary planners, record their decision traces, select the planner best matched to the scenario family, and convert the chosen trace into rule\-grounded reasoning paired with the trajectory\. Because the reasoning trace and the trajectory arise from the same planner execution, the rationale is action\-bearing by construction, rather than post\-hoc alignment\. Figure[1](https://arxiv.org/html/2606.23938#S1.F1)summarizes this pipeline\.

Our main contributions are as follows\.

- •We introduceNeuro\-Symbolic Drive, a neuro\-symbolic framework that repurposes rule\-based planner execution traces as structured reasoning supervision for vision\-language driving agents\.
- •We design a lightweight abstraction that maps heterogeneous raw planner states from multiple teachers to a unified four\-slot reasoning schema, enabling cross\-teacher learning without planner\-specific supervision vocabularies\.
- •We demonstrate that rule\-grounded reasoning improves driving performance and produces more behaviorally consistent rationales than trajectory\-only supervision\.

## 2Related Work

### 2\.1Language Reasoning in Driving VLAs

Early language\-augmented systems translate object vectors, route context, or traffic descriptions into prompts or planning states\[[5](https://arxiv.org/html/2606.23938#bib.bib5),[33](https://arxiv.org/html/2606.23938#bib.bib6),[46](https://arxiv.org/html/2606.23938#bib.bib8)\], while recent VLA models directly connect multi\-view observations to explanations, decisions, controls, or trajectories\[[36](https://arxiv.org/html/2606.23938#bib.bib11),[52](https://arxiv.org/html/2606.23938#bib.bib7),[34](https://arxiv.org/html/2606.23938#bib.bib10),[20](https://arxiv.org/html/2606.23938#bib.bib13),[51](https://arxiv.org/html/2606.23938#bib.bib14),[21](https://arxiv.org/html/2606.23938#bib.bib34),[41](https://arxiv.org/html/2606.23938#bib.bib33),[30](https://arxiv.org/html/2606.23938#bib.bib12),[45](https://arxiv.org/html/2606.23938#bib.bib9),[55](https://arxiv.org/html/2606.23938#bib.bib37)\]\. A parallel agentic line studies LLM and VLM driving agents that must follow instructions, reason about rules, use experience, and act under interaction\[[53](https://arxiv.org/html/2606.23938#bib.bib45),[6](https://arxiv.org/html/2606.23938#bib.bib46),[12](https://arxiv.org/html/2606.23938#bib.bib59),[29](https://arxiv.org/html/2606.23938#bib.bib47),[19](https://arxiv.org/html/2606.23938#bib.bib48),[13](https://arxiv.org/html/2606.23938#bib.bib56),[15](https://arxiv.org/html/2606.23938#bib.bib57)\], and recent VLA systems further explore adaptive reasoning, RL, instructed action generation, and decoupled high\-level reasoning\[[57](https://arxiv.org/html/2606.23938#bib.bib23),[26](https://arxiv.org/html/2606.23938#bib.bib24),[54](https://arxiv.org/html/2606.23938#bib.bib39),[10](https://arxiv.org/html/2606.23938#bib.bib22),[39](https://arxiv.org/html/2606.23938#bib.bib27)\]\. These works show language can make driving agents more interpretable, but many still rely on free\-form rationales or post\-hoc alignment;Neuro\-Symbolic Driveinstead derives reasoning directly from the executable planner state that produced the trajectory\.

### 2\.2Chain\-of\-Thought Reasoning, Faithfulness, and the Supervision Mismatch

Inference\-time scaling, from CoT\[[49](https://arxiv.org/html/2606.23938#bib.bib40)\]to process reward models\[[28](https://arxiv.org/html/2606.23938#bib.bib55)\], consistently shows that step\-level supervision can be more effective than outcome supervision, with OpenAI o1\[[31](https://arxiv.org/html/2606.23938#bib.bib51)\]and DeepSeek\-R1\[[9](https://arxiv.org/html/2606.23938#bib.bib44)\]identifying rule\-based verifiable rewards as a key driver over free\-form imitation\. EMMA\[[20](https://arxiv.org/html/2606.23938#bib.bib13)\]represents continuous control signals as discrete tokens using the existing VLM’s pretrained tokenizer, so driving actions and language reasoning share the same word embeddings and semantic space\. DriveVLM\[[41](https://arxiv.org/html/2606.23938#bib.bib33)\]uses language in a similar way as an intermediate reasoning layer between perception and trajectory generation; DriveLM\[[36](https://arxiv.org/html/2606.23938#bib.bib11)\]structures reasoning as a graph of visual question\-answering steps; DriveCoT\[[45](https://arxiv.org/html/2606.23938#bib.bib9)\]and Reason2Drive\[[30](https://arxiv.org/html/2606.23938#bib.bib12)\]generate supervised thinking\-process traces from expert demonstrations\. Alpamayo\-R1\[[47](https://arxiv.org/html/2606.23938#bib.bib49)\]constructs structured CoC labels and applies post\-training alignment to bridge decision\-grounded reasoning and trajectory generation\. CoT studies warn that plausible explanations can omit or distort the true factors behind a model output\[[43](https://arxiv.org/html/2606.23938#bib.bib41),[25](https://arxiv.org/html/2606.23938#bib.bib42)\], and imitation policies may exploit correlated but non\-causal signals\[[8](https://arxiv.org/html/2606.23938#bib.bib43)\]\. In autonomous driving, these concerns surface concretely as language\-action mismatch, weak reasoning\-planning coupling, and confounding in end\-to\-end planning\[[32](https://arxiv.org/html/2606.23938#bib.bib21),[10](https://arxiv.org/html/2606.23938#bib.bib22),[26](https://arxiv.org/html/2606.23938#bib.bib24),[37](https://arxiv.org/html/2606.23938#bib.bib25),[38](https://arxiv.org/html/2606.23938#bib.bib26)\]\. The root cause is a supervision mismatch: existing approaches generate reasoning labels post\-hoc, as text narrations, causal annotations, or latent surrogates, rather than from the mechanism that determined the motion\.Neuro\-Symbolic Driveaddresses this at the source: reasoning traces are abstracted directly from the planner computation that selected the supervision trajectory, not constructed after the fact\.

### 2\.3Rule\-Based Planners and Decision\-Grounded Supervision

Interpretable planning precedes current driving VLAs\. Neural planners and map\-perceive\-predict\-plan systems expose detections, semantic maps, and cost volumes as structured intermediate variables\[[56](https://arxiv.org/html/2606.23938#bib.bib31),[4](https://arxiv.org/html/2606.23938#bib.bib32)\]; end\-to\-end planning\-oriented systems retain task decomposition\[[18](https://arxiv.org/html/2606.23938#bib.bib3),[22](https://arxiv.org/html/2606.23938#bib.bib4)\]; and IVMP and QuAD emphasize interpretable factors and candidate evaluation\[[44](https://arxiv.org/html/2606.23938#bib.bib17),[1](https://arxiv.org/html/2606.23938#bib.bib20)\]\. Closed\-loop benchmarks show that rule\-based planners remain competitive and complementary to learned models\[[3](https://arxiv.org/html/2606.23938#bib.bib2),[7](https://arxiv.org/html/2606.23938#bib.bib18),[14](https://arxiv.org/html/2606.23938#bib.bib28)\]\. What this body of work does not exploit is that rule\-based planners produce a rich internal execution state comprising active constraints, candidate proposals, safety checks, and score components, all causally synchronized with the selected trajectory\. On the supervision side, decision\-grounded learning has appeared in several forms: DriveMLM aligns language output with planning states\[[46](https://arxiv.org/html/2606.23938#bib.bib8)\], DriveCoT generates thinking\-process labels from rule\-based experts\[[45](https://arxiv.org/html/2606.23938#bib.bib9)\], Hydra\-MDP distills from human and rule\-based teachers\[[27](https://arxiv.org/html/2606.23938#bib.bib38)\], and DiMA distills LLM knowledge into efficient driving models\[[17](https://arxiv.org/html/2606.23938#bib.bib29)\]\. Program\-grounded learning further suggests that executable structure can be serialized and distilled into neural models\[[35](https://arxiv.org/html/2606.23938#bib.bib30)\], while NaviDriveVLM and RAD\-LAD explore decoupled or hybrid language\-planner systems\[[39](https://arxiv.org/html/2606.23938#bib.bib27),[14](https://arxiv.org/html/2606.23938#bib.bib28)\]\. Across all of these lines, planners are treated as baselines, safety modules, or inference\-time collaborators, and reasoning labels are generated or aligned after the trajectory is determined\.Neuro\-Symbolic Drivediffers by treating planner execution traces as the primary supervision source: the reasoning and the trajectory arise from the same transparent computation, making the rationale action\-bearing by construction rather than by post\-hoc alignment\.

## 3Method

![Refer to caption](https://arxiv.org/html/2606.23938v1/x2.png)Figure 2:Generated reasoning and trajectory visualization\.We show a qualitative prediction from the trained model with the generated rule\-grounded reasoning trace and the corresponding top\-down trajectory comparison\. The example illustrates that the model does not merely output waypoints: it first identifies the active driving context and decision rationale, then predicts a trajectory that is visually aligned with the ground\-truth motion\.We treat the faithfulness gap, the disconnect between a model’s stated rationale and what actually determined its motion, as a supervision sourcing problem: reasoning labels should be derived from the mechanism that determined the motion, not constructed after the fact\.

### 3\.1Problem Formulation

At timesteptt, the driving input is

xt:=\(It,ht,gt\),x\_\{t\}\\mathrel\{:=\}\\left\(I\_\{t\},h\_\{t\},g\_\{t\}\\right\),\(1\)whereItI\_\{t\}denotes synchronized multi\-view images,hth\_\{t\}is the recent ego\-state history, andgtg\_\{t\}is the route or mission context\. A deterministic plannerp∈𝒫p\\in\\mathcal\{P\}\(the set of available rule\-based planners\) receivesxtx\_\{t\}and returns both a future trajectoryyty\_\{t\}and an internal execution traceztz\_\{t\}:

\(yt,zt\)=p​\(xt\)\.\(y\_\{t\},z\_\{t\}\)=p\(x\_\{t\}\)\.\(2\)
The traceztz\_\{t\}is the raw program state comprising active constraints, route context, candidate proposals, safety checks, score components, and final trajectory choice, not free\-form text\. An abstraction operator𝒜\\mathcal\{A\}converts it into a reasoning sequencertr\_\{t\}:

rt:=𝒜​\(zt\)\.r\_\{t\}\\mathrel\{:=\}\\mathcal\{A\}\(z\_\{t\}\)\.\(3\)
The training target for the VLA is the concatenated sequence

ut:=\[<REASONING\>​rt​</REASONING\>​<PLANNING\>​yt​</PLANNING\>\]\.u\_\{t\}\\mathrel\{:=\}\[\\texttt\{<REASONING\>\}\\;r\_\{t\}\\;\\texttt\{</REASONING\>\}\\;\\texttt\{<PLANNING\>\}\\;y\_\{t\}\\;\\texttt\{</PLANNING\>\}\]\.\(4\)The key design choice is thatrtr\_\{t\}is derived from the same planner state that producedyty\_\{t\}, making the reasoning action\-bearing by construction rather than merely correlated with the trajectory\.

### 3\.2Planner Trace Extraction

We use three complementary rule\-based planners in simulation: Intelligent Driver Model Planner \(IDM\), IDM\-MOBIL Planner, which augments IDM with the MOBIL lane\-change criterion \(Minimizing Overall Braking Induced by Lane Changes\)\[[42](https://arxiv.org/html/2606.23938#bib.bib53)\], and Predictive Driver Model Closed Planner \(PDM\-Closed\)\[[7](https://arxiv.org/html/2606.23938#bib.bib18)\]\. IDM provides stable centerline car\-following traces through lead\-agent, headway, closing\-speed, safety\-gap, and acceleration terms, while IDM\-MOBIL augments IDM with interaction\-aware lane\-change and gap\-selection logic for cut\-in, merge, queueing, and close\-following cases\. PDM\-Closed follows a proposal\-based closed\-loop procedure: it generates candidate trajectories, applies safety and progress checks, scores surviving proposals, and selects the final maneuver\[[16](https://arxiv.org/html/2606.23938#bib.bib52),[7](https://arxiv.org/html/2606.23938#bib.bib18),[23](https://arxiv.org/html/2606.23938#bib.bib19)\]\. These three teachers therefore expose complementary reasoning patterns: gap control, interaction\-aware maneuver selection, and propose\-evaluate\-select planning\.

As Figure[3](https://arxiv.org/html/2606.23938#S3.F3)shows, rather than relying on a single universal teacher, we route each scenario family to its most capable planner: closed\-loop evaluation identifies the most reliable teacher per family, whose trace and trajectory become the default supervision\. This teacher\-selection step is the data\-construction analog of L4 dispatch\. When multiple planners are complementary, we also retain multiple successful rollouts for the same scene, increasing supervision diversity without forcing a single teacher to cover scenarios outside its competence\.

![Refer to caption](https://arxiv.org/html/2606.23938v1/x3.png)Figure 3:Same\-scene planner comparison\.We visualize the three rule\-based teachers on matched scenarios and timestamps\. The examples show that different rule\-based planners can make distinct closed\-loop choices under the same observation, which motivates selecting the teacher trace according to scenario\-level closed\-loop quality rather than treating any single planner as a universal expert\.Figure[4](https://arxiv.org/html/2606.23938#S3.F4)summarizes the merged scenario coverage used for planner trace extraction\. The distribution spans 59 scenario types and roughly 330 hours of simulated duration in the dataset, with the largest individual scenario families contributing only a small fraction of the full corpus\. This broad but relatively balanced coverage is important for our setting: the model sees planner traces from routine lane following, longitudinal car\-following, traffic\-light traversal, turning, interaction, and long\-tail maneuver scenarios, rather than learning reasoning from a narrow planner\-specific subset\.

![Refer to caption](https://arxiv.org/html/2606.23938v1/x4.png)Figure 4:Merged dataset duration by scenario type\.We compute each scenario’s simulated duration as the number of examples multiplied by the average scenario duration\. The chart shows the top 21 scenario types and groups the remaining 38 types as*Other*, illustrating the diverse scenario coverage used to extract rule\-grounded reasoning traces\.Instrumentation is read\-only: we record the planner program state without changing the trajectory computation itself\. The extracted fields include route context, relevant agents, safety constraints, candidate outcomes, score components, selected proposal identity, and fallback status, when applicable\. Because the trace is captured from the same execution that produces the trajectory, every reasoning target remains synchronized with the exact motion used for supervision\.

### 3\.3Reasoning Abstraction and Serialization

Raw program traces are not suitable supervision targets by themselves\. First, the three planners expose different internal vocabularies: IDM is organized around lead vehicles, headway, desired gaps, and acceleration updates; IDM\-MOBIL adds lateral incentives and lane\-change safety checks; PDM\-Closed exposes proposal banks, safety filters, and score components\. A single VLA cannot learn unified decision semantics from these heterogeneous variable names alone\. Second, raw traces mix action\-relevant evidence with implementation artifacts such as loop counters, temporary buffers, and intermediate accumulators\. Directly imitating them would encourage memorization of planner\-specific execution details rather than learning reusable driving logic\. Figure[2](https://arxiv.org/html/2606.23938#S3.F2)provides a concrete example of the resulting serialized reasoning trace together with the corresponding top\-down generated trajectory\. We therefore serialize each trace into a compact schema with four slots:

rt:=\[stscene,stconstraint,stcandidate,stdecision\]\.r\_\{t\}\\mathrel\{:=\}\\left\[s\_\{t\}^\{\\text\{scene\}\},s\_\{t\}^\{\\text\{constraint\}\},s\_\{t\}^\{\\text\{candidate\}\},s\_\{t\}^\{\\text\{decision\}\}\\right\]\.\(5\)The*scene*summarizes the route and ego context\. The*constraint*specifies the active safety or rule signal\. The*candidate*describes either the gap update sequence or the proposal\-screening outcomes\. The*decision*states the chosen maneuver and its immediate consequence for future motion\.

We instantiate two text realizations\. The*concise*variant keeps the schema short and quantitative, exposing*what*the planner decided at each step and yielding a stable supervision target with limited token overhead\. The*detailed*variant expands the same slots into a step\-by\-step narrative that exposes*how*each decision was reached through candidate filtering and rejection rationale\. Both variants share the same four\-slot semantics; they differ only in the depth of the decision trace exposed\. This lets us test whether performance gains come from the decision structure or the depth of the explanation\.

The target sequence is serialized as rule\-grounded reasoning first, followed by the future trajectory \(Eq\.[4](https://arxiv.org/html/2606.23938#S3.E4)\)\. Explicit delimiters are used in the implementation, but serialized targets exclude planner\-specific method names and simulator\-specific debug fields\. The reasoning sequence only preserves decision semantics that remain stable between teachers and scenarios\.

## 4Experiments

The goal of our experiments is to test whether rule\-grounded reasoning improves autonomous\-driving planning, rather than to rank rule\-based planners as standalone systems\. Because our supervision is extracted from simulator rollouts, all training, inference, and testing are conducted in simulation\. This gives us synchronized observations, planner traces, and target trajectories, and therefore a controlled setting for isolating the effect of reasoning depth and teacher selection\.

### 4\.1Experimental Setup

We evaluate on theNSD\-Simdataset, a simulator\-generated corpus built from 520 nuPlan scenarios and 77,558 timestep\-level training examples\. Each example contains synchronized multi\-view images, ego history, mission context, a planner\-generated trajectory, and the corresponding planner trace\. We use a scenario\-level split to avoid timestep leakage, with 70% of scenarios for training, 15% for validation, and 15% for testing\. We evaluate two camera settings: a 3\-camera setting with front, front\-left, and front\-right views, and an 8\-camera setting with the full surrounding camera suite\. Following common planning evaluation protocols in autonomous driving\[[2](https://arxiv.org/html/2606.23938#bib.bib1),[3](https://arxiv.org/html/2606.23938#bib.bib2),[18](https://arxiv.org/html/2606.23938#bib.bib3),[22](https://arxiv.org/html/2606.23938#bib.bib4),[34](https://arxiv.org/html/2606.23938#bib.bib10),[20](https://arxiv.org/html/2606.23938#bib.bib13),[51](https://arxiv.org/html/2606.23938#bib.bib14)\], we evaluate the generated future trajectory against the planner\-generated target trajectory \(which serves as ground truth in our simulation setting\) using Average ADE, ADE/FDE at 1s/2s/3s, AHE/FHE at 3s, and Miss Rate at 3s\. ADE/FDE measure displacement error, AHE \(Average Heading Error\) / FHE \(Final Heading Error\) measure heading error, and Miss Rate measures the fraction of generations that exceed the predefined trajectory error threshold\.

We instantiate the driving backbone with Qwen3\.5\-4B\[[40](https://arxiv.org/html/2606.23938#bib.bib54)\]and fine\-tune all parameters with standard supervised learning\. Unless otherwise noted, all variants use the same optimizer, tokenizer, image preprocessing, maximum sequence length, and number of epochs\. The controlled variables are the camera setting, the reasoning level, and the planner source used to construct the training data\. We compare the three supervision levels defined in Section[3\.3](https://arxiv.org/html/2606.23938#S3.SS3)— without reasoning, concise, and detailed — to test whether richer planner decision exposure improves trajectory generation, and whether gains come from decision structure or explanation depth\. Since all conditions share the same trajectory targets, any improvement reflects the supervisory signal rather than the reference motion\.

### 4\.2Main Results: Camera Setting and Reasoning Depth

Table 1:Main planning results\.All rows use the full multi\-planner data construction pipeline\. The comparison isolates whether rule\-grounded reasoning depth improves planning performance under 3\-camera and 8\-camera perception settings\. Lower is better for all metrics\.Reasoning levelAverage ADE↓\\downarrowADE@1s↓\\downarrowADE@2s↓\\downarrowADE@3s↓\\downarrowFDE@1s↓\\downarrowFDE@2s↓\\downarrowFDE@3s↓\\downarrowAHE@3s↓\\downarrowFHE@3s↓\\downarrowMR@3s↓\\downarrowThree CamerasWithout reasoning0\.320\.220\.330\.470\.210\.470\.681\.991\.858\.30%Concise reasoning0\.200\.130\.190\.280\.140\.290\.521\.041\.056\.90%Detailed reasoning0\.190\.120\.180\.260\.140\.270\.491\.021\.016\.40%Eight CamerasWithout reasoning0\.360\.240\.310\.540\.280\.440\.881\.591\.4810\.13%Concise reasoning0\.200\.130\.180\.270\.130\.280\.491\.030\.976\.50%Detailed reasoning0\.190\.130\.170\.260\.130\.270\.460\.990\.925\.99%

#### Main results\.

Table[1](https://arxiv.org/html/2606.23938#S4.T1)shows a clear and consistent trend: adding rule\-grounded reasoning improves planning quality under both perception settings, and richer reasoning consistently yields the best or tied results\. In the three\-camera setting, concise reasoning already produces a large gain over the no\-reasoning baseline, reducing Average ADE from 0\.32 to 0\.20, ADE@3s from 0\.47 to 0\.28, and MR@3s from 8\.30% to 6\.90%\. Detailed reasoning improves the same model further, lowering Average ADE to 0\.19, ADE@3s to 0\.26, FDE@3s to 0\.49, and MR@3s to 6\.40%\. This pattern suggests that rule\-grounded reasoning helps not only on long\-horizon failure cases, but also on the overall trajectory quality summarized by Average ADE\.

The eight\-camera setting reinforces the same pattern\. The no\-reasoning baseline degrades to 0\.36 Average ADE and 10\.13% MR@3s, showing that more views do not by themselves produce stable planning\. Once rule\-grounded reasoning is added, performance improves sharply: concise reasoning reduces Average ADE to 0\.20 and MR@3s to 6\.50%, while detailed reasoning reaches the best overall result with 0\.19 Average ADE, 0\.26 ADE@3s, 0\.46 FDE@3s, 0\.99 AHE@3s, 0\.92 FHE@3s, and 5\.99% MR@3s\. The gain does not come from a richer sensor suite; it comes from intermediate decision semantics that organize how perceptual evidence translates into behavior\.

Comparing concise and detailed reasoning is also informative\. Concise reasoning, which exposes*what*the planner decided at each step, already captures most of the improvement, indicating that active constraints and selected maneuvers carry the highest\-value supervisory signal\. Detailed reasoning, which additionally exposes*how*each decision was reached through candidate filtering and rejection rationale, still brings a further, repeatable gain on long\-horizon displacement, heading, and miss\-rate metrics, implying that step\-by\-step decision semantics provide additional information beyond the final maneuver label\. Taken together, these results support the central thesis: reasoning helps not because it makes the model more verbose, but because rule\-grounded reasoning makes the planning target easier to align with and learn from\.

Figure[2](https://arxiv.org/html/2606.23938#S3.F2)gives a qualitative view of the same trend\. The generated reasoning trace exposes the scene\-level constraint and maneuver choice before the model emits future waypoints, and the top\-down overlay makes it possible to inspect whether the planned motion is consistent with that reasoning\.

### 4\.3Ablation: Planner Sources and Data Organization

Table 2:Planner\-source ablation\.All rows use detailed rule\-grounded reasoning\. The comparison varies only which rule\-based teacher supplies reasoning traces and trajectories, and how multiple teachers are selected or organized\. Lower is better for all metrics\.Teacher sourceSelection policyAverage ADE↓\\downarrowADE@1s↓\\downarrowADE@2s↓\\downarrowADE@3s↓\\downarrowFDE@1s↓\\downarrowFDE@2s↓\\downarrowFDE@3s↓\\downarrowAHE@3s↓\\downarrowFHE@3s↓\\downarrowMR@3s↓\\downarrowThree CamerasIDMsingle teacher0\.310\.210\.250\.410\.270\.490\.861\.701\.7210\.21%PDM\-Closedsingle teacher0\.280\.150\.260\.400\.220\.390\.571\.301\.398\.47%IDM\-MOBILsingle teacher3\.173\.043\.453\.602\.623\.616\.026\.654\.8130\.16%All Plannersrandom teacher per scene0\.220\.150\.200\.280\.150\.300\.541\.021\.076\.27%All Plannersscenario\-aware selection0\.190\.120\.180\.260\.140\.270\.491\.021\.016\.40%Eight CamerasIDMsingle teacher0\.290\.210\.270\.400\.240\.470\.781\.671\.6410\.32%PDM\-Closedsingle teacher0\.240\.160\.220\.340\.180\.350\.591\.261\.197\.39%IDM\-MOBILsingle teacher3\.132\.623\.083\.702\.763\.845\.356\.125\.2528\.91%All Plannersrandom teacher per scene0\.200\.130\.190\.270\.140\.280\.511\.060\.976\.22%All Plannersscenario\-aware selection0\.190\.130\.170\.260\.130\.270\.460\.990\.925\.99%

#### Planner\-source ablation\.

The single\-teacher rows show that no rule\-based planner is a universally reliable source of supervision\. PDM\-Closed is stronger than IDM on most metrics in both camera settings, while IDM\-MOBIL alone performs poorly when used as the only teacher\. This result is expected rather than a data bug: IDM\-MOBIL is primarily designed for car\-following with lane\-change incentives and nearby\-agent interaction, but it has limited ability to handle route\-level turning, intersection traversal, and many traffic\-control cases\. When forced to supervise the entire benchmark, it produces very large errors on those out\-of\-scope scenes\. Its value appears in a narrower set of lateral maneuver and interaction\-heavy scenarios, which is why we use it as a specialized teacher inside the multi\-planner mixture rather than as a universal teacher\.

Multi\-teacher data is consistently better than any single\-teacher setting: with three cameras, random teacher per scene reduces MR@3s from 8\.47% for the best single teacher to 6\.27%; with eight cameras, scenario\-aware selection reaches the best overall result with 0\.19 Average ADE and 5\.99% MR@3s\. The comparison between the two multi\-planner policies is also informative\. In the random\-teacher setting, each scene still contributes only one retained training example: we first form a pool of valid rollouts from the three planners, then randomly choose one teacher trace and trajectory for that scene\. Thus, the dataset size does not grow relative to the single\-teacher and scenario\-aware settings, and the gain cannot be attributed to a simple increase in supervision volume\. Scenario\-aware selection is the cleaner policy for our main claim because it assigns one default teacher to each scenario family according to closed\-loop reliability, reducing conflicting targets while preserving the same dataset scale\. This supports our design choice of using planner capability estimates to decide which reasoning trace and trajectory should supervise each scenario family\.

### 4\.4Quality of the Rule\-based Planner Teachers

Figure[5](https://arxiv.org/html/2606.23938#S4.F5)reports the closed\-loop quality of the rule\-based teachers used to construct rule\-grounded supervision over the 59 scenario families observed in our dataset\. We use the Composite Closed\-Loop Score Reactive \(CCLS\-R\) to compare planners across scenario families\. For each rollout, CCLS\-R combines the major closed\-loop safety and progress terms:

CCLS−R:=\\displaystyle\\mathrm\{CCLS\-R\}\\mathrel\{:=\}\{\}0\.35​Ccollision\+0\.25​Cdrivable\+0\.15​Cttc\\displaystyle 35C\_\{\\mathrm\{collision\}\}\+25C\_\{\\mathrm\{drivable\}\}\+15C\_\{\\mathrm\{ttc\}\}\(6\)\+0\.10​Cdirection\+0\.10​Ccomfort\+0\.05​min⁡\(Pprogress,1\)\.\\displaystyle\+10C\_\{\\mathrm\{direction\}\}\+10C\_\{\\mathrm\{comfort\}\}\+05\\min\(P\_\{\\mathrm\{progress\}\},1\)\.This metric is intentionally broader than a binary success rate: it captures whether a planner produces usable closed\-loop behavior even when nuPlan’s strict aggregate score is low\. For data construction, CCLS\-R ranks teachers within each scenario family\. We then use the best\-CCLS\-R teacher as the default source of reasoning traces and trajectories, after applying the simulator route\-success filter\. A rollout is retained for supervision only if it completes the full planned trajectory without collision or off\-road violation\. We do not apply an additional global CCLS\-R threshold; CCLS\-R guides teacher selection, while the route\-success filter determines whether a rollout is kept for VLA training\.

![Refer to caption](https://arxiv.org/html/2606.23938v1/x5.png)Figure 5:Closed\-loop teacher quality\.Each indexed column is a shuffled scenario family in the dataset, and each row is one teacher planner\. Color indicates CCLS\-R \(higher is better\)\. Red boxes mark the highest\-CCLS\-R teacher per scenario, guiding scenario\-aware selection of the planner trace and trajectory for VLA supervision\. Scenario\-index mappings are listed in Appendix[A\.2](https://arxiv.org/html/2606.23938#A1.SS2)\.Figures[3](https://arxiv.org/html/2606.23938#S3.F3)and[5](https://arxiv.org/html/2606.23938#S4.F5)support a mixture\-of\-teachers strategy\. PDM\-Closed is the strongest generalist in most intersection, turn, and traffic\-light scenarios; IDM\-MOBIL is more competitive in lateral maneuver and static\-obstacle cases; and IDM remains useful in conservative crawl\-and\-yield settings where partial closed\-loop credit matters\. Thus, the supervising planner is not chosen arbitrarily: closed\-loop evidence determines which teacher is most reliable per scenario family\.

## 5Conclusion

We introducedNeuro\-Symbolic Drive, a neuro\-symbolic framework that grounds vision\-language driving models in executable planner semantics rather than post\-hoc rationalization\. The core insight is that rule\-based planners already encode rich decision\-making logic: constraint activation, proposal filtering, safety validation, and outcome ranking\. This internal execution trace can be repurposed as faithful reasoning supervision for training driving VLAs\. By instrumenting three complementary planners and routing each scenario to its most capable teacher, mirroring the L4 hierarchy of specialized planners per scenario family, we ensure that reasoning targets stay synchronized with the trajectories they explain\. Empirically, detailed rule\-grounded reasoning reduces ADE@3s by 45% and miss rate by 23% under three\-camera perception, and brings an unstable eight\-camera baseline to parity with the three\-camera setting\. Executable decision semantics are a strong supervision signal; more broadly, any expert system with an executable trace, such as a constraint solver or rule engine, is a candidate supervision source\.

#### Limitations\.

Our study operates in simulation: supervision is synthetic, derived from rule\-based planner rollouts, and the VLA is evaluated in open\-loop rather than deployed in a real\-world closed loop\. This is intentional: rule\-based planner execution traces are not preserved in passively logged sensor data and can only be captured by running the planners in simulation\. The resulting model also inherits the capability boundaries of its teacher planners, although our scenario\-aware teacher selection is designed precisely to reduce the impact of any single planner’s blind spots\. Finally, detailed reasoning increases sequence length and inference cost; in practice, the concise trace already captures most of the gain, suggesting that future systems can trade off interpretability depth and latency depending on deployment needs\.

#### Broader impact\.

Rule\-grounded reasoning improves the*debuggability*and*auditability*of Driving VLAs: developers and regulators can trace each reasoning step the model processes rather than relying on opaque outputs\. However, readable reasoning does not guarantee correct behavior\[[11](https://arxiv.org/html/2606.23938#bib.bib60),[24](https://arxiv.org/html/2606.23938#bib.bib61)\], due to rule\-based planner blind spots and simulation\-to\-real gaps\. We positionNeuro\-Symbolic Driveas a tool for training, diagnosis, and model selection, to be combined with formal verification and closed\-loop evaluation before safety\-critical deployment\.

## References

- \[1\]\(2024\)QuAD: query\-based interpretable neural motion planning for autonomous driving\.In2024 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 14236–14243\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[2\]H\. Caesar, V\. Bankiti, A\. H\. Lang, S\. Vora, V\. E\. Liong, Q\. Xu, A\. Krishnan, Y\. Pan, G\. Baldan, and O\. Beijbom\(2020\)Nuscenes: a multimodal dataset for autonomous driving\.InCVPR,Cited by:[§A\.7](https://arxiv.org/html/2606.23938#A1.SS7.p1.1),[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p1.1)\.
- \[3\]H\. Caesar, J\. Kabzan, K\. S\. Tan, W\. K\. Fong, E\. Wolff, A\. Lang, L\. Fletcher, O\. Beijbom, and S\. Omari\(2021\)NuPlan: a closed\-loop ml\-based planning benchmark for autonomous vehicles\.arXiv preprint arXiv:2106\.11810\.Cited by:[§A\.7](https://arxiv.org/html/2606.23938#A1.SS7.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p1.1)\.
- \[4\]S\. Casas, A\. Sadat, and R\. Urtasun\(2021\)MP3: a unified model to map, perceive, predict and plan\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14403–14412\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[5\]L\. Chen, O\. Sinavski, J\. Hünermann, A\. Karnsund, A\. J\. Willmott, D\. Birch, D\. Maund, and J\. Shotton\(2023\)Driving with llms: fusing object\-level vector modality for explainable autonomous driving\.arXiv preprint arXiv:2310\.01957\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[6\]C\. Cui, Y\. Ma, Z\. Yang, Y\. Zhou, P\. Liu, J\. Lu, L\. Li, Y\. Chen, J\. H\. Panchal, A\. Abdelraouf,et al\.\(2024\)Large language models for autonomous driving \(llm4ad\): concept, benchmark, experiments, and challenges\.arXiv preprint arXiv:2410\.15281\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[7\]D\. Dauner, M\. Hallgarten, A\. Geiger, and K\. Chitta\(2023\)Parting with misconceptions about learning\-based vehicle motion planning\.InProceedings of The 7th Conference on Robot Learning,Proceedings of Machine Learning Research, Vol\.229,pp\. 1268–1281\.Cited by:[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1),[§3\.2](https://arxiv.org/html/2606.23938#S3.SS2.p1.1)\.
- \[8\]P\. de Haan, D\. Jayaraman, and S\. Levine\(2019\)Causal confusion in imitation learning\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[9\]DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.Nature645,pp\. 633–638\.Note:arXiv:2501\.12948External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[10\]H\. Fu, D\. Zhang, Z\. Zhao, J\. Cui, D\. Liang, C\. Zhang, D\. Zhang, H\. Xie, B\. Wang, and X\. Bai\(2025\)ORION: a holistic end\-to\-end autonomous driving framework by vision\-language instructed action generation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 24823–24834\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[11\]X\. Gao, T\. Lin, R\. Song, Y\. Wu, K\. Huang, Z\. Jin, F\. Lin, S\. Liu, and Z\. Tu\(2025\)Safecoop: unravelling full stack safety in agentic collaborative driving\.arXiv preprint arXiv:2510\.18123\.Cited by:[§5](https://arxiv.org/html/2606.23938#S5.SS0.SSS0.Px2.p1.1)\.
- \[12\]X\. Gao, K\. Wu, H\. Zhang, K\. Tian, Y\. Zhou, and Z\. Tu\(2025\)Automated vehicles should be connected with natural language\.arXiv preprint arXiv:2507\.01059\.Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[13\]X\. Gao, Y\. Wu, R\. Wang, C\. Liu, Y\. Zhou, and Z\. Tu\(2025\)Langcoop: collaborative driving with language\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 4226–4237\.Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[14\]A\. Ghosh, S\. Narasimhan, M\. Chandraker, and F\. Pittaluga\(2026\)RAD\-lad: rule and language grounded autonomous driving in real\-time\.arXiv preprint arXiv:2603\.28522\.Cited by:[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[15\]M\. Godbole, X\. Gao, and Z\. Tu\(2025\)Drama\-x: a fine\-grained intent prediction and risk reasoning benchmark for driving\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 815–820\.Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[16\]M\. Hallgarten, J\. Zapata, M\. Stoll, K\. Renz, and A\. Zell\(2024\)Can vehicle motion planning generalize to realistic long\-tail scenarios?\.In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),Vol\.,pp\. 5388–5395\.External Links:[Document](https://dx.doi.org/10.1109/IROS58592.2024.10803052)Cited by:[§3\.2](https://arxiv.org/html/2606.23938#S3.SS2.p1.1)\.
- \[17\]D\. Hegde, R\. Yasarla, H\. Cai, S\. Han, A\. Bhattacharyya, S\. Mahajan, L\. Liu, R\. Garrepalli, V\. M\. Patel, and F\. Porikli\(2025\)Distilling multi\-modal large language models for autonomous driving\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 27575–27585\.Cited by:[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[18\]Y\. Hu, J\. Yang, L\. Chen, K\. Li, C\. Sima, X\. Zhu, S\. Chai, S\. Du, T\. Lin, W\. Wang,et al\.\(2023\)Planning\-oriented autonomous driving\.InCVPR,Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p1.1)\.
- \[19\]Y\. Huang, J\. Sansom, Z\. Ma, F\. Gervits, and J\. Chai\(2024\)DriVLMe: enhancing llm\-based autonomous driving agents with embodied and social experiences\.In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 3153–3160\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[20\]J\. Hwang, R\. Xu, H\. Lin, W\. Hung, J\. Ji, K\. Choi, D\. Huang, T\. He, P\. Covington, B\. Sapp,et al\.\(2024\)EMMA: end\-to\-end multimodal model for autonomous driving\.arXiv preprint arXiv:2410\.23262\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1),[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p1.1)\.
- \[21\]B\. Jiang, S\. Chen, B\. Liao, X\. Zhang, W\. Yin, Q\. Zhang, C\. Huang, W\. Liu, and X\. Wang\(2024\)Senna: bridging large vision\-language models and end\-to\-end autonomous driving\.arXiv preprint arXiv:2410\.22313\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[22\]B\. Jiang, S\. Chen, Q\. Xu, B\. Liao, J\. Chen, H\. Zhou, Q\. Zhang, W\. Liu, C\. Huang, and X\. Wang\(2023\)VAD: vectorized scene representation for efficient autonomous driving\.InICCV,Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1),[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p1.1)\.
- \[23\]N\. Karnchanachari, D\. Geromichalos, K\. S\. Tan, N\. Li, C\. Eriksen, S\. Yaghoubi, N\. Mehdipour, G\. Bernasconi, W\. K\. Fong, Y\. Guo,et al\.\(2024\)Towards learning\-based planning: the nuplan benchmark for real\-world autonomous driving\.In2024 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 629–636\.Cited by:[§3\.2](https://arxiv.org/html/2606.23938#S3.SS2.p1.1)\.
- \[24\]J\. Kim, W\. Chen, D\. Soleymanzadeh, Y\. Ding, X\. Gao, Z\. Tu, R\. Zhang, F\. Fei, S\. Veer, Y\. Lyu,et al\.\(2026\)Modular safety guardrails are necessary for foundation\-model\-enabled robots in the real world\.arXiv preprint arXiv:2602\.04056\.Cited by:[§5](https://arxiv.org/html/2606.23938#S5.SS0.SSS0.Px2.p1.1)\.
- \[25\]T\. Lanham, A\. Chen, A\. Radhakrishnan, B\. Steiner, C\. Denison, D\. Hernandez, D\. Li, E\. Durmus, E\. Hubinger, J\. Kernion,et al\.\(2023\)Measuring faithfulness in chain\-of\-thought reasoning\.arXiv preprint arXiv:2307\.13702\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[26\]Y\. Li, M\. Tian, D\. Zhu, J\. Zhu, Z\. Lin, Z\. Xiong, and X\. Zhao\(2025\)Drive\-r1: bridging reasoning and planning in vlms for autonomous driving with reinforcement learning\.arXiv preprint arXiv:2506\.18234\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[27\]Z\. Li, K\. Li, S\. Wang, S\. Lan, Z\. Yu, Y\. Ji, Z\. Li, Z\. Zhu, J\. Kautz, Z\. Wu, Y\. Jiang, and J\. M\. Álvarez\(2024\)Hydra\-mdp: end\-to\-end multimodal planning with multi\-target hydra\-distillation\.arXiv preprint arXiv:2406\.06978\.Cited by:[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[28\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2024\)Let’s verify step by step\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[29\]Y\. Ma, C\. Cui, X\. Cao, W\. Ye, P\. Liu, J\. Lu, A\. Abdelraouf, R\. Gupta, K\. Han, A\. Bera,et al\.\(2024\)LaMPilot: an open benchmark dataset for autonomous driving with language model programs\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15141–15151\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[30\]M\. Nie, R\. Peng, C\. Wang, X\. Cai, J\. Han, H\. Xu, and L\. Zhang\(2024\)Reason2Drive: towards interpretable and chain\-based reasoning for autonomous driving\.InEuropean Conference on Computer Vision,pp\. 292–308\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[31\]OpenAI\(2024\)Introducing openai o1\.Note:[https://openai\.com/o1/](https://openai.com/o1/)Cited by:[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[32\]K\. Renz, L\. Chen, E\. Arani, and O\. Sinavski\(2025\)SimLingo: vision\-only closed\-loop autonomous driving with language\-action alignment\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11993–12003\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[33\]H\. Sha, Y\. Mu, Y\. Jiang, L\. Chen, C\. Xu, P\. Luo, S\. E\. Li, M\. Tomizuka, W\. Zhan, and M\. Ding\(2023\)LanguageMPC: large language models as decision makers for autonomous driving\.arXiv preprint arXiv:2310\.03026\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[34\]H\. Shao, Y\. Hu, L\. Wang, G\. Song, S\. L\. Waslander, Y\. Liu, and H\. Li\(2024\)LMDrive: closed\-loop end\-to\-end driving with large language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15120–15130\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p1.1)\.
- \[35\]M\. Shlapentokh\-Rothman, Y\. Wang, and D\. Hoiem\(2025\)Visual program distillation with template\-based augmentation\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Cited by:[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[36\]C\. Sima, K\. Renz, K\. Chitta, L\. Chen, H\. Zhang, C\. Xie, J\. Beißwenger, P\. Luo, A\. Geiger, and H\. Li\(2024\)DriveLM: driving with graph visual question answering\.InEuropean Conference on Computer Vision,pp\. 256–274\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[37\]X\. Song, S\. Huai, J\. Jiang, J\. Kong, and J\. Luo\(2025\)More than meets the eye? uncovering the reasoning\-planning disconnect in training vision\-language driving models\.arXiv preprint arXiv:2510\.04532\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[38\]J\. Tang, Z\. Zhou, Z\. He, J\. Zhang, K\. Zhang, and J\. Pu\(2026\)CausalVAD: de\-confounding end\-to\-end autonomous driving via causal intervention\.arXiv preprint arXiv:2603\.18561\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[39\]X\. Tao, P\. Taghavi, D\. Filev, R\. Langari, and G\. Pandey\(2026\)NaviDriveVLM: decoupling high\-level reasoning and motion planning for autonomous driving\.arXiv preprint arXiv:2603\.07901\.Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[40\]Q\. Team\(2026\-02\)Qwen3\.5: accelerating productivity with native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p2.1)\.
- \[41\]X\. Tian, J\. Gu, B\. Li, Y\. Liu, Y\. Wang, Z\. Zhao, K\. Zhan, P\. Jia, X\. Lang, and H\. Zhao\(2025\)DriveVLM: the convergence of autonomous driving and large vision\-language models\.InProceedings of The 8th Conference on Robot Learning,Proceedings of Machine Learning Research, Vol\.270,pp\. 4698–4726\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[42\]M\. Treiber and A\. Kesting\(2009\)Modeling lane\-changing decisions with mobil\.InTraffic and Granular Flow ’07,C\. Appert\-Rolland, F\. Chevoir, P\. Gondret, S\. Lassarre, J\. Lebacque, and M\. Schreckenberg \(Eds\.\),Berlin, Heidelberg,pp\. 211–221\.External Links:ISBN 978\-3\-540\-77074\-9Cited by:[§3\.2](https://arxiv.org/html/2606.23938#S3.SS2.p1.1)\.
- \[43\]M\. Turpin, J\. Michael, E\. Perez, and S\. R\. Bowman\(2023\)Language models don’t always say what they think: unfaithful explanations in chain\-of\-thought prompting\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[44\]H\. Wang, P\. Cai, Y\. Sun, L\. Wang, and M\. Liu\(2021\)Learning interpretable end\-to\-end vision\-based motion planning for autonomous driving with optical flow distillation\.In2021 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 13731–13737\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[45\]T\. Wang, E\. Xie, R\. Chu, Z\. Li, and P\. Luo\(2024\)DriveCoT: integrating chain\-of\-thought reasoning with end\-to\-end driving\.arXiv preprint arXiv:2403\.16996\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[46\]W\. Wang, J\. Xie, C\. Hu, H\. Zou, J\. Fan, W\. Tong, Y\. Wen, S\. Wu, H\. Deng, Z\. Li,et al\.\(2023\)DriveMLM: aligning multi\-modal large language models with behavioral planning states for autonomous driving\.arXiv preprint arXiv:2312\.09245\.Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[47\]Y\. Wang, W\. Luo, J\. Bai, Y\. Cao, T\. Che, K\. Chen, Y\. Chen, J\. Diamond, Y\. Ding, W\. Ding,et al\.\(2025\)Alpamayo\-r1: bridging reasoning and action prediction for generalizable autonomous driving in the long tail\.arXiv preprint arXiv:2511\.00088\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[48\]Y\. Wang, S\. Xing, C\. Can, R\. Li, H\. Hua, K\. Tian, Z\. Mo, X\. Gao, K\. Wu, S\. Zhou,et al\.\(2025\)Generative ai for autonomous driving: frontiers and opportunities\.arXiv preprint arXiv:2505\.08854\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1)\.
- \[49\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, b\. ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.Cited by:[§2\.2](https://arxiv.org/html/2606.23938#S2.SS2.p1.1)\.
- \[50\]S\. Xing, H\. Hua, X\. Gao, S\. Zhu, R\. Li, K\. Tian, X\. Li, H\. Huang, T\. Yang, Z\. Wang,et al\.\(2024\)Autotrust: benchmarking trustworthiness in large vision language models for autonomous driving\.arXiv preprint arXiv:2412\.15206\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1)\.
- \[51\]S\. Xing, C\. Qian, Y\. Wang, H\. Hua, K\. Tian, Y\. Zhou, and Z\. Tu\(2025\)OpenEMMA: open\-source multimodal model for end\-to\-end autonomous driving\.InProceedings of the Winter Conference on Applications of Computer Vision,pp\. 1001–1009\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.23938#S4.SS1.p1.1)\.
- \[52\]Z\. Xu, Y\. Zhang, E\. Xie, Z\. Zhao, Y\. Guo, K\. K\. Y\. Wong, Z\. Li, and H\. Zhao\(2023\)DriveGPT4: interpretable end\-to\-end autonomous driving via large language model\.arXiv preprint arXiv:2310\.01412\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[53\]Z\. Yang, X\. Jia, H\. Li, and J\. Yan\(2023\)LLM4Drive: a survey of large language models for autonomous driving\.arXiv preprint arXiv:2311\.01043\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[54\]Y\. Ye, Z\. Zhang, J\. Lin, S\. Sun, C\. Peng, and W\. Gao\(2026\)AutoDrive\-P3: unified chain of perception–prediction–planning thought via reinforcement fine\-tuning\.InInternational Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[55\]J\. Yuan, S\. Sun, D\. Omeiza, B\. Zhao, P\. Newman, L\. Kunze, and M\. Gadd\(2024\)RAG\-driver: generalisable driving explanations with retrieval\-augmented in\-context learning in multi\-modal large language model\.InRobotics: Science and Systems,Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.
- \[56\]W\. Zeng, W\. Luo, S\. Suo, A\. Sadat, B\. Yang, S\. Casas, and R\. Urtasun\(2019\)End\-to\-end interpretable neural motion planner\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 8660–8669\.Cited by:[§1](https://arxiv.org/html/2606.23938#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.23938#S2.SS3.p1.1)\.
- \[57\]Z\. Zhou, T\. Cai, S\. Z\. Zhao, Y\. Zhang, Z\. Huang, B\. Zhou, and J\. Ma\(2025\)AutoVLA: a vision\-language\-action model for end\-to\-end autonomous driving with adaptive reasoning and reinforcement fine\-tuning\.InAdvances in Neural Information Processing Systems,Cited by:[§2\.1](https://arxiv.org/html/2606.23938#S2.SS1.p1.1)\.

## Appendix AAppendix

### A\.1Trace Schema and Teacher\-Specific Signals

Table[3](https://arxiv.org/html/2606.23938#A1.T3)summarizes the planner\-state fields retained before text serialization\. The goal is not to expose implementation\-specific debug variables, but to preserve the decision evidence that shaped the selected trajectory\.

Table 3:Trace fields retained for reasoning serialization\.TeacherRaw trace fieldsSerialized slotsIDMroute context, lead\-vehicle state, headway, desired gap, relative speed, acceleration updatescene context, active longitudinal constraint, gap status, final speed decisionIDM\-MOBILIDM state, adjacent\-lane interaction, lane\-change incentive, gap acceptance, nearby\-agent responsescene context, interactive agents, lateral candidate, selected maneuverPDM\-Closedproposal bank, safety\-gate outcomes, progress and comfort scores, selected proposal, fallback statuscandidate summary, rejection reasons, score evidence, selected trajectory
### A\.2Closed\-Loop Scenario Index Mapping

Figure[5](https://arxiv.org/html/2606.23938#S4.F5)uses compact indices to keep the 59\-scenario closed\-loop teacher comparison readable\. TableLABEL:tab:scenario\_index\_mappinglists the shuffled index assignment, the scenario family, the selected teacher, and the corresponding best CCLS\-R value\. Missing teacher entries in the source CSV are filled with deterministic low values in\[0,0\.25\]\[0,0\.25\]for visualization only; the mapping table records the teacher selected after that visualization\-time fill\.

Table 4:Scenario\-index mapping for the closed\-loop teacher\-quality heatmap\.IndexScenario familySelected teacherCCLS\-RS01low\_magnitude\_speedIDM0\.378S02near\_trafficcone\_on\_driveablePDM\-Closed0\.926S03accelerating\_at\_traffic\_light\_with\_leadIDM0\.451S04accelerating\_at\_traffic\_light\_without\_leadPDM\-Closed0\.978S05following\_lane\_with\_leadIDM0\.173S06stationary\_in\_trafficIDM0\.324S07following\_lane\_with\_slow\_leadIDM\-MOBIL0\.821S08on\_stopline\_traffic\_lightPDM\-Closed0\.781S09changing\_lane\_to\_leftIDM0\.611S10changing\_laneIDM0\.498S11medium\_magnitude\_speedIDM0\.537S12stopping\_at\_stop\_sign\_without\_leadIDM\-MOBIL0\.850S13behind\_pedestrian\_on\_driveableIDM0\.313S14starting\_right\_turnIDM0\.434S15traversing\_crosswalkIDM\-MOBIL0\.904S16starting\_protected\_noncross\_turnIDM0\.478S17near\_barrier\_on\_driveableIDM\-MOBIL0\.740S18accelerating\_at\_crosswalkIDM0\.471S19following\_lane\_without\_leadIDM\-MOBIL0\.969S20stationary\_at\_traffic\_light\_without\_leadPDM\-Closed0\.990S21on\_intersectionPDM\-Closed0\.967S22near\_pedestrian\_on\_crosswalk\_with\_egoIDM0\.440S23high\_magnitude\_jerkIDM0\.401S24high\_lateral\_accelerationIDM0\.875S25starting\_straight\_stop\_sign\_intersection\_traversalIDM0\.555S26stationary\_at\_crosswalkIDM0\.465S27starting\_protected\_cross\_turnIDM0\.556S28stopping\_at\_crosswalkIDM0\.453S29on\_traffic\_light\_intersectionIDM\-MOBIL0\.861S30stopping\_at\_stop\_sign\_with\_leadIDM\-MOBIL0\.277S31high\_magnitude\_speedIDM0\.633S32starting\_straight\_traffic\_light\_intersection\_traversalIDM0\.483S33stationaryPDM\-Closed0\.452S34behind\_long\_vehiclePDM\-Closed0\.986S35traversing\_intersectionIDM0\.552S36traversing\_narrow\_laneIDM0\.616S37traversing\_traffic\_light\_intersectionIDM0\.910S38near\_pedestrian\_on\_crosswalkIDM0\.497S39on\_stopline\_crosswalkIDM0\.453S40near\_multiple\_vehiclesIDM0\.591S41starting\_high\_speed\_turnIDM\-MOBIL0\.770S42accelerating\_at\_stop\_sign\_no\_crosswalkIDM\-MOBIL0\.570S43starting\_left\_turnPDM\-Closed0\.974S44starting\_low\_speed\_turnIDM\-MOBIL0\.744S45stopping\_at\_traffic\_light\_without\_leadPDM\-Closed0\.976S46crossed\_by\_vehicleIDM\-MOBIL0\.485S47near\_high\_speed\_vehicleIDM0\.616S48stopping\_at\_traffic\_light\_with\_leadIDM\-MOBIL0\.666S49behind\_bikeIDM0\.768S50stopping\_with\_leadIDM\-MOBIL0\.271S51starting\_unprotected\_noncross\_turnIDM0\.507S52near\_long\_vehicleIDM\-MOBIL0\.903S53accelerating\_at\_traffic\_lightIDM\-MOBIL0\.420S54stationary\_at\_traffic\_light\_with\_leadIDM0\.302S55changing\_lane\_to\_rightPDM\-Closed0\.965S56starting\_unprotected\_cross\_turnIDM0\.554S57on\_stopline\_stop\_signPDM\-Closed0\.925S58accelerating\_at\_stop\_signIDM\-MOBIL0\.787S59waiting\_for\_pedestrian\_to\_crossIDM0\.526
### A\.3Prompt Template

The model receives the same user\-side driving prompt for all variants compared\. Only the assistant target changes across without\-reasoning, concise\-reasoning, and detailed\-reasoning settings\.

> You are an autonomous driving agent\. Given multi\-view camera images, route context, and the recent ego\-state history, predict the ego trajectory for the next 5 seconds\. If reasoning is requested, explain the active driving constraint, summarize the key candidate evaluations, and then output future waypoints\.

For rule\-grounded variants, the assistant target is serialized as planner reasoning followed by waypoints:

> <REASONING\>\.\.\.</REASONING\><PLANNING\>\[x\_1,y\_1\], …, \[x\_10,y\_10\]</PLANNING\>\.

### A\.4Reasoning Data Comparison Examples

To clarify what changes are made across our training\-data variants, we compare three supervision styles used for the trajectory\-generation task:*without reasoning*,*concise reasoning*, and*detailed reasoning*\. All three variants share the same scene input and the same target trajectory; they differ only in how much intermediate decision information is exposed to the model\.

Table 5:Assistant\-response length for representative training examples\. The detailed variant is roughly three times longer than the concise variant and about nine times longer than the no\-reasoning baseline, motivating our concise\-vs\.\-detailed comparison in the main paper\.ExampleNo reasoningConciseDetailedStarting to cruise on a free road4431,1833,605Car\-following in dense traffic4211,5423,823Red\-light response4271,5673,597Stopped in a queue4231,6784,042Table[5](https://arxiv.org/html/2606.23938#A1.T5)highlights an important practical difference between the two variants of reasoning\. Detailed chain\-of\-thought supervision provides richer descriptions, but it is substantially more expensive in tokens\. Concise rule\-grounded reasoning is, therefore, attractive because it preserves the action\-relevant structure of the planner trace while remaining much cheaper to train and decode\.

#### Example: Red\-light stopping\.

Sample id:115e26ae2da15108\_74\. This example is more representative of the paper’s main claim because the intermediate supervision explicitly names the active traffic constraint and shows how it shapes the resulting motion\.

```
Input prompt:
You are an autonomous driving agent. You have access to multi-view camera
images of a vehicle: (1) front view (which you should focus on with the most
attention) <image>, (2) front right view <image>, and (3) front left view
<image>. Your task is to do your best to predict future waypoints for the
vehicle over the next 10 timesteps, given the vehicle’s intent inferred from
the images. Provided are the previous ego vehicle status. Positive x means
forward direction while positive y means leftwards. The data is presented in
the format [x, y]:(t-2.0s) [-10.26, -0.15], Acceleration: X -1.15, Y -1.13
m/s^2, Velocity: X 6.12, Y 0.11 m/s, (t-1.5s) [-7.39, -0.06], Acceleration:
X -0.84, Y -0.54 m/s^2, Velocity: X 5.70, Y 0.04 m/s, (t-1.0s) [-4.71, -0.02],
Acceleration: X -0.94, Y -0.31 m/s^2, Velocity: X 5.39, Y 0.10 m/s,
(t-0.5s) [-2.22, -0.01], Acceleration: X -1.23, Y -0.11 m/s^2, Velocity:
X 4.88, Y 0.09 m/s, (t-0.0s) [0.0, 0.0], Acceleration: X -1.05, Y -0.63 m/s^2,
Velocity: X 4.34, Y 0.09 m/s

No-reasoning answer:
<PLANNING>
Predicted future movement details for the next 5 seconds (sampled at
0.5-second intervals), including BEV location in x and y directions (in
meters). Positive x means forward direction while positive y means leftwards.
The output is formatted as [x, y]: [2.17, -0.17], [4.27, -0.18], [6.18, -0.18],
[7.78, -0.19], [8.98, -0.20], [9.76, -0.21], [10.20, -0.21], [10.42, -0.21],
[10.52, -0.21], [10.57, -0.21]
</PLANNING>

Concise-reasoning answer:
<REASONING>
[Route] Following planned lane sequence. 10 lane segments, route found. Target
speed: 10.0 m/s.
[Scene] 37 objects within detection range. 1 red light(s) on route. Current
speed: 4.3 m/s.
[t=0.5s] Red light at 11.6m. Gap tight (11.6m vs 12.9m needed). Decelerate
-0.28 m/s^2.
[t=1.0s] Red light at 9.4m. Gap tight (9.4m vs 12.4m needed). Decelerate
-0.75 m/s^2.
[t=1.5s] Red light at 7.3m. Gap tight (7.3m vs 11.0m needed). Decelerate
-1.25 m/s^2.
[t=2.0s] Red light at 5.4m. Gap tight (5.4m vs 8.8m needed). Decelerate
-1.61 m/s^2.
[t=2.5s] Red light at 3.8m. Gap tight (3.8m vs 6.2m needed). Decelerate
-1.66 m/s^2.
[t=3.0s] Red light at 2.6m. Gap tight (2.6m vs 4.1m needed). Decelerate
-1.36 m/s^2.
[t=3.5s] Red light at 1.9m. Gap tight (1.9m vs 2.5m needed). Decelerate
-0.89 m/s^2.
[t=4.0s] Red light at 1.4m. Gap tight (1.4m vs 1.7m needed). Decelerate
-0.47 m/s^2.
[t=4.5s] Red light at 1.2m. Gap tight (1.2m vs 1.3m needed). Decelerate
-0.22 m/s^2.
[t=5.0s] Red light at 1.1m. Gap tight (1.1m vs 1.1m needed). Maintain speed
(-0.10 m/s^2).
[Summary] Responding to red traffic light. Decelerated from 4.3 to 0.0 m/s.
</REASONING>
<PLANNING>
Predicted future movement details for the next 5 seconds (sampled at
0.5-second intervals), including BEV location in x and y directions (in
meters). Positive x means forward direction while positive y means leftwards.
The output is formatted as [x, y]: [2.17, -0.17], [4.27, -0.18], [6.18, -0.18],
[7.78, -0.19], [8.98, -0.20], [9.76, -0.21], [10.20, -0.21], [10.42, -0.21],
[10.52, -0.21], [10.57, -0.21]
</PLANNING>

Detailed-reasoning answer:
<REASONING>
I need to plan my trajectory for the next 5 seconds. Let me analyze the current
driving situation step by step.

**Perception & Route:**
My route consists of 10 lane segments and was successfully planned. The target
speed for this road is 10.0 m/s.
I detect 37 objects in my surroundings. Crucially, 1 red traffic light(s) are
present on my route, which I must respect. My current speed is 4.3 m/s.

**Step-by-step Planning:**

Step 1 (t=0.5s): A red traffic light is 11.6m ahead. I must stop before the
intersection.
The current gap (11.6m) is less than the safe stopping distance (12.9m). I need
to brake.
Applying brake: deceleration = -0.28 m/s^2. Speed after this step: 4.2 m/s.

Step 2 (t=1.0s): A red traffic light is 9.4m ahead. I must stop before the
intersection.
The current gap (9.4m) is less than the safe stopping distance (12.4m). I need
to brake.
Applying brake: deceleration = -0.75 m/s^2. Speed after this step: 3.8 m/s.

Step 3 (t=1.5s): A red traffic light is 7.3m ahead. I must stop before the
intersection.
The current gap (7.3m) is less than the safe stopping distance (11.0m). I need
to brake.
Applying brake: deceleration = -1.25 m/s^2. Speed after this step: 3.2 m/s.

Step 4 (t=2.0s): A red traffic light is 5.4m ahead. I must stop before the
intersection.
The current gap (5.4m) is less than the safe stopping distance (8.8m). I need
to brake.
Applying brake: deceleration = -1.61 m/s^2. Speed after this step: 2.4 m/s.

Step 5 (t=2.5s): A red traffic light is 3.8m ahead. I must stop before the
intersection.
The current gap (3.8m) is less than the safe stopping distance (6.2m). I need
to brake.
Applying brake: deceleration = -1.66 m/s^2. Speed after this step: 1.6 m/s.

Step 6 (t=3.0s): A red traffic light is 2.6m ahead. I must stop before the
intersection.
The current gap (2.6m) is less than the safe stopping distance (4.1m). I need
to brake.
Applying brake: deceleration = -1.36 m/s^2. Speed after this step: 0.9 m/s.

Step 7 (t=3.5s): A red traffic light is 1.9m ahead. I must stop before the
intersection.
The current gap (1.9m) is less than the safe stopping distance (2.5m). I need
to brake.
Applying brake: deceleration = -0.89 m/s^2. Speed after this step: 0.4 m/s.

Step 8 (t=4.0s): A red traffic light is 1.4m ahead. I must stop before the
intersection.
The current gap (1.4m) is less than the safe stopping distance (1.7m). I need
to brake.
Applying brake: deceleration = -0.47 m/s^2. Speed after this step: 0.2 m/s.

Step 9 (t=4.5s): A red traffic light is 1.2m ahead. I must stop before the
intersection.
The current gap (1.2m) is less than the safe stopping distance (1.3m). I need
to brake.
Applying brake: deceleration = -0.22 m/s^2. Speed after this step: 0.1 m/s.

Step 10 (t=5.0s): A red traffic light is 1.1m ahead. I must stop before the
intersection.
The current gap (1.1m) is less than the safe stopping distance (1.1m). I need
to brake.
I am already stopped, waiting for the light to change.
Holding steady: acceleration = -0.10 m/s^2. Speed after this step: 0.0 m/s.

**Decision Summary:**
This is a red traffic light requiring me to stop. I decelerated from 4.3 to
0.0 m/s to maintain safety.
</REASONING>
<PLANNING>Predicted future movement details for the next 5 seconds (sampled at
0.5-second intervals), including BEV location in x and y directions (in
meters). Positive x means forward direction while positive y means leftwards.
The output is formatted as [x, y]: [2.17, -0.17], [4.27, -0.18], [6.18, -0.18],
[7.78, -0.19], [8.98, -0.20], [9.76, -0.21], [10.20, -0.21], [10.42, -0.21],
[10.52, -0.21], [10.57, -0.21]
</PLANNING>
```

These full examples show exactly what changes across the three supervision variants\. The scene input is identical, the final waypoint target is identical, and only the amount of exposed intermediate decision information changes\. This is the contrast we use throughout the paper when analyzing whether richer rule\-grounded reasoning improves planning quality\.

### A\.5Planner Variables and Motion Summaries

Auxiliary motion labels are derived from the planner output trajectory rather than from hand\-written annotations\. We use three label families for analysis: longitudinal mode \(stop / slow / maintain / accelerate\), lateral mode \(lane follow / offset / lane change\), and interaction outcome \(yield / proceed / fallback\)\. These labels are not the main target metric; the main experiments evaluate trajectory quality with ADE, FDE, heading error, and miss rate\.

### A\.6Additional Ablations and Compute Budget

The default training setup uses Qwen3\.5\-4B as the driving backbone, bf16 precision, full\-parameter supervised fine\-tuning, and an effective batch size of 16 on 8 GPUs\. All reported variants use the same data split, preprocessing pipeline, optimization recipe, and trajectory decoder unless explicitly stated otherwise\. The anonymized supplementary material contains the experiment configuration templates, data\-construction entry points, evaluation scripts, and table\-generation protocol needed to reproduce the reported comparisons\. Per\-run wall\-clock time depends on the local simulator and storage backend, but the reported experiments are designed to be reproduced from the released configuration files without changing model architecture, data split, or metric definitions\.

### A\.7Asset Provenance, Release Scope, and Safeguards

Our experiments build on public driving benchmarks and simulator\-generated planner traces derived from them\. nuPlan and nuScenes provide the underlying scenario corpora and benchmark interfaces\[[3](https://arxiv.org/html/2606.23938#bib.bib2),[2](https://arxiv.org/html/2606.23938#bib.bib1)\], and the VLM backbone follows the public Qwen model family\. The anonymized supplementary material documents the preprocessing pipeline, prompt templates, trace serialization schema, evaluation scripts, and reproduction instructions; any released derived assets are intended for research use under the terms of the underlying datasets and models\. We do not claim that rule\-grounded reasoning alone makes a driving system deployable\. Dataset artifacts, prompts, and checkpoints should be used with simulator\-only caveats, license information for the underlying assets, and a clear statement that generated rationales are intended for analysis and research rather than direct safety guarantees\.

Similar Articles

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hugging Face Daily Papers

Introduces CLVR (Closed-Loop Visual Reasoning), a framework that reformulates text-to-image generation from a single-step process into a closed-loop, multi-step visual reasoning approach using a VLM controller and diffusion models, achieving improved performance on compositional prompts.