@dair_ai: // Self-play with a pinch of human data // Really cool paper combining human demonstrations and self-play RL. 30 minute…

X AI KOLs Following 06/20/26, 10:10 PM Papers

Summary

A research paper that combines a small amount of human demonstrations as a regularization objective with self-play reinforcement learning, enabling human-compatible driving policies using far less human data (30 minutes vs thousands of hours) and training in 15 hours on a single consumer GPU.

// Self-play with a pinch of human data // Really cool paper combining human demonstrations and self-play RL. 30 minutes of human data, 2500x less than imitation learning, is enough to make self-play policies coordinate with real people. Pure self-play learns effective but alien conventions that humans cannot drive alongside. The usual fix is brittle reward engineering and domain randomization. This work instead treats a small set of human demonstrations as a regularization objective on top of a minimal safe goal-reaching reward. Why does it matter? The resulting policies coordinate with held-out human trajectories and finish training in 15 hours on a single consumer GPU. The lesson travels well past driving. A small demonstration regularizer may be the cheapest alignment knob we have for self-play. Paper: https://arxiv.org/abs/2606.19370 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 06/22/26, 07:33 AM

// Self-play with a pinch of human data //

Really cool paper combining human demonstrations and self-play RL.

30 minutes of human data, 2500x less than imitation learning, is enough to make self-play policies coordinate with real people.

Pure self-play learns effective but alien conventions that humans cannot drive alongside. The usual fix is brittle reward engineering and domain randomization.

This work instead treats a small set of human demonstrations as a regularization objective on top of a minimal safe goal-reaching reward.

Why does it matter?

The resulting policies coordinate with held-out human trajectories and finish training in 15 hours on a single consumer GPU. The lesson travels well past driving. A small demonstration regularizer may be the cheapest alignment knob we have for self-play.

Paper: https://arxiv.org/abs/2606.19370

Learn to build effective AI agents in our academy: https://academy.dair.ai

Human-like autonomy emerges from self-play and a pinch of human data

Source: https://arxiv.org/html/2606.19370 Daphne Cornelisse1&Julian Hunt2&Zixu Zhang3&Waël Doulazmi4,5&Kevin Joseph2&Jaime Fernández Fisac3Eugene Vinitsky1

1NYU Tandon School of Engineering2NYU Courant3Princeton University 4Centre for Robotics, Mines Paris5Valeo

Abstract

Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500× fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available athttps://spiced-self-play.com/.

Keywords:Self-play Reinforcement Learning, Imitation Learning, Autonomous Driving

1Introduction

[Uncaptioned image] Self-play reinforcement learning (RL) has produced superhuman agents in strategic games[1,2,3]and, more recently, has shown promise in real-world domains, such as autonomous driving[4,5,6,7]. The approach elegantly sidesteps a central difficulty in multi-agent learning - how to model the opponent - through the following idea: theagent’s opponent is a copy of itself. The appeal here is that as the agent improves, so does its co-player. This gives rise to an automatically evolving curriculum[8]that takes the policy from random play to skilled behavior entirely through synthetic simulated experience.

In zero-sum games, this mechanism, with a sparse measure for success (e.g., +1 when winning a game of chess), is enough to produce strong play against arbitrary opponents. Many real-world settings, however, are not zero-sum. Driving, for instance, can be viewed as a mixed-motive game: each player hasindividual objectives(reaching a destination safely) but must alsocoordinatewith other road users by adhering to shared norms, expectations, and conventions. Self-play RL with only a high-level objective for success provides no guarantees of such alignment; policies may converge to effective but “alien” strategies that are incompatible with human partners[9]. Concretely, an agent trained to “reach a destination safely” may very well learn to do so in reverse, sideways, or on the wrong side of the road if such constraints are not specified in the reward.

Refer to caption Figure 1:Spiced self-play RL achieves human-like coordination from 30 minutes of human data and 60 years of simulated experience.Left:Safe task completion (task completion rate−-at-fault collision rate) against human driving data, evaluated against human-replay proxies. With∼\sim30 min of human driving data as a behavioral anchor (, ours;0.9940.994), our method outperforms unregularized self-play (;0.9790.979) and SMART-tiny CLSFT[10](;0.8300.830), an IL-based approach trained on the full Waymo dataset. Beige arrows show improvement over each baseline.Center:Total training transitions used per method. Both self-play variants consume 20B transitions (∼63{\sim}63years at 10 Hz) of cheap synthetic experience; SMART uses 45M–225M human logged transitions (∼52{\sim}52days–7 months; see AppendixE).Right:Example rollout (seevideos). The self-play policy () drives aggressively and threads the needle when there are gaps; the regularized policy () waits patiently for other agents. The dark-blue vehicle is the controlled agent, which is goal-conditioned on the green target destination. Grey agents follow log replay.Previous works have addressed such misalignment in two ways. One line of work involvesmanual reward engineering, where reward terms are added iteratively until the desired behavior and conventions emerge[5,11]. While effective, this strategy is labor-intensive by nature, domain-specific, and brittle since it is not trivial to figure out what reward will produce the desired human-like behavior[12]. A case in point is GIGAFLOW[5], which required nine individually tuned reward terms and several other domain randomization techniques to produce naturalistic and cautious driving policies. On the other side of the spectrum, we haveImitation Learning[13,14,15, IL]. In IL, the policy is optimized todirectly imitatehuman driving data, avoiding the need for defining a reward function altogether. However, robustness requires wide state coverage, so these approaches typically need large quantities of human demonstrations[16].

We take a different approach, grounded in a practical observation about the changing cost structure of experience generation. Modern RL frameworks and simulation infrastructure can generate between 300K and 20M environment stepsper secondon a single consumer-grade GPU[17,18], making synthetic experience generation effectively limitless. Human driving data, by contrast, requires manual collection and remains slow to scale. This suggests a natural role for human data in coordination games: not as the primary source of training signal, but as a lightweight anchor that steers the policy away from effective yet behaviorally alien strategies. Indeed, regularizing self-play RL toward such an anchor has shown promise in producing human-compatible agents in Diplomacy[19,20]and driving[21,22,7], yethow muchdata is required to reach human compatibility remains, to our knowledge, unexamined.

We measure it. Anchoring self-play RL to human driving data from the Waymo Open Motion Dataset[23, WOMD], we find that a surprisingly small amount of demonstration data improves coordination with human proxies. Paired with roughly 60 years of self-play experience, 30 minutes of human driving data (0.04% of the full WOMD training set) yields a marked improvement, without doing any reward engineering or domain randomization. The effect mirrors an analogy already present in the literature: it is well documented that injecting a small fraction of detrimental data can cause catastrophic model degradation, a phenomenon known asdata poisoning[24,25,26]. To our knowledge, we are the first to report a comparable effect in the opposite direction within self-play RL; a small fraction ofbeneficialdata disproportionately improves behavior. Much like a pinch of cayenne changes the flavor of an entire dish, a small amount of human data appears to alter the behavior of a self-play policy. Reflective of this effect, we call thisdata spicing, and name our methodspiced self-play.

Concretely, we train a PPO policy[27]under a sparse reward for safe goal reaching, whileregularizingit toward a behavioral cloning anchor fit to a small amount of human driving data. We observe that:

•30 minutes to 3 hours of human driving data, combined with self-play at scale, is sufficient to improve coordination with human proxies without reward engineering or domain randomization (Figure1; Sections4.1,4.3).
•Spiced policies not only have lower collision rates, they also display more human-like behavior in terms of distributional realism[28]and collision severity profiles[29](Section4.2).
•To make it easy to reproduce and build on the current results, we open-source the full codebase. Policies can be trained end-to-end in 15 hours on a single consumer-class GPU.

2Related Work

Imitation learning for autonomous driving.

The generation of driving policies is a fundamental challenge across end-to-end autonomous driving[30,31,32,33], multi-agent trajectory prediction[34], and reactive traffic simulation[28,35]. Driven by the widespread availability of large-scale human driving datasets[36,23,37], imitation learning has become the dominant approach across all these domains[38]. Under this imitation learning paradigm, a broad spectrum of methodologies has emerged to fit models to historical data, ranging from marginal[39,40,41]and joint[42,43,44,45]forecasting to autoregressive sequence modeling of tokenized trajectories[46,15,10]and continuous distribution learning via diffusion and promptable world models[47,48,49,50,51]. While these generative approaches yield diverse open-loop behaviors, they are fundamentally constrained by the scale of human data required and frequently suffer from compounding covariate shift in closed-loop deployment[16]. To mitigate these shifts, recent hybrid approaches integrate reinforcement learning[52,53,54], yet they typically still rely on extensive human driving data as their primary optimization signal. Our approach systematically inverts this balance: rather than depending on human driving data as the core supervisor, we utilize synthetic, multi-agent RL self-play as the primary engine for discovering robust interactive behaviors, retaining a remarkably small human dataset strictly as a behavioral anchor to ensure conformity to realistic traffic norms.

Self-play reinforcement learning in games.

Self-play reinforcement learning has produced superhuman agents in games from Go and Chess[1,55]to StarCraft II[56]and Stratego[2], all without human data. Superhuman play is not the same as human-compatible play, however. Many games admit multiple equilibria, and self-play need not converge to equilibria that are compatible with human partners[9,57]. The failure has been shown in cooperative games such as Hanabi[58]and Diplomacy[9], where self-play agents develop internally consistent conventions that transfer poorly to human partners. The cause is reward underspecification: when the reward is defined as a score to maximize, there are often many ways to achieve it. In other words, the solution space is large. Previous work attempts to resolve this by designing the reward by hand[5,11]. For instance, GIGAFLOW[5]demonstrates that reward engineering and domain randomization can produce naturalistic behavior at scale, at the cost of nine individually tuned reward terms. We avoid reward engineering entirely. A small amount of human data serves as a behavioral anchor, and self-play does the rest. This reduces a labor-intensive design problem to a one-hour data collection procedure.

Human-regularized self-play reinforcement learning and search.

One alternative to reward engineering is to regularize self-play toward a human anchor policy. This idea has been explored in Diplomacy, where KL regularization toward a human prior during both search and learning produced agents that coordinate more effectively with human partners[19,20].Jacob et al. [59]study KL-regularized search more broadly and show that it recovers human-like play across several games. In autonomous driving, the idea has been explored at a limited scale[21,7]. Previous work showed improved human-likeness and coordination with log-replays through regularized self-play RL in autonomous driving[21]. However, the authors were bottlenecked by experience-generation speed: their simulator ran at 2,000 steps per second[35]. As a result, the policies were trained on only 140 million self-play transitions across 200 scenarios, which required five days of wall-clock time and left little room to study data scaling. More recently,Chang et al. [7]demonstrated that KL-regularized self-play can yield human-like driving policies using SMART[10]as the behavioral anchor. Notable differences to their setup include: 1)Vulnerable road users (VRUs; pedestrians and cyclists) were replayed from human data during training, which conflates the anchor’s contribution with that of the mixed-in human trajectories and precludes a clean analysis of where the impact comes from; 2) Their behavioral anchor is a large tokenized model trained on the full 500,000-scenario Waymo dataset; 3) Policies were trained on 1 billion training transitions, particularly due to the high cost of running inference on SMART. We scale self-play to 20 billion steps, control all agents during self-play training to preclude human contamination of collected human data, and systematically study how much human anchor data is needed to improve human compatibility.

3Method

Problem setup.

A human-compatible agent shouldblend inwith human drivers. We approximate interaction with human road users by replaying logged human trajectories in simulation. We evaluate in three settings, illustrated in Figure2:

•Self-play.All agents are controlled by thesame policyin a decentralized manner.
•Human-replay.Only the self-driving car (SDC) is controlled by the policy; all other agents follow their logged trajectories.
•IDM.The SDC is controlled by the policy; all other agents follow the Intelligent Driver Model[60], following a precomputed lane-center path for lateral control and using longitudinal accelerations of IDM to maintain a safe gap between the lead vehicle[61].

An effective and human-compatible agent should reach its goal without collisions or off-road events across all three settings, each of which probes a distinct failure mode. Human-replay tests whether the policy has internalized human driving conventions against non-reactive co-players. IDM introduces closed-loop dynamics with reactive rule-based co-players. Self-play tests internal consistency and additionally serves as a convergence sanity check.

Refer to caption Figure 2:Evaluation settings.Self-play (left) and human-replay (center, right). Red arrows mark collisions. Rectangles are vehicles; squares are pedestrians. In human-replay, some collisions are effectively unavoidable: replay agents follow their logged trajectories and can drive into the controlled SDC from behind. We therefore distinguish betweencollisions(any contact) andat-fault collisions(contact caused by the controlled agent, following the NAVSIM benchmark[62]).

Metrics.

We report several metrics that capture task performance. Thescoreis an aggregate metric; an agent scores 1 if it completes the task of driving to a goal destination before the end of the episode without colliding or going off-road, and 0 otherwise. To diagnose failure modes, we separately reportcollision rate,at-fault collision rate,off-road rate, androute progress. An ideal agent should score well with its own population as well as the human-replay population. Score-based metrics capture whether agents complete their task safely, but not whether their behavior looks human. We therefore also reportdistributional realismusing the Waymo Open Sim Agent Challenge[28]to compare their behavior to logged trajectories. Finally, we also analyze the severity of the at-fault collisions[29]. Metrics are reported onheld-out test scenariosunless stated otherwise; see full definitions and details in AppendixD.2.

3.1Simulation Environment

World initialization.

We use PufferDrive 2.0[18]for simulation and training. Environments are initialized from the Waymo Open Motion Dataset[23, WOMD]: each 9-second scenario provides a roadgraph, a variable set of agents (cars, cyclists, pedestrians) up toN=32N=32, and per-agent initial poses and goals drawn from the logs. Each agent is goal-conditioned on a target destination (x,yx,yposition) and receives a partial, decentralized, ego-frame observation consisting of its own state, theN−1N-1closest neighbors within 50 m, and up to 128 nearby road segments (road edges, lanes and lines). World initialization and observation space details are provided in AppendixA.1andA.2, respectively.

Reward function.

To isolate the effect of human driving data, we avoid reward engineering and use a sparse reward:+1+1for reaching the goal,−1-1for collision or off-road events, and0otherwise. Any differences in human-like behavior, therefore, stem from BC regularization rather than a hand-tuned reward. Episodes terminate once all agents reach their destinations, and we filter out transitions from agents that reach their goals early.

3.2Spiced Self-Play Reinforcement Learning

Spiced self-play isregularizedself-play RL anchored to a small amount of human demonstration data (here driving logs). The anchor is a behavioral cloning policy fit to this data, which regularizes self-play through a KL penalty. We train policies in two stages: a behavioral cloning (BC) anchor is first fit to human data, then frozen and used as a regularizer during self-play RL.

Step 1: Train the anchor policy.

To study how the amount of human data affects downstream performance, we train anchor policies on subsets of the full dataset𝒟={(oti,ati)}i=1T⋅K\mathcal{D}=\{(o_{t}^{i},a_{t}^{i})\}_{i=1}^{T\cdot K}. We sample subsets𝒟n\mathcal{D}_{n}corresponding tonnscenarios, yielding roughly{10min,30min,3h,30h}\{10\text{ min},30\text{ min},3\text{ h},30\text{ h}\}of human driving data, and fit each anchorτϕn{\color[rgb]{0.85546875,0.46484375,0.3359375}\definecolor[named]{pgfstrokecolor}{rgb}{0.85546875,0.46484375,0.3359375}\tau_{\phi^{n}}}by minimizing negative log-likelihood:

ϕn=arg⁡minϕ∑(oti,ati)∈𝒟n−log⁡τϕ(ati∣oti).\displaystyle{\color[rgb]{0.85546875,0.46484375,0.3359375}\definecolor[named]{pgfstrokecolor}{rgb}{0.85546875,0.46484375,0.3359375}\phi^{n}}=\arg\min_{\phi}\sum_{(o^{i}_{t},\,a^{i}_{t})\,\in\,\mathcal{D}_{n}}-\log\tau_{\phi}(a_{t}^{i}\mid o^{i}_{t}).(1)We use only the self-driving car (SDC) trajectory from each scenario to generate our imitation data, as it is typically the highest-quality trajectory. Each anchorτϕn{\color[rgb]{0.85546875,0.46484375,0.3359375}\definecolor[named]{pgfstrokecolor}{rgb}{0.85546875,0.46484375,0.3359375}\tau_{\phi^{n}}}is then frozen for the subsequent self-play stage. Full details are in AppendixA.4.

Step 2: Regularized self-play RL.

We trainπθ{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pi_{\theta}}from scratch using Proximal Policy Optimization[27, PPO]. The policyπθ\pi_{\theta}is represented by a 650k-parameter neural network. Each anchorτϕn{\color[rgb]{0.85546875,0.46484375,0.3359375}\definecolor[named]{pgfstrokecolor}{rgb}{0.85546875,0.46484375,0.3359375}\tau_{\phi^{n}}}serves as a behavioral regularizer via a KL penalty:

ℒ(θ)=ℒPPO(θ)+λ𝔼o∼ρπθ[DKL(τϕn(⋅∣o)∥πθ(⋅∣o))],\displaystyle\mathcal{L}(\theta)=\mathcal{L}_{\mathrm{PPO}}(\theta)+\lambda\,\mathbb{E}_{o\sim\rho_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pi_{\theta}}}}\!\left[D_{\mathrm{KL}}\!\left({\color[rgb]{0.85546875,0.46484375,0.3359375}\definecolor[named]{pgfstrokecolor}{rgb}{0.85546875,0.46484375,0.3359375}\tau_{\phi^{n}}}(\cdot\mid o)\,\Big\|\,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pi_{\theta}}(\cdot\mid o)\right)\right],(2)whereρπθ\rho_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pi_{\theta}}}is the on-policy state distribution andλ≥0\lambda\geq 0controls regularization strength. The KL term pullsπθ{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pi_{\theta}}toward the anchor on states the policy actually visits, rather than on the offline distribution of𝒟n\mathcal{D}_{n}. Hyperparameters and training details are in AppendicesA.1andB.

4Experiments

This section summarizes the key results. Additional details and analyses are reported in the appendices. We structure the sections to answer the following questions:

1.Scaling human driving data for regularized self-play RL: How much human data is needed for strong performance in both self-play and human-replay evaluations? (Section4.1)
2.Behavior and safety analysis: How does a small amount of human demonstration data shape policy behavior beyond task performance? We analyze the effect on distributional realism, collision severity, and driving style (Section4.2).
3.The role of metadata and scenario diversity: Driving datasets such as WOMD and NuPlan providescenario metadata—road graphs and initial agent positions—that ground simulation at a fraction of the cost of collecting human driving data. How does the number of training scenarios (maps) used for self-play influence agent performance? (Section4.3)

4.1Scaling Human Driving Data for Regularized Self-Play RL

How much collected human driving data does regularized self-play need, and how does this compare to imitation learning-only based approaches? It is worth noting that one reason the second question matters is that any apparent data efficiency on our side could simply reflect the homogeneity of the Waymo Open Dataset rather than an actual property of the method. We benchmark against unregularized self-play RL (); a goal-conditioned RL policy that is trained to reach a goal without colliding with other agents or going off-road (Section3.1). This provides a human-data-free lower bound. We also benchmark to SMART-tiny-CLSFT[10,54](), the state-of-the-art IL approach in this domain. SMART is trained on the same nested driving data subsets; we additionally include the open-sourced SMART-tiny-CATK checkpoint[54], trained on all 500k WOMD training scenarios, as an IL upper bound (AppendixB.3).

Refer to caption Figure 3:Scaling human driving data for spiced self-play reinforcement learning.Top: Performance ofSpicedself-play RL () and SMART with CAT-K closed-loop fine-tuning () as a function of total human log data used for training, evaluated in self-play and against human replays. Policies are evaluated on the same random 10k held-out WOMD validation split[23]. Unregularized self-play RL () is shown as a horizontal line, since it uses no human driving data. The horizontal axis is semi-logarithmic.Bottom: Relative improvement to IL baseline.Table 1:Performance versus amount of human demonstrations for the best trained policies on 10k held-out randomly sampled scenarios. For SMART, we report the best-performing variant at each data scale (details AppendixG.1). Top-3 values per column are highlighted (best,2nd,3rd); the best value per column is additionally shown in bold. The unregularized self-play row uses no human driving data.Self-play (test)Human-replay (test)Human demosusedMethodColl. (%)↓\downarrowOff-road (%)↓\downarrowRoute prog. (%)↑\uparrowScore↑\uparrowColl. (%)↓\downarrowAt-fault (%)↓\downarrowOff-road (%)↓\downarrowRoute prog. (%)↑\uparrow10 minSMART11.955.884.50.24632.025.018.657.730 minSMART9.555.485.80.37917.912.516.876.93 hoursSMART8.053.686.20.51811.46.94.581.530 hoursSMART7.753.386.50.6016.83.31.685.452 daysSMART6.153.591.70.6544.41.6\cellcolortiersecond 1.188.5—unreg. self-play1.0±0.41.0\pm 0.4\cellcolortierbest0.2±0.2\bm{0.2\pm 0.2}\cellcolortierbest99.9±0.1\bm{99.9\pm 0.1}0.967±0.0060.967\pm 0.0062.7±0.52.7\pm 0.52.1±0.52.1\pm 0.5\cellcolortierbest0.6±0.2\bm{0.6\pm 0.2}\cellcolortierbest100.0±0.0\bm{100.0\pm 0.0}10 minreg. self-play (ours)1.0±0.71.0\pm 0.7\cellcolortierthird0.3±0.20.3\pm 0.299.0±0.499.0\pm 0.40.941±0.0070.941\pm 0.0073.9±0.63.9\pm 0.61.4±0.41.4\pm 0.41.4±0.41.4\pm 0.499.6±0.299.6\pm 0.230 minreg. self-play (ours)\cellcolortiersecond0.2±0.10.2\pm 0.10.5±0.20.5\pm 0.299.3±0.399.3\pm 0.3\cellcolortierthird0.968±0.0060.968\pm 0.006\cellcolortierthird2.0±0.42.0\pm 0.4\cellcolortiersecond0.7±0.30.7\pm 0.31.4±0.41.4\pm 0.499.8±0.199.8\pm 0.13 hoursreg. self-play (ours)\cellcolortierthird0.2±0.10.2\pm 0.10.6±0.40.6\pm 0.4\cellcolortierthird99.6±0.299.6\pm 0.2\cellcolortiersecond0.973±0.0050.973\pm 0.005\cellcolortiersecond1.6±0.41.6\pm 0.4\cellcolortierbest0.6±0.2\bm{0.6\pm 0.2}1.2±0.31.2\pm 0.3\cellcolortiersecond100.0±0.0100.0\pm 0.030 hoursreg. self-play (ours)\cellcolortierbest0.0±0.0\bm{0.0\pm 0.0}\cellcolortiersecond0.3±0.20.3\pm 0.2\cellcolortiersecond99.7±0.299.7\pm 0.2\cellcolortierbest0.976±0.005\bm{0.976\pm 0.005}\cellcolortierbest1.4±0.4\bm{1.4\pm 0.4}\cellcolortierthird0.8±0.30.8\pm 0.3\cellcolortierthird1.1±0.31.1\pm 0.3\cellcolortierthird99.9±0.099.9\pm 0.0

Spiced self-play RL surpasses IL with a fraction of the human driving data.

As shown in Figure3and Table1, spiced self-play outperforms SMART-tiny-CLSFT[54]across all data regimes and metrics. With as little as 30 minutes to 3 hours of human data, spiced self-play achieves the lowest at-fault collision rate (0.6-0.7%); a 2.5×\timesimprovement over SMART-tiny-CLSFT trained on the entire Waymo train dataset (52 days; 1.6%). The advantage is most pronounced at low human data: at 30 minutes, spiced self-play yields an 11×\timesreduction in at-fault collision rate and 46×\timesin self-play collision rate relative to SMART. Against standard self-play RL (at-fault CR: 2.1%;), spiced self-play achieves a 3.5×\timesimprovement, demonstrating the value of an anchor trained on minimal human data as a regularizer. Regularized self-play RL with the 30-hour anchor leads to similar results.

Self-play exposes agents to a changing population of partners.

The environment of a self-play RL policy is non-stationary: early policies have near-random behavior and become increasingly competent. This is in contrast to a single-agent RL setting, where the partner distribution is fixed. We observe that the self-play setting is associated with an increase in convergence to mutually consistent conventions. Spiced self-play agents achieve low collision rates in both self-play and cross-play with human logs (below 1.5% in each). SMART, trained on 52 days of human data, incurs a 6% self-play collision rate but only 1.6 % when paired with logs. Two factors can explain this gap: sample count (20 billion transitions versus 225 million for SMART, Figure1) and training paradigm (SMART is optimized open-loop for log-likelihood, then finetuned closed-loop to stay near the log distribution, and is never exposed to the partner distribution self-play naturally provides). To test for the role of the partner distribution, we compare self-play agents with agents traineddirectly againstthe human-replay population (single-agent RL against static partners). The latter perform well within that population (at-fault collision rate 0.2–0.3%) but do worse in self-play (0.8–1.2%). This is consistent with exposure to reactive, evolving partners contributing to robustness (Figure19).

4.2Behavior and Safety Analysis

The goal of this section is to understand the behavioral differences between unregularized and regularized self-play policies beyond straightforward performance metrics.

Spiced policies exhibit lower-severity collisions.

Collision rates, as reported in Sections4.1and4.3, measure how often agents fail, but not how bad the failures are. This distinction matters when policies are deployed alongside humans. Following Waymo’s most recent safety report[29], we quantify collision severity via thechange in velocity at impact(Δv\Delta v), a widely studied proxy for occupant injury risk. As shown in Table9and Figure4, regularization reduces both the frequency and the severity of failures. The mean per-eventΔv\Delta vdrops from2.092.09m/s to1.711.71m/s, and the maximum observed impact velocity falls from13.7113.71m/s to8.098.09m/s. The improvement is more apparent when we focus on the tail of collision events:14.3%14.3\%of unregularized collisions exceed1515mph, the threshold above which serious injury risk rises substantially, compared to7.5%7.5\%for the regularized model. The survival curve in Figure4(right) shows the two groups are nearly indistinguishable at lowΔv\Delta v, with the gap opening sharply above55m/s and widening through the severe range. Regularization thus produces policies that not only collide less often but also cause less damage when they do collide.

Refer to caption Figure 4:Analyzing collision event severity.Left:empirical CDF of per-eventΔv\Delta v. The dashed line marks Waymo’s11mph (0.450.45m/s) reporting threshold.Center:meanΔv\Delta vper collision event, conditional on a collision occurring. Regularized collisions are on average18%18\%lower in severity (1.711.71vs.2.092.09m/s).Right:fraction of collisions exceedingΔv\Delta v(log scale).

Regularized self-play improves realism with minimal data.

Unregularized self-play scores 0.680 on the WOSAC meta-score[28], with the largest deficits in the kinematic and interactive groups. Anchoring to 30 minutes of human data increases this to 0.725; the meta-score does not improve with additional data, suggesting BC anchor quality is the limiting factor. SMART-tiny CLSFT[10,54]achieves the highest realism score (0.755), yet underperforms on collision rate and task completion across every data bin (Section4.1), confirming that distributional similarity to logged human trajectories does not necessarily imply safety or competence[63]. Additional results and graphs are in AppendixF.3.

Regularized policies display more social driving behavior.

We perform a qualitative analysis with representative videos available athttps://spiced-self-play.com/. The most salient difference is that regularized policies are more considerate of surrounding traffic: they maintain greater following distances, avoid cutting in, and yield at intersections relative to unregularized self-play agents. RL policies are trained to maximize the expected cumulativediscountedreturn. An undesirable side-effect of this is that policies tend to achieve their task in the least number of steps possible. This is different than what humans do. A human driver will aim to get to her destination on time, but is not trying to get there as quickly as possible;satisficing[64]rather thanoptimizing. As visible in the videos and supported by the average episode length, regularization partially corrects for this: regularized agents complete their episodes in 64 steps on average (±3.5\pm 3.5), compared to 38 (±2.6\pm 2.6) steps for unregularized self-play.

This effect is also visible in the displacement errors to the human-replays in Table2, which we decompose into a longitudinal component (along the direction of travel) and a lateral component (perpendicular to it). Lateral error reflects whether the policy follows the human’s path through the scene (e.g., lane choice, turns) while longitudinal error reflects whether it travels that path at a human-like pace. A policy that rushes ahead stays on the right route but reaches each point too early or too late. We observe a clear difference: the unregularized longitudinal L2 (13.33 m) is over five times its lateral L2 (2.39 m). Regularization more than halves the longitudinal error (to 5.56 m) and nearly halves the lateral error (to 1.27 m), so the regularized policy follows human-like paths and traverses them at a human-like speed. The videos confirm both effects: the large longitudinal gap comes from unregularized RL policies driving very fast, and the lateral gap usually comes from their swerving around the replayed logs.

Table 2:Comparing unregularized and regularized self-play policies on 10k random validation split. Long. L2 and Lat. L2 are the displacement errors from the human trajectory decomposed along the direction of travel and perpendicular to it, and ADE is the average displacement error over the episode time-aligned to the logs (all in meters). Lower is better throughout. Best value per column inbold.Human-replay (interactive)MethodAt-fault (%)↓\downarrowLong. L2↓\downarrowLat. L2↓\downarrowTime-aligned ADE↓\downarrowUnregularized2.1±0.52.1\pm 0.513.327±0.12913.327\pm 0.1292.390±0.1482.390\pm 0.14814.074±0.18214.074\pm 0.182Regularized (ours)0.7±0.3\bm{0.7\pm 0.3}5.559±0.077\bm{5.559\pm 0.077}1.274±0.029\bm{1.274\pm 0.029}5.927±0.076\bm{5.927\pm 0.076}

4.3The Role of Scenario Metadata

Scenario diversity is essential for learning general policies.

Aside from human driving data, a cheaper source of simulation grounding data is scenariometadata: road graphs, initial positions, and velocities. Recent work has shown that regularized self-play RL grounded by target-city metadata can adapt driving policies to new cities[22]. A natural follow-up question is how much the diversity provided by metadata matters for training generalizable policies, which is what we explore here. We train regularized and unregularized self-play RL agents on subsetsℳk\mathcal{M}_{k}with|ℳk|∈{10,100,1,000,10,000,50,000}|\mathcal{M}_{k}|\in\{10,100,1{,}000,10{,}000,50{,}000\}scenarios, holding the BC anchorsτn\tau^{n}and reward functionrrfixed. This isolates the effect of environment initialization and diversity besides the agent behaviors.

Refer to caption Figure 5:Scaling scenario metadata. The unregularized self-play baseline is shown in black; shades of blue correspond to regularized policies trained with different BC anchors, with darker shades indicating more anchor data.Left:collision rate in self-play, where all agents are controlled by the same policy on a held-out validation set.Center:at-fault collision rate, the fraction of collisions caused by the controlled agent (See cartoon in Figure2).Right:Gap between self-play and human-replay performance (here referred to as zero-shot coordination;ΔZSC\Delta_{\mathrm{ZSC}}). Concretely, it is difference in the at-fault collision rate between the self-play and human-replay settings.We find that the number of training scenarios (a proxy for map diversity) is an important ingredient for generalization, both to held-out maps and to the human-replay population. As shown in Figure5, both unregularized and regularized self-play improve drastically with the amount of metadata. For unregularized self-play, the at-fault collision rate drops from 14% at 10 scenarios to 0.5-1% at 50k scenarios, and the human-replay collision rate falls from 25.2% to 2% over the same range. Regularized self-play follows the same trend and reaches lower absolute values: with a 30-min BC anchor, the human-replay at-fault collision rate drops from 14% at 10 scenarios to 0.7% at 50k scenarios. The gap between the self-play performance (pairing policy with itself) and the human-replay population approaches 0.2% for regularized policies, and is 1.5% for unregularized self-play.

5Conclusion, Limitations & Discussion

Conclusion.

We consider a series of experiments aimed at putting the mixing of human driving data with synthetic simulated experience on a more scientific footing. Our central finding is that a small amount of human data, roughly 30 minutes to 3 hours of human driving data, can dramatically move the needle towards human-compatible driving agents. This is three orders of magnitude less than SOTA imitation learning baselines and is achieved without reward engineering or domain randomization techniques. The broader implication is that when simulation is cheap, and some clear metrics for desirable behavior are available, human driving data may be best usednotas the primary training signal but as alightweight anchorthat steers policies away from effective-but-alien equilibria.

Limitations.

1.Robustness in tight coordination scenarios: We perform an additional analysis to better understand the limitations of the resulting regularized policies. We curate a small dataset consisting of the top 200 most difficult interactive scenarios (see AppendixD.1). Repeating the analysis from Section4.1on this set of harder scenarios shows that, while the ranking of the policies holds (reg. self-play RL policies still outperform the SMART and unregularized self-play baselines by the same margins), the absolute at-fault collision rate increases from 0.7% to 2.1-2.8%. This indicates that there is room for improvement in the robustness of the resulting policies. Arguably, not all of these contacts reflect policy failures: some are caused by replay agents cutting abruptly into the SDC’s lane, leaving almost no physically feasible avoidance response. What constitutes a fair collision-avoidance benchmark beyond at-fault heuristics is itself a difficult open question in both industry and academia[65]. Nevertheless, an important direction for future work is to improve the robustness of regularized policies. See AppendixGfor the results, an in-depth discussion, and ideas to improve along this axis.
2.External validity of evals: Our evaluations use human replays and IDM-controlled agents in simulation as proxies for coordination with humans. The extent to which performance in these settings transfers to on-road deployment remains an open question.
3.Sensitivity to the anchor: Many underlying details by which regularizing the RL policy to the pre-trained BC anchor improves human-likeness remain incompletely understood. How do the properties of the anchor distribution, such as its entropy, affect the outcome? Results show that the regularized policies substantially outperform their anchors (see Figure9, Table7), indicating that RL corrects for at least some suboptimal behavior in the anchor. It is unclear how sensitive this is to the BC policy’s closed-loop quality, or how the correction occurs precisely.

Combining human demonstrations with synthetic simulated experience.

Our key finding raises a deeper question that we have only touched the surface of, but is worth exploring further. Given the ability to generate simulated self-play experience on demand, what is thecomplementary valueof a bit of human data? Can we predict how much human data, and of what kind, is worth collecting for a given application X with structure Y? In the present work, we can loosely intuit two effects. First, the resulting regularized self-play RL policies are more human-like because the actor distributions stay close to the anchor distributions (see SectionF.2). Second, the resulting policies are more robust because they are exposed to broader coverage of the state space during training: the self-play agents learn from 20B transitions and start from random play, whereas the IL baseline is trained on a fixed dataset of 225 million expert transitions (Figure1, Center). But count is a crude explanation; not all transitions are equally informative. Recent work on epiplexity[66]takes a step toward formalizing this notion of data value, but in its current form, is a theoretical measure that we cannot yet compute or apply to data selection in practice. Developing tools to help determinewhat kind of human datais needed to learn a given behavior, andpredicting how muchis needed before collecting it, is a promising direction for future work.

Acknowledgments

We thank the authors of CAT-K[54]for generously sharing the weights of their best SMART-tiny-CLSFT checkpoint, which we use as the imitation learning baseline throughout the paper, and their code, which we use as a baseline for the scaling law experiments. We also thank Luke Rowe, Rodrigue de Schaetzen, and Roger Girgis for feedback on some early results and various interesting discussions on the topic of end-to-end driving and self-play. We thank Momchil Tomov for a helpful discussion on evals and metrics for evaluating human-likeness and compatibility in driving.

This work was also supported in part through the NYU IT High-Performance Computing resources, services, and staff expertise. Daphne Cornelisse is partially supported by the Cooperative AI Foundation and a Chishiki-AI SCIPE Fellowship.

References

Silver et al. [2016]D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al.Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016.
Sokota et al. [2025a]S. Sokota, E. Vinitsky, H. Hu, J. Z. Kolter, and G. Farina.Superhuman AI for stratego using self-play reinforcement learning and test-time search.CoRR, abs/2511.07312, 2025a.doi:10.48550/ARXIV.2511.07312.URLhttps://doi.org/10.48550/arXiv.2511.07312.
Sokota et al. [2025b]S. Sokota, E. Vinitsky, H. Hu, J. Z. Kolter, and G. Farina.Superhuman ai for stratego using self-play reinforcement learning and test-time search.arXiv preprint arXiv:2511.07312, 2025b.
Kazemkhani et al. [2024]S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky.Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps.arXiv preprint arXiv:2408.01584, 2024.
Cusumano-Towner et al. [2025]M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Sener, P. Krähenbühl, and V. Koltun.Robust autonomy emerges from self-play.arXiv preprint arXiv:2502.03349, 2025.
Cornelisse et al. [2025]D. Cornelisse, A. Pandya, K. Joseph, J. Suárez, and E. Vinitsky.Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025.
Chang et al. [2025]W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y. Hu, and W. Zhan.SPACeR: Self-play anchoring with centralized reference models.arXiv preprint arXiv:2510.18060, 2025.
Leibo et al. [2019]J. Z. Leibo, E. Hughes, M. Lanctot, and T. Graepel.Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research.arXiv preprint arXiv:1903.00742, 2019.
Bakhtin et al. [2021]A. Bakhtin, D. Wu, A. Lerer, and N. Brown.No-press diplomacy from scratch.Advances in Neural Information Processing Systems, 34:18063–18074, 2021.
Wu et al. [2024]W. Wu, X. Feng, Z. Gao, and Y. Kan.Smart: Scalable multi-agent real-time motion generation via next-token prediction.Advances in Neural Information Processing Systems, 37:114048–114071, 2024.
Qiu et al. [2026]J. Qiu, A. Saviolo, C. Wang, M. Wang, and X. Huang.Heterogeneous self-play for realistic highway traffic simulation.2026.URLhttps://arxiv.org/abs/2604.16406.
Knox et al. [2023]W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone.Reward (mis) design for autonomous driving.Artificial Intelligence, 316:103829, 2023.
Pomerleau [1988]D. A. Pomerleau.Alvinn: An autonomous land vehicle in a neural network.Advances in neural information processing systems, 1, 1988.
Bojarski et al. [2016]M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al.End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016.
Philion et al. [2023]J. Philion, X. B. Peng, and S. Fidler.Trajeglish: Traffic modeling as next-token prediction.arXiv preprint arXiv:2312.04535, 2023.
Baniodeh et al. [2025]M. Baniodeh, K. Goel, S. Ettinger, C. Fuertes, A. Seff, T. Shen, C. Gulino, C. Yang, G. Jerfel, D. Choe, R. Wang, V. Kallem, S. Casas, R. Al-Rfou, B. Sapp, and D. Anguelov.Scaling laws of motion forecasting and planning: A technical report.arXiv preprint arXiv:2506.08228, 2025.
Suarez [2024]J. Suarez.PufferLib: Making reinforcement learning libraries and environments play nice.arXiv preprint arXiv:2406.12905, 2024.
Cornelisse et al. [2025]D. Cornelisse, S. Cheng, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V. Charraut, A. Gupta, J. Suarez, and E. Vinitsky.PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025.URLhttps://github.com/Emerge-Lab/PufferDrive.
Hu et al. [2022]H. Hu, D. J. Wu, A. Lerer, J. Foerster, and N. Brown.Human-ai coordination via human-regularized search and learning.arXiv preprint arXiv:2210.05125, 2022.
Bakhtin et al. [2023]A. Bakhtin, D. J. Wu, A. Lerer, J. Gray, A. P. Jacob, G. Farina, A. H. Miller, and N. Brown.Mastering the game of no-press Diplomacy via human-regularized reinforcement learning and planning.InInternational Conference on Learning Representations, 2023.arXiv:2210.05492.
Cornelisse and Vinitsky [2024]D. Cornelisse and E. Vinitsky.Human-compatible driving partners through data-regularized self-play reinforcement learning.InReinforcement Learning Journal, 2024.arXiv:2403.19648.
Wang et al. [2026]Z. Wang, S. Rahmani, D. Cornelisse, B. Sarkar, A. D. Goldie, J. N. Foerster, and S. Whiteson.Learning to drive in new cities without human demonstrations.arXiv preprint arXiv:2602.15891, 2026.
Ettinger et al. [2021]S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou, et al.Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset.InProceedings of the IEEE/CVF international conference on computer vision, pages 9710–9719, 2021.
Wan et al. [2023]A. Wan, E. Wallace, S. Shen, and D. Klein.Poisoning language models during instruction tuning.InInternational Conference on Machine Learning, pages 35413–35425. PMLR, 2023.
Zhang et al. [2025]Y. Zhang, J. Rando, I. Evtimov, J. Chi, E. M. Smith, N. Carlini, F. Tramèr, and D. Ippolito.Persistent pre-training poisoning of llms.InInternational Conference on Learning Representations, volume 2025, pages 31323–31340, 2025.
Souly et al. [2025]A. Souly, J. Rando, E. Chapman, X. Davies, B. Hasircioglu, E. Shereen, C. Mougan, V. Mavroudis, E. Jones, C. Hicks, et al.Poisoning attacks on llms require a near-constant number of poison samples.arXiv preprint arXiv:2510.07192, 2025.
Schulman et al. [2017]J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov.Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.
Montali et al. [2023]N. Montali, J. Lambert, P. Mougin, A. Kuefler, N. Rhinehart, M. Li, C. Gulino, T. Emrich, Z. Yang, S. Whiteson, et al.The waymo open sim agents challenge.Advances in Neural Information Processing Systems, 36:59151–59171, 2023.
Waymo LLC [2025]Waymo LLC.Waymo safety impact.https://waymo.com/safety/impact/, 2025.Accessed: 2026-05-06.
Chen et al. [2024]L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li.End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024.
Jia et al. [2024]X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan.Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024.
Hu et al. [2023]Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, et al.Planning-oriented autonomous driving.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023.
Jiang et al. [2023]B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang.Vad: Vectorized scene representation for efficient autonomous driving.InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023.
Huang et al. [2022]Y. Huang, J. Du, Z. Yang, Z. Zhou, L. Zhang, and H. Chen.A survey on trajectory-prediction methods for autonomous driving.IEEE transactions on intelligent vehicles, 7(3):652–674, 2022.
Vinitsky et al. [2022]E. Vinitsky, N. Lichtlé, S. Kanaa, et al.Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world.InAdvances in Neural Information Processing Systems (NeurIPS), 2022.
Caesar et al. [2019]H. Caesar, V. Bankiti, A. Lang, S. Vora, V. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom.nuscenes: A multimodal dataset for autonomous driving. arxiv.2019.
Wilson et al. [2023]B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, et al.Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023.
Bansal et al. [2018]M. Bansal, A. Krizhevsky, and A. Ogale.Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst.arXiv preprint arXiv:1812.03079, 2018.
Salzmann et al. [2020]T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone.Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data.InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 683–700. Springer, 2020.
Gu et al. [2021]J. Gu, C. Sun, and H. Zhao.Densetnt: End-to-end trajectory prediction from dense goal sets.InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15303–15312, 2021.
Nayakanti et al. [2023]N. Nayakanti, R. Al-Rfou, A. Zhou, K. Goel, K. S. Refaat, and B. Sapp.Wayformer: Motion forecasting via simple & efficient attention networks.In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2980–2987. IEEE, 2023.
Ngiam et al. [2021]J. Ngiam, B. Caine, V. Vasudevan, Z. Zhang, H.-T. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, et al.Scene transformer: A unified architecture for predicting multiple agent trajectories.arXiv preprint arXiv:2106.08417, 2021.
Zhou et al. [2022]Z. Zhou, L. Ye, J. Wang, K. Wu, and K. Lu.Hivt: Hierarchical vector transformer for multi-agent motion prediction.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8823–8833, 2022.
Shi et al. [2022]S. Shi, L. Jiang, D. Dai, and B. Schiele.Motion transformer with global intention localization and local movement refinement.Advances in Neural Information Processing Systems, 35:6531–6543, 2022.
Zhou et al. [2023]Z. Zhou, J. Wang, Y.-H. Li, and Y.-K. Huang.Query-centric trajectory prediction.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17863–17873, 2023.
Seff et al. [2023]A. Seff, B. Cera, D. Chen, M. Ng, A. Zhou, N. Nayakanti, K. S. Refaat, R. Al-Rfou, and B. Sapp.Motionlm: Multi-agent motion forecasting as language modeling.InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8590, 2023.
Zhong et al. [2023]Z. Zhong, D. Rempe, D. Xu, Y. Chen, S. Veer, T. Che, B. Ray, and M. Pavone.Guided conditional diffusion for controllable traffic simulation.In2023 IEEE international conference on robotics and automation (ICRA), pages 3560–3566. IEEE, 2023.
Jiang et al. [2024]C. M. Jiang, Y. Bai, A. Cornman, C. Davis, X. Huang, H. Jeon, S. Kulshrestha, J. Lambert, S. Li, X. Zhou, et al.Scenediffuser: Efficient and controllable driving simulation initialization and rollout.Advances in Neural Information Processing Systems, 37:55729–55760, 2024.
Huang et al. [2024]Z. Huang, Z. Zhang, A. Vaidya, Y. Chen, C. Lv, and J. F. Fisac.Versatile behavior diffusion for generalized traffic agent simulation.arXiv preprint arXiv:2404.02524, 2024.
Tan et al. [2025]S. Tan, J. Lambert, H. Jeon, S. Kulshrestha, Y. Bai, J. Luo, D. Anguelov, M. Tan, and C. M. Jiang.Scenediffuser++: City-scale traffic simulation via a generative world model.InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1570–1580, 2025.
Liao et al. [2025]B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y. Zhang, Q. Zhang, et al.Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025.
Lu et al. [2023]Y. Lu, J. Fu, G. Tucker, X. Pan, E. Bronstein, R. Roelofs, B. Sapp, B. White, A. Faust, S. Whiteson, et al.Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios.In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7553–7560. IEEE, 2023.
Peng et al. [2024]Z. Peng, W. Luo, Y. Lu, T. Shen, C. Gulino, A. Seff, and J. Fu.Improving agent behaviors with RL fine-tuning for autonomous driving.InComputer Vision - ECCV 2024 - 18th European Conference, volume 15083 ofLecture Notes in Computer Science, pages 165–181. Springer, 2024.
Zhang et al. [2025]Z. Zhang, P. Karkus, M. Igl, W. Ding, Y. Chen, B. Ivanovic, and M. Pavone.Closed-loop supervised fine-tuning of tokenized traffic models.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
Silver et al. [2018]D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al.A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.Science, 362(6419):1140–1144, 2018.
Vinyals et al. [2019]O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al.Grandmaster level in starcraft ii using multi-agent reinforcement learning.nature, 575(7782):350–354, 2019.
Hu et al. [2020]H. Hu, A. Lerer, A. Peysakhovich, and J. Foerster.“other-play” for zero-shot coordination.InInternational conference on machine learning, pages 4399–4410. PMLR, 2020.
Bard et al. [2020]N. Bard, J. N. Foerster, S. Chandar, N. Burch, M. Lanctot, H. F. Song, E. Parisotto, V. Dumoulin, S. Moitra, E. Hughes, et al.The hanabi challenge: A new frontier for ai research.Artificial Intelligence, 280:103216, 2020.
Jacob et al. [2022]A. P. Jacob, D. J. Wu, G. Farina, A. Lerer, H. Hu, A. Bakhtin, J. Andreas, and N. Brown.Modeling strong and human-like gameplay with KL-regularized search.InInternational Conference on Machine Learning, pages 9695–9728. PMLR, 2022.
Treiber et al. [2000]M. Treiber, A. Hennecke, and D. Helbing.Congested traffic states in empirical observations and microscopic simulations.Physical Review E, 62(2):1805–1824, Aug. 2000.ISSN 1063-651X, 1095-3787.doi:10.1103/PhysRevE.62.1805.URLhttps://link.aps.org/doi/10.1103/PhysRevE.62.1805.
Charraut et al. [2025]V. Charraut, W. Doulazmi, T. Tournaire, and T. Buhet.V-Max: A RL framework for autonomous driving.Reinforcement Learning Journal, 6:2427–2451, 2025.
Dauner et al. [2024]D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al.Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems, 37:28706–28719, 2024.
Cornelisse [2025]D. Cornelisse.Human-likeness metrics for autonomous agents: are we measuring the right thing?Substack, 2025.Blog post analyzing the Waymo Open Sim Agent Challenge (WOSAC) realism benchmark.
Arumugam et al. [2024]D. Arumugam, S. Kumar, R. Gummadi, and B. Van Roy.Satisficing exploration for deep reinforcement learning.arXiv preprint arXiv:2407.12185, 2024.
Scanlon et al. [2026]J. M. Scanlon, K. D. Kusano, J. Engstrom, and T. Victor.Collision avoidance effectiveness of an automated driving system using a human driver behavior reference model in reconstructed fatal collisions.InWCX SAE World Congress Experience. SAE Technical Paper, 2026.
Finzi et al. [2026]M. Finzi, S. Qiu, Y. Jiang, P. Izmailov, J. Z. Kolter, and A. G. Wilson.From entropy to epiplexity: Rethinking information for computationally bounded intelligence.arXiv preprint arXiv:2601.03220, 2026.
Distelzweig et al. [2026]A. Distelzweig, F. Janjoš, A. Look, A. Rothenhäusler, D. Jost, O. Scheel, R. Rajan, D. Cornelisse, E. Vinitsky, and J. Boedecker.Beyond self-play and scale: A behavior benchmark for generalization in autonomous driving.arXiv preprint arXiv:2605.10034, 2026.
Gulino et al. [2023]C. Gulino, J. Fu, W. Luo, G. Tucker, E. Bronstein, Y. Lu, J. Harb, X. Pan, Y. Wang, X. Chen, et al.Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems, 36:7730–7742, 2023.

Appendix ASimulation Environment and Design

A.1World Initialization from Scenario Metadata

We use PufferDrive 2.0 for simulation and training[18]. PufferDrive is a batched simulator that runs many environments in parallel, reaching 390k steps per second (SPS) on an NVIDIA RTX 5090 GPU. We initialize environments using the Waymo Open Motion Dataset (WOMD)[23], which provides a large set of multi-agent traffic scenarios. Each scenario supplies the metadata we need: the roadgraph, a variable number of agents (cars, cyclists, and pedestrians), and other objects in the scene. This information is the output of a perception stack, so we operate directly on these clean features (in bounding-box world).

Each scenario is 9 seconds long and discretized into 90 steps. We take each logged agent’s initial position (t=0t=0) as its starting position in the scene, and its last valid logged position (t=Tt=T) as its goal, which lets us goal-condition the agents. The full Waymo training dataset contains 500k scenarios, but in this paper we use at most 50k of the randomly sampled scenarios. When constructing the environments, we randomly sample scenarios from WOMD until we hit a target number of agents (e.g., on an NVIDIA RTX 4080 with 16GB of memory, we keep adding environments until we reach 1024 agents).

A.2Observation Space

We take a decentralized approach and provide every agent with a partial view of the environment in a local coordinate frame. This is similar to the observation space of prior related works, such as GIGAFLOW[5], and GPUDrive[4]. At each timestep, an agent receives the combination of three feature blocks: an ego block describing its own state, a partner block describing theNp=31N_{p}=31closest other agents within a 50 m radius, and a road block describing up toNr=128N_{r}=128nearby road segments drawn from a21×2121\times 21grid of5m×5m5\,\text{m}\times 5\,\text{m}cells centered on the agent. Missing slots (fewer partners or road segments than the maximum) are zero-padded. Tables3,4, and5list the features in each block. All positions and headings are expressed in the agent’s local frame, so the observation is invariant to the global pose of the scene. The total observation vector has dimension11+7×31+7×128=1,12411+7\times 31+7\times 128=1{,}124.

Table 3:Ego features (14 values) for the delta-local dynamics model. Features 0–3 expose the sampled conditioning variables to the policy so it can modulate its behavior as a function ofλ\lambdaand the reward weights (Section4.3). We did not use conditioning in the paper and set all values to fixed values:λ=0.075\lambda=0.075;rcoll,roff=−1r_{\text{coll}},r_{\text{off}}=-1andrgoal=+1r_{\text{goal}}=+1.IdxFeatureNormalizationDescription0λ\lambda—Human-regularization coefficient1rcollr_{\text{coll}}—Sampled collision reward2roffr_{\text{off}}—Sampled off-road reward3rgoalr_{\text{goal}}—Sampled goal reward4Δxgoal\Delta x_{\text{goal}}×0.005\times 0.005Goal position (ego frame), longitudinal5Δygoal\Delta y_{\text{goal}}×0.005\times 0.005Goal position (ego frame), lateral6signed speed/100m/s/\,100\,\text{m/s}Speed projected onto heading7vehicle width/15m/\,15\,\text{m}Ego bounding-box width8vehicle length/30m/\,30\,\text{m}Ego bounding-box length9collision flag{0,1}\{0,1\}1 if currently colliding10entity type/3/\,3Vehicle (1), pedestrian (2), cyclist (3)Table 4:Partner features (7 values×\times31 partners = 217 values). Partners are ordered by index and filtered to those within50m50\,\text{m}of the ego agent. All positions and headings are in the ego frame.IdxFeatureNormalizationDescription0Δx\Delta x×0.02\times 0.02Partner position, longitudinal1Δy\Delta y×0.02\times 0.02Partner position, lateral2partner width/15m/\,15\,\text{m}Partner bounding-box width3partner length/30m/\,30\,\text{m}Partner bounding-box length4cos⁡(Δψ)\cos(\Delta\psi)—Relative heading, cosine component5sin⁡(Δψ)\sin(\Delta\psi)—Relative heading, sine component6partner signed speed/100m/s/\,100\,\text{m/s}Signed speed along partner’s headingTable 5:Road-segment features (7 values×\times128 segments = 896 values). Segments are drawn from a21×2121\times 21grid of5m5\,\text{m}cells centered on the ego agent, and include road lanes, road lines, and road edges. Each segment is described by the midpoint, length, and orientation of a single polyline segment.IdxFeatureNormalizationDescription0midpointxx×0.02\times 0.02Segment midpoint, longitudinal (ego frame)1midpointyy×0.02\times 0.02Segment midpoint, lateral (ego frame)2segment length/100m/\,100\,\text{m}Length of the polyline segment3segment width/100m/\,100\,\text{m}Fixed nominal width (0.1 m)4cos⁡(θ)\cos(\theta)—Segment orientation in ego frame5sin⁡(θ)\sin(\theta)—Segment orientation in ego frame6segment type{0,1,2}\{0,1,2\}Road lane (0), road line (1), road edge (2)

A.3Actions and Dynamics

We use a single dynamics model with a discretized action space for both the unregularized and regularized agents.

Delta-local dynamics with kinematic constraints.

The action is a triple(Δx,Δy,Δψ)(\Delta x,\Delta y,\Delta\psi)in the agent’s local frame at timett. Translation is rotated into the world frame and added to the position; heading is updated directly:

xt+1\displaystyle x_{t+1}=xt+cos⁡(ψt)Δx−sin⁡(ψt)Δy,\displaystyle=x_{t}+\cos(\psi_{t})\,\Delta x-\sin(\psi_{t})\,\Delta y,(3)yt+1\displaystyle y_{t+1}=yt+sin⁡(ψt)Δx+cos⁡(ψt)Δy,\displaystyle=y_{t}+\sin(\psi_{t})\,\Delta x+\cos(\psi_{t})\,\Delta y,(4)ψt+1\displaystyle\psi_{t+1}=wrap(ψt+Δψ).\displaystyle=\mathrm{wrap}(\psi_{t}+\Delta\psi).(5)Velocity is reported as the world-frame displacement divided byΔt=0.1\Delta t=0.1s. We bound each component roughly based on realistic actions present in the human data, as shown in Figure6; specifically, we defineΔx∈[−3.5,3.5]\Delta x\in[-3.5,3.5]m,Δy∈[−0.1,0.1]\Delta y\in[-0.1,0.1]m, andΔψ∈[−π/6,π/6]\Delta\psi\in[-\pi/6,\pi/6]. Each of the three dimensions is binned independently into 51, 51, and 127 values, respectively. Figure6shows that the distributions forΔy\Delta yandΔψ\Delta\psiare roughly symmetric, whereas the distribution forΔx\Delta xis strongly asymmetric. This is expected, since most vehicles move forward and only a small number of agents in the scenes drive in reverse (e.g., when parking).

Delta-local dynamics are kinematically unconstrained by default: the agent can translate laterally without rotating, pivot in place, or instantaneously reverse its heading rate. To prevent impossible behaviors, we apply two physics-based constraints to the action at each step. Each constraint clips the action after the previous one has been applied, with the previously executed (post-constraint) values used as the reference. The constraints are:

1.Longitudinal acceleration bound.The change in implied forward speed is clipped to±Along,max⋅Δt\pm A_{\text{long,max}}\cdot\Delta t, whereAlong,max=8A_{\text{long,max}}=8m/s2. This caps acceleration and braking.
2.Lateral motion envelope.Lateral displacement is bounded by|Δy|≤|Δx|⋅tan⁡(δmax)|\Delta y|\leq|\Delta x|\cdot\tan(\delta_{\max}), whereδmax=0.7\delta_{\max}=0.7rad is the maximum effective steering angle. This eliminates lateral sliding and side-shimmy at low forward speed.

These physical constraints prevent kinematically implausible actions; they do not encode any preference over driving style and are independent of the human anchor.

Refer to caption Figure 6:Discretized delta-local action space for each component (Δx\Delta x,Δy\Delta y,Δψ\Delta\psi). Histograms show the empirical density (blue) of 10,996,751 valid action timesteps recovered from expert trajectories across 10,000 maps. Yellow lines mark the 1st and 99th percentiles of the data; red lines mark the action-space bounds (±3.5\pm 3.5m,±0.1\pm 0.1m,±π/6\pm\pi/6rad). Each dimension is binned independently into 512 values. The bounds were chosen to respect natural movements in the data:0.00%0.00\%ofΔx\Delta xandΔy\Delta ysamples fall outside their bounds, and0.71%0.71\%ofΔψ\Delta\psisamples fall outside±π/6\pm\pi/6.

A.4Collecting Human Driving Data

The behavioral cloning (BC) anchor is trained on observation–action pairs(ot,at)(o_{t},a_{t}). We therefore need actions that (i) live in the simulator’s action space and (ii) reproduce the logged motion when applied through the simulator’s dynamics. We construct the dataset in two steps. Figure8shows three examples of this process in the simulator.

Step 1: Inferring actions from the data.

For each timesteptt, we invert the delta-local dynamics to recover the action that produced the next logged state. Projecting the world-frame displacement into the agent’s local frame atttgives:

Δxt\displaystyle\Delta x_{t}=cos⁡(ψt)(xt+1−xt)+sin⁡(ψt)(yt+1−yt),\displaystyle=\cos(\psi_{t})(x_{t+1}-x_{t})+\sin(\psi_{t})(y_{t+1}-y_{t}),(6)Δyt\displaystyle\Delta y_{t}=−sin⁡(ψt)(xt+1−xt)+cos⁡(ψt)(yt+1−yt),\displaystyle=-\sin(\psi_{t})(x_{t+1}-x_{t})+\cos(\psi_{t})(y_{t+1}-y_{t}),(7)Δψt\displaystyle\Delta\psi_{t}=wrap(ψt+1−ψt).\displaystyle=\mathrm{wrap}(\psi_{t+1}-\psi_{t}).(8)Each triple is clipped to the action bounds and snapped to the nearest discrete bin. Timesteps where eitherttort+1t+1is flagged invalid in the log are marked as missing and excluded from training.

Step 2: Replaying actions through the simulator.

To produce observations, we replay the inferred action sequence through the simulator and record the observation at every resulting state. The BC anchor is then trained on the resulting (simulator observation, inferred action) pairs. Discretization introduces a small error that grows inversely with bin size (details below); to prevent its accumulation, we insteadteleportagents to each ground truth successive state rather than stepping them forward with the inferred actions. We note that stepping agents directly is also viable when using larger action spaces, where the discretization error is smaller.

Effect of discretization on performance.

Figure7and Table6quantify the cost of discretization. Continuous actions reproduce the logged trajectory almost exactly (ADE0.0010.001m), confirming that the delta-local dynamics and kinematic constraints are themselves well-posed. Discretizing into 512 bins per dimension introduces a quantization floor of ADE0.0970.097m, which is roughly two orders of magnitude larger, but is still very close to the original trajectory. Off-road and collision rates increase modestly under discretization (1.2%1.2\%vs.0.8%0.8\%off-road,0.4%0.4\%vs.0.0%0.0\%collision), reflecting the rare cases where snapping to the nearest bin pushes the SDC just outside a road edge or into a static neighbor; both representations complete the route in100%100\%of scenarios.

Table 6:Inferred-expert-action quality for the delta-local dynamics model. Comparison of discrete (bin-quantized) vs continuous (direct float) expert actions. Aggregated over 10,240 pooled samples. Values are mean±\pmSE.Action typeRoute prog. (%)↑\uparrowColl. (%)↓\downarrowOff-road (%)↓\downarrowADE (m)↓\downarrowLateral L2 (m)↓\downarrowLongitudinal L2 (m)↓\downarrowdiscrete100.00.4±0.10.4\pm 0.11.2±0.21.2\pm 0.20.097±0.0020.097\pm 0.0020.096±0.0020.096\pm 0.0020.004±0.0000.004\pm 0.000continuous100.00.00.8±0.10.8\pm 0.10.001±0.0000.001\pm 0.0000.001±0.0000.001\pm 0.0000.001±0.0000.001\pm 0.000

Refer to caption Figure 7:Effect of action discretization on inferred-expert-action quality. We replay each agent’s logged trajectory through the simulator using actions inferred from the logs, comparing discrete (bin-quantized, blue) and continuous (direct float, green) action representations.Left:SDC rates aggregated across 10,240 pooled samples; both representations complete the route in 100% of scenarios, but discretization induces modestly higher off-road and collision rates.Center, right:distributions of per-trajectory lateral and longitudinal L2 error to the logged pose. Continuous actions reproduce the log almost exactly (errors concentrated near zero), while discrete actions exhibit a small but consistent quantization floor of∼0.1\sim 0.1m laterally. Error bars on the bar plot denote standard error. Refer to caption Figure 8:Three annotated example scenarios illustrating the human data collection process. The self-driving car (SDC), marked in cyan, is the Waymo vehicle whose human-driven trajectory we use as the driving log. Logged trajectories are shown in green; purple trajectories show the result of stepping each agent through the simulator under the inferred delta-local actions. We select only the SDC trajectory because it is typically the cleanest data in the scene; the visualized step-wise displacement illustrates a few low-quality (high-ADE) log trajectories that would otherwise contaminate the anchor.

A.5Reward Function

We use a sparse reward:ri=+1r^{i}=+1if agentiireaches its goal withinδ=2\delta=2meters before the episode ends,−1-1on collision or going off-road, and0otherwise. We deliberately omit dense shaping terms so that safe and human-compatible behaviors can emerge from regularization.

Appendix BTraining

B.1Behavioral Cloning Anchor Policies

Each anchorτn\tau_{n}is trained by minimizing the negative log-likelihood of the logged actions under the factorized discrete action distribution described in AppendixA.3. We extract observation, action tuples through the procedure described in AppendixA.4. Note that we use only the SDC trajectory from each scenario for training, as it is the highest-quality data source. Since other agents are reconstructed from the perception stack, they exhibit more noise. Moreover, we have no guarantees about the driving quality of the surrounding humans. Since we obtain one trajectory per scene, each scenario contributes roughly 9 seconds of human data. Although these trajectories were collected in Waymo vehicles, they reflect manual human driving by an expert driver behind the wheel[23].

We train with Adam at a learning rate of10−410^{-4}and a batch size of 2048 for up to 5000 epochs, with early stopping on the held-out validation loss after 100 epochs without improvement. Table7reports open- and closed-loop metrics for each anchor on 10,000 held-out validation scenarios. Figure10shows the 5-bin validation accuracy for each action head over training; from only 30 minutes of data, validation accuracy converges to between 80% and 90%. We use the 5-bin metric instead of top-1 as there are 256 bins per action head, so the step sizes between bins are very small.

Figures12and11compare the learned action distributions against the empirical distribution of the logged actions, for anchors trained on 30 minutes and 30 hours of data, respectively; in both cases, the learned distributions match the data reasonably well.

Table 7:BC anchor evaluation. Open-loop metrics on the held-out validation set; closed-loop metrics averaged over validation scenes. Within-5-bin accuracy is the average ofΔx\Delta x,Δy\Delta y,Δyaw\Delta\mathrm{yaw}accuracies at the final training step.Open-loopClosed-loop self-playClosed-loop human-replay (SDC only)Human data (h)Acc. (%)Acc.±5\pm 5bins (%)LossRoute prog.ScoreRoute prog.Score0.2\cellcolorgreen!5 23.4\cellcolorgreen!5 72.4\cellcolorred!50 15.677\cellcolorgreen!50.720±0.0120.720\pm 0.012\cellcolorgreen!50.215±0.0130.215\pm 0.013\cellcolorgreen!50.765±0.0070.765\pm 0.007\cellcolorgreen!50.242±0.0090.242\pm 0.0090.5\cellcolorgreen!24 36.1\cellcolorgreen!34 87.3\cellcolorred!17 5.269\cellcolorgreen!50.719±0.0110.719\pm 0.011\cellcolorgreen!100.277±0.0140.277\pm 0.014\cellcolorgreen!190.800±0.0060.800\pm 0.006\cellcolorgreen!180.371±0.0110.371\pm 0.0113.0\cellcolorgreen!42 48.2\cellcolorgreen!45 92.6\cellcolorred!6 1.641\cellcolorgreen!290.835±0.0100.835\pm 0.010\cellcolorgreen!320.502±0.0170.502\pm 0.017\cellcolorgreen!370.842±0.0060.842\pm 0.006\cellcolorgreen!360.538±0.0110.538\pm 0.01130.0\cellcolorgreen!5052.8\cellcolorgreen!5094.9\cellcolorred!51.266\cellcolorgreen!500.932±0.007\bm{0.932\pm 0.007}\cellcolorgreen!500.685±0.016\bm{0.685\pm 0.016}\cellcolorgreen!500.873±0.006\bm{0.873\pm 0.006}\cellcolorgreen!500.666±0.010\bm{0.666\pm 0.010}

Refer to caption Figure 9:Open- and closed-loop performance of the anchor BC policies as a function of human driving data. Left: The final real (blue) and within 5 bin accuracy (purple) accuracy on 10,000 held-out validation scenarios. Right: Final validation loss. Right; Route progress; Right Score. Refer to caption Figure 10:Training curves for the anchor policies. Each panel shows within-5-bin validation accuracy on a held-out set of scenarios for one action component (Δx\Delta x,Δy\Delta y,Δψ\Delta\psi). Curves terminate at different step counts because training stops once validation accuracy plateaus (no improvement for 100 consecutive epochs). Refer to caption Figure 11:Example of actual vs. learned distributions - for 12k maps (30 hours)Figure 12:Example of actual vs. learned distributions - for 200 maps (30 min)

B.2Self-Play Reinforcement Learning

Both self-play variants run for 20 billion steps.

B.2.1Regularization

Letπθ\pi_{\theta}denote the RL policy andτn\tau_{n}the fixed BC anchor trained onnnscenarios. We regularizeπθ\pi_{\theta}towardτn\tau_{n}by adding a KL penalty on states visited during the rollout:

ℒreg(θ)=λM∑j=1MDKL(τn(⋅∣oj)∥πθ(⋅∣oj)),\mathcal{L}_{\mathrm{reg}}(\theta)=\frac{\lambda}{M}\sum_{j=1}^{M}D_{\mathrm{KL}}\!\left(\tau_{n}(\cdot\mid o_{j})\,\middle\|\,\pi_{\theta}(\cdot\mid o_{j})\right),(9)whereλ=0.075\lambda=0.075is fixed throughout training and inference andMMis the minibatch size. The full objective augments standard PPO with this penalty:

ℒ(θ)=ℒpg+cvℒv−cHH+ℒreg,\mathcal{L}(\theta)=\mathcal{L}_{\mathrm{pg}}+c_{v}\,\mathcal{L}_{\mathrm{v}}-c_{H}\,H+\mathcal{L}_{\mathrm{reg}},(10)whereℒpg\mathcal{L}_{\mathrm{pg}}is the clipped surrogate policy-gradient loss,ℒv\mathcal{L}_{\mathrm{v}}the value-function loss,HHthe entropy bonus, andcvc_{v},cHc_{H}their respective coefficients. The KL term pullsπθ\pi_{\theta}toward the anchor on states the policy actually visits, rather than on the offline logged data distribution. Settingλ=0\lambda=0recovers unregularized self-play.

B.2.2Hyperparameters

Table8lists the hyperparameters. We use the same parameters for regularized self-play RL and the baseline.

Table 8:PPO training hyperparameters.ArchitectureTrainingEnvironment & RewardsInput size64Total timesteps20BNumber of agents1,024Hidden size256Batch size524,288Number of workers16RNN typeLSTMMinibatch size32,768Episode length150 stepsRNN input size256Rollout horizon32TimestepΔt\Delta t0.1 sRNN hidden size256Update epochs1Goal radius2.0 mLearning rate4.26×10−34.26\times 10^{-3}Action spaceDiscreteLR scheduleLinear annealingDynamics modelDelta-localAdamβ1\beta_{1}0.9Goal reward+1.0+1.0Adamβ2\beta_{2}0.999Collision penalty−1.0-1.0Adamϵ\epsilon10−810^{-8}Off-road penalty−1.0-1.0Clip coefficient0.2Entropy coefficient0.001VF coefficient2.0VF clip0.2GAEλ\lambda0.95Discountγ\gamma0.99Max gradient norm1.0Priorityα\alpha0.85Priorityβ0\beta_{0}0.85V-traceccclip1.0V-traceρ\rhoclip1.0OptimizerAdamSeed42

B.3SMART Model Training and CATK finetuning

IL data scaling baseline experiments.

We trained SMART models via CATK[54]using the open-sourced codebasehttps://github.com/NVlabs/catkat commitd23886761fc5b5628c5973148c40284452745745. For the data scaling experiments, we used subsets of the Waymo Open Motion Dataset (WOMD). WOMD motion shards were preprocessed into CATK’s per-scenario cached format, and all training subsets were constructed from these cached scenario files.

Our final local runs used thesmart_mini_3Mmodel with vehicle-only supervision on deterministic subsets of 67, 200, 1200, and 12000 scenarios. In the subset construction scripts, scenarios are sorted by cached scenario filename in lexicographic order before selecting subsets. Vehicle-only supervision means that only vehicle agents contribute to the training loss, while pedestrians and cyclists remain present in the scene and are available as contextual inputs to the model. The local models were trained with CATK’spre_bcconfiguration on a single GPU for 64 epochs with batch size 8. Results for both the SMART behavioral cloning checkpoints and the CAT-K / CLSFT fine-tuned checkpoints are reported in Table11.

Open-sourced checkpoints.

We additionally compare against two author-provided checkpoints: a behavioral cloning checkpoint (pre_bc_E31.ckpt) and a closed-loop supervised fine-tuning checkpoint (clsft_E9.ckpt). For downstream evaluation, we exported predictions as.pklfiles on the same 10k random validation split (data available athttps://huggingface.co/datasets/daphne-cornelisse/pufferdrive_womd_val). We use two export modes: an all-agents mode, where the model controls all agents, and a planning mode, where only the SDC is controlled by the model while all other agents are replayed from ground truth. We re-exported all combinations of models and export modes with 32 rollouts for multimodal evaluation. We verified that there is zero scenario-ID overlap between each local training subset (67, 200, 1200, and 12000 scenarios) and the evaluation set.

Appendix CNeural Network Architecture

Both the BC anchor and the RL policy share the same multi-modal encoder structure. The flattened environment observation vector is first unpacked into its modalities: ego state, partner agents, and road segments. Each modality is processed by a two-layer MLP with ReLU activation and layer normalisation between the two linear layers. Partner and road embeddings are then aggregated across objects via max-pooling, producing one vector per stream. The three pooled vectors are concatenated and passed through a shared two-layer MLP (Linear→\toReLU→\toLinear) to produce the final embedding. Separate linear heads decode this embedding into logits over each action dimension; a separate linear head with unit output produces the value estimate. The two architectures differ in width and in the presence of recurrence:

•BC anchor.Per-stream MLP width 128, shared MLP3×128→512→5123{\times}128\to 512\to 512. No recurrence. Actor heads are linear projections from the 512-dimensional embedding. It has 776,190 trainable parameters.
•RL policy.Per-stream MLP width 64, shared MLP3×64→256→2563{\times}64\to 256\to 256. The 256-dimensional embedding is passed through a single-layer LSTM with input size 256 and hidden size 256 (PufferLibLSTMWrapper). Actor and critic heads are linear projections from the 256-dimensional LSTM output. It has 650k trainable parameters.

Road segment features include a categorical type field that is replaced by a 7-class one-hot vector before encoding, expanding the road feature dimension fromdroadd_{\text{road}}todroad+6d_{\text{road}}+6.

Appendix DEvaluation

D.1Filtering the Waymo Dataset for Interactive SDC Scenarios

As pointed out in earlier works[67,21], many scenarios in the Waymo Open Motion Dataset (WOMD) involve the self-driving car (SDC) traveling without meaningful interaction with other agents—the SDC reaches its destination without requiring coordination or yielding. To increase the signal in our human-replay evaluation, we filter the dataset for scenarios in which the SDC trajectory intersects with other agents’ trajectories, indicating situations that require coordination, such as merging, yielding, or navigating busy intersections.

We score each scenario by counting the number of segment-level intersections between the SDC trajectory and all other agent trajectories, optionally filtering crossings that meet a minimum acute-angle threshold (to exclude near-parallel overlaps, such as lane changes). From a pool of 10,000 held-out validation scenarios, we rank by intersection count and select the top 200 most interactive scenes. Figure13shows the resulting intersection count distributions across the full dataset and the selected subset, and Figure14shows nine representative examples from the selected set.

Refer to caption Figure 13:Distribution of SDC trajectory intersection counts.Left:raw intersection counts across all 50k scenarios.Center:angled intersections (non-zero only).Right:distribution within the selected top-200 subset.

Refer to caption

Figure 14:Nine example scenarios from the selected interactive subset. The SDC trajectory is shown in green, other agents in blue, and trajectory intersection points with other logs in red.

D.2Metrics

We report the following metrics across all experiments. Unless noted, all metrics are computed per active (i.e., controlled) agent per episode and averaged across agents and scenarios.

Score.

An agent scores 1 if it reaches its goal without any collision or off-road event during the episode, and 0 otherwise. It jointly captures all failure modes and is a useful aggregate metric.

Completion rate.

The fraction of agents that reach their goal position (withinδ=2\delta=2meters) before episode end, regardless of whether a collision or off-road event occurred.

Collision rate.

The fraction of episodes in which the agent is involved in at least one collision with another vehicle.

At-fault collision rate.

A subset of the collision criteria taken from NAVSIM[62]. A collision is attributed to an agent if (i) the other vehicle is in front of the agent at the time of impact, and (ii) the agent’s velocity vector points toward the other vehicle. This filters out collisions in which the agent was rear-ended or struck laterally by an inattentive partner.

Collision severity (Δv\Delta v).

Beyond the binary collision indicator, we measure the severity of each at-fault collision event using the change in velocity (Δv\Delta v) imparted to the agent at impact. Following the impulse-momentum formulation used in[29], the Delta-V of agentiiin a collision with partnerjjis

Δvi=mjmi+mj(1+e)(v→j−v→i)⋅n^,\Delta v_{i}=\frac{m_{j}}{m_{i}+m_{j}}\,(1+e)\,\bigl(\vec{v}_{j}-\vec{v}_{i}\bigr)\cdot\hat{n},(11)wheren^\hat{n}is the unit collision normal (taken as the vector from agentii’s center to agentjj’s center at impact),e=0.1e=0.1is the coefficient of restitution for vehicle-to-vehicle crashes, and the dot product is clipped at zero to ignore separating velocities. Masses are proxied from bounding-box footprint for vehicles (anchored at1500kg1500\,\mathrm{kg}for a4.5m×1.8m4.5\,\mathrm{m}\times 1.8\,\mathrm{m}reference sedan) and fixed for vulnerable road users (75kg75\,\mathrm{kg}for pedestrians,90kg90\,\mathrm{kg}for cyclists).Δv\Delta vis one of the strongest predictors of injury risk in vehicle-to-vehicle crashes[29]and lets us distinguish low-impact contacts (e.g. parking-lot taps) from high-energy collisions even when the binary collision rate is identical.

Off-road rate.

The fraction of episodes in which the agent crosses a road edge boundary, detected by checking for intersection between the agent bounding box and any road edge polyline.

Route progress ratio.

Following[68], we measure how far along its expert reference trajectory each agent travels. At each timesteptt, we find the closest pointx(t)x(t)on the agent’s logged trajectory and compute its arc-length distancedx(t)d_{x(t)}from the start of the path. The route progress ratio is

ρ=dx(t)−dpdq−dp,\rho=\frac{d_{x(t)}-d_{p}}{d_{q}-d_{p}},(12)wheredpd_{p}anddqd_{q}are the arc-length distances to the initial and final positions of the logged trajectory, respectively. A value ofρ=1\rho=1means the agent reached its destination;ρ>1\rho>1is possible if the agent overshoots. For agents that reach their goal undergoal_removetermination, we setρ=1\rho=1directly, since their position is invalidated upon removal. For all other agents,ρ\rhois computed from the agent’s final position at episode end.

Lateral deviation.

At each timestepttfor which the agent is alive, we find the nearest valid point on the agent’s expert reference trajectory,

k∗(t)=arg⁡mink⁡‖pt−qk‖2,k^{*}(t)=\arg\min_{k}\|p_{t}-q_{k}\|_{2},(13)whereptp_{t}is the agent’s simulated position andqkq_{k}is the expert position at reference indexkk. The lateral deviation is

ℓt=‖pt−qk∗(t)‖2.\ell_{t}=\|p_{t}-q_{k^{*}(t)}\|_{2}.(14)We report the mean ofℓt\ell_{t}over alive timesteps. This metric is geometry-aligned rather than time-aligned: it measures cross-track drift from the reference path, independent of whether the agent is early or late along that path.

Longitudinal deviation.

We also decompose path-following error along the expert trajectory. Letdkd_{k}denote the cumulative arc length of the expert trajectory up to reference indexkk. At timesteptt, using the same nearest reference pointk∗(t)k^{*}(t)as above, the signed longitudinal deviation is

rt=dk∗(t)−dt.r_{t}=d_{k^{*}(t)}-d_{t}.(15)Positive values indicate that the agent is ahead of the time-aligned expert along the route, while negative values indicate that it is behind. We report the mean absolute longitudinal deviation,𝔼t[|rt|]\mathbb{E}_{t}[|r_{t}|], over alive timesteps. Like lateral deviation, this metric is route-aligned rather than strictly time-aligned.

Average displacement error (ADE).

Finally, we report the standard time-aligned displacement error. At each timesteptt, we compare the agent’s simulated position directly to the expert position at the same timestep:

ADEt=‖pt−qt‖2.\mathrm{ADE}_{t}=\|p_{t}-q_{t}\|_{2}.(16)We average this quantity over all alive timesteps with a valid expert reference state. Unlike the lateral and longitudinal deviations, ADE is strictly time-aligned and therefore penalizes both spatial deviation and timing error.

Appendix EMapping Agent Experience to Human Time

We train self-play RL agents on 20 billion transitions. Since Waymo scenarios are discretized at 10 Hz, each transition(ot,at)(o_{t},a_{t})corresponds to 0.1 seconds of real time, placing the total training experience at approximately 63 years of driving.

For comparison, SMART[10]was trained on the full Waymo Open Motion Dataset, which contains 500,000 training scenarios. Each scenario contributes roughly 90 transitions of SDC trajectory data at 10 Hz, amounting to approximately 45 million transitions in total. The open-sourced SMART-CLSFT checkpoint was trained on all agents in each scene rather than the SDC alone; assuming an average of 5 agents per scenario, this corresponds to roughly 225 million transitions.

Our own checkpoints are trained on subsets of 67, 200, 1,200, and 12,000 maps, each contributing approximately 90 transitions per scenario.

2,500 x claim in the abstract comes from 200 scenarios×\times9 seconds each==30 minutes. 500,000×\times9 seconds = 75,000 minutes. 30 minutes / 75,000 minutes = 0.0004.

Appendix FAdditional Results

F.1Human driving data

Refer to caption Figure 15:Scaling human driving data for reg. self-play RL; Same as Figure3but with the collision rates on a log scale.

F.2Regularization keeps RL policies close to human anchors

Figure16shows task completion and KL divergence to the anchor policy over training. Both regularized and unregularized agents converge to comparably effective strategies in terms of goal completion and collision avoidance, yet the underlying action distributions diverge substantially. Without regularization, the agent drifts freely through the space of competent policies, converging far from human behavior; KL divergence increases monotonically throughout training. Regularization constrains the trajectory through policy space without restricting the set of achievable outcomes: the agent remains free to discover effective strategies, but the penalty keeps those strategies within the behavioral distribution of human driving. The result is an agent that is both capable and closer to the distribution of human driving.

Refer to caption Figure 16:Regularized self-play remains close to the anchor distribution while unregularized self-play diverges.Both agents converge to effective driving strategies (left), but their action distributions differ, as measured by KL divergence between observation-conditioned action distributions (right). Regularized policies stay near the anchor; unregularized policies diverge monotonically.

F.3Distributional Realism: Waymo Open Sim Agent Challenge

Figure17reports the WOSAC[28]realism meta-score alongside its three group metrics (kinematic, interactive, and map-based); Figure18breaks down all nine submetrics that together make up the meta-score.

Refer to caption Figure 17:WOSAC meta-scores and group metrics.##### Unregularized self-play.

Unregularized self-play achieves a WOSAC meta-score of 0.68, with the largest deficits in the kinematic (0.22) and interactive groups. As shown in Figure18, these policies produce low likelihoods particularly in linear speed, acceleration, and distance to nearest object.

Regularized self-play.

Adding regularization improves the meta-score to 0.725, with gains over unregularized self-play across every metric. The score is largely insensitive to additional data.

SMART-tiny CLSFT.

SMART trained on 52 days of human data achieves the highest meta-score of 0.755, despite a worse collision rate and task completion across all data bins (Table1). This result is consistent with the SMART-tiny CLSFT results reported on the CATK github repository.

Refer to caption Figure 18:WOSAC submetrics

F.4Safety analysis

Table 9:Collision severity tail breakdown with human-replays in interactive held-out scenarios.Eventsshows the count and share of all collision events attributed to each group. Per-eventΔv\Delta vstatistics and the fraction of events exceeding three injury-risk thresholds (11mph: cosmetic;55mph: airbag-deployment floor;1515mph: elevated serious-injury risk). Best value per column inbold; lower is better throughout.MethodEvents(at-fault coll. rate)MeanΔv\Delta v(m/s)↓\downarrowMaxΔv\Delta v(m/s)↓\downarrow>1>1mph (%)↓\downarrow>5>5mph (%)↓\downarrow>15>15mph (%)↓\downarrowunregularized91 (5.0%)2.0913.7189.054.914.3regularized53 (2.8%)1.718.0990.654.77.5

F.5Single and multi-agent RL

Refer to caption Figure 19:Single vs. multi-agent experiments. Purple bar plots represent performance of policies trained in a single-agent setting; Red barplots are policies trained in a multi-agent (self-play) setting.

Appendix GExtended limitations

Failure modes and directions for improvement.

We perform an additional analysis to better understand the limitations of the resulting regularized policies. To improve the signal of the analysis, we evaluate on a curated set ofinteractivescenarios within the held-out set, that is, filter for scenarios that contain dense multi-agent interactions such as merges, unprotected turns, and yielding (details in AppendixD.1).

Table10shows that (at-fault) collision rates increase noticeably in these interactive scenarios, even for the best regularized policy (2.1-2.8%) and SMART-tiny-CLSFT trained on 52 days of data (2.7%). We also share several representative failure modes on the webpagehttps://spiced-self-play.com/(see failure modes).

A likely reason for the increased collision rates for the self-play policies is that the Waymo scenarios that we train in during self-play are small (since they are constructed from a 9-second log), and agent interactions are relatively sparse (see Figure13for the distribution of intersections between agent logs), so the RL agent only occasionally trains on transitions that improve difficult coordination situations.

We outline several directions for future work that could improve robustness:

1.Curriculum learning based on advantage.Each scenario can be treated as a level whose difficulty is measured by the agent’s average advantage. Upsampling scenarios proportionally to their advantage would concentrate training signal on cases the agent finds difficult, naturally increasing exposure to rare but safety-critical situations such as sudden cut-ins and stationary obstacles.
2.Domain randomization.Masking out the observation of a ratio of agents within each scenario (“blind” agents[5]) and adding noise to the dynamics or partner features provides a targeted form of domain randomization that could make policy behavior more cautious.
3.Adversarial fine-tuning.A third training stage that fine-tunes on a curated set of adversarial human data would expose the policy to scenarios where the other agents in the scene do not respond to it.
4.Human-like opponents.Occasionally replacing the self-play opponent with the BC anchor rather than a copy of the RL policy would expose the agent to more human-like partner behavior throughout training.
5.Stronger anchor policy.The BC anchor is itself a limiting factor: our best anchor achieves a closed-loop score of 0.66 (Table7), and a stronger anchor, whether through architectural improvements or additional data, would give the KL regularizer a more reliable behavioral target.

Table 10:Interactive evaluation across all scaling checkpoints. All metrics are computed on the interactive validation subset; policies are rolled out in each of the 200 scenarios 10 times. Top-3 values per column are highlighted (best,2nd,3rd); best value additionally in bold.Graymarks the best unregularized self-play value per column. IDM results are not available for SMART (indicated by —).ScoreCollision ratesSelf-play maps(metadata)Anchor data(human demos)HR Score↑\uparrowIDM Score↑\uparrowIDM At-fault (%)↓\downarrowHR At-fault (%)↓\downarrowIDM Coll. (%)↓\downarrowHR Coll. (%)↓\downarrow100 (unreg.)0.312±0.0100.312\pm 0.0100.296±0.0100.296\pm 0.01042.8±1.142.8\pm 1.146.2±1.146.2\pm 1.146.6±1.146.6\pm 1.150.1±1.150.1\pm 1.11000 (unreg.)0.598±0.0110.598\pm 0.0110.577±0.0110.577\pm 0.01128.9±1.028.9\pm 1.029.9±1.029.9\pm 1.034.6±1.134.6\pm 1.134.3±1.034.3\pm 1.01k0 (unreg.)0.868±0.0070.868\pm 0.0070.842±0.0080.842\pm 0.0085.8±0.55.8\pm 0.57.6±0.67.6\pm 0.610.1±0.710.1\pm 0.712.2±0.712.2\pm 0.710k0 (unreg.)0.891±0.0070.891\pm 0.0070.876±0.0070.876\pm 0.007\cellcolortierunregbest3.2±0.43.2\pm 0.4\cellcolortierunregbest4.1±0.44.1\pm 0.49.0±0.69.0\pm 0.610.2±0.710.2\pm 0.750k0 (unreg.)\cellcolortierunregbest0.908±0.0060.908\pm 0.006\cellcolortierunregbest0.893±0.0070.893\pm 0.0073.8±0.43.8\pm 0.44.9±0.54.9\pm 0.5\cellcolortierunregbest7.6±0.67.6\pm 0.6\cellcolortierunregbest8.7±0.68.7\pm 0.61030 minutes0.425±0.0110.425\pm 0.0110.432±0.0110.432\pm 0.01133.1±1.033.1\pm 1.034.6±1.134.6\pm 1.136.6±1.136.6\pm 1.137.6±1.137.6\pm 1.1103 hours0.361±0.0110.361\pm 0.0110.371±0.0110.371\pm 0.01137.3±1.137.3\pm 1.139.6±1.139.6\pm 1.139.8±1.139.8\pm 1.143.2±1.143.2\pm 1.110030 minutes0.722±0.0100.722\pm 0.0100.661±0.0100.661\pm 0.01016.8±0.816.8\pm 0.818.0±0.818.0\pm 0.822.4±0.922.4\pm 0.923.6±0.923.6\pm 0.91003 hours0.658±0.0100.658\pm 0.0100.629±0.0110.629\pm 0.01121.8±0.921.8\pm 0.924.0±0.924.0\pm 0.925.5±1.025.5\pm 1.028.2±1.028.2\pm 1.01k30 minutes0.897±0.0070.897\pm 0.0070.858±0.0080.858\pm 0.0084.4±0.54.4\pm 0.55.9±0.55.9\pm 0.58.4±0.68.4\pm 0.69.2±0.69.2\pm 0.61k3 hours0.886±0.0070.886\pm 0.0070.866±0.0080.866\pm 0.0085.3±0.55.3\pm 0.57.0±0.67.0\pm 0.69.3±0.69.3\pm 0.610.2±0.710.2\pm 0.710k10 minutes0.916±0.0060.916\pm 0.0060.858±0.0080.858\pm 0.0083.1±0.43.1\pm 0.43.0±0.43.0\pm 0.48.3±0.68.3\pm 0.66.8±0.66.8\pm 0.610k30 minutes0.926±0.0060.926\pm 0.0060.892±0.0070.892\pm 0.0073.5±0.43.5\pm 0.4\cellcolortiersecond2.4±0.32.4\pm 0.37.9±0.67.9\pm 0.67.1±0.67.1\pm 0.610k3 hours0.906±0.0060.906\pm 0.0060.873±0.0070.873\pm 0.0073.0±0.43.0\pm 0.43.5±0.43.5\pm 0.47.7±0.67.7\pm 0.67.9±0.67.9\pm 0.610k30 hours0.925±0.0060.925\pm 0.006\cellcolortiersecond0.904±0.0070.904\pm 0.007\cellcolortiersecond2.6±0.42.6\pm 0.43.5±0.43.5\pm 0.4\cellcolortierthird5.9±0.55.9\pm 0.56.0±0.56.0\pm 0.550k10 minutes0.923±0.0060.923\pm 0.0060.883±0.0070.883\pm 0.0073.1±0.43.1\pm 0.43.0±0.43.0\pm 0.47.4±0.67.4\pm 0.66.9±0.66.9\pm 0.650k30 minutes\cellcolortierthird0.931±0.0060.931\pm 0.0060.890±0.0070.890\pm 0.007\cellcolortierthird2.8±0.42.8\pm 0.4\cellcolortierthird2.6±0.42.6\pm 0.4\cellcolortiersecond5.6±0.55.6\pm 0.56.0±0.56.0\pm 0.550k3 hours\cellcolortiersecond0.935±0.0050.935\pm 0.0050.890±0.0070.890\pm 0.0073.6±0.43.6\pm 0.42.8±0.42.8\pm 0.46.5±0.56.5\pm 0.5\cellcolortiersecond5.2±0.55.2\pm 0.550k30 hours\cellcolortierbest0.949±0.005\bm{0.949\pm 0.005}\cellcolortierbest0.908±0.006\bm{0.908\pm 0.006}\cellcolortierbest2.2±0.3\bm{2.2\pm 0.3}\cellcolortierbest2.1±0.3\bm{2.1\pm 0.3}\cellcolortierbest5.2±0.5\bm{5.2\pm 0.5}\cellcolortierbest4.2±0.4\bm{4.2\pm 0.4}—10 min (SMART)0.048±0.0050.048\pm 0.005——35.0±1.135.0\pm 1.1—43.9±1.143.9\pm 1.1—30 min (SMART)0.148±0.0080.148\pm 0.008——24.5±1.024.5\pm 1.0—30.9±1.030.9\pm 1.0—3 hours (SMART)0.319±0.0100.319\pm 0.010——15.3±0.815.3\pm 0.8—21.2±0.921.2\pm 0.9—30 hours (SMART)0.376±0.0110.376\pm 0.011——6.4±0.56.4\pm 0.5—11.6±0.711.6\pm 0.7—52 days BC (SMART)0.383±0.0110.383\pm 0.011——4.5±0.54.5\pm 0.5—7.9±0.67.9\pm 0.6—52 days CLSFT (SMART)0.433±0.0110.433\pm 0.011——2.7±0.42.7\pm 0.4—\cellcolortierthird5.4±0.55.4\pm 0.5

G.1SMART model performance with and without finetuning

The 52-day IL baseline results in Table1are obtained from the CAT-K fine-tuned SMART model trained on the full 500k-scenario Waymo training set[54], which achieves the strongest imitation-learning performance (training details in AppendixB.3). For completeness, Table11reports both raw and fine-tuned SMART checkpoints trained on subsets of the Waymo dataset. Although fine-tuning generally improves route completion, the pre-finetuning SMART checkpoints consistently achieve lower collision and off-road rates. We therefore report the raw checkpoints in the main paper, as they yield the strongest overall baseline performance.

Table 11:SMART performance with and without CATK[54]fine-tuning on 10,000 held-out validation scenarios. The main paper reports the strongest-performing variant at each data scale; for SMART, these correspond to the non-fine-tuned checkpoints shown here. Fine-tuned rows denote the same checkpoints after closed-loop supervised fine-tuning.Self-play / all-agents (test)Human-replay / SDC-only (test)Human demosusedFine-tunedColl. (%)↓\downarrowOff-road (%)↓\downarrowRoute prog. (%)↑\uparrowScore↑\uparrowColl. (%)↓\downarrowAt-fault (%)↓\downarrowOff-road (%)↓\downarrowRoute prog. (%)↑\uparrow10 minNo11.955.884.50.24632.025.018.657.710 minYes19.257.785.10.21633.326.927.368.530 minNo9.555.485.80.37917.912.516.876.930 minYes13.456.987.00.31123.318.326.480.43 hoursNo8.053.686.20.51811.46.94.581.53 hoursYes10.554.387.20.48114.110.18.985.730 hoursNo7.753.386.50.6016.83.31.685.430 hoursYes9.153.687.80.5866.94.02.889.4