Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

arXiv cs.LG Papers

Summary

This paper proposes a multi-agent reinforcement learning framework that co-trains an autonomous vehicle and pedestrians with personality-driven jaywalking behavior, achieving a 30% reduction in collisions compared to single-agent approaches and demonstrating more realistic interaction scenarios.

arXiv:2605.20255v1 Announce Type: new Abstract: Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies, and that the resulting behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. This paper describes a MARL environment in which an SDC and 12 pedestrians are co-trained using Multi-Agent Proximal Policy Optimization (MAPPO). Pedestrian locomotion follows scripted Dijkstra pathfinding, while an RL policy controls high-level go/wait decisions. Jaywalking probability depends on a per-pedestrian personality trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, compared to 35% goals and 33% collisions for the best rule-based baseline. A speed differential metric shows that the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating that jaywalking encounters were not anticipated. Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions. Co-training with MARL pedestrians reduced collisions by 30% relative to single-agent RL, as pedestrians learned to wait when the SDC approached at speed.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:20 AM

# Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty ††thanks: This work is supported by the SNSF SwarmOps project (No. 200021_219732).
Source: [https://arxiv.org/html/2605.20255](https://arxiv.org/html/2605.20255)
###### Abstract

Simulation\-based testing of self\-driving cars \(SDCs\) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior\. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe\. We hypothesize that jointly training pedestrians and the SDC with multi\-agent reinforcement learning \(MARL\) produces more realistic interaction scenarios than training the SDC against fixed pedestrian policies, and that the resulting behavior gap between predictable and unpredictable crossings can be measured directly from trajectories\. This paper describes a MARL environment in which an SDC and 12 pedestrians are co\-trained using Multi\-Agent Proximal Policy Optimization \(MAPPO\)\. Pedestrian locomotion follows scripted Dijkstra pathfinding, while an RL policy controls high\-level go/wait decisions\. Jaywalking probability depends on a per\-pedestrian personality trait sampled at episode start and hidden from the SDC\. In 500\-episode evaluations, the co\-trained SDC reached 78% of goals with a 14% collision rate, compared to 35% goals and 33% collisions for the best rule\-based baseline\. A speed differential metric shows that the SDC traveled 2\.65 m/s faster near jaywalkers than near crosswalk users at close range \(0–3 m\), indicating that jaywalking encounters were not anticipated\. Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions\. Co\-training with MARL pedestrians reduced collisions by 30% relative to single\-agent RL, as pedestrians learned to wait when the SDC approached at speed\.

## IIntroduction

Pedestrian\-vehicle interactions are a recurring source of urban collisions, and self\-driving cars \(SDCs\) must account for them during training and evaluation\. Existing simulation\-based testing of SDCs typically relies on scripted pedestrian motion or simplified crossing rules that do not reflect the heterogeneity of real human behavior\[[16](https://arxiv.org/html/2605.20255#bib.bib21),[2](https://arxiv.org/html/2605.20255#bib.bib37)\], raising a question for SDC assessment: are we training and testing SDCs against pedestrian models realistic enough to expose latent\-intent situations such as jaywalking? We hypothesize that co\-training pedestrians and the SDC with multi\-agent reinforcement learning \(MARL\), with a personality\-driven jaywalking mechanism invisible to the vehicle, yields more realistic interaction scenarios than single\-agent training, and that the gap between predictable and unpredictable crossings can be measured from trajectories\. We instantiate this idea with Multi\-Agent Proximal Policy Optimization \(MAPPO\)\[[26](https://arxiv.org/html/2605.20255#bib.bib1)\]\.

Prior work on pedestrian\-vehicle interaction has examined risk\-aware RL for jaywalker encounters\[[29](https://arxiv.org/html/2605.20255#bib.bib18)\], social force based simulation\[[7](https://arxiv.org/html/2605.20255#bib.bib13),[16](https://arxiv.org/html/2605.20255#bib.bib21)\], and survey evidence on jaywalking near automated vehicles\[[4](https://arxiv.org/html/2605.20255#bib.bib19)\]\. MARL has been applied to cooperative driving at intersections\[[27](https://arxiv.org/html/2605.20255#bib.bib23),[28](https://arxiv.org/html/2605.20255#bib.bib22)\]\. However, these works either fix the pedestrian model or train the SDC in isolation; none co\-train pedestrians and the SDC with explicit, trait\-driven behavioral uncertainty\.

![Refer to caption](https://arxiv.org/html/2605.20255v1/scenario_collision_1.png)\(a\)Collision with jaywalker
![Refer to caption](https://arxiv.org/html/2605.20255v1/scenario_jw_avoid_1.png)\(b\)Successful avoidance

Figure 1:Two outcomes in our environment\. The blue rectangle denotes the SDC, coloured dots denote pedestrians \(colour encoding personality traits, with yellow segments marking active jaywalking\), grey regions are roads, and green strips are sidewalks\. The goal is shown as a red crosshair\. \(a\) The SDC collides with a jaywalking pedestrian\. \(b\) The SDC steers around a jaywalker and reaches its goal\. A collision is registered when the SDC centre comes within 1\.5 m of a pedestrian centre, a safety envelope intended to account for vehicle footprint and pedestrian body size\.This paper makes the following contributions\.*First*, we describe a MARL environment for pedestrian\-SDC co\-training in which jaywalking is driven by a latent personality trait hidden from the SDC, extending prior MARL intersection work that uses at most a handful of scripted pedestrians per scene\[[27](https://arxiv.org/html/2605.20255#bib.bib23)\]\.*Second*, we introduce a speed differential metric that quantifies, directly from trajectories, how the SDC responds differently to predictable crosswalk encounters and unpredictable jaywalking encounters\.*Third*, we show empirically that co\-training produces emergent cooperative waiting behavior in pedestrians and reduces collisions by 30% relative to single\-agent RL, even when the single\-agent SDC is later paired with the MARL pedestrian policy\.

## IIRelated Work

Multi\-agent RL\.Cooperative MARL methods include MADDPG\[[13](https://arxiv.org/html/2605.20255#bib.bib2)\], QMIX\[[17](https://arxiv.org/html/2605.20255#bib.bib3)\], COMA\[[6](https://arxiv.org/html/2605.20255#bib.bib4)\], and IPPO\[[24](https://arxiv.org/html/2605.20255#bib.bib5)\]\. MAPPO\[[26](https://arxiv.org/html/2605.20255#bib.bib1)\], built on PPO\[[21](https://arxiv.org/html/2605.20255#bib.bib7)\]with GAE\[[20](https://arxiv.org/html/2605.20255#bib.bib8)\], uses the CTDE paradigm\[[14](https://arxiv.org/html/2605.20255#bib.bib6)\]: a centralized critic during training, decentralized actors at execution\. We adopt MAPPO with a shared centralized critic\.

Pedestrian behavior\.The social force model\[[7](https://arxiv.org/html/2605.20255#bib.bib13)\]remains foundational for pedestrian simulation\. Personality traits have been incorporated into heterogeneous pedestrian models\[[25](https://arxiv.org/html/2605.20255#bib.bib16)\], and recent work integrates sensory\-motor constraints into RL\-based pedestrian policies\[[23](https://arxiv.org/html/2605.20255#bib.bib14)\]\. Trajectory prediction methods such as Social LSTM\[[1](https://arxiv.org/html/2605.20255#bib.bib30)\]and Trajectron\+\+\[[19](https://arxiv.org/html/2605.20255#bib.bib32)\]forecast pedestrian paths but do not model crossing decisions\. Khuzam et al\.\[[11](https://arxiv.org/html/2605.20255#bib.bib20)\]studied jaywalking through Markov games, while Zhang et al\.\[[29](https://arxiv.org/html/2605.20255#bib.bib18)\]proposed risk\-aware RL for jaywalker interactions with the SDC trained in isolation\.

Uncertainty and simulation testing\.Kendall and Gal\[[10](https://arxiv.org/html/2605.20255#bib.bib26)\]distinguish aleatoric and epistemic uncertainty, Hoel et al\.\[[8](https://arxiv.org/html/2605.20255#bib.bib27)\]apply uncertainty estimation to tactical driving, and Wang et al\.\[[22](https://arxiv.org/html/2605.20255#bib.bib29)\]survey UQ methods for autonomous vehicles\. Standard SDC testbeds include CARLA\[[5](https://arxiv.org/html/2605.20255#bib.bib24)\]and SUMO\[[12](https://arxiv.org/html/2605.20255#bib.bib38)\]\. Birchler et al\.\[[2](https://arxiv.org/html/2605.20255#bib.bib37)\]show that simulation\-based SDC testing does not always align with human perception of safety and realism, motivating richer pedestrian models in SDC assessment\. GPU\-accelerated RL via JAX\[[3](https://arxiv.org/html/2605.20255#bib.bib40)\]has been demonstrated by JaxMARL\[[18](https://arxiv.org/html/2605.20255#bib.bib11)\]and CleanRL\[[9](https://arxiv.org/html/2605.20255#bib.bib12)\]\. Table[I](https://arxiv.org/html/2605.20255#S2.T1)summarizes the comparison\.

TABLE I:Comparison with Related Systems![Refer to caption](https://arxiv.org/html/2605.20255v1/x1.png)Figure 2:System architecture\. \(a\) CTDE: the centralized critic uses global state during training and is discarded at execution\. \(b\) Network architecture for pedestrian actor, SDC actor, and shared critic\. \(c\) Hierarchical decomposition: RL controls go/wait and accel/steer; scripted Dijkstra handles locomotion \(SDC bypasses this layer\); physics handles motion\. \(d\) Agent\-environment loop with personality traits feeding into the environment\.
## IIISystem Design

Fig\.[2](https://arxiv.org/html/2605.20255#S2.F2)shows the system architecture\. The environment, training, and inference all run in JAX on a single GPU viajax\.vmap\(512 parallel environments\) andjax\.lax\.scan\(rollout collection\)\.

### III\-AEnvironment

The environment covers a 120×\\times120 m urban map containing a four\-way intersection and a T\-junction, with 3 road segments, 20 sidewalk segments, and 6 crosswalks\. The scene is designed to contain two distinct intersection topologies so that a single 50 s episode covers both a four\-way and a T\-junction crossing\. Simulation runs atd​t=0\.1dt\{=\}0\.1s \(10 Hz\) for 500 steps \(50 s\) per episode\.

Pedestrians\.The environment contains 12 pedestrian agents, substantially more than the up to 3 background pedestrians used in the recent CARLA\-based MARL intersection study of Yu et al\.\[[27](https://arxiv.org/html/2605.20255#bib.bib23)\], chosen to sustain multiple concurrent pedestrian\-SDC interactions per episode rather than a single encounter\. Pedestrians navigate via Dijkstra shortest\-path on a 40\-node navigation graph \(34 sidewalk waypoints plus 6 crosswalk midpoints\)\. The RL policy outputs a binary go/wait decision\. When the policy selects “go,” a personality\-driven roll determines whether the pedestrian crosses via a designated crosswalk or jaywalks across the road:

P​\(jaywalk∣go\)=τj×0\.25P\(\\text\{jaywalk\}\\mid\\text\{go\}\)=\\tau\_\{j\}\\times 0\.25\(1\)whereτj∈\[0,1\]\\tau\_\{j\}\\in\[0,1\]is the jaywalking tendency, sampled uniformly at episode start and*not observable*by the SDC\. This trait\-based parameterization of pedestrian heterogeneity is in the same spirit as the risk\-taking, cautious, and distracted pedestrian types used in prior social\-force simulations of AV\-pedestrian interaction\[[16](https://arxiv.org/html/2605.20255#bib.bib21)\]\. Walking speed ranges from 1\.0 to 2\.0 m/s\.

SDC\.The SDC uses a kinematic bicycle model\[[15](https://arxiv.org/html/2605.20255#bib.bib39)\]with wheelbase 2\.5 m, maximum speed 8\.33 m/s \(30 km/h\), and maximum steering angle 0\.52 rad\. A hard constraint keeps the SDC on the road with a 0\.5 m margin\. An episode terminates on collision \(SDC\-pedestrian centre distance below 1\.5 m\), on goal \(distance below 3\.0 m\), or on timeout\.

### III\-BMAPPO Training

Algorithm[1](https://arxiv.org/html/2605.20255#alg1)outlines the training loop\. We use clip PPO objectives with GAE and a shared centralized critic, following the common MAPPO design\[[26](https://arxiv.org/html/2605.20255#bib.bib1)\]\. The number of updates \(5,000\) was chosen empirically by monitoring the learning curves of all three networks until they plateaued on held\-out seeds; combined with 512 parallel environments and 256\-step rollouts, this corresponds to6\.55×1086\.55\\times 10^\{8\}environment steps in total\. The centralized critic observes a 58\-dim global state \(all pedestrian positions, speeds, jaywalking tendencies, and SDC state\)\. The 20\-dim pedestrian observation comprises own state, traits, waypoint direction, surface type, and SDC relative state\. The 34\-dim SDC observation comprises own state, goal, road and lane information, and the 6 nearest pedestrians\. All 12 pedestrians share actor parameters, as is standard for homogeneous MARL agents\[[26](https://arxiv.org/html/2605.20255#bib.bib1)\]\. Table[II](https://arxiv.org/html/2605.20255#S3.T2)lists the full configuration\.

Algorithm 1MAPPO Co\-Training1:Initialize ped actor

πθp\\pi\_\{\\theta\}^\{p\}, SDC actor

πϕs\\pi\_\{\\phi\}^\{s\}, shared critic

VψV\_\{\\psi\}
2:forupdate

=1=1to

5,0005\{,\}000do

3:Collect rollouts across 512 parallel envs for 256 steps

4:foreach env stepdo

5:Ped obs

←\\leftarrowlocal \(20\-dim\); SDC obs

←\\leftarrowlocal \(34\-dim\)

6:Ped actions

∼πθp\\sim\\pi\_\{\\theta\}^\{p\}; SDC action

∼πϕs\\sim\\pi\_\{\\phi\}^\{s\}
7:Step env; record

\(o,a,r,d\)\(o,a,r,d\)
8:endfor

9:Compute

V​\(s\)V\(s\)using critic with global state \(58\-dim\)

10:Compute GAE advantages \(

γ=0\.995\\gamma\{=\}0\.995,

λ=0\.95\\lambda\{=\}0\.95\)

11:Update

πθp\\pi\_\{\\theta\}^\{p\},

πϕs\\pi\_\{\\phi\}^\{s\},

VψV\_\{\\psi\}with clipped PPO \(

ϵ=0\.2\\epsilon\{=\}0\.2, 4 epochs, 8 minibatches\)

12:endfor

TABLE II:System Configuration and HyperparametersThe critic trains on a blended reward \(50% mean pedestrian plus 50% SDC\) reflecting the cooperative objective\. Pedestrian rewards include waypoint progress \(\+2\.0⋅Δ​d\+2\.0\\cdot\\Delta d\), waypoint reached \(\+5\.0\+5\.0\), collision \(−25\.0\-25\.0\), and a smart\-waiting bonus \(\+0\.3\+0\.3when waiting within 8 m of a fast\-approaching SDC\)\. SDC rewards include goal progress \(\+2\.0⋅Δ​d\+2\.0\\cdot\\Delta d\), goal reached \(\+50\.0\+50\.0\), collision \(−50\.0\-50\.0\), speeding penalties near occupied crosswalks and jaywalkers, lane centering, and heading alignment\. No positive reward is given for stopping, which prevents the SDC from accumulating yielding rewards by waiting indefinitely\.

### III\-CUncertainty Quantification

We distinguish two encounter types from the SDC’s perspective:*predictable*\(crosswalk crossings, anticipatable from crosswalk proximity and trajectory\) and*uncertain*\(jaywalking, governed by the latent traitτj\\tau\_\{j\}invisible to the SDC\)\. We quantify uncertainty via aSpeed Differentialmetric\. For each timestep where a pedestrian of typec∈\{cw,jw\}c\\in\\\{\\text\{cw\},\\text\{jw\}\\\}is within distance bin\[d1,d2\]\[d\_\{1\},d\_\{2\}\]of the SDC, we record the SDC speedvsdcv\_\{\\text\{sdc\}\}:

v¯c​\(d1,d2\)=1\|Tc\|​∑t∈Tcvsdc​\(t\),Tc=\{t:dtc∈\[d1,d2\]\}\\bar\{v\}\_\{c\}\(d\_\{1\},d\_\{2\}\)=\\frac\{1\}\{\|T\_\{c\}\|\}\\sum\_\{t\\in T\_\{c\}\}v\_\{\\text\{sdc\}\}\(t\),\\quad T\_\{c\}=\\\{t:d\_\{t\}^\{c\}\\in\[d\_\{1\},d\_\{2\}\]\\\}\(2\)
The gapΔ​v=v¯jw−v¯cw\\Delta v=\\bar\{v\}\_\{\\text\{jw\}\}\-\\bar\{v\}\_\{\\text\{cw\}\}measures how much faster the SDC travels near jaywalkers than near crosswalk users at a given distance\. A positiveΔ​v\\Delta vindicates the SDC was not anticipating the jaywalking encounter\. We additionally measurecollision attribution, the fraction of collisions caused by each pedestrian type, normalized by encounter frequency\.

## IVExperiments

All results use 500\-episode evaluations with a fixed seed set shared across methods\. We report goal rate and collision rate, standard safety and success metrics for RL\-based autonomous driving evaluation \(CARLA uses similar metrics\[[5](https://arxiv.org/html/2605.20255#bib.bib24)\]\)\. Scenario seeds in Fig\.[1](https://arxiv.org/html/2605.20255#S1.F1)were selected by fixed programmatic criteria on encounter proximity rather than hand\-picked\.

### IV\-ABaseline Comparison

Fig\.[3](https://arxiv.org/html/2605.20255#S4.F3)compares the MARL SDC against four non\-learning baselines and a single\-agent RL SDC over 500 episodes\. The co\-trained SDC reached 78%/14% \(goals/collisions\)\. The best rule\-based method \(full throttle\) reached 35%/33%; reactive braking did not help because it can brake but not steer\. The single\-agent RL SDC reached 65%/20%\.

![Refer to caption](https://arxiv.org/html/2605.20255v1/x2.png)Figure 3:Goal and collision rates across methods \(500 episodes each\)\. Single\-agent RL was trained with scripted pedestrians\.
### IV\-BUncertainty Results

Fig\.[4](https://arxiv.org/html/2605.20255#S4.F4)shows the speed differential from \([2](https://arxiv.org/html/2605.20255#S3.E2)\)\. At 0–3 m, the SDC was 2\.65 m/s faster near jaywalkers \(8\.08 vs\. 5\.43 m/s\), and the gap persists across all distance bins \(0\.80–1\.60 m/s at 3–12 m\)\. Near crosswalk users the SDC had already decelerated, while near jaywalkers it maintained a higher speed, consistent with the crossing being unanticipated\.

![Refer to caption](https://arxiv.org/html/2605.20255v1/x3.png)Figure 4:SDC speed vs\. distance to nearest pedestrian, separated by encounter type\. The shaded region indicates the speed gap between jaywalker and crosswalk encounters\.Collision attribution\.Jaywalking accounted for 13% of crossing events but was associated with 62% of collisions\. The 5th\-percentile minimum approach distance was 3\.25 m for jaywalker encounters and 3\.49 m for crosswalk encounters\.

Personality mapping\.Jaywalking rate increases monotonically with the trait: Q1 \(τj<0\.25\\tau\_\{j\}<0\.25\)==3\.2%, Q2==9\.4%, Q3==15\.7%, Q4 \(τj\>0\.75\\tau\_\{j\}\>0\.75\)==21\.6%, consistent with \([1](https://arxiv.org/html/2605.20255#S3.E1)\)\.

### IV\-CMARL vs\. Single\-Agent RL

Table[III](https://arxiv.org/html/2605.20255#S4.T3)presents a 2×\\times2 comparison\. Co\-training reduces collisions from 20% to 14%\. The pedestrian RL policy learns to wait when the SDC is nearby and moving fast, whereas scripted pedestrians never wait\. The co\-trained SDC also generalizes well against scripted pedestrians \(76% vs\. 65% for the single\-agent SDC\)\.

TABLE III:MARL vs\. Single\-Agent RL \(500 episodes each\)
### IV\-DAblation: Jaywalking Rate

We scaled the jaywalking probability by scaling the0\.250\.25multiplier in Eq\. \([1](https://arxiv.org/html/2605.20255#S3.E1)\), producing five settings with effective jaywalking rates of 0%, 5%, 13% \(default\), 30%, and 50%\. Performance degrades gracefully in the low\-to\-moderate range and then sharply once jaywalking dominates: 0%→\\rightarrow77%/15% \(goals/collisions\), 5%→\\rightarrow77%/15%, 13%→\\rightarrow76%/16%, 30%→\\rightarrow73%/18%, 50%→\\rightarrow64%/28%\. Between 0% and 30% jaywalking the collision rate grows from 15% to 18%, while from 30% to 50% it jumps from 18% to 28%\. This suggests that the SDC’s learned behavior tolerates a realistic amount of unpredictable crossings but degrades non\-linearly once a large fraction of pedestrians bypass the crosswalks, consistent with jaywalking acting as a tunable source of aleatoric uncertainty\.

## VDiscussion and Conclusion

Beyond numerical results, the main qualitative contribution of this work is richer pedestrian behavior during SDC training and evaluation: pedestrians learned cooperative waiting without explicit programming, and even the single\-agent SDC saw collisions drop from 20% to 16% when paired with MARL peds\. The speed differential metric provides a trajectory\-level measure of uncertainty that complements human\-perception\-based SDC assessment\[[2](https://arxiv.org/html/2605.20255#bib.bib37)\]\.

Relating this to simulation\-based SDC testing: the realism of the pedestrian model shapes which situations an SDC is exposed to during assessment\. A scripted\-pedestrian testbed underrepresents jaywalking and its surprise \(62%62\\%of collisions with only13%13\\%of crossings\), a gap hidden if pedestrians never cross outside crosswalks\. Co\-trained, trait\-driven pedestrians expose the SDC to more latent\-intent situations and yield a metric \(speed differential\) that can be computed for any SDC policy under test\.

Future work includes a speed\-based collision criterion, per\-agent\-type critics, human\-in\-the\-loop scenario validation, and scaling to larger heterogeneous traffic\.

## Acknowledgments

We thank the SNSF for the project entitled ”SwarmOps: Human\-sensing based MLOps for Collaborative Cyber\-physical systems” \(Project No\. 200021\_219732\)\.

## References

- \[1\]A\. Alahi, K\. Goel, V\. Ramanathan, A\. Robicquet, L\. Fei\-Fei, and S\. Savarese\(2016\)Social LSTM: Human Trajectory Prediction in Crowded Spaces\.pp\. 961–971\.External Links:[Link](https://openaccess.thecvf.com/content_cvpr_2016/html/Alahi_Social_LSTM_Human_CVPR_2016_paper.html)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p2.1)\.
- \[2\]C\. Birchler, T\. K\. Mohammed, P\. Rani, T\. Nechita, T\. Kehrer, and S\. Panichella\(2024\-07\)How Does Simulation\-Based Testing for Self\-Driving Cars Match Human Perception?\.Replication Package \- ”How does Simulation\-based Testing for Self\-driving Cars match Human Perception?”1\(FSE\),pp\. 42:929–42:950\.External Links:[Link](https://dl.acm.org/doi/10.1145/3643768),[Document](https://dx.doi.org/10.1145/3643768)Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p1.1),[§II](https://arxiv.org/html/2605.20255#S2.p3.1),[§V](https://arxiv.org/html/2605.20255#S5.p1.1)\.
- \[3\]J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, Y\. Katariya, C\. Leary, D\. Maclaurin, G\. Necula, A\. Paszke, J\. VanderPlas, S\. Wanderman\-Milne, and Q\. Zhang\(2018\)JAX: composable transformations of Python\+NumPy programs\.External Links:[Link](http://github.com/jax-ml/jax)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1)\.
- \[4\]X\. Dong, E\. Guerra, and R\. A\. Daziano\(2024\-05\)Will automated vehicles encourage more jaywalking? Results from a stated preference survey\.Transportation Research Part F: Traffic Psychology and Behaviour103,pp\. 217–229\.External Links:ISSN 1369\-8478,[Link](https://www.sciencedirect.com/science/article/pii/S1369847824000858),[Document](https://dx.doi.org/10.1016/j.trf.2024.04.011)Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p2.1)\.
- \[5\]A\. Dosovitskiy, G\. Ros, F\. Codevilla, A\. Lopez, and V\. Koltun\(2017\-11\)CARLA: An Open Urban Driving Simulator\.arXiv\.Note:arXiv:1711\.03938 \[cs\]Comment: Published at the 1st Conference on Robot Learning \(CoRL\)External Links:[Link](http://arxiv.org/abs/1711.03938),[Document](https://dx.doi.org/10.48550/arXiv.1711.03938)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1),[§IV](https://arxiv.org/html/2605.20255#S4.p1.1)\.
- \[6\]J\. N\. Foerster, G\. Farquhar, T\. Afouras, N\. Nardelli, and S\. Whiteson\(2018\-02\)Counterfactual multi\-agent policy gradients\.InProceedings of the Thirty\-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence,AAAI’18/IAAI’18/EAAI’18,New Orleans, Louisiana, USA,pp\. 2974–2982\.External Links:ISBN 978\-1\-57735\-800\-8,[Link](https://dl.acm.org/doi/10.5555/3504035.3504398)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p1.1)\.
- \[7\]D\. Helbing and P\. Molnar\(1995\-05\)Social Force Model for Pedestrian Dynamics\.Physical Review E51\(5\),pp\. 4282–4286\.Note:arXiv:cond\-mat/9805244Comment: For related work see http://www\.theo2\.physik\.uni\-stuttgart\.de/helbing\.htmlExternal Links:ISSN 1063\-651X, 1095\-3787,[Link](http://arxiv.org/abs/cond-mat/9805244),[Document](https://dx.doi.org/10.1103/PhysRevE.51.4282)Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p2.1),[§II](https://arxiv.org/html/2605.20255#S2.p2.1)\.
- \[8\]C\. Hoel, K\. Wolff, and L\. Laine\(2020\-04\)Tactical Decision\-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation\.arXiv\.Note:arXiv:2004\.10439 \[cs\]External Links:[Link](http://arxiv.org/abs/2004.10439),[Document](https://dx.doi.org/10.48550/arXiv.2004.10439)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1)\.
- \[9\]S\. Huang, R\. F\. J\. Dossa, C\. Ye, J\. Braga, D\. Chakraborty, K\. Mehta, and J\. G\. M\. Araújo\(2022\)CleanRL: High\-quality Single\-file Implementations of Deep Reinforcement Learning Algorithms\.Journal of Machine Learning Research23\(274\),pp\. 1–18\.External Links:ISSN 1533\-7928,[Link](http://jmlr.org/papers/v23/21-1342.html)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1)\.
- \[10\]A\. Kendall and Y\. Gal\(2017\-10\)What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?\.arXiv\.Note:arXiv:1703\.04977 \[cs\]Comment: NIPS 2017External Links:[Link](http://arxiv.org/abs/1703.04977),[Document](https://dx.doi.org/10.48550/arXiv.1703.04977)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1)\.
- \[11\]E\. A\. Khuzam, G\. Lanzaro, and T\. Sayed\(2025\-09\)Impact of jaywalking on pedestrian interaction behavior: A multiagent Markov Game\-based analysis\.Accident Analysis & Prevention220,pp\. 108141\.External Links:ISSN 0001\-4575,[Link](https://www.sciencedirect.com/science/article/pii/S0001457525002271),[Document](https://dx.doi.org/10.1016/j.aap.2025.108141)Cited by:[TABLE I](https://arxiv.org/html/2605.20255#S2.T1.4.4.2.1),[§II](https://arxiv.org/html/2605.20255#S2.p2.1)\.
- \[12\]P\. A\. Lopez, M\. Behrisch, L\. Bieker\-Walz, J\. Erdmann, Y\. Flötteröd, R\. Hilbrich, L\. Lücken, J\. Rummel, P\. Wagner, and E\. WieBner\(2018\-11\)Microscopic Traffic Simulation using SUMO\.In2018 21st International Conference on Intelligent Transportation Systems \(ITSC\),Maui, HI, USA,pp\. 2575–2582\.External Links:ISBN 978\-1\-7281\-0321\-1,[Link](https://doi.org/10.1109/ITSC.2018.8569938),[Document](https://dx.doi.org/10.1109/ITSC.2018.8569938)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1)\.
- \[13\]R\. Lowe, Y\. Wu, A\. Tamar, J\. Harb, P\. Abbeel, and I\. Mordatch\(2017\-12\)Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6382–6393\.External Links:ISBN 978\-1\-5108\-6096\-4,[Link](https://dl.acm.org/doi/10.5555/3295222.3295385)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p1.1)\.
- \[14\]F\. A\. Oliehoek and C\. Amato\(2016\)A Concise Introduction to Decentralized POMDPs\.SpringerBriefs in Intelligent Systems,Springer International Publishing,Cham\.External Links:ISBN 978\-3\-319\-28927\-4 978\-3\-319\-28929\-8,[Link](http://link.springer.com/10.1007/978-3-319-28929-8),[Document](https://dx.doi.org/10.1007/978-3-319-28929-8)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p1.1)\.
- \[15\]P\. Polack, F\. Altché, B\. d’Andréa\-Novel, and A\. de La Fortelle\(2017\-06\)The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles?\.In2017 IEEE Intelligent Vehicles Symposium \(IV\),pp\. 812–818\.External Links:[Document](https://dx.doi.org/10.1109/IVS.2017.7995816)Cited by:[§III\-A](https://arxiv.org/html/2605.20255#S3.SS1.p3.1)\.
- \[16\]M\. M\. Rashid, M\. Seyedi, and S\. Jung\(2024\-04\)Simulation of pedestrian interaction with autonomous vehicles via social force model\.Simulation Modelling Practice and Theory132,pp\. 102901\.External Links:ISSN 1569\-190X,[Link](https://www.sciencedirect.com/science/article/pii/S1569190X24000157),[Document](https://dx.doi.org/10.1016/j.simpat.2024.102901)Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p1.1),[§I](https://arxiv.org/html/2605.20255#S1.p2.1),[TABLE I](https://arxiv.org/html/2605.20255#S2.T1.4.6.4.1),[§III\-A](https://arxiv.org/html/2605.20255#S3.SS1.p2.1)\.
- \[17\]T\. Rashid, M\. Samvelyan, C\. S\. De Witt, G\. Farquhar, J\. Foerster, and S\. Whiteson\(2020\-01\)Monotonic value function factorisation for deep multi\-agent reinforcement learning\.J\. Mach\. Learn\. Res\.21\(1\),pp\. 178:7234–178:7284\.External Links:ISSN 1532\-4435,[Link](https://dl.acm.org/doi/10.5555/3455716.3455894)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p1.1)\.
- \[18\]A\. Rutherford, B\. Ellis, M\. Gallici, J\. Cook, A\. Lupu, G\. Ingvarsson, T\. Willi, R\. Hammond, A\. Khan, C\. S\. d\. Witt, A\. Souly, S\. Bandyopadhyay, M\. Samvelyan, M\. Jiang, R\. T\. Lange, S\. Whiteson, B\. Lacerda, N\. Hawes, T\. Rocktaschel, C\. Lu, and J\. N\. Foerster\(2024\-11\)JaxMARL: Multi\-Agent RL Environments and Algorithms in JAX\.arXiv\.Note:arXiv:2311\.10090 \[cs\]External Links:[Link](http://arxiv.org/abs/2311.10090),[Document](https://dx.doi.org/10.48550/arXiv.2311.10090)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1)\.
- \[19\]T\. Salzmann, B\. Ivanovic, P\. Chakravarty, and M\. Pavone\(2021\-01\)Trajectron\+\+: Dynamically\-Feasible Trajectory Forecasting With Heterogeneous Data\.arXiv\.Note:arXiv:2001\.03093 \[cs\]Comment: 23 pages, 6 figures, 5 tables\. All code, models, and data can be found at https://github\.com/StanfordASL/Trajectron\-plus\-plus \. European Conference on Computer Vision \(ECCV\) 2020\. Fixed a few typosExternal Links:[Link](http://arxiv.org/abs/2001.03093),[Document](https://dx.doi.org/10.48550/arXiv.2001.03093)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p2.1)\.
- \[20\]J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel\(2018\-10\)High\-Dimensional Continuous Control Using Generalized Advantage Estimation\.arXiv\.Note:arXiv:1506\.02438 \[cs\]External Links:[Link](http://arxiv.org/abs/1506.02438),[Document](https://dx.doi.org/10.48550/arXiv.1506.02438)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p1.1)\.
- \[21\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\-08\)Proximal Policy Optimization Algorithms\.arXiv\.Note:arXiv:1707\.06347 \[cs\]External Links:[Link](http://arxiv.org/abs/1707.06347),[Document](https://dx.doi.org/10.48550/arXiv.1707.06347)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p1.1)\.
- \[22\]K\. Wang, C\. Shen, X\. Li, and J\. Lu\(2025\-03\)Uncertainty Quantification for Safe and Reliable Autonomous Vehicles: A Review of Methods and Applications\.IEEE Transactions on Intelligent Transportation Systems26\(3\),pp\. 2880–2896\.External Links:ISSN 1558\-0016,[Link](https://ieeexplore.ieee.org/document/10879299),[Document](https://dx.doi.org/10.1109/TITS.2025.3532803)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p3.1)\.
- \[23\]Y\. Wang, A\. R\. Srinivasan, Y\. M\. Lee, and G\. Markkula\(2024\-09\)Modeling Pedestrian Crossing Behavior: A Reinforcement Learning Approach with Sensory Motor Constraints\.arXiv\.Note:arXiv:2409\.14522 \[cs\]External Links:[Link](http://arxiv.org/abs/2409.14522),[Document](https://dx.doi.org/10.48550/arXiv.2409.14522)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p2.1)\.
- \[24\]C\. S\. d\. Witt, T\. Gupta, D\. Makoviichuk, V\. Makoviychuk, P\. H\. S\. Torr, M\. Sun, and S\. Whiteson\(2020\-11\)Is Independent Learning All You Need in the StarCraft Multi\-Agent Challenge?\.arXiv\.Note:arXiv:2011\.09533 \[cs\]External Links:[Link](http://arxiv.org/abs/2011.09533),[Document](https://dx.doi.org/10.48550/arXiv.2011.09533)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p1.1)\.
- \[25\]Z\. Xue, Q\. Dong, X\. Fan, Q\. Jin, H\. Jian, and J\. Liu\(2017\-10\)Fuzzy Logic\-Based Model That Incorporates Personality Traits for Heterogeneous Pedestrians\.Symmetry9\(10\),pp\. 239\(en\)\.External Links:ISSN 2073\-8994,[Link](https://www.mdpi.com/2073-8994/9/10/239),[Document](https://dx.doi.org/10.3390/sym9100239)Cited by:[§II](https://arxiv.org/html/2605.20255#S2.p2.1)\.
- \[26\]C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu\(2022\-11\)The surprising effectiveness of PPO in cooperative multi\-agent games\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA,pp\. 24611–24624\.External Links:ISBN 978\-1\-7138\-7108\-8Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p1.1),[§II](https://arxiv.org/html/2605.20255#S2.p1.1),[§III\-B](https://arxiv.org/html/2605.20255#S3.SS2.p1.1)\.
- \[27\]T\. Yu, K\. Wang, Z\. Li, T\. Yu, and K\. Sakaguchi\(2025\-05\)Multi\-Agent Reinforcement Learning\-based Cooperative Autonomous Driving in Smart Intersections\.arXiv\.Note:arXiv:2505\.04231 \[cs\]Comment: 7 pagesExternal Links:[Link](http://arxiv.org/abs/2505.04231),[Document](https://dx.doi.org/10.48550/arXiv.2505.04231)Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p2.1),[§I](https://arxiv.org/html/2605.20255#S1.p3.1),[TABLE I](https://arxiv.org/html/2605.20255#S2.T1.4.5.3.1),[§III\-A](https://arxiv.org/html/2605.20255#S3.SS1.p2.2)\.
- \[28\]R\. Zhang, J\. Hou, F\. Walter, S\. Gu, J\. Guan, F\. Röhrbein, Y\. Du, P\. Cai, G\. Chen, and A\. Knoll\(2024\-08\)Multi\-Agent Reinforcement Learning for Autonomous Driving: A Survey\.arXiv\.Note:arXiv:2408\.09675 \[cs\]Comment: 23 pages, 6 figures and 2 tables\. Submitted to IEEE JournalExternal Links:[Link](http://arxiv.org/abs/2408.09675),[Document](https://dx.doi.org/10.48550/arXiv.2408.09675)Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p2.1)\.
- \[29\]Z\. Zhang, H\. Li, T\. Chen, N\. N\. Sze, W\. Yang, Y\. Zhang, and G\. Ren\(2025\-02\)Decision\-making of autonomous vehicles in interactions with jaywalkers: A risk\-aware deep reinforcement learning approach\.Accident Analysis & Prevention210,pp\. 107843\.External Links:ISSN 0001\-4575,[Link](https://www.sciencedirect.com/science/article/pii/S0001457524003889),[Document](https://dx.doi.org/10.1016/j.aap.2024.107843)Cited by:[§I](https://arxiv.org/html/2605.20255#S1.p2.1),[TABLE I](https://arxiv.org/html/2605.20255#S2.T1.4.3.1.1),[§II](https://arxiv.org/html/2605.20255#S2.p2.1)\.

Similar Articles

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

arXiv cs.LG

Proposes LILAC+, a framework for safe continual reinforcement learning under nonstationarity that uses three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Evaluations in simulated driving environments show reduced safety violations under distribution shift while maintaining competitive performance.