Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

arXiv cs.AI Papers

Summary

This paper proposes an uncertainty-aware reinforcement learning framework for autonomous driving that uses expert advice guided by adaptive uncertainty thresholds and a commitment-cooldown strategy to improve safety and efficiency. Experiments in the CARLA simulator show a 5-7% success improvement over the IQN baseline.

arXiv:2605.30576v1 Announce Type: new Abstract: Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framework that leverages expert advice to guide exploration while avoiding long-term dependence. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent's confidence. A commitment-cooldown strategy with a stochastic early-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget. Expert and agent experiences are combined in a shared replay buffer within an off-policy implicit quantile network (IQN) backbone, enabling efficient reuse of expert trajectories. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5-7% and reducing failures, demonstrating that risk-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor-based RL policy learning in unsignalized intersection navigation.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:23 AM

# Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving
Source: [https://arxiv.org/html/2605.30576](https://arxiv.org/html/2605.30576)
Ahmed Abouelazm1,2, Felix Klingebiel2, Philip Schoerner1,2, and J\. Marius Zöllner1,21Authors are with the FZI Research Center for Information Technology, Germanyname@fzi\.de2Authors are with the Karlsruhe Institute of Technology, Germany

###### Abstract

Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off\-road driving\. We propose an uncertainty\-aware framework that leverages expert advice to guide exploration while avoiding long\-term dependence\. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent’s confidence\. A commitment–cooldown strategy with a stochastic early\-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget\. Expert and agent experiences are combined in a shared replay buffer within an off\-policy implicit quantile network \(IQN\) backbone, enabling efficient reuse of expert trajectories\. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5–7% and reducing failures, demonstrating that risk\-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor\-based RL policy learning in unsignalized intersection navigation\.

## IIntroduction

End\-to\-end \(E2E\)has emerged as a promising paradigm forautonomous driving \(AD\), mapping raw sensory inputs directly to control actions and reducing reliance on handcrafted modular pipelines\[[6](https://arxiv.org/html/2605.30576#bib.bib3)\]\. AmongE2Eapproaches,reinforcement learning \(RL\)is particularly appealing: unlikeImitation learning \(IL\), which depends on expert demonstrations,RLenables agents to adapt through direct interaction with the environment\[[32](https://arxiv.org/html/2605.30576#bib.bib7)\]\. By exploring different behaviors and receiving feedback via rewards,RLagents can handle complex conditions such as occlusions, erratic drivers, and rare events that are underrepresented in datasets\.

However,RLalso poses fundamental challenges in sample efficiency and safety\[[14](https://arxiv.org/html/2605.30576#bib.bib5)\]\. The learning process is driven by the interplay between exploration, where the agent deliberately samples novel actions to acquire new knowledge about the environment, and exploitation, where it applies its current policy to maximize expected rewards\[[32](https://arxiv.org/html/2605.30576#bib.bib7)\]\. Exploration is indispensable for avoiding suboptimal policies and uncovering effective driving strategies, yet it inherently involves unsafe or undesirable behaviors\[[36](https://arxiv.org/html/2605.30576#bib.bib6)\]\.

At the same time, extensive exploration leads to poor sample efficiency, as agents require a vast number of interactions before converging to a reliable policy\[[12](https://arxiv.org/html/2605.30576#bib.bib8)\]\. These limitations underscore the need for mechanisms that constrain or prune exploration, directing the agent toward informative regions of the state space while preventing unnecessary behaviors\.

Research Gap\.Existing work on safe exploration inRLforADhas followed several directions\. Formal approaches such asConstrained Markov Decision Process \(CMDP\)\[[17](https://arxiv.org/html/2605.30576#bib.bib17)\], and Lyapunov functions\[[25](https://arxiv.org/html/2605.30576#bib.bib43)\]encode safety explicitly, but they require handcrafted risk definitions, involve complex optimization, and provide no mechanism to guide policies toward safer actions\. To address these limitations, expert knowledge has been introduced through demonstrations\. Demonstration\-based pre\-training\[[7](https://arxiv.org/html/2605.30576#bib.bib52)\]accelerates convergence yet leaves agents vulnerable to unsafe exploration and provides no corrective feedback, while human\-in\-the\-loop systems\[[18](https://arxiv.org/html/2605.30576#bib.bib58)\]allow targeted interventions but are not scalable and remain detached from the agent’s internal decision process\.

More scalable alternatives replace humans with rule\-based\[[38](https://arxiv.org/html/2605.30576#bib.bib59)\]or learned\[[41](https://arxiv.org/html/2605.30576#bib.bib63)\]expert policies, which intervene automatically based on predefined heuristics\. Although effective at preventing unsafe rollouts, rule\-based triggers are overly conservative and context\-dependent, while learned experts assume strong generalization and may overwrite valid actions with poor ones when this assumption fails\. Recent advances shift control to the agent, allowing it to query expert advice when needed via triggers such as state uncertainty\[[8](https://arxiv.org/html/2605.30576#bib.bib36)\]and state novelty\[[19](https://arxiv.org/html/2605.30576#bib.bib65)\]\. However, these approaches remain largely state\-centric, neglect the risk induced by the agent’s exploratory actions, and lack mechanisms to regulate the frequency and impact of advice\.

To address these shortcomings, methods are needed that extend uncertainty estimation to capture action\-related risk, combined with adaptive mechanisms that regulate advice and embed safety into the learned policy\.

Contribution\.This work proposes anRLframework forADthat improves exploration safety by extending uncertainty estimation to account for action\-related risk and integrating expert input through adaptive mechanisms\. The key contributions are:

- •Uncertainty\-aware expert guidance:the agent requests advice in states with high epistemic or aleatoric uncertainty, storing both expert and agent transitions in a shared buffer for training\.
- •Uncertainty estimation:aleatoric uncertainty \(environmental risk\) is derived from return variance, while epistemic uncertainty \(limited knowledge\) is measured via bootstrapped ensembles, enabling detection of risky and underexplored states\.
- •Commitment–cooldown strategy:regulates advice use by applying expert actions over short horizons and enforcing cooldowns, ensuring effective guidance without overreliance\.

## IIrelated work

SafeRLinADis challenging due to the need to balance efficiency with safety requirements\. A common approach is to regulate exploration through constraints or learned objectives\.CMDPformulations constrain cumulative risk below predefined thresholds\[[39](https://arxiv.org/html/2605.30576#bib.bib47),[17](https://arxiv.org/html/2605.30576#bib.bib17)\], while Lyapunov\-based methods enforce stability by requiring a Lyapunov function to decrease along trajectories\[[25](https://arxiv.org/html/2605.30576#bib.bib43),[11](https://arxiv.org/html/2605.30576#bib.bib48)\]\.Control Barrier Functions \(CBFs\)define forward\-invariant safe sets and enforce constraints to keep trajectories within these sets\[[33](https://arxiv.org/html/2605.30576#bib.bib49),[34](https://arxiv.org/html/2605.30576#bib.bib50)\], and distributional RL optimizes risk\-sensitive returns such asConditional Value at Risk \(CVaR\)\[[3](https://arxiv.org/html/2605.30576#bib.bib51),[21](https://arxiv.org/html/2605.30576#bib.bib41)\]\.

Despite their differences, these approaches share key limitations: they solve complex optimization problems, depend on manually defined costs or stability conditions, or rely on learned objectives that are especially noisy at the start of training, and crucially, they lack mechanisms to actively guide the policy toward safer actions\. This prompted the use of expert knowledge and demonstrations, which provide more direct and effective signals during exploration\.

Several works utilize demonstrations to pre\-train policies, which accelerates convergence but leaves the agent vulnerable to unsafe exploration and provides no corrective feedback during training\[[7](https://arxiv.org/html/2605.30576#bib.bib52),[31](https://arxiv.org/html/2605.30576#bib.bib53)\]\. Early integration strategies alternate between collecting episodes from the expert and from the agent’s policy\[[13](https://arxiv.org/html/2605.30576#bib.bib54),[5](https://arxiv.org/html/2605.30576#bib.bib21)\], which improves robustness but still limits expert input to entire episodes rather than targeted interventions\. As these episodes predominantly capture nominal driving and seldom illustrate recovery from unsafe states, these methods fail to teach agents how to act in the situations where guidance is most needed\. This limitation has driven interest in more active systems, where experts can intervene, or the agent selectively queries advice\.

A common form of active expert intervention is human\-in\-the\-loop training\[[30](https://arxiv.org/html/2605.30576#bib.bib56),[18](https://arxiv.org/html/2605.30576#bib.bib58)\], where a supervisor monitors the environment and provides corrective actions to guide the agent out of unsafe states\. While effective, this approach is not scalable for long training, introduces bias in when and how interventions occur, and remains detached from the agent’s decision process, as humans cannot fully observe the agent’s intentions and may override potential recovery actions prematurely\. These challenges have shifted attention toward alternative expert policies that can offer more scalable and consistent intervention during training\.

For such alternative policies, explicit intervention signals are needed to decide when to prune unsafe exploration\. One direction relies on rule\-based definitions of risk, where interventions occur if the risk associated with the agent’s behavior exceeds a threshold\[[27](https://arxiv.org/html/2605.30576#bib.bib16),[21](https://arxiv.org/html/2605.30576#bib.bib41)\]or if the behavior is identified as unsafe through safety analysis\[[38](https://arxiv.org/html/2605.30576#bib.bib59),[35](https://arxiv.org/html/2605.30576#bib.bib60)\]\. While effective at avoiding unsafe rollouts, these definitions are often complex, context\-dependent, and can restrict exploration and hinder policy improvement\.

To overcome the rigidity of rule\-based experts, recent works employ learned expert policies\. These experts intervene if the agent’s action has a low likelihood under the expert policy\[[29](https://arxiv.org/html/2605.30576#bib.bib62)\], or the expert’s value function deems it unsafe\[[37](https://arxiv.org/html/2605.30576#bib.bib61)\], or it deviates significantly from the expert’s optimal action\[[41](https://arxiv.org/html/2605.30576#bib.bib63)\]\. Although more flexible, these methods assume that the expert policy generalizes reliably across the state space\. When this assumption fails, interventions may overwrite reasonable agent actions with poor ones, leading to suboptimal behavior\. This has opened the way for approaches where the agent selectively queries expert advice instead of being forced into intervention\.

Expert advice gives the agent more control by allowing it to decide when to query the expert\. Several works trigger advice based on signals derived from the agent’s learning process, such as state uncertainty\[[22](https://arxiv.org/html/2605.30576#bib.bib64),[8](https://arxiv.org/html/2605.30576#bib.bib36)\], state novelty\[[19](https://arxiv.org/html/2605.30576#bib.bib65),[40](https://arxiv.org/html/2605.30576#bib.bib66)\], or similarity to unsafe states\[[4](https://arxiv.org/html/2605.30576#bib.bib67)\]\. While these methods provide a more active framework, they focus primarily on state\-based measures, neglect the risk induced by the agent’s actions, lack mechanisms to regulate the impact or frequency of advice, and often rely on fixed thresholds to decide when to intervene\. These gaps motivate our framework for selective expert advice, which expands uncertainty estimation to explicitly account for action\-related risks and employs adaptive triggering to deliver targeted guidance that is internalized by the policy during training\.

## IIIMethodology

Our framework, illustrated in Fig\.[1](https://arxiv.org/html/2605.30576#S3.F1), introduces an advice mechanism in which the agent requests guidance when its policy is uncertain about the state \(epistemic uncertainty\) or potentially unsafe due to its actions \(aleatoric uncertainty\)\. An adaptive triggering strategy regulates when and for how long advice is applied, ensuring that interventions accelerate learning without fostering long\-term dependence\.

### III\-AProblem Formulation

We formalize policy learning forADas aPartially Observable Markov Decision Process \(POMDP\), defined by the tupleℳ=⟨𝒮,𝒜,𝒫,r,𝒪,γ⟩\\mathcal\{M\}=\\langle\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{P\},r,\\mathcal\{O\},\\gamma\\ranglewhere𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}denote the state and action spaces, and𝒫​\(s′∣s,a\)\\mathcal\{P\}\(s^\{\\prime\}\\mid s,a\)the transition model describing the probability of reaching states′s^\{\\prime\}from statessunder actionaa\. The rewardr​\(s,a\)r\(s,a\)specifies the immediate feedback for taking actionaain statess, whileγ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor that balances short\- and long\-term returns\. In contrast to an MDP, the agent cannot directly access the underlying statests\_\{t\}\. Instead, after executing an actionata\_\{t\}, it receives an observationot\+1∼𝒪​\(st\+1,at\)o\_\{t\+1\}\\sim\\mathcal\{O\}\(s\_\{t\+1\},a\_\{t\}\)which is generated from the hidden statest\+1s\_\{t\+1\}via a sensor model\.

![Refer to caption](https://arxiv.org/html/2605.30576v1/x1.png)Figure 1:Overview of the proposed uncertainty\-aware expert guidance framework\. An ensemble distributional architecture provides epistemic and aleatoric uncertainty estimates\. The agent queries the expert whenever either uncertainty exceeds adaptive thresholds derived from rolling buffers\. A commitment–cooldown mechanism regulates the frequency and duration of advice\.Beyond its own policyπθ\\pi\_\{\\theta\}, the agent has access to an expert policyπE\\pi\_\{E\}that can be queried for advice, subject to a global budgetBBthat limits the total number of requests and reflects practical constraints such as limited availability of human supervision or computationally expensive expert controllers\. We treat the expert as a black\-box oracle and make no assumptions about its optimality or consistency; we only require that it can provide an action for a given observation or state\. To regulate when advice is requested, we introduce an advice decision policyϕ:𝒪×𝒜→\{0,1\}\\phi:\\mathcal\{O\}\\times\\mathcal\{A\}\\to\\\{0,1\\\}\. For convenience, we denoteϕt=ϕ​\(ot,at\)\\phi\_\{t\}=\\phi\(o\_\{t\},a\_\{t\}\), whereϕt=1\\phi\_\{t\}=1indicates that the expert policyπE\\pi\_\{E\}is queried at timett, andϕt=0\\phi\_\{t\}=0denotes that the agent follows its own policyπθ\\pi\_\{\\theta\}\.

### III\-BPolicy Learning Algorithm

Our framework requires a learning algorithm that can integrate expert advice with agent experiences\. This motivates an off\-policy approach, where experiences are collected under a behavior policyπb\\pi\_\{b\}and used to optimize the target policyπθ\\pi\_\{\\theta\}\. This decoupling allows expert actions to be inserted into a shared replay buffer alongside agent rollouts, ensuring that advice can be reused across multiple updates\. In contrast, on\-policy methods couple data collection and optimization, so expert overrides produce biased gradient estimates\. Additionally, these methods discard rollouts after each update; expert corrections cannot be reused, and their effect remains short\-lived\.

Partial observability further complicatesAD, as decisions must be made under both epistemic \(limited knowledge\) and aleatoric \(sensor noise and occlusions\) uncertainty\. Conventional value\-based methods optimize only expected returns, which do not capture these uncertainties\. To address this, we adopt a distributionalRLformulation that models the entire return distribution\.

Specifically, we employ theImplicit quantile network \(IQN\)\[[9](https://arxiv.org/html/2605.30576#bib.bib12)\], which uniformly samples quantile fractionsτ∼𝒰​\[0,1\]\\tau\\sim\\mathcal\{U\}\[0,1\]\. Eachτ\\tauis first embedded as cosine features and combined with the observation latent representation\. The network then uses this joint representation to outputZτ​\(ot,a\)Z\_\{\\tau\}\(o\_\{t\},a\), which approximates theτ\\tau\-quantile of the return distribution, i\.e\., the inverse cumulative distribution functionFZ​\(ot,a\)−1​\(τ\)F^\{\-1\}\_\{Z\(o\_\{t\},a\)\}\(\\tau\), as shown in Eq\.[1](https://arxiv.org/html/2605.30576#S3.E1)\. Furthermore,IQNsupports risk\-sensitive decision\-making by conditioning on specific regions of the distribution, for instance, focusing on lower quantiles to prioritize safer driving\.

Zτ​\(ot,a\)≈FZ​\(ot,a\)−1​\(τ\),τ∼𝒰​\[0,1\]Z\_\{\\tau\}\(o\_\{t\},a\)\\;\\approx\\;F^\{\-1\}\_\{\\,Z\(o\_\{t\},a\)\}\(\\tau\),\\quad\\tau\\sim\\mathcal\{U\}\[0,1\]\(1\)

### III\-CUncertainty Estimation

Uncertainty is central to our proposed approach, as it dictates when the agent should request expert advice, enabling dynamic adaptation in high\-risk or ambiguous scenarios\. To efficiently estimate uncertainty, we adopt an ensemble framework with multiple heads sharing a latent encoder, as illustrated in Fig\.[1](https://arxiv.org/html/2605.30576#S3.F1)\. Each head is trained on a different subset of the replay buffer\[[28](https://arxiv.org/html/2605.30576#bib.bib32)\], ensuring that variability in underexplored or ambiguous regions of the state space is reflected in the ensemble’s estimates\.

The model output is structured asNH×Nτ×N𝒜N\_\{H\}\\times N\_\{\\tau\}\\times N\_\{\\mathcal\{A\}\}, whereNHN\_\{H\}denotes the number of ensemble heads,NτN\_\{\\tau\}the sampled quantiles, andN𝒜N\_\{\\mathcal\{A\}\}the available actions\. This output captures variability across ensemble predictions, distributional samples, and actions, and serves as the foundation for the advice mechanism that determines when expert intervention is desired\. To ensure reliable decision\-making, the agent must account for both epistemic and aleatoric uncertainty, the two principal sources of uncertainty inRL\[[26](https://arxiv.org/html/2605.30576#bib.bib34)\]\.

#### III\-C1Epistemic Uncertainty

arises from the agent’s limited knowledge about the environment and is most pronounced in regions of the state space that have not been sufficiently explored\[[24](https://arxiv.org/html/2605.30576#bib.bib35)\]\. During training, this uncertainty gradually decreases as the agent collects more diverse experiences, enabling it to form more confident estimates in previously unfamiliar situations\. ForAD, this is particularly relevant when the vehicle encounters uncommon road layouts or atypical maneuvers from other drivers\. In such cases, high epistemic uncertainty reflects limited prior exposure but also indicates that the agent has the potential to improve its understanding if similar scenarios are encountered again\.

In this work, we propose a state\-based formulation that jointly considers the return distributions of all actions\. This choice provides a more stable estimate and aligns with the definition of epistemic uncertainty as a lack of knowledge about the full state dynamics, rather than about individual actions in isolation\. Concretely, we measure the epistemic uncertainty as the variability of the quantile function across ensemble heads, aggregated over all actions\.

To better leverage the expressive power of distributionalRL, we investigate two plausible formulations of epistemic uncertainty, each highlighting different aspects of the return distribution\. Our objective is to assess which provides a more reliable signal in practice\. The first formulation adopts a distributional perspective by comparing entire return distributions across ensemble heads using the Wasserstein distanceW1W\_\{1\}\. For two headshhandh′h^\{\\prime\}, the distance is computed by averaging absolute differences across quantile samples, as shown in Eq\.[2](https://arxiv.org/html/2605.30576#S3.E2), whereZτi\(h\)​\(ot,a\)Z^\{\(h\)\}\_\{\\tau\_\{i\}\}\(o\_\{t\},a\)denotes theτi\\tau\_\{i\}\-quantile predicted by headhh\.

W1\(h,h′\)​\(ot,a\)=1Nτ​∑τi∈τ\|Zτi\(h\)​\(ot,a\)−Zτi\(h′\)​\(ot,a\)\|W\_\{1\}^\{\(h,h^\{\\prime\}\)\}\(o\_\{t\},a\)=\\dfrac\{1\}\{N\_\{\\tau\}\}\\sum\_\{\\tau\_\{i\}\\in\\tau\}\\left\|Z^\{\(h\)\}\_\{\\tau\_\{i\}\}\(o\_\{t\},a\)\-Z^\{\(h^\{\\prime\}\)\}\_\{\\tau\_\{i\}\}\(o\_\{t\},a\)\\right\|\(2\)Epistemic uncertaintyUWepiU^\{\\text\{epi\}\}\_\{\\text\{W\}\}is computed as the mean pairwise Wasserstein distance across ensemble head pairs, capturing variability, and aggregated over actions into a state\-based measure, as illustrated in Eq\.[3](https://arxiv.org/html/2605.30576#S3.E3)\.

UWepi​\(ot\)=1N𝒜​∑a∈𝒜1\(NH2\)​∑h,h′=1h<h′NHW1\(h,h′\)​\(ot,a\)U^\{\\text\{epi\}\}\_\{\\text\{W\}\}\(o\_\{t\}\)=\\dfrac\{1\}\{N\_\{\\mathcal\{A\}\}\}\\sum\_\{a\\in\\mathcal\{A\}\}\\dfrac\{1\}\{\\binom\{N\_\{H\}\}\{2\}\}\\sum\_\{\\begin\{subarray\}\{c\}h,h^\{\\prime\}=1\\\\ h<h^\{\\prime\}\\end\{subarray\}\}^\{N\_\{H\}\}W\_\{1\}^\{\(h,h^\{\\prime\}\)\}\(o\_\{t\},a\)\(3\)The second formulation emphasizes a risk\-sensitive perspective\. Rather than comparing full distributions, we approximate the expected return using theCVaR\[[9](https://arxiv.org/html/2605.30576#bib.bib12)\], which focuses on the lower quantilesτ~\\widetilde\{\\tau\}determined by the risk parameterα\\alpha\. For headhh, theCVaRestimate is defined in Eq\.[4](https://arxiv.org/html/2605.30576#S3.E4), whereZτ~i\(h\)​\(ot,a\)Z^\{\(h\)\}\_\{\\widetilde\{\\tau\}\_\{i\}\}\(o\_\{t\},a\)denotes the predicted return at quantileτ~i\\widetilde\{\\tau\}\_\{i\}\.

CVaRα\(h\)⁡\(ot,a\)=1Nτ​∑τi∈τZτ~i\(h\)​\(ot,a\),τ~i=α​τi\\operatorname\{CVaR\}^\{\(h\)\}\_\{\\alpha\}\(o\_\{t\},a\)=\\dfrac\{1\}\{N\_\{\\tau\}\}\\sum\_\{\\tau\_\{i\}\\in\\tau\}Z^\{\(h\)\}\_\{\\widetilde\{\\tau\}\_\{i\}\}\(o\_\{t\},a\),\\quad\\widetilde\{\\tau\}\_\{i\}=\\alpha\\,\\tau\_\{i\}\(4\)Epistemic uncertaintyUCVaRepiU^\{\\text\{epi\}\}\_\{\\text\{CVaR\}\}is then computed as the variance of theseCVaRestimates across ensemble heads, aggregated over actions, as given in Eq\.[5](https://arxiv.org/html/2605.30576#S3.E5)\.

UCVaRepi​\(ot\)=1N𝒜​∑a∈𝒜VarH​\(CVaRα\(h\)⁡\(ot,a\)\)U^\{\\text\{epi\}\}\_\{\\text\{CVaR\}\}\(o\_\{t\}\)=\\frac\{1\}\{N\_\{\\mathcal\{A\}\}\}\\sum\_\{a\\in\\mathcal\{A\}\}\\mathrm\{Var\}\_\{H\}\\\!\\big\(\\operatorname\{CVaR\}^\{\(h\)\}\_\{\\alpha\}\(o\_\{t\},a\)\\big\)\(5\)

#### III\-C2Aleatoric Uncertainty

reflects the inherent randomness in the environment and cannot be eliminated through training\. It arises from factors such as sensor noise or partial observability\[[26](https://arxiv.org/html/2605.30576#bib.bib34)\]\. For instance, when a vehicle approaches an occluded intersection, the presence, speed, or intention of unseen vehicles is fundamentally uncertain from the agent’s perspective, regardless of the agent’s prior experience\.

In this work, we follow the distributional interpretation used in\[[16](https://arxiv.org/html/2605.30576#bib.bib30)\], where aleatoric uncertainty is derived from the spread of the predicted return distribution\. To align this with our advice mechanism, uncertainty is evaluated only for the action the agent intends to execute, since this action determines whether intervention is required\. Accordingly, aleatoric uncertainty is defined as the lower\-tail return\-distribution variance of the selected risk\-sensitive actiona∗a^\{\*\}, chosen by the greedyCVaR\-based policy in Eq\.[6](https://arxiv.org/html/2605.30576#S3.E6)\.

a∗=arg​maxa∈𝒜⁡1NH​∑h=1NHCVaRα\(h\)⁡\(ot,a\)a^\{\*\}=\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{A\}\}\\dfrac\{1\}\{N\_\{H\}\}\\sum\_\{h=1\}^\{N\_\{H\}\}\\operatorname\{CVaR\}^\{\(h\)\}\_\{\\alpha\}\(o\_\{t\},a\)\(6\)To quantify uncertainty fora∗a^\{\*\}, we focus on the lower\-tail quantilesτ~\\widetilde\{\\tau\}, which are determined by the risk parameterα\\alpha\. For each quantileτ~i\\widetilde\{\\tau\}\_\{i\}, we average the corresponding predictions across ensemble heads, yielding the ensemble\-mean quantile estimateμτ~i\\mu\_\{\\widetilde\{\\tau\}\_\{i\}\}, as shown in Eq\.[7](https://arxiv.org/html/2605.30576#S3.E7)\. Aleatoric uncertainty is then computed as the variance of these ensemble\-mean estimates over the lower\-tail quantilesτ~\\widetilde\{\\tau\}\(Eq\.[8](https://arxiv.org/html/2605.30576#S3.E8)\)\.

μτ~i​\(ot,a∗\)\\displaystyle\\mu\_\{\\widetilde\{\\tau\}\_\{i\}\}\(o\_\{t\},a^\{\*\}\)=1NH​∑h=1NHZτ~i\(h\)​\(ot,a∗\)\\displaystyle=\\dfrac\{1\}\{N\_\{H\}\}\\sum\_\{h=1\}^\{N\_\{H\}\}Z^\{\(h\)\}\_\{\\widetilde\{\\tau\}\_\{i\}\}\(o\_\{t\},a^\{\*\}\)\(7\)Uαale​\(ot,a∗\)\\displaystyle U^\{\\text\{ale\}\}\_\{\\alpha\}\(o\_\{t\},a^\{\*\}\)=Varτ~i∈τ~⁡\[μτ~i​\(ot,a∗\)\]\\displaystyle=\\operatorname\{Var\}\_\{\\widetilde\{\\tau\}\_\{i\}\\in\\widetilde\{\\tau\}\}\\left\[\\mu\_\{\\widetilde\{\\tau\}\_\{i\}\}\(o\_\{t\},a^\{\*\}\)\\right\]\(8\)

### III\-DExpert Advice Integration

In this work, we argue that expert advice is most beneficial when the agent is uncertain about the state or the risk of its actions\. Leveraging the uncertainty measures introduced in Section[III\-C](https://arxiv.org/html/2605.30576#S3.SS3), we design a query mechanism that triggers advice under uncertainty, employs a commitment–cooldown strategy to prevent over\-reliance, and integrates expert guidance into training to accelerate policy improvement\.

#### III\-D1Adaptive Query Mechanism

A central challenge in leveraging uncertainty for advice queries is determining when uncertainty is high enough\. Fixed thresholds are impractical because uncertainty magnitudes are not known in advance and evolve with the agent’s policy: thresholds set too high suppress valuable guidance during early training, while thresholds set too low risk over\-reliance on the expert\. To address this, we introduce an adaptive thresholding mechanism that evolves with training dynamics\. Unlike heuristic or static thresholds used in prior work\[[15](https://arxiv.org/html/2605.30576#bib.bib28),[8](https://arxiv.org/html/2605.30576#bib.bib36)\], our approach adapts to recent uncertainty statistics, enabling consistent triggering of advice across different training phases\.

Epistemic and aleatoric uncertainties are stored in separate buffers that maintain a moving window of past values\. At each steptt, we derive an adaptive threshold by comparing the current uncertainty to the empiricalβ\\beta\-percentile of its buffer distribution\. Formally, the thresholdsTtepiT^\{\\text\{epi\}\}\_\{t\}andTtaleT^\{\\text\{ale\}\}\_\{t\}are defined in Eq\.[9](https://arxiv.org/html/2605.30576#S3.E9), whereprcβ​\(⋅\)\\mathrm\{prc\}\_\{\\beta\}\(\\cdot\)denotes theβ\\beta\-percentile operator applied to the corresponding buffer of past values\.

Ttepi=prcβ​\(\{Ut′epi\}t′<t\),Ttale=prcβ​\(\{Ut′ale\}t′<t\)T^\{\\text\{epi\}\}\_\{t\}=\\mathrm\{prc\}\_\{\\beta\}\\\!\\left\(\\\{U^\{\\text\{epi\}\}\_\{t^\{\\prime\}\}\\\}\_\{t^\{\\prime\}<t\}\\right\),\\quad T^\{\\text\{ale\}\}\_\{t\}=\\mathrm\{prc\}\_\{\\beta\}\\\!\\left\(\\\{U^\{\\text\{ale\}\}\_\{t^\{\\prime\}\}\\\}\_\{t^\{\\prime\}<t\}\\right\)\(9\)If either epistemic or aleatoric uncertainty exceeds its respective threshold, the state is classified as high\-uncertainty and triggers an advice request as defined in Eq\. \([10](https://arxiv.org/html/2605.30576#S3.E10)\), where𝟙​\[⋅\]\\mathds\{1\}\[\\cdot\]denotes the indicator function\.

ϕt=𝟙​\[Uepi​\(ot\)\>Ttepi∨Uale​\(ot,a∗\)\>Ttale\]\\phi\_\{t\}=\\mathds\{1\}\\\!\\left\[U^\{\\text\{epi\}\}\(o\_\{t\}\)\>T^\{\\text\{epi\}\}\_\{t\}\\;\\lor\\;U^\{\\text\{ale\}\}\(o\_\{t\},a^\{\*\}\)\>T^\{\\text\{ale\}\}\_\{t\}\\right\]\(10\)This dynamic mechanism reflects the demands ofAD: even if the agent has experienced similar situations, high aleatoric uncertainty \(e\.g\., an occluded intersection\) still warrants caution, while high epistemic uncertainty in a less familiar road layout justifies consulting the expert\.

#### III\-D2Commitment–Cooldown Strategy

Most prior work\[[8](https://arxiv.org/html/2605.30576#bib.bib36),[16](https://arxiv.org/html/2605.30576#bib.bib30)\]neglects the temporal regulation of expert advice, typically restricting advice to a single step\. This has two drawbacks: \(i\) sparse corrections provide little opportunity to internalize the expert’s policy \(e\.g\., a one\-step lane\-change correction does not expose the full maneuver\), and \(ii\) unrestricted queries risk over\-reliance and rapid exhaustion of the advice budget\.

We address these issues with a commitment–cooldown strategy\. During commitment, once triggered, the agent follows the expert for several consecutive steps, providing richer temporal context and exposure to coherent expert behavior\. The subsequent cooldown blocks new advice requests, enforcing independent operation\.

In addition to fixed commitment and cooldown periods, we propose a stochastic risk\-sensitive heuristic for early termination of the commitment phase, inspired by the policy switching mechanism in\[[23](https://arxiv.org/html/2605.30576#bib.bib37)\]\. The agent disengages from the expert once there is sufficient statistical evidence \(PimpP\_\{\\text\{imp\}\}\) that its own policy achieves higher safety\-adjusted returns\. We quantify this evidence as the probability that the agent’s optimal actiona∗a^\{\*\}outperforms the expert’s actionaEa\_\{E\}, as defined in Eq\. \([11](https://arxiv.org/html/2605.30576#S3.E11)\), whereXaX\_\{a\}denotes a random variable approximating the safety\-adjusted return of actionaa\.

Pimp=P​\(Xa∗\>XaE\)P\_\{\\text\{imp\}\}=P\(X\_\{a^\{\*\}\}\>X\_\{a\_\{E\}\}\)\(11\)For each actionaa, every ensemble headhhproduces aCVaRestimateCVaRα\(h\)⁡\(ot,a\)\\operatorname\{CVaR\}^\{\(h\)\}\_\{\\alpha\}\(o\_\{t\},a\)\. We treat the resulting set\{CVaRα\(h\)⁡\(ot,a\)\}h=1NH\\\{\\operatorname\{CVaR\}^\{\(h\)\}\_\{\\alpha\}\(o\_\{t\},a\)\\\}\_\{h=1\}^\{N\_\{H\}\}as samples from the random variableXaX\_\{a\}and approximate it by a Gaussian distribution with empirical meanμXa\\mu\_\{X\_\{a\}\}and varianceσXa2\\sigma^\{2\}\_\{X\_\{a\}\}\. To compare the optimal actiona∗a^\{\*\}with the expert’s actionaEa\_\{E\}, we define the differenceΔ=Xa∗−XaE\.\\Delta=X\_\{a^\{\*\}\}\-X\_\{a\_\{E\}\}\.Under the Gaussian approximation ofXa∗X\_\{a^\{\*\}\}andXaEX\_\{a\_\{E\}\}, the differenceΔ\\Deltais also Gaussian, as given in Eq\. \([12](https://arxiv.org/html/2605.30576#S3.E12)\)\.

Δ∼𝒩​\(μXa∗−μXaE,σXa∗2\+σXaE2\)\\Delta\\sim\\mathcal\{N\}\\,\\big\(\\mu\_\{X\_\{a^\{\*\}\}\}\-\\mu\_\{X\_\{a\_\{E\}\}\},\\;\\sigma\_\{X\_\{a^\{\*\}\}\}^\{2\}\+\\sigma\_\{X\_\{a\_\{E\}\}\}^\{2\}\\big\)\(12\)Accordingly,Pimp=P​\(Δ\>0\)P\_\{\\text\{imp\}\}=P\(\\Delta\>0\)is evaluated using the Gaussian cumulative distribution function\. Early stopping of the commitment phase is triggered wheneverPimp\>λ⋅ρtcP\_\{\\text\{imp\}\}\>\\lambda\\cdot\\rho^\{t\_\{c\}\}, whereλ\\lambdais a base confidence,ρ∈\(0,1\)\\rho\\in\(0,1\)is a decay factor, andtct\_\{c\}is the number of elapsed commitment steps\. Sinceλ⋅ρtc\\lambda\\cdot\\rho^\{t\_\{c\}\}decreases monotonically withtct\_\{c\}, the confidence threshold lowers over time, making disengagement increasingly likely the longer advice is followed\.

#### III\-D3Integration in Policy Learning

We integrate expert advice directly into the learning process by maintaining a shared replay buffer that stores both agent experiences and expert advice, leveraging the off\-policy nature ofIQN\. This contrasts with approaches that separate expert and agent buffers\[[30](https://arxiv.org/html/2605.30576#bib.bib56)\], which may keep expert samples disproportionately long and risk anchoring the policy to the expert’s behavior\. In our design, the frequency of expert\-generated transitions in the replay buffer decreases naturally as the agent becomes more confident and requests advice less frequently in similar states, better reflecting the agent’s evolving competence\.

Furthermore, since advice queries are triggered by uncertainty, the ratio of expert to agent samples adjusts automatically: when the agent is frequently uncertain, expert actions fill the buffer; when the agent is confident, its own experience dominates\. This dynamic balance emerges naturally without imposing artificial ratios\. Finally, the commitment\-cooldown strategy further enriches the replay buffer with coherent trajectories rather than isolated corrections\. By committing to short horizons of consecutive expert actions, the agent is exposed to complete maneuvers that provide more meaningful training signals\. The subsequent cooldown enforces independence and prevents over\-reliance on the expert\.

We also avoid reward forcing\[[5](https://arxiv.org/html/2605.30576#bib.bib21),[30](https://arxiv.org/html/2605.30576#bib.bib56)\], where expert actions are artificially assigned high rewards\. Such strategies implicitly assume that the expert is optimal, which is unrealistic in practice\. Our expert may be suboptimal, and we want the learner to retain the ability to surpass it\. By evaluating expert actions under the same reward function as the agent, the framework benefits from guidance without biasing the policy toward potentially flawed behavior\.

Together, these choices ensure that expert knowledge is integrated in a balanced and principled way: leveraged when uncertainty is high, gradually fading as the agent becomes more competent, and always evaluated under the same reward function\. This makes advice an accelerator for learning rather than a lasting dependency, embedding safety within the policy without limiting long\-term autonomy\.

## IVexperimental Setup

This section details the experimental setup, including theRLagent design, traffic scenarios, and the utilized expert\. We also outline baselines, ablations, and evaluation metrics to enable a fair comparison of performance\.

### IV\-ARL Agent Description

TheRLagent operates on a multimodal observation space that integrates complementary sensors and state information\. The primary perception inputs consist of a frontal RGB camera image with a resolution of128×128128\\times 128pixels and a LiDAR point cloud projected into a128×128128\\times 128bird’s\-eye\-view grid\. To support goal\-directed navigation, the LiDAR grid map is augmented with a reference route toward the target destination\. The observation space also includes vehicle states, namely longitudinal and lateral velocities and their respective accelerations\.

To encode the observation, we employ modality\-specific encoders: a CNN for RGB images, a CNN for LiDAR grid maps, and an MLP for vehicle kinematics\. Their outputs are fused into a shared latent representation that captures spatial and state information\. Based on this representation, the policy head selects a cruise\-control acceleration command from a finite action set uniformly spaced between maximum braking and maximum throttle\. Expert actions are projected onto the same action space by selecting the closest command, while steering is handled by a route\-following controller\.

### IV\-BTraffic Scenarios

In this work, we focus on urban driving tasks where an autonomous agent must safely approach and traverse unsignalized intersections\. These intersections are among the most safety\-critical components of road networks, as they lack explicit right\-of\-way rules and require implicit negotiation with surrounding traffic\[[2](https://arxiv.org/html/2605.30576#bib.bib10)\]\. Although our framework is applicable to a wide range of driving scenarios, this work evaluates it in unsignalized intersections, a particularly challenging setting for evaluating our expert\-guidedRLagents\.

Traffic scenarios are generated using CARLA\[[10](https://arxiv.org/html/2605.30576#bib.bib11)\]across randomized configurations of traffic vehicles placed in multiple T\-junctions and four\-way intersections\. To encourage generalization, we randomize traffic attributes such as vehicle geometry, speed, and lateral offset\. For evaluation, we construct a hold\-out set of one unseen T\-junction and two unseen four\-way intersections, ensuring that performance is assessed on layouts not encountered during training\.

As an expert policy, we employ CARLA’sTraffic Manager \(TM\), a rule\-based system for controlling vehicle behavior\. While theTMhas privileged access to the environment state, including road topology, and the state of all traffic vehicles, the framework itself is not restricted to this expert and only requires an action proposal from an external policy\. Furthermore, the expert is required only during training and is neither required nor accessible during evaluation\.

### IV\-CBaselines and Evaluation Metrics

We benchmark our proposedRLalgorithm againstIQN\[[9](https://arxiv.org/html/2605.30576#bib.bib12)\], a well\-established and widely used distributionalRLbaseline\. In addition, we conduct a series of ablation studies to isolate the contribution of individual components of our approach, specifically evaluating the impact of the commitment–cooldown mechanism, the expert advice budget, and the stochastic early stopping strategy for the commitment mechanism\. These analyses provide a clearer understanding of how each design choice influences the overall learning performance\.

To ensure fair comparison, all agents are trained for an identical number of steps using the same network architectures and hyperparameter configurations\. Each training run is repeated with three independent random seeds, and every trained policy is evaluated over three runs to account for the stochasticity inherent in CARLA\. This setup follows the evaluation methodology outlined in\[[20](https://arxiv.org/html/2605.30576#bib.bib13)\]\.

Evaluation is performed on a hold\-out set of intersection scenarios with varying traffic densities\. Traffic density is the ratio of the number of active traffic vehicles to the environment’s maximum capacity\. Results are reported as mean and standard deviation over all training seeds and evaluation runs\. Performance is measured using cumulative episode reward \(E​RER\) as well as driving\-specific metrics: success rate \(S​RSR\), failure rate \(F​RFR\), the sum of off\-road, collision, and timeout rates, and route progress \(R​PRP\)\. To ensure robust evaluation, we use RLiable\[[1](https://arxiv.org/html/2605.30576#bib.bib68)\]to report aggregate metrics such as interquartile mean \(IQM\) and optimality gap, which better capture performance variability\.

![Refer to caption](https://arxiv.org/html/2605.30576v1/x2.png)Figure 2:Probability of improvement\[[1](https://arxiv.org/html/2605.30576#bib.bib68)\], quantifying the likelihood that algorithm X outperforms algorithm Y\.

## VEvaluation

Table[I](https://arxiv.org/html/2605.30576#S5.T1)reports ablation studies evaluating our methodology across traffic densities\. Introducing*commitment and cooldown periods*proves essential, as they ensure more consistent expert guidance\. The baselines\[[40](https://arxiv.org/html/2605.30576#bib.bib66),[8](https://arxiv.org/html/2605.30576#bib.bib36)\]without this mechanism correspond to\(1,1\)\(1,1\), where success reaches0\.670\.67at density0\.750\.75and0\.530\.53at density1\.01\.0\. A suitable period of\(5,5\)\(5,5\)raises success to0\.740\.74and0\.610\.61, respectively, while reducing failures from0\.330\.33to0\.260\.26at density0\.750\.75and from0\.470\.47to0\.390\.39at density1\.01\.0\. This demonstrates that temporal consistency during training stabilizes policy learning\.

The expert budget plays a central role, as it represents the fraction of training during which the agent can receive advice\. A limited budget of 25% already enhances performance, while 50% yields the best balance between advice and independent learning\. In contrast, allocating 75% induces excessive reliance on the expert and degrades generalization\. At the balanced50%50\\%, success improves from0\.670\.67forIQNto0\.740\.74at density0\.750\.75and from0\.550\.55to0\.610\.61at density1\.01\.0\.

When comparing epistemic uncertainty formulations, both Wasserstein andCVaRfollow similar trends across commitment settings and budgets\. However,CVaRconsistently achieves higher success \(e\.g\.,0\.610\.61vs\.0\.580\.58at density1\.01\.0\) by emphasizing risk\-sensitive quantiles and filtering distributional noise\. The stochastic early\-stop mechanism shows a nuanced effect\. At the optimal50%50\\%budget, early stopping causes minor degradation, whereas at larger budgets it mitigates overreliance\. At density0\.750\.75with75%75\\%budget, success increases from0\.640\.64to0\.720\.72, indicating that stochastic early\-stop can reduce overreliance during longer training schedules and expert budgets\.

Overall, our best configuration, CVaR\-based uncertainty with\(5,5\)\(5,5\)commitment\-cooldown and a50%50\\%expert budget, delivers consistent improvements across all densities\. Compared to IQN, it yields55–7%7\\%gains in success rate, while lowering failures in challenging dense traffic\. These results confirm the effectiveness of combining structured expert advice with risk\-sensitive uncertainty estimation for the evaluated intersection\-navigation setting\.

Furthermore, RLiable analysis confirms these findings\. Fig\.[3](https://arxiv.org/html/2605.30576#S5.F3)shows that our best variants raise IQM from0\.660\.66\(IQN\) to0\.720\.72\(CVaR\) and reduce the optimality gap from0\.330\.33to0\.270\.27\. Pairwise improvement probabilities \(Fig\.[2](https://arxiv.org/html/2605.30576#S4.F2)\) further highlight consistency: CVaR outperforms IQN with probability\>0\.75\>0\.75, while Wasserstein remains above0\.60\.6\. Thus, our best configurations deliver not only higher mean performance but also more reliable learning outcomes\.

Having established the effectiveness of our best configurations through both direct metrics and RLiable analysis, we finally evaluate aleatoric uncertainty as a runtime safety guard\. At inference time, the agent triggers a deceleration\-to\-stop maneuver whenever the current uncertainty exceeds the90th90^\{\\text\{th\}\}percentile threshold of uncertainty values faced on a validation set\. Under density0\.750\.75, this mechanism achieves the highest success rates of0\.760\.76with CVaR\-based epistemic uncertainty and0\.710\.71with the Wasserstein variant, surpassing all previous configurations\. These findings suggest that runtime uncertainty monitoring complements expert advice regulation, providing an additional layer of robustness when the agent encounters rare failure cases\.

![Refer to caption](https://arxiv.org/html/2605.30576v1/x3.png)Figure 3:Interquartile mean \(IQM\) and optimality gap\[[1](https://arxiv.org/html/2605.30576#bib.bib68)\], quantifying the statistical stability of a policy\.TABLE I:Ablation results in CARLA traffic scenarios across traffic densities\. Results compare IQN with our method under different commitment–cooldown periods, expert budgets, and uncertainty formulations\.
## VIConclusion

This work addressed the gap in safe exploration by introducing an uncertainty\-aware framework that triggers advice when epistemic or aleatoric uncertainty exceeds adaptive thresholds and regulates its use with a commitment–cooldown strategy, exposing the agent to coherent expert trajectories without inducing long\-term dependence\. Experiments in CARLA traffic scenarios demonstrated consistent gains over IQN, with the best configuration \(CVaR\-based uncertainty, and a50%50\\%advice budget with\(5,5\)\(5,5\)periods\) achieving55–7%7\\%higher success and lower failures across all traffic densities\. In addition, we showed that aleatoric uncertainty can serve as an inference\-time safety guard, further improving success under dense traffic and providing robustness against rare failure cases\. Future work will explore noisy, partially observable, and multi\-expert settings, where specialized experts can be queried selectively, enhancing flexibility and broadening applicability to diverse and safety\-critical driving contexts\.

## ACKNOWLEDGMENT

The research leading to these results is funded by the German Federal Ministry for Economic Affairs and Energy within the project “Safe AI Engineering – Sicherheitsargumentation befähigendes AI Engineering über den gesamten Lebenszyklus einer KI\-Funktion”\. The authors would like to thank the consortium for the successful cooperation\.

## References

- \[1\]\(2021\)Deep reinforcement learning at the edge of the statistical precipice\.NeurIPS\.Cited by:[Figure 2](https://arxiv.org/html/2605.30576#S4.F2),[Figure 2](https://arxiv.org/html/2605.30576#S4.F2.3.2),[§IV\-C](https://arxiv.org/html/2605.30576#S4.SS3.p3.4),[Figure 3](https://arxiv.org/html/2605.30576#S5.F3),[Figure 3](https://arxiv.org/html/2605.30576#S5.F3.3.2)\.
- \[2\]M\. Al\-Sharman, L\. Edes, B\. Sun, V\. Jayakumar, H\. Tahir, M\. A\. Daoud, B\. J\. Emran, D\. Rayside, and W\. Melek\(2026\)Autonomous driving at unsignalized intersections: a review of decision\-making challenges and reinforcement learning\-based solutions\.IEEE Transactions on Automation Science and Engineering\.External Links:[Document](https://dx.doi.org/10.1109/TASE.2025.3646982)Cited by:[§IV\-B](https://arxiv.org/html/2605.30576#S4.SS2.p1.1)\.
- \[3\]J\. Bernhard, S\. Pollok, and A\. Knoll\(2019\)Addressing inherent uncertainty: risk\-sensitive behavior generation for automated driving using distributional reinforcement learning\.In2019 IEEE Intelligent Vehicles Symposium \(IV\),Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p1.1)\.
- \[4\]D\. Bethell, S\. Gerasimou, R\. Calinescu, and C\. Imrie\(2024\)Safe reinforcement learning in black\-box environments via adaptive shielding\.arXiv preprint arXiv:2405\.18180\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p7.1)\.
- \[5\]R\. Chekroun, M\. Toromanoff, S\. Hornauer, and F\. Moutarde\(2023\)Gri: general reinforced imitation and its application to vision\-based autonomous driving\.Robotics\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p3.1),[§III\-D3](https://arxiv.org/html/2605.30576#S3.SS4.SSS3.p3.1)\.
- \[6\]L\. Chen, P\. Wu, K\. Chitta, B\. Jaeger, A\. Geiger, and H\. Li\(2024\)End\-to\-end autonomous driving: challenges and frontiers\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p1.1)\.
- \[7\]J\. Choi, D\. Kim, J\. Yoo, B\. Kim, and J\. Hwang\(2025\)Enhancing autonomous driving with pre\-trained imitation and reinforcement learning\.In2025 International Conference on Electronics, Information, and Communication \(ICEIC\),Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p4.1),[§II](https://arxiv.org/html/2605.30576#S2.p3.1)\.
- \[8\]F\. L\. Da Silva, P\. Hernandez\-Leal, B\. Kartal, and M\. E\. Taylor\(2020\)Uncertainty\-aware action advising for deep reinforcement learning agents\.InProceedings of the AAAI conference on artificial intelligence,Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p5.1),[§II](https://arxiv.org/html/2605.30576#S2.p7.1),[§III\-D1](https://arxiv.org/html/2605.30576#S3.SS4.SSS1.p1.1),[§III\-D2](https://arxiv.org/html/2605.30576#S3.SS4.SSS2.p1.1),[§V](https://arxiv.org/html/2605.30576#S5.p1.14)\.
- \[9\]W\. Dabney, G\. Ostrovski, D\. Silver, and R\. Munos\(2018\)Implicit quantile networks for distributional reinforcement learning\.InInternational conference on machine learning,Cited by:[§III\-B](https://arxiv.org/html/2605.30576#S3.SS2.p3.5),[§III\-C1](https://arxiv.org/html/2605.30576#S3.SS3.SSS1.p3.12),[§IV\-C](https://arxiv.org/html/2605.30576#S4.SS3.p1.1)\.
- \[10\]A\. Dosovitskiy, G\. Ros, F\. Codevilla, A\. Lopez, and V\. Koltun\(2017\)CARLA: an open urban driving simulator\.InConference on robot learning,Cited by:[§IV\-B](https://arxiv.org/html/2605.30576#S4.SS2.p2.1)\.
- \[11\]D\. Du, S\. Han, N\. Qi, H\. B\. Ammar, J\. Wang, and W\. Pan\(2023\)Reinforcement learning for safe robot control using control lyapunov barrier functions\.In2023 IEEE International Conference on Robotics and Automation \(ICRA\),External Links:[Document](https://dx.doi.org/10.1109/ICRA48891.2023.10160991)Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p1.1)\.
- \[12\]G\. Dulac\-Arnold, N\. Levine, D\. J\. Mankowitz, J\. Li, C\. Paduraru, S\. Gowal, and T\. Hester\(2021\)Challenges of real\-world reinforcement learning: definitions, benchmarks and analysis\.Machine Learning\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p3.1)\.
- \[13\]D\. Gao, H\. Wang, H\. Zhou, N\. Ammar, S\. Mishra, A\. Moradipari, I\. Soltani, and J\. Zhang\(2025\)IN\-ril: interleaved reinforcement and imitation learning for policy fine\-tuning\.arXiv preprint arXiv:2505\.10442\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p3.1)\.
- \[14\]S\. Gu, L\. Yang, Y\. Du, G\. Chen, F\. Walter, J\. Wang, and A\. Knoll\(2024\)A review of safe reinforcement learning: methods, theories and applications\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p2.1)\.
- \[15\]T\. Hester, M\. Vecerik, O\. Pietquin, M\. Lanctot, T\. Schaul, B\. Piot, D\. Horgan, J\. Quan, A\. Sendonaris, I\. Osband,et al\.\(2018\)Deep q\-learning from demonstrations\.InProceedings of the AAAI conference on artificial intelligence,Cited by:[§III\-D1](https://arxiv.org/html/2605.30576#S3.SS4.SSS1.p1.1)\.
- \[16\]C\. Hoel, K\. Wolff, and L\. Laine\(2023\)Ensemble quantile networks: uncertainty\-aware reinforcement learning with applications in autonomous driving\.IEEE Transactions on Intelligent Transportation Systems\.Cited by:[§III\-C2](https://arxiv.org/html/2605.30576#S3.SS3.SSS2.p2.1),[§III\-D2](https://arxiv.org/html/2605.30576#S3.SS4.SSS2.p1.1)\.
- \[17\]X\. Hu, P\. Chen, Y\. Wen, B\. Tang, and L\. Chen\(2026\)Long\-and short\-term constraint\-driven safe reinforcement learning for autonomous driving\.IEEE Transactions on Systems, Man, and Cybernetics: Systems\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p4.1),[§II](https://arxiv.org/html/2605.30576#S2.p1.1)\.
- \[18\]Z\. Huang, Z\. Sheng, C\. Ma, and S\. Chen\(2024\)Human as ai mentor: enhanced human\-in\-the\-loop reinforcement learning for safe and efficient autonomous driving\.Communications in Transportation Research\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p4.1),[§II](https://arxiv.org/html/2605.30576#S2.p4.1)\.
- \[19\]E\. Ilhan, J\. Gow, and D\. Perez\(2021\)Student\-initiated action advising via advice novelty\.IEEE Transactions on Games\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p5.1),[§II](https://arxiv.org/html/2605.30576#S2.p7.1)\.
- \[20\]B\. Jaeger, D\. Dauner, J\. Beißwenger, S\. Gerstenecker, K\. Chitta, and A\. Geiger\(2025\)CaRL: learning scalable planning policies with simple rewards\.InProc\. of the Conf\. on Robot Learning \(CoRL\),Cited by:[§IV\-C](https://arxiv.org/html/2605.30576#S4.SS3.p2.1)\.
- \[21\]D\. Kamran, T\. Engelgeh, M\. Busch, J\. Fischer, and C\. Stiller\(2021\)Minimizing safety interference for safe and comfortable automated driving with distributional reinforcement learning\.In2021 IEEE/RSJ international conference on intelligent robots and systems \(IROS\),Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p1.1),[§II](https://arxiv.org/html/2605.30576#S2.p5.1)\.
- \[22\]M\. Kelly, C\. Sidrane, K\. Driggs\-Campbell, and M\. J\. Kochenderfer\(2019\)Hg\-dagger: interactive imitation learning with human experts\.In2019 International Conference on Robotics and Automation \(ICRA\),Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p7.1)\.
- \[23\]A\. Kurenkov, A\. Mandlekar, R\. Martin\-Martin, S\. Savarese, and A\. Garg\(2019\)Ac\-teach: a bayesian actor\-critic method for policy learning with an ensemble of suboptimal teachers\.arXiv preprint arXiv:1909\.04121\.Cited by:[§III\-D2](https://arxiv.org/html/2605.30576#S3.SS4.SSS2.p3.5)\.
- \[24\]B\. Lakshminarayanan, A\. Pritzel, and C\. Blundell\(2017\)Simple and scalable predictive uncertainty estimation using deep ensembles\.Advances in neural information processing systems30\.Cited by:[§III\-C1](https://arxiv.org/html/2605.30576#S3.SS3.SSS1.p1.1)\.
- \[25\]Y\. Liu and S\. Diao\(2024\)An automatic driving trajectory planning approach in complex traffic scenarios based on integrated driver style inference and deep reinforcement learning\.PLoS one\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p4.1),[§II](https://arxiv.org/html/2605.30576#S2.p1.1)\.
- \[26\]O\. Lockwood and M\. Si\(2022\)A review of uncertainty for deep reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment,Cited by:[§III\-C2](https://arxiv.org/html/2605.30576#S3.SS3.SSS2.p1.1),[§III\-C](https://arxiv.org/html/2605.30576#S3.SS3.p2.4)\.
- \[27\]S\. Mo, X\. Pei, and C\. Wu\(2021\)Safe reinforcement learning for autonomous vehicle using monte carlo tree search\.IEEE Transactions on Intelligent Transportation Systems\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p5.1)\.
- \[28\]I\. Osband, C\. Blundell, A\. Pritzel, and B\. Van Roy\(2016\)Deep exploration via bootstrapped dqn\.Advances in neural information processing systems29\.Cited by:[§III\-C](https://arxiv.org/html/2605.30576#S3.SS3.p1.1)\.
- \[29\]Z\. Peng, Q\. Li, C\. Liu, and B\. Zhou\(2022\)Safe driving via expert guided policy optimization\.InConference on Robot Learning,Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p6.1)\.
- \[30\]Z\. M\. Peng, W\. Mo, C\. Duan, Q\. Li, and B\. Zhou\(2023\)Learning from active human involvement through proxy value propagation\.Advances in neural information processing systems\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p4.1),[§III\-D3](https://arxiv.org/html/2605.30576#S3.SS4.SSS3.p1.1),[§III\-D3](https://arxiv.org/html/2605.30576#S3.SS4.SSS3.p3.1)\.
- \[31\]M\. Pfeiffer, S\. Shukla, M\. Turchetta, C\. Cadena, A\. Krause, R\. Siegwart, and J\. Nieto\(2018\)Reinforced imitation: sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations\.IEEE Robotics and Automation Letters\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p3.1)\.
- \[32\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.MIT press Cambridge\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p1.1),[§I](https://arxiv.org/html/2605.30576#S1.p2.1)\.
- \[33\]D\. C\. Tan, F\. Acero, R\. McCarthy, D\. Kanoulas, and Z\. Li\(2023\)Value functions are control barrier functions: verification of safe policies using control theory\.arXiv preprint arXiv:2306\.04026\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p1.1)\.
- \[34\]D\. C\. Tan, R\. McCarthy, F\. Acero, A\. M\. Delfaki, Z\. Li, and D\. Kanoulas\(2024\)Safe value functions: learned critics as hard safety constraints\.In2024 IEEE 20th International Conference on Automation Science and Engineering \(CASE\),Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p1.1)\.
- \[35\]J\. Thumm, G\. Pelat, and M\. Althoff\(2023\)Reducing safety interventions in provably safe reinforcement learning\.In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p5.1)\.
- \[36\]A\. Wachi, W\. Hashimoto, X\. Shen, and K\. Hashimoto\(2023\)Safe exploration in reinforcement learning: a generalized formulation and algorithms\.Advances in Neural Information Processing Systems\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p2.1)\.
- \[37\]H\. Wang, X\. Yuan, and Q\. Ren\(2023\)Learning to recover for safe reinforcement learning\.arXiv preprint arXiv:2309\.11907\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p6.1)\.
- \[38\]X\. Wang and M\. Althoff\(2023\)Safe reinforcement learning for automated vehicles via online reachability analysis\.IEEE Transactions on Intelligent Vehicles\.Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p5.1),[§II](https://arxiv.org/html/2605.30576#S2.p5.1)\.
- \[39\]X\. Wang, J\. Zhang, D\. Hou, and Y\. Cheng\(2023\)Autonomous driving based on approximate safe action\.IEEE Transactions on Intelligent Transportation Systems\.Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p1.1)\.
- \[40\]Y\. Wei, S\. Liu, J\. Song, T\. Zheng, K\. Chen, and M\. Song\(2025\)Agent\-aware training for agent\-agnostic action advising in deep reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§II](https://arxiv.org/html/2605.30576#S2.p7.1),[§V](https://arxiv.org/html/2605.30576#S5.p1.14)\.
- \[41\]Z\. Xue, Z\. Peng, Q\. Li, Z\. Liu, and B\. Zhou\(2023\)Guarded policy optimization with imperfect online demonstrations\.International Conference on Learning Representations\.External Links:[Link](https://openreview.net/forum?id=O5rKg7IRQIO)Cited by:[§I](https://arxiv.org/html/2605.30576#S1.p5.1),[§II](https://arxiv.org/html/2605.30576#S2.p6.1)\.

Similar Articles

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

arXiv cs.LG

Proposes LILAC+, a framework for safe continual reinforcement learning under nonstationarity that uses three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Evaluations in simulated driving environments show reduced safety violations under distribution shift while maintaining competitive performance.

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.