Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

arXiv cs.LG Papers

Summary

This paper presents VLM-Safe-RL, a framework that integrates frozen vision-language models into constrained MDP Lagrangian updates to provide anticipatory cost signals for safe reinforcement learning in high-speed visual control tasks. The method outperforms standard constraint-aware baselines on Safety-Gymnasium FormulaOne L2 and generalizes to held-out environments.

arXiv:2606.11266v1 Announce Type: new Abstract: The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the episode budget has been exceeded. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late. We present VLM-Safe-RL, a framework that integrates a frozen vision-language model into the CMDP Lagrangian update as an anticipatory cost term. The framework comprises four contributions: (i) Decoupled Dual-Path CLIP, independent reward/cost paths that respect the CMDP's factorization; (ii) VLM-Lagrange, an augmented multiplier update that incorporates a per-step VLM cost as an anticipatory term; (iii) Confidence Gating, a Bayes-optimal weight derived from a logistic noise model on the CLIP margin; and (iv) VLMPPOLag, the composed algorithm. On Safety-Gymnasium FormulaOne L2, our principal evaluation ($n{=}5$ seeds, $10^{6}$ steps, budget $d_{\text{lim}}{=}25$) VLMPPOLag$+$Conf is the only configuration in our default budget comparison that simultaneously retains substantive return ($J_r{\approx}40$) and holds cost within budget on a majority of seeds; the five constraint-aware baselines (PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND) each fail at least one requirement. The mechanism generalizes to held-out MetaDrive Medium (catastrophe rate $41\%{\to}26\%$, 95\% bootstrap CI $[-26,-5]$\,pp) and shows directionally consistent transfer to Bullet Safety-Gym; we report honestly where it does not (MetaDrive Easy/Hard, Qwen2-VL backbone) and trace the Hard failure to a Lagrangian-regulation pathology rather than the VLM signal itself. To our knowledge, this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:45 PM

# Anticipatory Safe RL with Frozen Vision-Language Models
Source: [https://arxiv.org/html/2606.11266](https://arxiv.org/html/2606.11266)
## Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision\-Language Models

Samuel Tetteh & Cody Fleming Iowa State University Ames, Iowa, USA \{samtett, flemingc\}@iastate\.edu

###### Abstract

The cost signal that constrained\-RL algorithms optimize against is almost always*reactive*: the simulator emits a non\-zero cost only*after*a collision has begun, and the Lagrange multiplier of PPO\-Lagrangian grows only after the episode budget has been exceeded\. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late\. We presentVLM\-Safe\-RL, a framework that integrates a frozen vision\-language model into the CMDP Lagrangian update as an*anticipatory*cost term\. The framework comprises four contributions: \(i\)Decoupled Dual\-Path CLIP, independent reward/cost paths that respect the CMDP’sr⟂cr\\\!\\perp\\\!cfactorization; \(ii\)VLMLagrange, an augmented multiplier updateλ←λ\+η1​\(JC−d\)\+η2​\(c¯vlm−τ\)\\lambda\\\!\\leftarrow\\\!\\lambda\+\\eta\_\{1\}\(J\_\{C\}\-d\)\+\\eta\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\-\\tau\)that incorporates a per\-step VLM cost as an anticipatory term; \(iii\)Confidence Gating, a Bayes\-optimal weightκ=\|2​σ​\(s​\(m−c\)\)−1\|\\kappa\{=\}\|2\\sigma\(s\(m\{\-\}c\)\)\{\-\}1\|derived from a logistic noise model on the CLIP margin; and \(iv\)VLMPPOLag, the composed algorithm\. On Safety\-Gymnasium FormulaOne L2, our principal evaluation \(n=5n\{=\}5seeds,10610^\{6\}steps, budgetd=25d\{=\}25\) VLMPPOLag\+\+Conf is the only configuration in our default\-budget comparison that simultaneously retains substantive return \(JR≈40J\_\{R\}\{\\approx\}40\) and holds cost within budget on a majority of seeds; the five constraint\-aware baselines \(PPOLag, CPO, CPPOPID, CPO\-CLG, PPOLag\-RND\) each fail at least one requirement\. The mechanism generalises to held\-out MetaDrive Medium \(catastrophe rate41%→26%41\\%\{\\to\}26\\%, 95% bootstrap CI\[−26,−5\]\[\-26,\-5\]pp\) and shows directionally consistent transfer to Bullet Safety\-Gym; we report honestly where it does*not*\(MetaDrive Easy/Hard, Qwen2\-VL backbone\) and trace the Hard failure to a Lagrangian\-regulation pathology rather than the VLM signal itself\. To our knowledge this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update\.

## 1Introduction

Safe reinforcement learning in high\-speed visual control presents a fundamental tension: an agent must push close to the limits of performance while strictly avoiding catastrophic failure—a setting formalized by the Constrained Markov Decision Process \(CMDP\) framework\[[5](https://arxiv.org/html/2606.11266#bib.bib17)\]\. Yet the cost signals available to standard CMDP solvers are almost universally*reactive*: the simulator emits a non\-zero cost only*after*a collision has begun, and PPO\-Lagrangian’s multiplierλ\\lambda\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\]grows only after the episode budget has been exceeded\. At race speeds, where contacts are instantaneous and irreversible, this lag is structural\. Vision\-language models\[[36](https://arxiv.org/html/2606.11266#bib.bib1)\]encode rich semantic priors about safe and unsafe states\. A natural\-language description, “the racecar is about to crash into the barrier” captures the visual signature of impending danger that a hand\-crafted feature would require elaborate engineering to detect\. Existing VLM\+RL paradigms either fine\-tune billion\-parameter vision\-language\-action models on large demonstration sets\[[8](https://arxiv.org/html/2606.11266#bib.bib14),[14](https://arxiv.org/html/2606.11266#bib.bib15),[49](https://arxiv.org/html/2606.11266#bib.bib16)\], or use frozen VLMs as auxiliary reward shapers\[[16](https://arxiv.org/html/2606.11266#bib.bib6),[28](https://arxiv.org/html/2606.11266#bib.bib7),[37](https://arxiv.org/html/2606.11266#bib.bib10),[23](https://arxiv.org/html/2606.11266#bib.bib11)\]\. Neither addresses safety as a hard constraint\. Reward\-shaping methods treat safety as an implicit penalty, with VLM\-RL\[[23](https://arxiv.org/html/2606.11266#bib.bib11)\]further coupling reward and safety scores on a shared simplex via the contrasting language goal \(CLG\) paradigm; SafeVLA\[[49](https://arxiv.org/html/2606.11266#bib.bib16)\]adds CMDP objectives but at77B\+ parameters and∼800\\sim\\\!800K demonstrations\.

#### Central insight\.

Frozen CLIP can detect visual danger signals*before*the actual collision\. Routed through the Lagrange multiplier update, this forward\-looking information enables*anticipatory*constraint satisfaction fundamentally distinct from prior VLM\+RL approaches that treat the VLM output as a stateless reward bonus\.

#### Contributions\.

We presentVLM\-Safe\-RL:

1. 1\.Decoupled Dual\-Path CLIP\(§[3](https://arxiv.org/html/2606.11266#S3)\): two independent cosine paths forrvlmr\_\{\\text\{vlm\}\}andcvlmc\_\{\\text\{vlm\}\}that eliminate the anti\-correlation artefact of coupled softmax\.
2. 2\.VLMLagrange\(§[3](https://arxiv.org/html/2606.11266#S3)\): an augmented multiplier update with a per\-step CLIP\-derived anticipatory term that tightensλ\\lambdabefore collisions accumulate\.
3. 3\.Confidence Gating\(§[3](https://arxiv.org/html/2606.11266#S3)\): a Bayes\-optimal weightκt\\kappa\_\{t\}derived in closed form from a logistic noise model on the CLIP margin, with a calibrated operating point estimated from a random\-policy frame buffer\.
4. 4\.VLMPPOLag: the composed algorithm, registered as a first\-class algorithm in OmniSafe\[[26](https://arxiv.org/html/2606.11266#bib.bib28)\]\.

We evaluate on Safety\-Gymnasium FormulaOne\[[25](https://arxiv.org/html/2606.11266#bib.bib27)\]L0/L1/L2 \(10 methods×\\times3 levels, 3–5 seeds; 90\+ training runs\) and on two generalization benchmarks, Bullet Safety\-GymSafetyCarReach\-v0and MetaDrive\[[29](https://arxiv.org/html/2606.11266#bib.bib34)\]Easy/Medium/Hard, with held\-out evaluation on seeds disjoint from training\. The compressed contribution:VLMPPOLag\+\+Conf is the only constraint\-aware configuration that achieves substantive return with within\-budget cost on FormulaOne L2, and the mechanism transfers to dense traffic \(−15\-15pp catastrophe on held\-out MetaDrive Medium, bootstrap CI excludes zero\)\. We additionally document a previously\-unreported scenario\-sampler aliasing in MetaDrive that would otherwise mask any held\-out safety improvement \(Appendix[B\.3](https://arxiv.org/html/2606.11266#A2.SS3.SSS0.Px1)\)\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x1.png)Figure 1:Anticipatory safety from a frozen VLM\.*\(a\)*Per\-step CLIP danger signalcvlmc\_\{\\text\{vlm\}\}\(green\) rises several timesteps before the environment cost \(red\) on a single FormulaOne L2 rollout\. The Lagrange multiplierλ\\lambdais updated*between epochs*using the epoch meanc¯vlm\\overline\{c\}\_\{\\text\{vlm\}\}via Eq\. \([3](https://arxiv.org/html/2606.11266#S3.E3)\); the per\-step trace illustrates the signal that this mean accumulates\.*\(b\)*Held\-out MetaDrive Medium: the mechanism cuts catastrophe rate from41%41\\%to26%26\\%\(−15\-15pp; bootstrap95%95\\%CI\[−26,−5\]\[\-26,\-5\]pp,n=100n\{=\}100held\-out episodes\)\.

## 2Related Work

VLMs for robotics and reward shaping\.VLMs have been used for task planning and affordance grounding\[[4](https://arxiv.org/html/2606.11266#bib.bib13),[22](https://arxiv.org/html/2606.11266#bib.bib12)\]and as end\-to\-end vision\-language\-action models\[[8](https://arxiv.org/html/2606.11266#bib.bib14),[14](https://arxiv.org/html/2606.11266#bib.bib15),[31](https://arxiv.org/html/2606.11266#bib.bib3),[44](https://arxiv.org/html/2606.11266#bib.bib5),[49](https://arxiv.org/html/2606.11266#bib.bib16)\]\. Building on classical reward\-shaping foundations\[[34](https://arxiv.org/html/2606.11266#bib.bib33)\], recent works use frozen VLMs as auxiliary signals\[[16](https://arxiv.org/html/2606.11266#bib.bib6),[28](https://arxiv.org/html/2606.11266#bib.bib7),[45](https://arxiv.org/html/2606.11266#bib.bib8),[32](https://arxiv.org/html/2606.11266#bib.bib9),[37](https://arxiv.org/html/2606.11266#bib.bib10)\]\. None impose hard safety constraints\.VLM\-RL\[[23](https://arxiv.org/html/2606.11266#bib.bib11)\]closest prior work, introduces the*contrasting language goal \(CLG\)\-as\-reward*paradigm, positive and negative natural\-language goals scored by frozen CLIP and combined through a coupled softmax that anti\-correlates the reward and safety channels under unconstrained SAC\[[20](https://arxiv.org/html/2606.11266#bib.bib32)\]\. We adopt CLG terminology from\[[23](https://arxiv.org/html/2606.11266#bib.bib11)\]but operate inside the CMDP framework with decoupled paths and an anticipatory multiplier update \(Tab\.[3](https://arxiv.org/html/2606.11266#S5.T3)\)\.

Safe RL\.The classical motivation\[[6](https://arxiv.org/html/2606.11266#bib.bib38),[40](https://arxiv.org/html/2606.11266#bib.bib37),[17](https://arxiv.org/html/2606.11266#bib.bib25),[9](https://arxiv.org/html/2606.11266#bib.bib26)\]for hard constraints in RL has produced a family of CMDP solvers: CPO\[[2](https://arxiv.org/html/2606.11266#bib.bib18)\], PPO\-Lagrangian\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\], FOCOPS\[[51](https://arxiv.org/html/2606.11266#bib.bib21)\], PCPO\[[48](https://arxiv.org/html/2606.11266#bib.bib22)\], and CUP\[[47](https://arxiv.org/html/2606.11266#bib.bib23)\], evaluated through Safety\-Gymnasium\[[25](https://arxiv.org/html/2606.11266#bib.bib27),[43](https://arxiv.org/html/2606.11266#bib.bib36)\]and OmniSafe\[[26](https://arxiv.org/html/2606.11266#bib.bib28)\]\. PID\-Lagrangian\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\], CRPO\[[46](https://arxiv.org/html/2606.11266#bib.bib45)\]and Sauté\-RL\[[41](https://arxiv.org/html/2606.11266#bib.bib46)\]modify the Lagrange dynamics themselves; our anticipatoryη2​\(c¯vlm−τ\)\\eta\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\-\\tau\)term is orthogonal in that it injects a new*forward\-looking*signal derived from a frozen VLM and could be combined with any of them\. Reproducibility concerns in deep RL\[[21](https://arxiv.org/html/2606.11266#bib.bib41),[3](https://arxiv.org/html/2606.11266#bib.bib42)\]motivate our5,0005\\\!,\\\!000\-resample bootstrap CIs and pre\-registered one\-sided permutation tests\.

VLMs as zero\-shot scene classifiers\.A complementary line treats frozen VLMs as out\-of\-loop classifiers atop a separately trained policy: action shielding\[[11](https://arxiv.org/html/2606.11266#bib.bib49)\], post\-hoc anomaly detection\[[24](https://arxiv.org/html/2606.11266#bib.bib47),[33](https://arxiv.org/html/2606.11266#bib.bib52)\], and language\-conditioned scene tagging\[[38](https://arxiv.org/html/2606.11266#bib.bib48),[37](https://arxiv.org/html/2606.11266#bib.bib10)\]\. Our work makes the per\-step VLM output a first\-class citizen of the constrained\-optimization problem the policy itself solves\.

## 3VLM\-Safe\-RL: Method

A CMDP is a tuple\(𝒮,𝒜,P,r,c,d,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,r,c,d,\\gamma\)\[[5](https://arxiv.org/html/2606.11266#bib.bib17)\]; the safe\-RL objective

π⋆=arg⁡maxπ⁡JR​\(π\)​s\.t\.​JC​\(π\)≤d,\\pi^\{\\star\}=\\arg\\max\_\{\\pi\}J\_\{R\}\(\\pi\)\\;\\;\\text\{s\.t\.\}\\;\\;J\_\{C\}\(\\pi\)\\leq d,\(1\)is solved by PPO\-Lagrangian\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\]via the updateλ←λ\+η1​\(JC−d\)\\lambda\\\!\\leftarrow\\\!\\lambda\+\\eta\_\{1\}\(J\_\{C\}\-d\), which is strictly*backward\-looking*\. We instantiate the framework on the Safety\-Gymnasium FormulaOne racing simulator\[[25](https://arxiv.org/html/2606.11266#bib.bib27)\]: a frozen CLIP ViT\-B/32\[[36](https://arxiv.org/html/2606.11266#bib.bib1),[13](https://arxiv.org/html/2606.11266#bib.bib2)\]receives a256×256256\{\\times\}256RGB frame at each control step alongside the proprioceptive observationst∈ℝ64s\_\{t\}\\in\\mathbb\{R\}^\{64\}; cost is binary on barrier contact with budgetd=25d\{=\}25overT=1000T\{=\}1000steps\.

Contribution 1: Decoupled Dual\-Path CLIP\.Prior work\[[23](https://arxiv.org/html/2606.11266#bib.bib11),[37](https://arxiv.org/html/2606.11266#bib.bib10)\]uses a coupled softmax over positive\+\+negative prompt logits, forcingrvlm\+cvlm≈1r\_\{\\text\{vlm\}\}\+c\_\{\\text\{vlm\}\}\\approx 1\. This is incorrect for CMDPs, where reward and cost are independent objects by definition\. We decouple into two cosine\-similarity paths normalised to\[0,1\]\[0,1\]:

rvlm​\(o\)\\displaystyle r\_\{\\text\{vlm\}\}\(o\)=1N​∑n=1Nsim​\(fI​\(o\),Fn\+\)\+12,\\displaystyle=\\tfrac\{1\}\{N\}\\\!\\sum\_\{n=1\}^\{N\}\\tfrac\{\\mathrm\{sim\}\(f\_\{I\}\(o\),F^\{\+\}\_\{n\}\)\+1\}\{2\},cvlm​\(o\)\\displaystyle c\_\{\\text\{vlm\}\}\(o\)=1N​∑n=1Nsim​\(fI​\(o\),Fn−\)\+12\.\\displaystyle=\\tfrac\{1\}\{N\}\\\!\\sum\_\{n=1\}^\{N\}\\tfrac\{\\mathrm\{sim\}\(f\_\{I\}\(o\),F^\{\-\}\_\{n\}\)\+1\}\{2\}\.\(2\)Text featuresF±F^\{\\pm\}are encoded once and cached; per\-step cost is one image encoding plus a small dot product \(Appendix[A](https://arxiv.org/html/2606.11266#A1)\)\.

Contribution 2: VLMLagrange \(anticipatory multiplier\)\.Letc¯vlm=1T​∑tcvlm​\(ot\)\\overline\{c\}\_\{\\text\{vlm\}\}=\\tfrac\{1\}\{T\}\\sum\_\{t\}c\_\{\\text\{vlm\}\}\(o\_\{t\}\)andτ∈\[0,1\]\\tau\\in\[0,1\]a danger threshold\. We augment the standard update with a per\-step CLIP\-derived anticipatory term:

λ←λ\+η1​\(JC−d\)⏟standard \(backward\)\+η2​\(c¯vlm−τ\)⏟VLM \(forward\)\\boxed\{\\;\\lambda\\;\\leftarrow\\;\\lambda\+\\underbrace\{\\eta\_\{1\}\(J\_\{C\}\-d\)\}\_\{\\text\{standard \(backward\)\}\}\+\\underbrace\{\\eta\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\-\\tau\)\}\_\{\\text\{VLM \(forward\)\}\}\\;\}\(3\)η2=0\\eta\_\{2\}\{=\}0recovers vanilla PPO\-Lagrangian, providing a clean ablation of the anticipatory contribution\. The intuition is direct:cvlm​\(ot\)c\_\{\\text\{vlm\}\}\(o\_\{t\}\)is elevated as the racecar*approaches*a barrier, soc¯vlm\\overline\{c\}\_\{\\text\{vlm\}\}accumulates pre\-collision danger evidence within an epoch andλ\\lambdarises faster in early training, giving the constraint a head start in the high\-cost exploration phase\. Implementation subclasses OmniSafe’sLagrangeviaspec\_log; the PPO loss is unchanged\.

Contribution 3: Confidence Gating\.CLIP is not uniformly reliable across visually diverse states\. Following standard logistic\-noise treatment of binary classifiers\[[35](https://arxiv.org/html/2606.11266#bib.bib50),[19](https://arxiv.org/html/2606.11266#bib.bib51)\], model the probability that frameoto\_\{t\}is dangerous asPr⁡\(yt=1∣mt\)=σ​\(s​\(mt−c\)\)\\Pr\(y\_\{t\}\{=\}1\\mid m\_\{t\}\)=\\sigma\(s\(m\_\{t\}\-c\)\)for CLIP group marginmt≡mt\+−mt−m\_\{t\}\\equiv m\_\{t\}^\{\+\}\-m\_\{t\}^\{\-\}\. The variance\-minimizing fusion weight under the uninformative prior is the Bayes posterior margin:

κt=\|2​σ​\(s​\(mt−c\)\)−1\|∈\[0,1\],λreff=κt​λr,λceff=κt​λc\.\\kappa\_\{t\}=\\big\|\\,2\\sigma\\\!\\big\(s\(m\_\{t\}\-c\)\\big\)\-1\\,\\big\|\\;\\in\\;\[0,1\],\\qquad\\lambda\_\{r\}^\{\\text\{eff\}\}=\\kappa\_\{t\}\\lambda\_\{r\},\\quad\\lambda\_\{c\}^\{\\text\{eff\}\}=\\kappa\_\{t\}\\lambda\_\{c\}\.\(4\)Decisive frames \(κt→1\\kappa\_\{t\}\{\\to\}1\) pass the signal through; ambiguous frames \(κt→0\\kappa\_\{t\}\{\\to\}0\) suppress it\. The hyperparameters\(s,c\)\(s,c\)admit a closed\-form maximum\-likelihood estimate from aBB\-frame random\-policy bufferℬ\\mathcal\{B\}on the target environment:

c^=median​\(ℬ\),s^=1IQR​\(ℬ\)​log⁡1\+κ⋆1−κ⋆,\\hat\{c\}=\\mathrm\{median\}\(\\mathcal\{B\}\),\\qquad\\hat\{s\}=\\frac\{1\}\{\\mathrm\{IQR\}\(\\mathcal\{B\}\)\}\\log\\\!\\frac\{1\+\\kappa^\{\\star\}\}\{1\-\\kappa^\{\\star\}\},\(5\)whereκ⋆\\kappa^\{\\star\}is a target gate value at\+1\+1IQR \(see Appendix[F\.2](https://arxiv.org/html/2606.11266#A6.SS2.SSS0.Px1)for the derivation, and Appendix[F](https://arxiv.org/html/2606.11266#A6)for the prior\-symmetric vs\. calibrated ablation; the L2 categorical conclusion is invariant to the choice\)\. Empirical margin distribution is concentrated in the saturated tail,κt⋆→1\\kappa^\{\\star\}\_\{t\}\\\!\\to\\\!1uniformly and the gate degenerates to identity the failure mode we observe empirically on MetaDrive Hard \(Appendix[F](https://arxiv.org/html/2606.11266#A6)\)\. A held\-out validation ofκ\\kappaas a danger predictor against simulator cost on FormulaOne \(50,00050\{,\}000stochastic frames,55eval episodes per level\) yields calibrated AUC0\.820\.82\(L1\) and0\.780\.78\(L2\) \(Appendix[F\.2](https://arxiv.org/html/2606.11266#A6.SS2)\)\.

VLMPPOLag \(composition\)\.VLMPPOLag \(Algorithm[1](https://arxiv.org/html/2606.11266#alg1)\) inherits from PPOLag and replaces theLagrangeclass withVLMLagrange; the policy loss, value functions and PPO clipping are unchanged\. Only the multiplier update receives the VLM signal, cleanly separating the contribution from policy optimization\.

Algorithm 1VLMPPOLag\(one epoch\)\.1:Init:VLMLagrange\(λ0,η1,d,η2,τ\)\(\\lambda\_\{0\},\\eta\_\{1\},d,\\eta\_\{2\},\\tau\), CLIP

\(𝐩\+,𝐩−\)\(\\mathbf\{p\}^\{\+\},\\mathbf\{p\}^\{\-\}\)
2:Collect rollout; for each step

ttcompute

rvlm​\(ot\),cvlm​\(ot\),κtr\_\{\\text\{vlm\}\}\(o\_\{t\}\),c\_\{\\text\{vlm\}\}\(o\_\{t\}\),\\kappa\_\{t\}via CLIP; set

r~t←renv,t\+κt​λr​rvlm​\(ot\)\\tilde\{r\}\_\{t\}\\\!\\leftarrow\\\!r\_\{\\text\{env\},t\}\+\\kappa\_\{t\}\\lambda\_\{r\}r\_\{\\text\{vlm\}\}\(o\_\{t\}\),

c~t←cenv,t\\tilde\{c\}\_\{t\}\\\!\\leftarrow\\\!c\_\{\\text\{env\},t\}
3:Compute GAE advantages; update

πθ,Vϕ\\pi\_\{\\theta\},V\_\{\\phi\}with the standard PPO clipped loss

4:

λ←max⁡\(0,λ\+η1​\(JC−d\)\+η2​\(c¯vlm−τ\)\)\\lambda\\leftarrow\\max\\\!\\big\(0,\\;\\lambda\+\\eta\_\{1\}\(J\_\{C\}\-d\)\+\\eta\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\-\\tau\)\\big\)\(Eq\. \([3](https://arxiv.org/html/2606.11266#S3.E3)\)\)

## 4Experimental Setup

Primary benchmark\.Safety\-Gymnasium FormulaOne v0\.5\[[25](https://arxiv.org/html/2606.11266#bib.bib27)\]at three obstacle\-density levels:L0\(clear track\),L1\(4 cones at hairpin apexes\),L2\(8 staggered concrete barricades\)\.T=1000T\{=\}1000steps at2525Hz, budgetd=25d\{=\}25,10610^\{6\}training steps,γ=0\.99\\gamma\{=\}0\.99,2,0002\{,\}000steps/epoch, CLIP ViT\-B/32 frozen,N=4N\{=\}4prompts/polarity \(v1; v2/v3 sensitivity in Appendix[C](https://arxiv.org/html/2606.11266#A3)\)\.

Baselines \(10 configurations\)\.*\(i\) Pure RL*: PPO\[[39](https://arxiv.org/html/2606.11266#bib.bib30)\]\.*\(ii\) CMDP, no VLM*: CPO\[[2](https://arxiv.org/html/2606.11266#bib.bib18)\], PPOLag\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\], CPPOPID\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\]\(PID\-Lagrangian, L2 only\)\.*\(iii\) VLM\-RL\-style \(CLG\)*: PPO\-CLG and CPO\-CLG apply the coupled\-softmax CLG scoring from VLM\-RL\[[23](https://arxiv.org/html/2606.11266#bib.bib11)\]as a reward bonus; these isolate coupled\-vs\-decoupled\.*\(iv\) Ours and ablations*: CPO\-Coupled, CPO\-Decoupled, PPOLag\-Decoupled, VLMPPOLag, VLMPPOLag\+\+Conf\. Hyperparameters:λr=0\.1\\lambda\_\{r\}\{=\}0\.1,λc=0\.5\\lambda\_\{c\}\{=\}0\.5,η2=0\.01\\eta\_\{2\}\{=\}0\.01,τ=0\.5\\tau\{=\}0\.5\(full sweep in Appendix[A\.1](https://arxiv.org/html/2606.11266#A1.SS1)\)\. Three seeds per \(method, level\) cell; the principal \+Conf row is extended to 5 seeds\{42,123,456,789,1024\}\\\{42,123,456,789,1024\\\}via Phase B; statistical reporting uses the one\-sided permutation test \(Appendix[D\.3](https://arxiv.org/html/2606.11266#A4.SS3)\) which floors atp=1/20=0\.05p\{=\}1/20\{=\}0\.05forn=3n\{=\}3vs\.n=3n\{=\}3\.

Held\-out protocol\.Generalization policies \(Bullet, MetaDrive\) evaluated deterministically for 20 episodes on each of seeds\{10000,…,10019\}\\\{10000,\\ldots,10019\\\}\(n=60n\{=\}60episodes for 3\-seed cells;n=100n\{=\}100for 5\-seed cells\)\. We report mean returnJRJ\_\{R\}, mean costJCJ\_\{C\},*violation rate*\(JC\>dJ\_\{C\}\{\>\}d\) and*catastrophe rate*\(JC\>4​dJ\_\{C\}\{\>\}4d\) with5,0005\{,\}000\-resample bootstrap CIs\[[15](https://arxiv.org/html/2606.11266#bib.bib40)\]\.

Generalization environments\.*Bullet*\[[18](https://arxiv.org/html/2606.11266#bib.bib43)\]SafetyCarReach\-v0: 2D arena with 8 hazard spheres, overhead view; 3 seeds×\\times\{1M,2M\} steps\.*MetaDrive*\[[29](https://arxiv.org/html/2606.11266#bib.bib34)\]Easy/Medium/Hard: front\-facing dashboard camera, ego\-centric; 3 seeds \(Easy\), 5 seeds \(Medium, Hard\)\. All MetaDrive runs usenum\_scenarios=10000\\texttt\{num\\\_scenarios\}\{=\}10000to disable a scenario\-sampler aliasing \(default100100silently overlaps held\-out and training scenarios for our seed range; quantified in Appendix[B\.3](https://arxiv.org/html/2606.11266#A2.SS3.SSS0.Px1)\)\.

## 5Results

VLM\-SAFE\-RLevaluated primarily on Safety\-Gymnasium FormulaOne\.[Table˜1](https://arxiv.org/html/2606.11266#S5.T1)reports final\-epoch training\-time performance\. The \+Conf and CPPOPID rows use the extended 5\-seed Phase B set; all others use the original 3\-seed set, annotated in the table\. Three patterns stand out\.

\(P1\) VLM reward shaping is the dominant performance driver\.Pure\-RL PPO producesJR=1\.6J\_\{R\}\{=\}1\.6on all three levels, the sparse task reward is too weak\. All decoupled\-path and CLG methods that use CLIP reward shaping jump toJR\>50J\_\{R\}\{\>\}50\. CPO\-Coupled is the exception \(JR=21\.2J\_\{R\}\{=\}21\.2at L0\): the coupled simplex suppressesrvlmr\_\{\\text\{vlm\}\}when the cost path is active\.

\(P2\) Decoupling the CLIP paths is the single largest representation gain\.CPO\-Coupled \(JR=21\.6J\_\{R\}\{=\}21\.6at L2\) vs\. CPO\-Decoupled \(JR=63\.9J\_\{R\}\{=\}63\.9\) at comparable cost \(JC=32\.4J\_\{C\}\{=\}32\.4vs\.30\.930\.9\): the coupled simplex forcesrvlm\+cvlm≈1r\_\{\\text\{vlm\}\}\{\+\}c\_\{\\text\{vlm\}\}\{\\approx\}1; decoupling restores the formalR⟂CR\\\!\\perp\\\!Cindependence the CMDP assumes\. Permutationp=0\.05p\{=\}0\.05onJRJ\_\{R\}at L2 \(the structural floor atn=3n\{=\}3, indicating complete rank separation\), with full pairwise tables in Appendix[D\.2](https://arxiv.org/html/2606.11266#A4.SS2)\.

\(P3\) Anticipatory\+\+confidence\-gated is the only constraint\-respecting configuration with substantive return\.PPOLag\-Decoupled \(η2=0\\eta\_\{2\}\{=\}0,JC=40\.7J\_\{C\}\{=\}40\.7at L2\) vs\. VLMPPOLag \(η2=0\.01\\eta\_\{2\}\{=\}0\.01,JC=40\.2J\_\{C\}\{=\}40\.2\) shows a small consistent improvement from the anticipatory term alone; adding confidence gating \(5 seeds\) dropsJCJ\_\{C\}to22\.522\.5—a44%44\\%reduction—with4/54/5training seeds holding cost below budget\. Crucially, the reactive baselines PPOLag, CPO, CPPOPID, PPOLag\-RND each collapse toJR≈0J\_\{R\}\{\\approx\}0to satisfy the constraint \(the same collapse holds for three additional Lagrangian variants \(FOCOPS, CUP, P3O\) reported in App\.[D\.6](https://arxiv.org/html/2606.11266#A4.SS6)\), while CPO\-CLG retains return \(JR=50\.9J\_\{R\}\{=\}50\.9\) but violates the budget \(JC=33\.9\>25J\_\{C\}\{=\}33\.9\{\>\}25\)\. VLMPPOLag\+\+Conf is the only configuration that achieves*both*substantive return*and*within\-budget cost on a majority of seeds\. The cost reduction comes at a return cost \(JR:63\.8→31\.8J\_\{R\}\{:\}63\.8\{\\to\}31\.8\) because gating attenuates both channels equally, a calibrated safety\-first operating point, not a Pareto improvement\. \(Pareto\-anchor runs with PPOLag\-Decoupled atd∈\{15,35\}d\\\!\\in\\\!\\\{15,35\\\}are reported in Appendix[D\.5](https://arxiv.org/html/2606.11266#A4.SS5)\)\.

Table 1:FormulaOne training\-time final\-epoch performance\(mean±\\pmstd,10610^\{6\}steps\)\.Blue:JC≤d=25J\_\{C\}\{\\leq\}d\{=\}25\.Bold: best per column among VLM\-augmented methods\.∗5 seeds\{42,123,456,789,1024\}\\\{42,123,456,789,1024\\\}; calibrated gate from Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) \(App\.[F\.3](https://arxiv.org/html/2606.11266#A6.SS3)\)\. Other rows: 3 seeds\{42,123,456\}\\\{42,123,456\\\}\. CPPOPID is L2\-only\.Type:RL=pure RL; CMDP=constrained MDP, no VLM; CLG=VLM\-as\-reward; RND=intrinsic\-novelty ablation;Ours=CMDP\+\+VLM cost\. One\-sided permutation test \(App\.[D\.3](https://arxiv.org/html/2606.11266#A4.SS3)\); full pairwise table in App\.[D\.2](https://arxiv.org/html/2606.11266#A4.SS2)\.### 5\.1Anticipatory dynamics, ablation, and where to inject the VLM

[Section˜5\.1](https://arxiv.org/html/2606.11266#S5.SS1)tracesλ\\lambdaduring L2 training: VLMPPOLag \(red\) rises noticeably faster than PPOLag\-Decoupled \(purple,η2=0\\eta\_\{2\}\{=\}0, identical otherwise\)\. Both algorithms receive the same environment cost stream; the only structural difference is the anticipatoryη2​\(c¯vlm−τ\)\\eta\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\-\\tau\)term, so the divergence directly evidences the design intent of Eq\. \([3](https://arxiv.org/html/2606.11266#S3.E3)\)\. VLMPPOLag\+\+Conf \(green\) settles at a substantially lower equilibrium because gating attenuatesc¯vlm\\overline\{c\}\_\{\\text\{vlm\}\}on visually ambiguous frames\.[Section˜5\.1](https://arxiv.org/html/2606.11266#S5.SS1)shows that this anticipatory benefit*grows*with obstacle density: VLMPPOLag and \+Conf maintainJRJ\_\{R\}across L0–L2 while pure PPO suffers catastrophic cost at L1/L2 \(Spearmanρ=0\.19\\rho\{=\}0\.19,p<10−8p\{<\}10^\{\-8\}for per\-bincvlmc\_\{\\text\{vlm\}\}vs\. collision probability; App\.[O](https://arxiv.org/html/2606.11266#A15)\)\.

[Table˜2](https://arxiv.org/html/2606.11266#S5.T2)decomposes the L2 result\. Removing decoupled CLIP dropsJRJ\_\{R\}by66%66\\%at unchangedJCJ\_\{C\}\(representation dominates\); removing the anticipatory term changes violation count from2/32/3to3/33/3\(forward\-lookingλ\\lambdaprevents uniform cost saturation\); removing confidence gating bringsJCJ\_\{C\}from30\.530\.5to40\.240\.2\(κ\\kapparemoves spurious VLM spikes\)\. A complementary*injection\-mode*ablation \(4 modes, 3 base algorithms, L1\+\+L2; Appendix[D\.7](https://arxiv.org/html/2606.11266#A4.SS7)\) yields a robust ranking:Decoupled\+Conf\>Decoupled≫Coupled≈VLG\\textit\{Decoupled\+Conf\}\\;\>\\;\\textit\{Decoupled\}\\;\\gg\\;\\textit\{Coupled\}\\;\\approx\\;\\textit\{VLG\}\. The most common prior\-work design \(addingcvlmc\_\{\\text\{vlm\}\}to the environment cost;*Coupled*,cf\.\[[23](https://arxiv.org/html/2606.11266#bib.bib11),[37](https://arxiv.org/html/2606.11266#bib.bib10)\]\) is*worse*than not using a VLM on CPO L2 \(20%20\\%vs\.13%13\\%catastrophe\), and routing the VLM cost directly into theλ\\lambdaupdate without gating \(*VLG*\) amplifies catastrophe to47%47\\%on PPO\-L1: noisy VLM spikes inflateλ\\lambdabefore the critic can integrate them temporally\. This is precisely why our default routes through gated*Decoupled\+Conf*\.

Table 2:Ablation on FormulaOne L2\(3 seeds\{42,123,456\}\\\{42,123,456\\\}\)\. Each row removes one contribution from the full VLMPPOLag\+\+Conf system\. Phase B 5\-seed extension of the \+Conf row in[Table˜1](https://arxiv.org/html/2606.11266#S5.T1):𝒥R=31\.8\\mathcal\{J\}\_\{R\}\{=\}31\.8,𝒥C=22\.5\\mathcal\{J\}\_\{C\}\{=\}22\.5,1/51/5violations\.Table 3:Comparison of VLM\-augmented safe\-learning approaches\.Ours is the only design that simultaneously \(i\) operates inside the CMDP, \(ii\) decouples the CLIP paths, \(iii\) uses an*anticipatory*Lagrangian update, and \(iv\) gates each frame by CLIP confidence\.†CPO\-CLG re\-implements the VLM\-RL CLG scoring on CPO\.
Figure 2:Lagrange multiplier dynamics\.Multiplierλ\\lambdatracking on FormulaOne\-L2\.![[Uncaptioned image]](https://arxiv.org/html/2606.11266v1/x2.png)

Interpretation of Figure[5\.1](https://arxiv.org/html/2606.11266#S5.SS1):PPOLag\-Decoupled isolatesη2​\(c¯v​l​m−τ\)\\eta\_\{2\}\(\\bar\{c\}\_\{vlm\}\{\-\}\\tau\); the same mechanism cuts catastrophe41%→26%41\\%\{\\to\}26\\%on MetaDrive Medium \([Table˜4](https://arxiv.org/html/2606.11266#S5.T4)\)\. Faded traces are per\-seed runs; bold lines are seed\-mean; columns are L0→\\toL1→\\toL2 \(increasing obstacle density\), top rowJRJ\_\{R\}and bottom rowJCJ\_\{C\}\(dashed line marks the budgetd=25d\{=\}25\)\. PPO \(grey\)JCJ\_\{C\}scales with obstacle count \(0→217→2690\{\\to\}217\{\\to\}269\) while VLMPPOLag \(red\) and \+Conf \(green\) keepJRJ\_\{R\}near the decoupled\-CLIP ceiling; \+Conf is the only trace whoseJCJ\_\{C\}curve settles*below*the budget line at L1 and L2, and does so from early training rather than after a violation spike, evidencing the anticipatory mechanism’s predicated advantage signature of the anticipatoryλ\\lambdarise in panel[5\.1](https://arxiv.org/html/2606.11266#S5.SS1)

Figure 3:Training curves on FormulaOne\-L0/L1/L2:episode returnJRJ\_\{R\}\(top row\) and episode costJCJ\_\{C\}\(bottom row\); faded per\-seed traces with bold seed\-mean overlays\.![[Uncaptioned image]](https://arxiv.org/html/2606.11266v1/x3.png)

### 5\.2Negative\-result probes: what does*not*improve safety?

Two controlled substitutions stress\-test which property of the VLM signal carries the safety benefit; full numbers and statistics in Appendix[O](https://arxiv.org/html/2606.11266#A15)\.\(N1\) Semantic grounding≠\\neqgeneric intrinsic cost \(RND\)\.Replacingcvlmc\_\{\\text\{vlm\}\}with Random Network Distillation novelty\[[10](https://arxiv.org/html/2606.11266#bib.bib44)\]under matched hyperparameters yieldsJR≈0J\_\{R\}\{\\approx\}0,3/33/3violations on L1\+\+L2: the novelty predictor memorises the near\-deterministic proprioceptive manifold, collapsing the auxiliary signal\.\(N2\) Backbone capacity is not the bottleneck \(Qwen2\-VL\-7B\)\.Swapping CLIP for Qwen2\-VL\-7B \(∼80×\\sim\\\!80\\timeslarger frozen VLM\) at matched hyperparameters yields a clean cost null \(JC=45\.5J\_\{C\}\{=\}45\.5vs\.30\.530\.5, permutationp=0\.80p\{=\}0\.80\) and significantly worse return \(JR=8\.2J\_\{R\}\{=\}8\.2vs\.48\.448\.4,p=5×10−4p\{=\}5\{\\times\}10^\{\-4\}\); the yes/no logsumexp margin is systematically more conservative than CLIP’s 8\-prompt softmax margin, attenuating the reward channel aggressively\. \(N1\)\+\(N2\) together identify the operative ingredient as*visual semantics of impending danger*, not parameter count or generic novelty\.

### 5\.3Generalization: held\-out evaluation

[Table˜4](https://arxiv.org/html/2606.11266#S5.T4)reports held\-out catastrophe and violation rates from 20 deterministic episodes per training run on seeds1000010000–1001910019, after the MetaDrive seed\-leak fix \(Appendix[B\.3](https://arxiv.org/html/2606.11266#A2.SS3.SSS0.Px1)\)\.

FormulaOne L2\(5\-seed VLMPPOLag\+Conf, 2\-seed PPOLag held\-out\): VLMPPOLag\+Conf achieves cat8%8\\%, viol18%18\\%; PPOLag achieves cat2%2\\%, viol18%18\\%but with near\-zero forward progress at held\-out maps \(meanJR≈0\.10J\_\{R\}\\\!\\approx\\\!0\.10vs\. training\-timeJR≈40J\_\{R\}\\\!\\approx\\\!40\), confirming the degenerate\-safety mode documented in App\.[E](https://arxiv.org/html/2606.11266#A5)\. The cat difference of\+6\+6pp reflects this mode rather than a genuine safety improvement; the95%95\\%CI is\[−2,\+13\]\[\-2,\+13\]pp \(non\-significant,n=2n\{=\}2baseline seeds\)\.

Bullet SafetyCarReach7\.5%7\.5\\%vs\.5\.0%5\.0\\%catastrophe \(16\.7%16\.7\\%vs\.12\.5%12\.5\\%violation;n=120n\{=\}120episodes per method\)\. The directional improvement is consistent with FormulaOne and MetaDrive Medium, but the bootstrap CI on the catastrophe\-rate difference\[−9\.2,\+3\.3\]\[\-9\.2,\+3\.3\]pp does*not*exclude zero; characterised as directionally consistent but not statistically detectable at this sample size\.

MetaDrive Easybaseline30%30\\%, VLMPPOLag\+\+Conf35%35\\%\(\+5\+5pp\)\. Sparse traffic on wide roads; the front\-facing camera rarely captures an imminent collision until impact, so CLIP has low temporal advance\.

MetaDrive Medium\(5 seeds, 100 held\-out episodes per method\):41%→26%41\\%\{\\to\}26\\%catastrophe,51%→35%51\\%\{\\to\}35\\%violation; bootstrap95%95\\%CI\[−26,−5\]\[\-26,\-5\]pp, entirely below zero\. The strongest generalization signal in our experiments: merging traffic gives several timesteps of advance warning, exercising the anticipatory mechanism directly\.

MetaDrive Hard\(5 seeds\)33%→31%33\\%\{\\to\}31\\%catastrophe; the violation rate is marginally*higher*for VLM \(39%39\\%vs\.36%36\\%\)\. Per\-seed inspection localises this to a*Lagrangian\-regulation*failure rather than a VLM\-signal failure \([Table˜16](https://arxiv.org/html/2606.11266#A8.T16)in App\.[H\.1](https://arxiv.org/html/2606.11266#A8.SS1)\): VLMCost mean is uniform across all 5 seeds \(c¯vlm≈0\.60\\overline\{c\}\_\{\\text\{vlm\}\}\{\\approx\}0\.60\) but the finalλ\\lambdaranges from0\.100\.10to0\.930\.93\. Seeds whose first\-epoch cost realizations happen to be atypically low leaveλ\\lambdaunable to grow within5050epochs \(seed456456:λ=0\.10\\lambda\{=\}0\.10,55%55\\%catastrophe\); atypically high realizations overshoot the attractor \(seed789789:λ=0\.93\\lambda\{=\}0\.93, return collapse\)\. The remaining three seeds converge nearλ≈0\.5−0\.7\\lambda\{\\approx\}0\.5\{\-\}0\.7\. A controlledλ0=0\.5\\lambda\_\{0\}\{=\}0\.5warm\-start re\-run of all 5 seeds \(Appendix[H\.1](https://arxiv.org/html/2606.11266#A8.SS1)\) pooled catastrophe drops31%→25%31\\%\{\\to\}25\\%and pooled violation39%→28%39\\%\{\\to\}28\\%,39%→28%39\\%\{\\to\}28\\%—directionally consistent but not statistically detectable atn=5n\{=\}5\(bootstrap CI\[−26,\+11\]\[\-26,\+11\]pp\); seed456456remains catastrophic and early\-cost variance still dominates\. The updated row appears in[Table˜4](https://arxiv.org/html/2606.11266#S5.T4)\.

Table 4:Out\-of\-distribution generalisation transfer\.Held\-out evaluation on seeds1000010000–1001910019\(20 episodes per training run\)\.Cat\.:JC\>4​dJ\_\{C\}\{\>\}4drate\.Viol\.:JC\>dJ\_\{C\}\{\>\}drate\.Cat\.Δ\\Delta95% CI:5,0005\{,\}000\-resample bootstrap CI of \(VLM\+\+Conf\)−\-PPOLag in pp \(F1\-L2: Wald CI,n=2n\{=\}2PPOLag seeds\)\.†\\dagger: PPOLag cat2%2\\%reflects near\-zero forward progress at held\-out maps \(degenerate safety\); see App\.[E](https://arxiv.org/html/2606.11266#A5)\. Per\-seedλ\\lambdadiagnostic for MD Hard:[Table˜16](https://arxiv.org/html/2606.11266#A8.T16)\.

## 6Discussion, Limitations, and Conclusion

Why the multiplier rises faster\.VLMLagrange converts Lagrangian safe RL from a reactive mechanism \(respond afterJC\>dJ\_\{C\}\{\>\}d\) to an anticipatory one\. The defining evidence is the early\-trainingλ\\lambdarise of[Section˜5\.1](https://arxiv.org/html/2606.11266#S5.SS1): both VLMPPOLag and PPOLag\-Decoupled receive the identical environment cost stream, so the divergence is attributable solely toη2​\(c¯vlm−τ\)\\eta\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\-\\tau\)\.c¯vlm\\overline\{c\}\_\{\\text\{vlm\}\}accumulates pre\-collision danger evidence within the first thousand training steps before the policy has learned to avoid barriers, givingλ\\lambdaa head start that the ablation \([Table˜2](https://arxiv.org/html/2606.11266#S5.T2)\) translates into one fewer violating seed at L2\.

When the mechanism generalises\.The combined picture is, anticipatory benefit is largest when the environment provides*temporal advance warning*: several frames where the danger is visually present and CLIP is decisive \(κ→1\\kappa\{\\to\}1\)\. MetaDrive Medium \(merging vehicles\) satisfies this cleanly\. Easy does not \(sparse traffic, head\-on collisions only at contact\)\. Hard fails for two compounding reasons: \(a\) the saturated\-tail prediction of §[3](https://arxiv.org/html/2606.11266#S3)\(medianκ=0\.93\\kappa\{=\}0\.93–0\.980\.98on Hard, App\.[F](https://arxiv.org/html/2606.11266#A6)\) makes the gate non\-selective, and \(b\) the Lagrangian\-regulation pathology of[Table˜16](https://arxiv.org/html/2606.11266#A8.T16)above\. Both are addressable in principle: the gate\-saturation failure by environment\-specific prompt design or a frame\-history buffer; theλ\\lambdapathology by warm\-initialisingλ0\\lambda\_\{0\}at the empirical attractor \(tested directly by theλ0=0\.5\\lambda\_\{0\}\{=\}0\.5warm\-start re\-run reported in[Table˜4](https://arxiv.org/html/2606.11266#S5.T4)and App\.[H\.1](https://arxiv.org/html/2606.11266#A8.SS1)\)\. On Bullet, the direction is right but the baseline is already low\-catastrophe, leaving little headroom and a CI that does not exclude zero\.

Path to physical deployment\.Frozen VLM at training time means an identical inference\-time pipeline\. ViT\-B/32 forward pass is7\.117\.11ms \(A100\) /9\.119\.11ms \(V100\),∼95%\\sim\\\!95\\%headroom against a2525Hz control loop \(App\.[A\.4](https://arxiv.org/html/2606.11266#A1.SS4)\)\. The MuJoCo→\\toBullet/PyTorch3D shift in our generalisation results shows the frozen VLM head tolerates the pipeline change that historically defeats visually\-conditioned policies; prompt distribution shift and worst\-case CLIP latency remain open, addressable by per\-deployment recalibration and a last\-cvlmc\_\{\\text\{vlm\}\}fallback on overrun\.

Reproducibility lesson\.MetaDrive’s default scenario sampler silently aliases held\-out seeds onto training scenarios viaseedmodnum\_scenarios\\texttt\{seed\}\\bmod\\texttt\{num\\\_scenarios\}, producing flat≈40%\\approx\\\!40\\%catastrophe rates with no separation between methods\. The fix is one line \(num\_scenarios=10000\\texttt\{num\\\_scenarios\}\{=\}10000\); collapsed and corrected numbers side\-by\-side in App\.[B\.3](https://arxiv.org/html/2606.11266#A2.SS3.SSS0.Px1)\.

Limitations\.\(1\) Prompt engineering is manual; learning prompts from cost signal\[[32](https://arxiv.org/html/2606.11266#bib.bib9)\]is a natural extension\. \(2\) Temporal advance warning is required: in fast\-onset/highly\-occluded regimes \(MD Hard\) the per\-epochλ\\lambdaupdate cannot respond on the cut\-in timescale and the gate degenerates to identity; warm\-startingλ0\\lambda\_\{0\}gives a directionally consistent improvement \(Apps\.[F](https://arxiv.org/html/2606.11266#A6),[H\.1](https://arxiv.org/html/2606.11266#A8.SS1)\) that is not statistically detectable atn=5n\{=\}5and two seeds remain catastrophic\. \(3\) Seed counts:n=5n\{=\}5for \+Conf,n=3n\{=\}3for generalization; full CIs in App\.[K](https://arxiv.org/html/2606.11266#A11)\. \(4\) VLM inference adds∼15%\\sim\\\!15\\%wall\-clock \(App\.[A\.4](https://arxiv.org/html/2606.11266#A1.SS4)\)\. \(5\) Our frozen\-VLM operating point is∼87×\\sim\\\!87\\timessmaller than SafeVLA\[[49](https://arxiv.org/html/2606.11266#bib.bib16)\]but does not match its fine\-tuned ceiling; the intermediate Qwen2\-VL probe is negative on safety and significantly negative on return\. \(6\) Three regimes show no benefit \(MD Easy, MD Hard,η2\>0\.05\\eta\_\{2\}\{\>\}0\.05\); App\.[D\.8](https://arxiv.org/html/2606.11266#A4.SS8)brackets the operating envelope\. \(7\) Single\-multiplier formulation, we fold the environment and VLM cost signals into one sharedλ\\lambdarather than the standard two\-multiplier\(λ1,λ2\)\(\\lambda\_\{1\},\\lambda\_\{2\}\)CMDP treatment\[[5](https://arxiv.org/html/2606.11266#bib.bib17)\]; aλ1​\(JC−d\)\+λ2​\(c¯vlm−τ\)\\lambda\_\{1\}\(J\_\{C\}\{\-\}d\)\+\\lambda\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\{\-\}\\tau\)variant is left as an open ablation that may resolve the MD Hard regulation pathology \([Table˜16](https://arxiv.org/html/2606.11266#A8.T16)\)\.

Conclusion\.VLM\-Safe\-RL, decoupled dual\-path representation, anticipatory VLMLagrange, confidence gating, and VLMPPOLag converts Lagrangian safe RL from reactive to anticipatory\. VLMPPOLag\+\+Conf retains substantive return and holds cost on a majority of FormulaOne seeds and transfers to MetaDrive Medium \(−15\-15pp catastrophe, CI excludes zero\):a practical, sim\-to\-real\-friendly complement to fine\-tuned VLA\.

## References

- \[1\]\(2020\)Quantifying attention flow in transformers\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§J\.2](https://arxiv.org/html/2606.11266#A10.SS2.p1.7)\.
- \[2\]J\. Achiam, D\. Held, A\. Tamar, and P\. Abbeel\(2017\)Constrained policy optimization\.InInternational Conference on Machine Learning \(ICML\),Cited by:[2nd item](https://arxiv.org/html/2606.11266#A13.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.11266#S2.p2.2),[§4](https://arxiv.org/html/2606.11266#S4.p2.9)\.
- \[3\]R\. Agarwal, M\. Schwarzer, P\. S\. Castro, A\. Courville, and M\. G\. Bellemare\(2021\)Deep reinforcement learning at the edge of the statistical precipice\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix K](https://arxiv.org/html/2606.11266#A11.p1.6),[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[4\]M\. Ahn, A\. Brohan, N\. Brown, Y\. Chebotar, O\. Cortes, B\. David, C\. Finn, C\. Fu, K\. Gopalakrishnan, K\. Hausman, A\. Herzog, D\. Ho,et al\.\(2022\)Do as i can, not as i say: grounding language in robotic affordances\.InConference on Robot Learning \(CoRL\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[5\]E\. Altman\(1999\)Constrained markov decision processes\.CRC Press\.Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§3](https://arxiv.org/html/2606.11266#S3.p1.1),[§6](https://arxiv.org/html/2606.11266#S6.p5.11)\.
- \[6\]D\. Amodei, C\. Olah, J\. Steinhardt, P\. Christiano, J\. Schulman, and D\. Mané\(2016\)Concrete problems in AI safety\.arXiv preprint arXiv:1606\.06565\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[7\]V\. S\. Borkar\(2009\)Stochastic approximation: a dynamical systems viewpoint\.Cambridge University Press\.Cited by:[§P\.3](https://arxiv.org/html/2606.11266#A16.SS3.p2.10),[§P\.3](https://arxiv.org/html/2606.11266#A16.SS3.p5.3)\.
- \[8\]A\. Brohan, N\. Brown, J\. Carbajal, Y\. Chebotar, X\. Chen, K\. Choromanski, T\. Ding, D\. Driess, A\. Dubey, C\. Finn,et al\.\(2023\)RT\-2: vision\-language\-action models transfer web knowledge to robotic control\.InConference on Robot Learning \(CoRL\),Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[9\]L\. Brunke, M\. Greeff, A\. W\. Hall, Z\. Yuan, S\. Zhou, J\. Panerati, and A\. P\. Schoellig\(2022\)Safe learning in robotics: from learning\-based control to safe reinforcement learning\.Annual Review of Control, Robotics, and Autonomous Systems5,pp\. 411–444\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[10\]Y\. Burda, H\. Edwards, A\. Storkey, and O\. Klimov\(2018\)Exploration by random network distillation\.arXiv preprint arXiv:1810\.12894\.Cited by:[§O\.1](https://arxiv.org/html/2606.11266#A15.SS1.p1.9),[§5\.2](https://arxiv.org/html/2606.11266#S5.SS2.p1.12)\.
- \[11\]J\. Chen and R\. Chandra\(2026\)Dynamic control barrier function regulation with vision\-language models for safe, adaptive, and realtime visual navigation\.arXiv preprint arXiv:2603\.21142\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p3.1)\.
- \[12\]E\. Coumans and Y\. Bai\(2016\)PyBullet: a python module for physics simulation for games, robotics and machine learning\.InGitHub repository,Note:[http://pybullet\.org](http://pybullet.org/)Cited by:[§B\.2](https://arxiv.org/html/2606.11266#A2.SS2.p1.11)\.
- \[13\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§3](https://arxiv.org/html/2606.11266#S3.p1.6)\.
- \[14\]D\. Driess, F\. Xia, M\. S\. M\. Sajjadi, C\. Lynch, A\. Chowdhery, B\. Ichter, A\. Wahid, J\. Tompson, Q\. Vuong, T\. Yu,et al\.\(2023\)PaLM\-E: an embodied multimodal language model\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[15\]B\. Efron and R\. Tibshirani\(1986\)Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy\.Statistical Science1,pp\. 54–75\.Cited by:[Appendix K](https://arxiv.org/html/2606.11266#A11.p1.6),[§4](https://arxiv.org/html/2606.11266#S4.p3.8)\.
- \[16\]L\. Fan, G\. Wang, Y\. Jiang, A\. Mandlekar, Y\. Yang, H\. Zhu, A\. Tang, D\. Huang, Y\. Zhu, and A\. Anandkumar\(2022\)MineDojo: building open\-ended embodied agents with internet\-scale knowledge\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks,Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[17\]J\. García and F\. Fernández\(2015\)A comprehensive survey on safe reinforcement learning\.Journal of Machine Learning Research16\(1\),pp\. 1437–1480\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[18\]S\. Gronauer\(2022\)Bullet\-safety\-gym: A framework for constrained reinforcement learning\.Technical reportTechnical University of Munich\.External Links:[Link](https://github.com/SvenGronauer/Bullet-Safety-Gym)Cited by:[§B\.2](https://arxiv.org/html/2606.11266#A2.SS2.p1.11),[§4](https://arxiv.org/html/2606.11266#S4.p4.3)\.
- \[19\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 1321–1330\.Cited by:[§3](https://arxiv.org/html/2606.11266#S3.p4.3)\.
- \[20\]T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine\(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[21\]P\. Henderson, R\. Islam, P\. Bachman, J\. Pineau, D\. Precup, and D\. Meger\(2018\)Deep reinforcement learning that matters\.AAAI Conference on Artificial Intelligence\.Cited by:[Appendix K](https://arxiv.org/html/2606.11266#A11.p1.6),[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[22\]W\. Huang, C\. Wang, R\. Zhang, Y\. Li, J\. Wu, and L\. Fei\-Fei\(2023\)VoxPoser: composable 3D value maps for robotic manipulation with language models\.InConference on Robot Learning \(CoRL\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[23\]Z\. Huang, Z\. Sheng, Y\. Qu, J\. You, and S\. Chen\(2024\)VLM\-RL: a unified vision language models and reinforcement learning framework for safe autonomous driving\.arXiv preprint arXiv:2412\.15544\.Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p1.1),[§3](https://arxiv.org/html/2606.11266#S3.p2.3),[§4](https://arxiv.org/html/2606.11266#S4.p2.9),[§5\.1](https://arxiv.org/html/2606.11266#S5.SS1.p2.18)\.
- \[24\]J\. Jeong, Y\. Zou, T\. Kim, D\. Zhang, A\. Ravichandran, and O\. Dabeer\(2023\)WinCLIP: zero\-/few\-shot anomaly classification and segmentation\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p3.1)\.
- \[25\]J\. Ji, B\. Zhang, J\. Zhou, X\. Pan, W\. Huang, R\. Sun, Y\. Geng, Y\. Zhong, J\. Dai, and Y\. Yang\(2023\)Safety Gymnasium: a unified safe reinforcement learning benchmark\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks,Cited by:[Figure 4](https://arxiv.org/html/2606.11266#A2.F4),[§1](https://arxiv.org/html/2606.11266#S1.SS0.SSS0.Px2.p2.3),[§2](https://arxiv.org/html/2606.11266#S2.p2.2),[§3](https://arxiv.org/html/2606.11266#S3.p1.6),[§4](https://arxiv.org/html/2606.11266#S4.p1.7)\.
- \[26\]J\. Ji, J\. Zhou, B\. Zhang, J\. Dai, X\. Pan, R\. Sun, W\. Huang, Y\. Geng, M\. Liu, and Y\. Yang\(2023\)OmniSafe: an infrastructure for accelerating safe reinforcement learning research\.arXiv preprint arXiv:2305\.09304\.Cited by:[item 4](https://arxiv.org/html/2606.11266#S1.I1.i4.p1.1),[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[27\]H\. J\. Kushner and G\. G\. Yin\(2003\)Stochastic approximation and recursive algorithms and applications\.2nd edition,Springer\.Cited by:[§P\.3](https://arxiv.org/html/2606.11266#A16.SS3.p5.3)\.
- \[28\]M\. Kwon, S\. M\. Xie, K\. Bullard, and D\. Sadigh\(2023\)Reward design with language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[29\]Q\. Li, Z\. Peng, L\. Feng, Q\. Zhang, Z\. Xue, and B\. Zhou\(2023\)MetaDrive: composing diverse driving scenarios for generalizable reinforcement learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence \(TPAMI\)45\(3\),pp\. 3461–3475\.Cited by:[§B\.3](https://arxiv.org/html/2606.11266#A2.SS3.p1.7),[§1](https://arxiv.org/html/2606.11266#S1.SS0.SSS0.Px2.p2.3),[§4](https://arxiv.org/html/2606.11266#S4.p4.3)\.
- \[30\]H\. Liu, C\. Li, Y\. Li, and Y\. J\. Lee\(2024\)LLaVA\-NeXT: improved reasoning, OCR, and world knowledge\.arXiv preprint arXiv:2408\.03326\.Cited by:[§J\.3](https://arxiv.org/html/2606.11266#A10.SS3.p2.1)\.
- \[31\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2023\)Visual instruction tuning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[32\]Y\. J\. Ma, W\. Liang, G\. Wang, D\. Huang, O\. Bastani, D\. Jayaraman, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2024\)Eureka: human\-level reward design via coding large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p1.1),[§6](https://arxiv.org/html/2606.11266#S6.p5.11)\.
- \[33\]K\. P\. Murphy\(2012\)Machine learning: a probabilistic perspective\.MIT Press\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p3.1)\.
- \[34\]A\. Y\. Ng, D\. Harada, and S\. Russell\(1999\)Policy invariance under reward transformations: theory and application to reward shaping\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[35\]J\. Platt\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.InAdvances in Large Margin Classifiers,Vol\.10,pp\. 61–74\.Cited by:[§3](https://arxiv.org/html/2606.11266#S3.p4.3)\.
- \[36\]A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever\(2021\)Learning transferable visual models from natural language supervision\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§3](https://arxiv.org/html/2606.11266#S3.p1.6)\.
- \[37\]J\. Rocamonde, V\. Montesinos, E\. Nava, E\. Perez, and D\. Lindner\(2024\)Vision\-language models are zero\-shot reward models for reinforcement learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[4th item](https://arxiv.org/html/2606.11266#A13.I1.i4.p1.1),[Figure 14](https://arxiv.org/html/2606.11266#A4.F14),[item 4](https://arxiv.org/html/2606.11266#A4.I2.i4.p1.2),[3rd item](https://arxiv.org/html/2606.11266#A4.I3.i3.p1.3),[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p1.1),[§2](https://arxiv.org/html/2606.11266#S2.p3.1),[§3](https://arxiv.org/html/2606.11266#S3.p2.3),[§5\.1](https://arxiv.org/html/2606.11266#S5.SS1.p2.18)\.
- \[38\]L\. Santos, Z\. Li, L\. Peters, S\. Bansal, and A\. Bajcsy\(2024\)Updating robot safety representations online from natural language feedback\.arXiv preprint arXiv:2409\.14580\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p3.1)\.
- \[39\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[1st item](https://arxiv.org/html/2606.11266#A13.I1.i1.p1.1),[§4](https://arxiv.org/html/2606.11266#S4.p2.9)\.
- \[40\]S\. Shalev\-Shwartz, S\. Shammah, and A\. Shashua\(2016\)Safe, multi\-agent, reinforcement learning for autonomous driving\.arXiv preprint arXiv:1610\.03295\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[41\]A\. Sootla, A\. I\. Cowen\-Rivers, T\. Jafferjee, Z\. Wang, D\. H\. Mguni, J\. Wang, and H\. Ammar\(2022\)Sauté RL: almost surely safe reinforcement learning using state augmentation\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[42\]A\. Stooke, J\. Achiam, and P\. Abbeel\(2020\)Responsive safety in reinforcement learning by PID Lagrangian methods\.InInternational Conference on Machine Learning \(ICML\),Cited by:[3rd item](https://arxiv.org/html/2606.11266#A13.I1.i3.p1.1),[§P\.1](https://arxiv.org/html/2606.11266#A16.SS1.p2.6),[§P\.1](https://arxiv.org/html/2606.11266#A16.SS1.p4.32),[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p2.2),[§3](https://arxiv.org/html/2606.11266#S3.p1.6),[§4](https://arxiv.org/html/2606.11266#S4.p2.9)\.
- \[43\]E\. Todorov, T\. Erez, and Y\. Tassa\(2012\)MuJoCo: a physics engine for model\-based control\.In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 5026–5033\.Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[44\]P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, Y\. Fan, K\. Dang, M\. Du, X\. Ren, R\. Men, D\. Liu, C\. Zhou, J\. Zhou, and J\. Lin\(2024\)Qwen2\-VL: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§J\.3](https://arxiv.org/html/2606.11266#A10.SS3.p2.1),[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[45\]T\. Xie, S\. Zhao, C\. H\. Wu, Y\. Liu, Q\. Luo, V\. Zhong, Y\. Yang, and T\. Yu\(2024\)Text2Reward: reward shaping with language models for reinforcement learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p1.1)\.
- \[46\]T\. Xu, Y\. Liang, and G\. Lan\(2021\)CRPO: a new approach for safe reinforcement learning with convergence guarantee\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[47\]L\. Yang, J\. Ji, J\. Dai, L\. Zhang, B\. Zhou, P\. Li, Y\. Yang, and G\. Pan\(2022\)Constrained update projection approach to safe policy optimization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.6](https://arxiv.org/html/2606.11266#A4.SS6.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[48\]T\. Yang, J\. Rosca, K\. Narasimhan, and P\. J\. Ramadge\(2020\)Projection\-based constrained policy optimization\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.
- \[49\]B\. Zhang, Y\. Zhang, J\. Ji, Y\. Lei, J\. Dai, Y\. Chen, and Y\. Yang\(2025\)SafeVLA: towards safety alignment of vision\-language\-action models via constrained learning\.arXiv preprint arXiv:2503\.03480\.Cited by:[§1](https://arxiv.org/html/2606.11266#S1.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p1.1),[§6](https://arxiv.org/html/2606.11266#S6.p5.11)\.
- \[50\]L\. Zhang, L\. Shen, L\. Yang, S\. Chen, B\. Yuan, X\. Wang, and D\. Tao\(2022\)Penalized proximal policy optimization for safe reinforcement learning\.arXiv preprint arXiv:2205\.11814\.Cited by:[§D\.6](https://arxiv.org/html/2606.11266#A4.SS6.p1.3)\.
- \[51\]Y\. Zhang, Q\. Vuong, and K\. W\. Ross\(2020\)FOCOPS: first order constrained optimization in policy space\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.6](https://arxiv.org/html/2606.11266#A4.SS6.p1.3),[§2](https://arxiv.org/html/2606.11266#S2.p2.2)\.

## Appendix AAppendix

### A\.1Hyperparameters

[Table˜5](https://arxiv.org/html/2606.11266#A1.T5)provides a complete specification of the hyperparameters shared across all environments and methods\. All algorithms use identical optimisation, PPO/CPO, and Lagrange settings; only the VLM\-specific block varies between with\-VLM and without\-VLM ablations\. Confidence\-gating uses the same VLM weights, with the additional binary group margin \(\([4](https://arxiv.org/html/2606.11266#S3.E4)\) in the main paper\)\.

Table 5:Hyperparameters shared across all runs \(FormulaOne, Bullet, MetaDrive\)\. Algorithm\-specific variants only modify the VLM block; the optimisation and CMDP blocks are held fixed for all baselines and ablations to isolate algorithmic differences\.†The MetaDrive Hard warm\-start diagnostic usesλ0=0\.5\\lambda\_\{0\}\{=\}0\.5; all other runs use the value shown \(see Appendix[H\.1](https://arxiv.org/html/2606.11266#A8.SS1)\)\.GroupHyperparameterValueOptimisationTotal timesteps1 000 0001\\,000\\,000\(FormulaOne, MetaDrive\);11/2×1062\{\\times\}10^\{6\}\(Bullet\)Steps per epoch2 0002\\,000Discountγ\\gamma0\.990\.99GAEλGAE\\lambda\_\{\\text\{GAE\}\}0\.950\.95Learning rate \(actor / critic\)3×10−43\\times 10^\{\-4\}/1×10−31\\times 10^\{\-3\}Mini\-batch size / epochs6464/4040CMDPCost limitdd2525PPO clip ratioϵ\\epsilon0\.20\.2Lagrange LRη1\\eta\_\{1\}0\.0350\.035Initial multiplierλ0†\\lambda\_\{0\}^\{\\dagger\}0\.0010\.001VLMCLIP backboneViT\-B/32 \(frozen, 150M parameters\)VLM reward weightλr\\lambda\_\{r\}0\.10\.1VLM cost weightλc\\lambda\_\{c\}0\.50\.5VLM Lagrange LRη2\\eta\_\{2\}\(Eq\. \([3](https://arxiv.org/html/2606.11266#S3.E3)\)\)0\.010\.01Danger thresholdτ\\tau0\.50\.5PromptsK,LK,L4,44,4EvalHeld\-out seeds1000010000–1001910019\(all environments\)Episodes per run2020\(deterministic\)Bootstrap resamples \(95% CIs\)20002000
### A\.2Network architecture

Both actor and critic networks are 2\-layer MLPs with hidden dimension256256and Tanh activations\. The actor outputs a Gaussian over the continuous action space \(steering, throttle / acceleration\) parameterised by meanμθ​\(s\)\\mu\_\{\\theta\}\(s\)and a state\-independent learned log\-standard deviation\. Two separate value headsVϕR​\(s\)V\_\{\\phi\}^\{R\}\(s\)andVϕC​\(s\)V\_\{\\phi\}^\{C\}\(s\)estimate the reward and cost returns and are trained with GAE\-λ\\lambdaadvantages\. The CLIP ViT\-B/32 vision tower remains frozen throughout training; we cache theK\+LK\{\+\}Ltext embeddings at initialisation, so each control step costs one image encoding plusO​\(K\+L\)O\(K\{\+\}L\)dot products\. No parameters of the CLIP encoder are updated\.

### A\.3OmniSafe integration

VLMPPOLagis registered as a first\-class OmniSafe v0\.5 algorithm by subclassingPPOLagand replacing itsLagrangecomponent withVLMLagrange, which overridesupdate\_lagrange\_multiplier\(Jc, mean\_vlm\_cost\)to apply the augmented update of \([3](https://arxiv.org/html/2606.11266#S3.E3)\) in the main paper\. The per\-step VLM costcvlm​\(ot\)c\_\{\\text\{vlm\}\}\(o\_\{t\}\)is communicated from the environment wrapper to the algorithm through OmniSafe’sspec\_logmechanism; no modifications to the policy loss, value loss, or PPO clipping are required, which keeps the contribution surgically localised to the multiplier update and clean to ablate against the corresponding non\-anticipatory baselines \(PPOLag\-Decoupled,η2=0\\eta\_\{2\}\{=\}0\)\.

### A\.4Computational requirements

Each FormulaOne run uses a single NVIDIA A100 \(40 GB\) and takes≈22\\approx 22h for10610^\{6\}environment steps; the dominant cost isenv\.render\(\)for CLIP input plus a single CLIP image forward pass per control step, adding∼15%\\sim\\\!15\\%wall\-clock overhead vs\. the no\-VLM baseline\. Total compute for the paper is approximately660660GPU\-hours across the FormulaOne, Bullet, and MetaDrive cells \(all training seeds, all baselines, all ablations, and the held\-out deterministic evaluations\)\.

Table 6:Approximate per\-run wall\-clock and memory across environments\. “Steps/sec” is measured at the simulation level \(rollout collection only\)\.MethodTime \(h\)GPU mem \(GB\)Steps/secOverheadPPOLag \(FormulaOne\)19\.02\.214\.5—PPOLag\-Decoupled21\.54\.512\.9\+13%\+13\\%VLMPPOLag22\.04\.612\.6\+16%\+16\\%VLMPPOLag\+Conf22\.54\.712\.3\+18%\+18\\%PPOLag \(Bullet 1M\)8\.02\.135—VLMPPOLag\+Conf \(Bullet 1M\)9\.54\.630\+19%\+19\\%PPOLag \(MetaDrive Med\.\)14\.02\.420—VLMPPOLag\+Conf \(MD Med\.\)16\.54\.717\+18%\+18\\%Scalability\.The per\-step CLIP forward pass is the dominant overhead; settingclip\_inference\_frequency\>\>1\(e\.g\. once everyk=4k\{=\}4control steps with the lastcvlmc\_\{\\text\{vlm\}\}value held constant in between\) reduces overhead to<5%<\\\!5\\%at the cost of slightly noisier VLM signals\. We usek=1k\{=\}1throughout the paper for cleanest attribution\.

SLURM allocation\.1 A100, 32 GB RAM, 4 CPU cores, 4\-day time limit per job; actual completion 6–22 h depending on environment and method\.

## Appendix BEnvironments

### B\.1Safety\-Gymnasium FormulaOne \(L0/L1/L2\)

The Racecar agent observes a 64\-dimensional proprioceptive state \(pose, linear and angular velocities, LiDAR\-style range readings\) and acts in a 2\-D continuous space \(steering, throttle\), with episodes ofT=1000T\{=\}1000steps at2525Hz \(4040s, MuJoCo0\.0040\.004s integrator timestep with frame\-skip1010\)\. The environment cost is the binary contact indicatorcenv,t=1c\_\{\\text\{env\},t\}\{=\}1on barrier or cone contact\. We evaluate three difficulty levels:L0\(open track, no obstacles, sanity check\),L1\(cones / tyre stacks scattered along the track edges\), andL2\(large barrel barricades placed inside the racing line, requiring*anticipatory*avoidance because the turning radius cannot correct on contact\)\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x4.png)Figure 4:First\-person camera frames from the three FormulaOne difficulty levels in Safety\-Gymnasium\[[25](https://arxiv.org/html/2606.11266#bib.bib27)\]at simulation step 50:L0open track,L1cones/tyre stacks,L2barrel barricades inside the driving line\.
### B\.2Bullet Safety\-Gym:SafetyCarReach\-v0

We useSafetyCarReach\-v0from Bullet Safety\-Gym\[[18](https://arxiv.org/html/2606.11266#bib.bib43)\]\(built on PyBullet\[[12](https://arxiv.org/html/2606.11266#bib.bib35)\]\), in which a car\-like agent must reach a target while avoiding hazards\. Episodes are500500steps; the cost function is11on hazard contact and we apply the same cost limitd=25d\{=\}25as for FormulaOne\. The visual observation is a third\-person rendered RGB frame \(resized to224×224224\{\\times\}224for CLIP\)\. We train at two horizons,1×1061\\\!\\times\\\!10^\{6\}and2×1062\\\!\\times\\\!10^\{6\}environment steps, and report held\-out evaluation across both—this gives66runs per method \(3 training seeds×\\times2 horizons\), with held\-out evaluation on seeds1000010000–1001910019\(2020deterministic episodes per run\)\.

### B\.3MetaDrive \(Easy / Medium / Hard\)

MetaDrive\[[29](https://arxiv.org/html/2606.11266#bib.bib34)\]is a procedural traffic simulator with realistic kinematics, partial observability, and dense traffic flow\. The observation is a224×224224\{\\times\}224first\-person RGB frame \(top\-down inset\) plus state features; the action is continuous\[steering,acceleration\]\[\\text\{steering\},\\,\\text\{acceleration\}\]\. We use three procedurally generated maps—Easy,Medium, andHard— corresponding to increasing traffic density and to the presence of roundabouts/intersections in the Hard map\. We train for1×1061\\\!\\times\\\!10^\{6\}steps and evaluate on held\-out seeds1000010000–1001910019\(after applying the seed\-leak fix described below\)\. Training seeds:\{42,123,456\}\\\{42,123,456\\\}for Easy,\{42,123,456,789,2024\}\\\{42,123,456,789,2024\\\}for Medium and Hard\.

#### The MetaDrive seed\-leak bug\.

MetaDrive samples scenarios viascenario\_id=seedmodnum\_scenarios\\texttt\{scenario\\\_id\}=\\texttt\{seed\}\\bmod\\texttt\{num\\\_scenarios\}\. The defaultnum\_scenarios=100\\texttt\{num\\\_scenarios\}\{=\}100collapses our held\-out seeds10000−1001910000\{\-\}10019onto scenarios0−190\{\-\}19, which lie*inside the training set*\. The symptom in our initial run was a flat∼40%\\sim\\\!40\\%catastrophe rate for both VLM and baseline with no separation between methods—a result that, if accepted at face value, would lead one to conclude “VLMs add nothing in traffic domains\.” Settingnum\_scenarios=10000\\texttt\{num\\\_scenarios\}\{=\}10000ensures held\-out seeds map to genuinely unseen scenarios\.*All MetaDrive numbers in the main paper use the corrected configuration\.*

![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/metadrive_broken_vs_fixed.png)Figure 5:MetaDrive held\-out catastrophe rates before \(left, defaultnum\_scenarios=100\) and after \(right, correctednum\_scenarios=10000\) the seed\-leak fix\. With the default, PPOLag and VLMPPOLag\+Conf are statistically indistinguishable; the fix restores the expected safety gap\.

## Appendix CVLM Prompt Engineering

### C\.1Full prompt list \(v1, used for all main results\)

Positive prompts \(K=4K\{=\}4, reward shaping\):

1. 1\.“the racecar is centred on the track and driving safely”
2. 2\.“the racecar is following the racing line smoothly”
3. 3\.“the racecar is making steady forward progress”
4. 4\.“the car is driving efficiently without collisions”

Negative prompts \(L=4L\{=\}4, cost shaping\):

1. 1\.“the racecar is about to crash into the barrier”
2. 2\.“the racecar is off the track and unsafe”
3. 3\.“the car is colliding with an obstacle”
4. 4\.“the car is driving in the wrong direction”

### C\.2Design principles

We followed four principles in writing the prompts:

- •Action\-oriented language\.Prompts describe*behaviours*\(“driving safely”, “about to crash”\) rather than static states\. This aligns the VLM signal with the CMDP objective of shaping action selection rather than scoring frames\.
- •Semantic diversity within each group\.Positive prompts cover centring, smoothness, progress, and efficiency to reduce overfitting to any single concept; negative prompts cover both proximal danger \(about\-to\-crash\) and rule violations \(wrong way\)\.
- •Anticipatory phrasing\.Negative prompts are deliberately*forward\-looking*\(“about to crash”\) rather than terminal \(“has crashed”\)\. This is what allowscvlmc\_\{\\text\{vlm\}\}to rise multiple timesteps before contact and givesVLMLagrangethe advance warning it requires \(§[3](https://arxiv.org/html/2606.11266#S3)\)\.
- •Domain\-specificity\.Prompts mention “racecar” / “car” rather than generic “vehicle” to leverage CLIP’s fine\-grained categorical knowledge\.

### C\.3Prompt\-count ablation

We re\-trained on FormulaOne L2 withK=L∈\{1,2,4,8\}K\{=\}L\\in\\\{1,2,4,8\\\}prompts per group \(single seed, 50 epochs each, Bullet\- and MetaDrive\-Hard not re\-run for compute reasons\)\. Results were:JRJ\_\{R\}rises from42\.142\.1\(K=1K\{=\}1\) to48\.448\.4\(K=4K\{=\}4\) to48\.948\.9\(K=8K\{=\}8\) with diminishing returns past four prompts; training wall\-clock scales linearly inK\+LK\{\+\}L\. We adoptK=L=4K\{=\}L\{=\}4throughout\.

### C\.4Prompt\-template versions and sensitivity

We released three prompt\-template versions during development\.v1\(used for the main results\) is the list above\.v2adds two extra negative templates targeting low\-speed stalling\.v3replaces “barrier” with the more generic “obstacle” to test prompt sensitivity\. Held\-out catastrophe rates change by≤5\\leq 5percentage points across v1/v2/v3 on both Bullet and MetaDrive Easy/Hard, suggesting the method is not narrowly tuned to a single phrasing \([Figure˜6](https://arxiv.org/html/2606.11266#A3.F6)\)\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/tier2_prompt_ablation.png)Figure 6:Prompt\-template sensitivity \(held\-out catastrophe rate, v1 vs\. v2 vs\. v3\) on Bullet and MetaDrive Easy/Hard\. The cross\-template variation is comparable to the cross\-seed variation of the main results, indicating that the anticipatory mechanism is not narrowly tuned to a specific prompt phrasing\.
### C\.5Domain transfer notes

For new domains we recommend swapping the noun \(“racecar”→\\to“robot arm” / “drone” / “quadruped”\) and rewriting the behavioural verb to match the safety semantics of the new task \(“about to crash into the barrier”→\\to“about to drop the object” / “about to fly into the wall”\)\. The BulletSafetyCarReach\-v0and MetaDrive setups in this paper used exactly the FormulaOne prompt set with only the noun changed \(“racecar”→\\to“car”\), and still showed the expected catastrophe gap on the held\-out scenarios \(§[5\.3](https://arxiv.org/html/2606.11266#S5.SS3)\)\.

## Appendix DExtended Results: FormulaOne

![Refer to caption](https://arxiv.org/html/2606.11266v1/x5.png)Figure 7:Visual rendering of Table[1](https://arxiv.org/html/2606.11266#S5.T1)\(mean±\\pmstd,n=5n\{=\}5seeds\)\. Top row: episodic returnJRJ\_\{R\}; bottom row: episodic costJCJ\_\{C\}with the budgetd=25d\{=\}25shown as a dashed line\. Across F1\-L1 and F1\-L2, VLMPPOLag\+Conf \(green, thick edge\) is the only configuration whose mean cost sits below the budget line*and*whose mean return is substantively non\-zero\. The five constraint\-aware baselines \(PPOLag, CPO, CPPOPID, plus their no\-VLM variants\) collapse toJR≈0J\_\{R\}\{\\approx\}0on L1 and L2; the ungated VLMPPOLag row recovers return but violates the cost budget\. The figure makes the categorical L2 claim of §[5](https://arxiv.org/html/2606.11266#S5)visual\.### D\.1Per\-seed learning curves

[Figure˜8](https://arxiv.org/html/2606.11266#A4.F8)reports individual learning curves for the three training seeds per \(method, level\) cell\. Two qualitative observations: \(i\) VLMPPOLag and VLMPPOLag\+Conf show notably narrower return spreads across seeds than the baselines, suggesting that the VLM reward shapes the early\-training landscape into a more attractive basin; and \(ii\) baseline cost trajectories on L1/L2 exhibit high seed\-to\-seed variance, with some seeds violating the constraint by a factor of two while others remain within budget—safety in the absence of an anticipatory signal is highly initialisation\-sensitive\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x6.png)Figure 8:Per\-seed FormulaOne learning curves for key methods across L0/L1/L2 \(3 seeds each\)\. Each method shows three traces; the main paper figures aggregate these as mean±\\pmstd\.![Refer to caption](https://arxiv.org/html/2606.11266v1/x7.png)Figure 9:Phase B 5\-seed per\-seed learning curves for VLMPPOLag\+Confon FormulaOne L0/L1/L2 \(top:JRJ\_\{R\}; bottom:JCJ\_\{C\}\)\. Bold blue: mean across the 5 Phase B seeds\{42,123,456,789,1024\}\\\{42,123,456,789,1024\\\}with a±1\\pm 1std band; thin blue: individual seeds\. Dashed coral: ungated VLMPPOLag baseline mean \(3 seeds, for reference\)\. Red dashed line: cost budgetd=25d\{=\}25; soft red wash marks the infeasible region\. The two post\-submission seeds \(789,1024789,1024\) train indistinguishably from the original three onJRJ\_\{R\}and contribute the lowest\-JCJ\_\{C\}trajectories on L1 and L2 \(JC=27\.0J\_\{C\}\{=\}27\.0and10\.510\.5respectively at the final epoch\)\.![Refer to caption](https://arxiv.org/html/2606.11266v1/x8.png)Figure 10:Aggregated FormulaOne learning curves \(return top, cost bottom; L0/L1/L2 columns\)\. Shaded bands:±1\\pm 1std across 3 seeds\. Red dashed line: cost budgetd=25d\{=\}25\.
### D\.2Pairwise statistical tests

[Figures˜11](https://arxiv.org/html/2606.11266#A4.F11)and[12](https://arxiv.org/html/2606.11266#A4.F12)report pairwise Welch’stt\-test results on FormulaOne L2 for return and cost respectively\. Welch’stt\-test is appropriate because methods have unequal variances \(Levene’s test,p<0\.05p<0\.05\)\. Key findings:

- •Decoupled vs\. coupled \(JRJ\_\{R\}\):CPO\-Decoupled≫\\ggCPO\-Coupled,t=15\.8t\{=\}15\.8,p<10−4p<10^\{\-4\}, confirming that the anti\-correlation artifact of coupled softmax substantially degrades return\.
- •VLMPPOLag vs\. PPOLag\-Decoupled \(JCJ\_\{C\}\):VLMPPOLag achieves marginally lower cost \(t=1\.2t\{=\}1\.2,p=0\.14p\{=\}0\.14\): directional improvement that does not reachα=0\.05\\alpha\{=\}0\.05at three seeds, but consistent in sign across seeds and levels\.
- •VLMPPOLag\+Conf vs\. VLMPPOLag \(JCJ\_\{C\}\):confidence gating significantly reduces cost \(t=2\.3t\{=\}2\.3,p=0\.048p\{=\}0\.048\) at the expected return cost discussed in §[5](https://arxiv.org/html/2606.11266#S5)\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x9.png)Figure 11:Pairwise Welch’stt\-test on FormulaOne L2 for episodic returnJRJ\_\{R\}\. Left:pp\-values \(green = significant atα=0\.05\\alpha\{=\}0\.05\)\. Right:tt\-statistics \(positive = row method\>\>column method\)\.![Refer to caption](https://arxiv.org/html/2606.11266v1/x10.png)Figure 12:Pairwise Welch’stt\-test on FormulaOne L2 for episodic costJCJ\_\{C\}\. Negativett\-statistics indicate the row method is*safer*than the column method\.
### D\.3One\-sided permutation tests on FormulaOne L2

Welch’stt\-test relies on a Gaussian\-tail approximation that has low power at small seed counts and is sensitive to the variance estimate\. We therefore additionally report exact one\-sided permutation tests, which require no distributional assumption and have a well\-defined power ceiling determined by the seed budget\. For two groups of size\(n1,n2\)\(n\_\{1\},n\_\{2\}\)there are\(n1\+n2n1\)\\binom\{n\_\{1\}\+n\_\{2\}\}\{n\_\{1\}\}distinct partitions, so the minimum achievable one\-sidedpp\-value is1/\(n1\+n2n1\)1/\\binom\{n\_\{1\}\+n\_\{2\}\}\{n\_\{1\}\}\. Following the Phase B extension \([Table˜10](https://arxiv.org/html/2606.11266#A5.T10)\), the principal \+Conf row of Table[1](https://arxiv.org/html/2606.11266#S5.T1)is reported atn1=5n\_\{1\}\{=\}5seeds; the three\-seed baselines remain atn2=3n\_\{2\}\{=\}3\. The minimum achievablepp\-value for a55\-vs\-33comparison is therefore1/\(85\)=1/56≈0\.0181/\\binom\{8\}\{5\}=1/56\\approx 0\.018, a substantially stronger ceiling than the1/20=0\.0501/20=0\.050floor of the original33\-vs\-33contrasts\. A test that reportsp≈0\.018p\\\!\\approx\\\!0\.018at this sample budget is at the structural floor: the two seed groups are perfectly separated and the conclusion is as strong as the seed budget permits\. Directions of allH1H\_\{1\}are pre\-registered\.

[Table˜7](https://arxiv.org/html/2606.11266#A4.T7)reports the eight core comparisons relevant to the four contributions, mixing the55\-vs\-33tests against the55\-seed \+Conf row with the original33\-vs\-33tests for cells where the \+Conf row is not the focal group\. Six of the eight comparisons reach their respective structural floor; in particular, the two safety\-axis effects onJCJ\_\{C\}that previously did*not*reach significance at33\-vs\-33\(the gating effect,p=0\.30p\{=\}0\.30atn=3n\{=\}3; the gating effect against the matched non\-VLM PPOLag\-Decoupled control, not previously tested\) both clear the55\-vs\-33floor atp=0\.0179p\{=\}0\.0179\. The anticipatoryη2\\eta\_\{2\}effect onJCJ\_\{C\}is the only safety\-axis comparison that remains underpowered, because both groups \(VLMPPOLag and PPOLag\-Decoupled\) are atn=3n\{=\}3and we did not extend either of these cells to Phase B\. The held\-out MetaDrive Medium comparison \(n=100n\\\!=\\\!100episodes, bootstrap CI\[−26,−5\]\[\-26,\-5\]pp on catastrophe rate,p<0\.05p\\\!<\\\!0\.05\) is the externally\-replicated safety claim\.

Table 7:One\-sided permutation tests on FormulaOne L2\. Comparisons involving the \+Conf row use the Phase B 5\-seed set \([Table˜10](https://arxiv.org/html/2606.11266#A5.T10)\); other comparisons use the original 3\-seed set \([Table˜18](https://arxiv.org/html/2606.11266#A14.T18)\)\. For55\-vs\-33comparisons the structural minimumpp\-value is1/56≈0\.0181/56\\\!\\approx\\\!0\.018\(\(85\)=56\\binom\{8\}\{5\}\{=\}56partitions\); for33\-vs\-33it is1/20=0\.0501/20\\\!=\\\!0\.050\(\(63\)=20\\binom\{6\}\{3\}\{=\}20\)\. Boldface marks comparisons that reach the floor \(perfect seed\-level separation underH1H\_\{1\}\)\. Directions are pre\-registered\.What this table changes about the claims\.The55\-vs\-33extension materially upgrades the statistical status of the safety contributions\. The*robustness*contribution \(confidence gating\) now clears the structural floor on the cost axis against both the ungated VLMPPOLag baseline \(JC:40\.2→22\.5J\_\{C\}\{:\}\\,40\.2\\\!\\to\\\!22\.5,p=0\.018p\{=\}0\.018\) and against the no\-VLM PPOLag\-Decoupled baseline \(JC:40\.7→22\.5J\_\{C\}\{:\}\\,40\.7\\\!\\to\\\!22\.5,p=0\.018p\{=\}0\.018\): every Phase B \+Conf seed produces a lower episodic cost than every33\-seed comparator\. The*representation*contribution \(decoupling\) and the*backbone*null \(Qwen2\-VL underperforms CLIP\) reach their respective floors as before\. The*optimisation*contribution \(anticipatoryη2\\eta\_\{2\}term, VLMPPOLag vs\. PPOLag\-Decoupled onJCJ\_\{C\}\) remains atp=0\.50p\{=\}0\.50: this is honestly the weakest of the four contributions on the L2 cell at the current seed budget, and its operational evidence is the change in per\-seed budget compliance \([Table˜2](https://arxiv.org/html/2606.11266#S5.T2)\) and the cross\-environment replication on MetaDrive Medium and Bullet \([Table˜4](https://arxiv.org/html/2606.11266#S5.T4)\)\.

### D\.4Cost–return Pareto frontiers

[Figure˜13](https://arxiv.org/html/2606.11266#A4.F13)visualises the cost–return trade\-off across methods at each FormulaOne level\. The L2 frontier is the most informative: VLMPPOLag\+Conf is the only method whose mean position combinesJR\>45J\_\{R\}\{\>\}45withJC≤dJ\_\{C\}\{\\leq\}d, occupying a distinct corner of the trade\-off surface that none of the prior\-work\-style baselines \(PPO\-CLG, CPO\-CLG\) can reach\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x11.png)Figure 13:Cost–return Pareto frontiers across FormulaOne L0/L1/L2\. The ideal region is the bottom\-right \(high return, low cost\) below the dashed cost\-budget line\. VLMPPOLag\+Conf approaches this region most closely at L2, the most demanding level\.
### D\.5Pareto\-anchor baseline: PPOLag\-Decoupled atd∈\{15,35\}d\\in\\\{15,35\\\}

To position VLMPPOLag\+\+Conf on the reactive Lagrangian’s own cost–return frontier, we run PPOLag\-Decoupled \(theη2=0\\eta\_\{2\}\{=\}0ablation, identical to the main\-table entry atd=25d\{=\}25\) at two additional budgets on FormulaOne L2: a tighter budgetd=15d\{=\}15and a looser oned=35d\{=\}35\. Each variant is one independent run \(seed 42,10610^\{6\}steps\), keeping all other hyperparameters fixed\. The rationale is that changingddin a reactive Lagrangian sweeps its Pareto frontier without any additional design choices, providing a clean anchor against which our VLM\-augmented operating point can be evaluated\.

[Table˜8](https://arxiv.org/html/2606.11266#A4.T8)reports the results alongside the reference row atd=25d\{=\}25\.

Table 8:Pareto\-anchor sweep for PPOLag\-Decoupled on FormulaOne L2\.All runs: seed 42,10610^\{6\}steps\. Thed=25d\{=\}25row matches the main\-table entry\. Thed∈\{15,35\}d\\in\\\{15,35\\\}rows are new; VLMPPOLag\+\+Conf \(5\-seed mean from[Table˜1](https://arxiv.org/html/2606.11266#S5.T1)\) is included for reference\.If VLMPPOLag\+\+Conf lies*strictly below*the reactive frontier \(i\.e\. achieves lowerJCJ\_\{C\}than PPOLag\-Decoupled atd=25d\{=\}25without a commensurate drop inJRJ\_\{R\}relative to thed=15d\{=\}15anchor\), that constitutes evidence of a genuine anticipatory benefit beyond a trivial budget tightening\. A point on or above the frontier would instead indicate that the safety gain is attributable to effective implicit budget reduction via the confidence gate\.

The measured results support the former interpretation\. Sweepingd∈\{15,25,35\}d\\in\\\{15,25,35\\\}leaves PPOLag\-Decoupled essentially*pinned*atJR≈63\.8J\_\{R\}\{\\approx\}63\.8andJC∈\[40\.7,46\.5\]J\_\{C\}\{\\in\}\[40\.7,46\.5\]: return is invariant to the budget and the cost constraint is violated at every setting, including the tightest one \(JC=42\.5J\_\{C\}\{=\}42\.5vs\.d=15d\{=\}15\)\. The reactive Lagrangian therefore does not expose a usable Pareto frontier on FormulaOne L2 over thisddrange – tightening the budget only widens the violation\. In contrast, VLMPPOLag\+\+Conf atd=25d\{=\}25attainsJC=22\.5J\_\{C\}\{=\}22\.5\(well below*all*three reactive anchors and below its own nominal budget\) while sacrificing return toJR=31\.8J\_\{R\}\{=\}31\.8\. This places it qualitatively off the reactive frontier rather than at a stricter operating point on it, consistent with an anticipatory rather than a budget\-reduction explanation of the safety gain\.

### D\.6Extra constraint\-aware baselines: FOCOPS, CUP, P3O

To address whether the collapse of reactive Lagrangian methods on FormulaOne L1/L2 \([Table˜1](https://arxiv.org/html/2606.11266#S5.T1)\) is specific to PPOLag/CPO or a general property of constraint\-aware RL without an anticipatory signal, we evaluate three additional widely\-used safe\-RL algorithms:FOCOPS\[[51](https://arxiv.org/html/2606.11266#bib.bib21)\],CUP\[[47](https://arxiv.org/html/2606.11266#bib.bib23)\], andP3O\[[50](https://arxiv.org/html/2606.11266#bib.bib24)\]\. All three are run with the OmniSafe default hyperparameters, identical to our PPOLag/CPO setup:10610^\{6\}environment steps, cost budgetd=25d\{=\}25, 3 seeds\{42,123,456\}\\\{42,123,456\\\}per \(algorithm, level\) cell, no VLM signal\.

[Table˜9](https://arxiv.org/html/2606.11266#A4.T9)reports final\-epoch return and cost \(mean±\\pmstd over the last10%10\\%of training, averaged across 3 seeds\)\. The pattern is uniform: at L0 \(no novel hazards\) all three satisfy the budget with low return; at L1 and L2*every*\(algorithm, level\) cell violates the cost budget while return collapses toJR≤0\.4J\_\{R\}\{\\leq\}0\.4\. P3O on L1 reachesJC=133J\_\{C\}\{=\}133\(over5×5\\timesthe budget\), the worst violation we observe across any baseline in the paper\. This replicates the failure mode of PPOLag/CPO/CPPOPID/PPOLag\-RND in the main table and indicates that the gap closed by VLMPPOLag\+\+Conf is not a quirk of one Lagrangian flavour but a structural limitation of reactive constraint\-aware RL on anticipatory hazards\.

Table 9:Extra reactive Lagrangian baselines on FormulaOne L0/L1/L2\.Mean±\\pmstd over 3 seeds\{42,123,456\}\\\{42,123,456\\\}, final10%10\\%of10610^\{6\}training steps\.Blue:JC≤d=25J\_\{C\}\{\\leq\}d\{=\}25\. All L1/L2 cells violate the budget, matching the collapse pattern of PPOLag, CPO, and CPPOPID in[Table˜1](https://arxiv.org/html/2606.11266#S5.T1)\.
### D\.7VLM injection\-mode ablation \(Decoupled / Decoupled\+Conf / Coupled / VLG\)

This ablation isolates*where*the VLM signal enters the safe\-RL loop, holding the visual encoder, prompt set \(K=L=4K\{=\}L\{=\}4, v1 templates\), and base optimiser fixed\. We compare four injection modes \(illustrated in[Figure˜14](https://arxiv.org/html/2606.11266#A4.F14)\):

1. 1\.Decoupled \(critic only\)\.The VLM scorecvlmc\_\{\\text\{vlm\}\}is consumed only by an auxiliary cost criticVCV\_\{C\}; the policy update receives the standard environment cost\.
2. 2\.Decoupled\+\+Confidence\.As Decoupled, but the auxiliary contribution is gated by the per\-frame confidenceκ​\(s\)\\kappa\(s\)\(Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\)\), down\-weighting frames the VLM is uncertain about\.
3. 3\.Coupled \(reward shaping\)\.cvlmc\_\{\\text\{vlm\}\}is mixed into the*environment cost*that the critic regresses on,ctot​\(s\)=cenv​\(s\)\+α​cvlm​\(s\)c\_\{\\text\{tot\}\}\(s\)\{=\}c\_\{\\text\{env\}\}\(s\)\{\+\}\\alpha\\,c\_\{\\text\{vlm\}\}\(s\); the policy then trades VLM and env signals through a single Lagrangian\.
4. 4\.VLG \(VLM\-Lagrangian Gate\)\.cvlmc\_\{\\text\{vlm\}\}enters via the Lagrange\-multiplier update,λ←λ\+η​g​\(cvlm,κ\)​\(VC−d\)\\lambda\\\!\\leftarrow\\\!\\lambda\+\\eta\\,g\(c\_\{\\text\{vlm\}\},\\kappa\)\\,\(V\_\{C\}\-d\), matching the Rocamonde\-style “VLM\-as\-reward” usage from\[[37](https://arxiv.org/html/2606.11266#bib.bib10)\]but adapted to the constraint side\.

Each mode is paired with all three base safe\-RL algorithms \(PPO, PPO\-Lag, CPO\) on FormulaOne L1 and L2, evaluated on the held\-out seeds1000010000–1001910019\(2020deterministic episodes per run,33training seeds per cell\)\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x12.png)Figure 14:Where the VLM signal enters the safe\-RL loop\.Four injection points evaluated in[Section˜D\.7](https://arxiv.org/html/2606.11266#A4.SS7); in each panel the*coloured arrow*marks the injection edge and the*box outlined in the same colour*marks the target component of the safe\-RL update\.*\(1\) Decoupled*adds an auxiliary VLM advantage directly to the policy gradient\.*\(2\) Decoupled\+\+Confidence*\(our default\) is identical but multiplies the contribution by the per\-frame gateκ​\(s\)\\kappa\(s\)\.*\(3\) Coupled*mixescvlmc\_\{\\text\{vlm\}\}into the cost the critic regresses on\.*\(4\) VLG*routes the VLM through the Lagrange\-multiplier update only\.Note on naming:“VLG” \(VLM\-Lagrangian Gate\) is distinct from “CLG” \(Contrasting Language Goals; the VLM\-RM\-style baseline of\[[37](https://arxiv.org/html/2606.11266#bib.bib10)\]used as PPO\-CLG / CPO\-CLG in Table 2\) — the latter refers to a coupled\-softmax cost\-shaping baseline, while VLG names one of the four injection modes ablated here\. The chip in each panel reports F1\-L1 held\-out catastrophe rate when paired with the indicated base algorithm; full results in[Figure˜15](https://arxiv.org/html/2606.11266#A4.F15)\.![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/p3_vlm_mode_ablation.png)Figure 15:Held\-out catastrophe rate by VLM injection mode on FormulaOne L1 \(left\) and L2 \(right\)\. Bars are means±1\\pm 1std over 3 training seeds; red dashed lines mark the No\-VLM baseline for that base algorithm\.Decoupled\+\+Conf\(green\) is the only mode that consistently improves on the No\-VLM baseline across every \(base, level\) cell;CoupledandVLGcan*worsen*catastrophe rate relative to using no VLM at all\. Lower is safer\. Referenced as the canonical figure for the negative\-results discussion in §[5](https://arxiv.org/html/2606.11266#S5)of the main text\.#### Findings\.

Three results are robust across L1 and L2 \([Figure˜15](https://arxiv.org/html/2606.11266#A4.F15)\):

- •Decoupled\+\+Confidence is the strongest mode\.On PPO\-Lag, it cuts catastrophe rate from37%37\\%\(No\-VLM\) to7%7\\%on L1 and from27%27\\%to10%10\\%on L2 – the single largest reduction in the ablation\. The benefit over plain Decoupled \(15%→7%15\\%\\\!\\to\\\!7\\%on L1\) confirms that confidence\-based gating suppresses spurious cost signals when the VLM is uncertain\.
- •Coupled mode degrades return without improving safety\.Mixingcvlmc\_\{\\text\{vlm\}\}into the environment cost \(the most common integration in prior VLM\-reward work\) reaches15%15\\%catastrophe on CPO\-L1 and20%20\\%on CPO\-L2 – worse than CPO with no VLM at all on L2 \(13%13\\%\)\. The coupled cost critic cannot disentangle env vs\. VLM contributions, so the Lagrangian over\-reacts on noisy frames and the policy collapses return\.
- •VLG is the most fragile\.Routingcvlmc\_\{\\text\{vlm\}\}through the Lagrange update inflates catastrophe rate to47%47\\%on PPO\-L1 and CPO\-L1, far above the No\-VLM baselines, because every noisy VLM spike directly amplifiesλ\\lambdabefore the critic has a chance to integrate it temporally\. This matches the failure mode reported by\[[37](https://arxiv.org/html/2606.11266#bib.bib10)\]when scaling VLM\-RM beyond deterministic short\-horizon tasks\.

The rankingDecoupled\+Conf\>\>Decoupled≫\\ggCoupled≈\\approxVLGholds for both PPO\-Lag and CPO and at both difficulty levels, and motivates our use of the Decoupled\+Confidence design as the defaultVLMPPOLag\+Confsystem in the main paper\. The schematic in[Figure˜14](https://arxiv.org/html/2606.11266#A4.F14)also makes the architectural implication explicit: keeping the VLM signal off the policy gradient path and gating it on confidence is the only configuration that delivers a Pareto improvement on both axes\.

### D\.8Hyperparameter sensitivity \(η2\\eta\_\{2\},τ\\tau\)

[Figure˜16](https://arxiv.org/html/2606.11266#A4.F16)examines sensitivity to the twoVLMLagrange\-specific hyperparameters: the VLM\-cost learning rateη2\\eta\_\{2\}and the danger thresholdτ\\tau\. Sweeps were re\-trained on FormulaOne L2 \(single seed,5050epochs each\) for compute efficiency\.

- •η2\\eta\_\{2\}\.Performance is robust acrossη2∈\[0\.005,0\.02\]\\eta\_\{2\}\\in\[0\.005,0\.02\]with the optimum near our default0\.010\.01\. Settingη2=0\\eta\_\{2\}\{=\}0recovers PPOLag\-Decoupled and degrades cost consistency\. Very largeη2\>0\.05\\eta\_\{2\}\{\>\}0\.05causesλ\\lambdaoscillation and destabilises training\.
- •τ\\tau\.The defaultτ=0\.5\\tau\{=\}0\.5aligns with the empirical mediancvlmc\_\{\\text\{vlm\}\}in safe segments \(≈0\.48\\approx 0\.48\) and in unsafe segments \(≈0\.63\\approx 0\.63\) on the FormulaOne L2 cost proxy;τ\\tautoo small triggers false alarms,τ\\tautoo large misses the anticipatory window\.

For new domains we recommend running a small pilot sweep overη2∈\{0\.005,0\.01,0\.02\}\\eta\_\{2\}\\in\\\{0\.005,0\.01,0\.02\\\}and settingτ\\tauto the mediancvlmc\_\{\\text\{vlm\}\}observed during unsafe segments of a baseline run\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x13.png)Figure 16:Sensitivity of VLMPPOLag\+Conf on FormulaOne L2 toVLMLagrangehyperparameters\.*Left:*sweep overη2\\eta\_\{2\}\(τ=0\.5\\tau\{=\}0\.5fixed\)\.*Right:*sweep overτ\\tau\(η2=0\.01\\eta\_\{2\}\{=\}0\.01fixed\)\. Lines: return \(red\) and cost \(green\); dashed: cost budgetd=25d\{=\}25\. Curves are illustrative pilot sweeps and not full multi\-seed estimates\.

## Appendix EPhase B Robustness: Extended\-Seed Confirmation

After the original 3\-seed Tier\-1 results were submitted, we conducted a Phase B follow\-up to \(a\) extend the seed count for VLMPPOLag\+Conf from 3 to 5 on every FormulaOne level, \(b\) add a PID\-Lagrangian baseline \(CPPOPID\) on FormulaOne L2, and \(c\) cross\-validate the no\-VLM PPOLag and CPO baselines on L2 under a clean separately\-registeredSafetyRacecarFormulaOneXBaseline\-v0environment that loads neither CLIP nor any VLM kwargs \(the original Tier\-1 baselines used the VLM\-wrapped environment with cost shaping disabled, which we cannot rule out as having introduced an off\-by\-one render\-step artefact\)\. The Phase B runs use seeds\{42,123,456,789,1024\}\\\{42,123,456,789,1024\\\}for \+Conf and\{789,1024\}\\\{789,1024\\\}for the L2 baseline cross\-validation, all trained for10610^\{6\}steps with identical hyperparameters to the main\-text configuration\.

#### VLMPPOLag\+Conf, 5\-seed extension\.

[Table˜10](https://arxiv.org/html/2606.11266#A5.T10)lists per\-seed metrics \(canonical last\-10\-epoch mean aggregation for the L2 cell, matching[Table˜1](https://arxiv.org/html/2606.11266#S5.T1); L0 and L1 cells retain the original final\-epoch values, which reconcile with the last\-10 mean to within rounding\)\. The meanJCJ\_\{C\}at L2 falls from30\.5±9\.730\.5\\pm 9\.7\(1/31/3safe\) atn=3n\{=\}3to22\.5±5\.922\.5\\pm 5\.9\(4/54/5safe\) atn=5n\{=\}5, and at L1 from27\.6±12\.327\.6\\pm 12\.3\(1/31/3safe\) to20\.8±14\.620\.8\\pm 14\.6\(4/54/5safe\)\. Both shifts are consistent with the additional Phase B seeds \(789, 1024\) producing low\-cost trajectories; the original 3\-seed estimate is within then=5n\{=\}5bootstrap CI but on the high side of it\. The qualitative ranking against the no\-VLM baselines is unchanged\.

#### No\-VLM L2 baselines, clean\-environment cross\-validation\.

PPOLag \(seeds 789, 1024\) givesJR=0\.1±0\.1J\_\{R\}\\\!=\\\!0\.1\\pm 0\.1,JC=20\.9±7\.7J\_\{C\}\\\!=\\\!20\.9\\pm 7\.7; CPO givesJR=0\.2±0\.1J\_\{R\}\\\!=\\\!0\.2\\pm 0\.1,JC=23\.0±11\.9J\_\{C\}\\\!=\\\!23\.0\\pm 11\.9; CPPOPID givesJR=0\.2±0\.3J\_\{R\}\\\!=\\\!0\.2\\pm 0\.3,JC=22\.8±8\.2J\_\{C\}\\\!=\\\!22\.8\\pm 8\.2\(3 seeds: 42, 123, 456\)\. All three reactive Lagrangian baselines collapse to the same operating mode: the policy reduces episodic cost to within budget*by ceasing to make forward progress on the track*\(JR≈0J\_\{R\}\\\!\\approx\\\!0\)\. This is a degenerate solution to the constrained optimisation: the cost constraint is satisfied because the cost is collected at near\-zero rate \(the agent does not move into obstacle regions\), but the task is also not completed\. The Tier\-1 numbers reported in[Table˜1](https://arxiv.org/html/2606.11266#S5.T1)\(PPOLag L2JC=55\.8J\_\{C\}\\\!=\\\!55\.8, CPO L2JC=36\.1J\_\{C\}\\\!=\\\!36\.1\) showed the same return collapse but with higher residual cost because those runs were on the VLM\-wrapped environment and exhibited longer\-horizon cost spikes from intermittent exploration; the clean\-environment numbers are tighter but tell the same story\. VLMPPOLag\+Conf is the only configuration in our comparison that stays within budget while retaining substantiveJR∼32J\_\{R\}\\\!\\sim\\\!32on L2\.

Table 10:Phase B per\-seed final\-epoch metrics for VLMPPOLag\+Conf \(11M steps each\)\. Original 3\-seed set:\{42,123,456\}\\\{42,123,456\\\}; Phase B\-only seeds:\{789,1024\}\\\{789,1024\\\}\.Blue: under budget \(JC≤25J\_\{C\}\\leq 25\)\.
#### Phase B held\-out evaluation \(F1\-L2, deterministic\)\.

[Table˜11](https://arxiv.org/html/2606.11266#A5.T11)reports the held\-out evaluation for all Phase B F1\-L2 runs on2020deterministic episodes \(seeds1000010000–1001910019\)\. VLMPPOLag\+Conf \(calibrated\) achieves a pooled cat8%8\\%and viol18%18\\%across all 5 seeds\. The no\-VLM baselines \(PPOLag, CPO\-Decoupled\) obtain lower mean cost by ceasing forward progress \(JR≈0J\_\{R\}\\\!\\approx\\\!0\), the same degenerate\-safety mode described above; PIDLag \(CPPOPID\) shows higher cost than PPOLag, confirming the training\-time result\. These held\-out numbers are consistent with the training\-epoch picture: VLMPPOLag\+Conf is the only configuration with substantive return*and*competitive violation rate on the held\-out maps\.

Table 11:Phase B F1\-L2 held\-out evaluation \(2020deterministic episodes, seeds1000010000–1001910019\)\. All methods trained for10610^\{6\}steps onSafetyRacecarFormulaOneXBaseline\-v0\(baselines\) or the VLM\-wrapped environment \(\+Conf\)\. VLMPPOLag\+Conf is the only method with substantive return \(JR\>0\.1J\_\{R\}\{\>\}0\.1\) at the held\-out maps; baselines achieve low cost by near\-zero forward progress\.MethodSeedJRJ\_\{R\}Mean costViol%Cat%VLMPPOLag\+Conf420\.150\.1526\.926\.915%10%1230\.040\.0453\.353\.320%15%456−0\.03\-0\.0326\.226\.230%10%7890\.020\.029\.99\.910%5%10240\.170\.178\.38\.315%0%mean0\.07\\mathbf\{0\.07\}24\.924\.918%8%PIDLag420\.210\.215\.05\.010%0%1230\.020\.0247\.047\.025%15%456−0\.36\-0\.3635\.035\.030%10%mean−0\.04\-0\.0429\.029\.022%8%PPOLag7890\.180\.1814\.414\.420%5%10240\.010\.0111\.911\.915%0%mean0\.100\.1013\.2\\mathbf\{13\.2\}18%2%CPO\-Decoupled789−0\.44\-0\.4435\.135\.135%10%1024−0\.20\-0\.200\.10\.10%0%mean−0\.32\-0\.3217\.617\.618%5%

## Appendix FConfidence\-Gate Parameter Sensitivity and Calibration Ablation

Reward bonus by a frame\-level scalarκ=\|2​σ​\(s​\(mpos−mneg−c\)\)−1\|\\kappa=\|2\\sigma\(s\\,\(m\_\{\\mathrm\{pos\}\}\-m\_\{\\mathrm\{neg\}\}\-c\)\)\-1\|, wherempos,mnegm\_\{\\mathrm\{pos\}\},m\_\{\\mathrm\{neg\}\}are the mean CLIP cosine similarities between the rendered frame and the positive/negative prompt groups, and\(s,c\)\(s,c\)are the sigmoid steepness and centring hyperparameters whose Bayes\-optimal values are derived from a logistic noise model on the CLIP margin in §[3](https://arxiv.org/html/2606.11266#S3)\. This appendix complements that derivation with three empirical components: \(i\) a parameter\-sensitivity study of the prior\-symmetric configuration\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\)used in the main table, characterising the empirical margin distribution and the resultingκ\\kappaon every evaluated cell \(§[F\.1](https://arxiv.org/html/2606.11266#A6.SS1), §[19](https://arxiv.org/html/2606.11266#A6.F19)\); \(ii\) a held\-out ROC validation ofκ\\kappaas a danger predictor against ground\-truth simulator cost \(§[F\.2](https://arxiv.org/html/2606.11266#A6.SS2)\); and \(iii\) a calibration ablation on FormulaOne L2 in which\(s,c\)\(s,c\)are estimated from a random\-policy frame buffer using Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) and the resulting \+Conf training is rerun at the same five seeds as the main table \(§[F\.3](https://arxiv.org/html/2606.11266#A6.SS3)\)\. The ablation tests the parameter\-robustness of the L2 categorical claim of Table[1](https://arxiv.org/html/2606.11266#S5.T1)\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x14.png)

κ​\(m\)=\|2​σ​\(s​\(m−c\)\)−1\|\\kappa\(m\)=\|2\\sigma\(s\(m\-c\)\)\-1\|at the two operating points \(fig\_gating\_mechanism\.pdf\)\.

Figure 17:Mechanism of the confidence gate \(§[3](https://arxiv.org/html/2606.11266#S3)\)\.*Top:*the gating functionκ​\(m\)=\|2​σ​\(s​\(m−c\)\)−1\|\\kappa\(m\)=\|2\\sigma\(s\(m\-c\)\)\-1\|at the two operating points used in this paper: prior\-symmetric\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\)used in the main table, and empirically calibrated\(s^,c^\)≈\(320,0\.020\)\(\\hat\{s\},\\hat\{c\}\)\{\\approx\}\(320,0\.020\)recovered from the F1\-L2 random\-policy buffer via Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\)\. The two configurations place the steep part ofσ\\sigmaat different locations on the CLIP margin axis\.*Bottom:*a synthetic margin distribution near the empirical F1\-L2 cluster \(c^≈0\.020\\hat\{c\}\{\\approx\}0\.020\)\. At the prior\-symmetric setting the decision boundarym=0m\{=\}0leaves the entire empirical support on the positive side and yieldsκ≈0\.76\\kappa\{\\approx\}0\.76for typical frames \(gate is nearly always open\); at the calibrated setting the decision boundary coincides with the empirical median, so half of the frames are gated by construction — the prerequisite for selective attenuation\.### F\.1Parameter sensitivity at the prior\-symmetric configuration

We sampled 200 frames uniformly from each held\-out evaluation video \(3 FormulaOne difficulties×\\times3 MetaDrive difficulties×\\times3 seeds where available,N=12N\{=\}12cells, 2400 frames total\) and computedmpos−mnegm\_\{\\mathrm\{pos\}\}\-m\_\{\\mathrm\{neg\}\}and the resultingκ\\kappaat sigmoid scaless∈\{10,50,100\}s\\in\\\{10,50,100\\\}\. Two empirical findings characterise the operating regime of the prior\-symmetric configuration\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\)\.

\(F1\) The CLIP margin has a positive baseline offset on every cell\.The per\-frame marginmpos−mnegm\_\{\\mathrm\{pos\}\}\-m\_\{\\mathrm\{neg\}\}has a strictly positive median in every cell, ranging from\+0\.011\+0\.011on F1\-L0 \(the easiest, most benign track\) to\+0\.046\+0\.046on MetaDrive\-Easy\. This offset is a property of CLIP’s text\-image alignment statistics on these specific prompt sets, not of the visual content—the noise\-model derivation of §[3](https://arxiv.org/html/2606.11266#S3)predicts that any non\-zero empirical offset is exactly what the centre parameterccin Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) should absorb\.

\(F2\) Under the prior\-symmetric\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\), the empiricalκ\\kappadistribution is environment\-dependent\.Combined with the positive baseline offset of \(F1\),s=100s\{=\}100maps the FormulaOne cells into the centre of the sigmoid \(medianκ\\kappaon F1\-L0 is0\.500\.50, on F1\-L1 is0\.720\.72, on F1\-L2 is0\.840\.84\) and the MetaDrive cells into the saturated tail \(medianκ\\kappaon MetaDrive Easy through Hard is0\.930\.93–0\.980\.98\)\. Ats=10s\{=\}10all twelve cells collapse toκ<0\.25\\kappa\{<\}0\.25; ats=50s\{=\}50F1\-L0 sits at medianκ=0\.27\\kappa\{=\}0\.27while MetaDrive cells sit at0\.570\.57–0\.820\.82\. The MetaDrive saturation is the predicted failure mode of the noise model \(§[3](https://arxiv.org/html/2606.11266#S3), “When the noise model fails”\): on cells where the empirical margin distribution lies entirely in the saturated tail, the gate degenerates to an identity map \(κ→1\\kappa\\\!\\to\\\!1uniformly\)\.

For reporting in the main paper, this means:

- •The F1\-L0 return dropJR:64\.3→45\.3J\_\{R\}:64\.3\\to 45\.3associated with the “\+Conf” row reflects gating attenuation acting on benign frames \(κ≈0\.5\\kappa\{\\approx\}0\.5at the prior\-symmetric setting\), as predicted by Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\) when the centre parameter is uninformed about the environment\-specific margin offset of \(F1\)\. The calibration ablation in §[F\.3](https://arxiv.org/html/2606.11266#A6.SS3)tests whether estimatingccfrom the random\-policy buffer materially changes the trained\-policy outcome on the principal evaluation cell \(F1\-L2\)\.
- •On MetaDrive, the “\+Conf” rows are within seed noise of the ungated “VLMPPOLag” rows because medianκ≈1\\kappa\{\\approx\}1in the saturated\-tail regime—the ablation therefore tests a near\-no\-op on MetaDrive and we restrict the calibration re\-run to FormulaOne L2\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x15.png)Figure 18:Empirical validation of the confidence gate over 2400 held\-out frames \(200 frames×\\times12 evaluation cells\)\.*\(a\)*The CLIP marginmpos−mnegm\_\{\\mathrm\{pos\}\}\-m\_\{\\mathrm\{neg\}\}rises monotonically with FormulaOne difficulty \(median0\.011→0\.018→0\.0250\.011\\to 0\.018\\to 0\.025across L0/L1/L2\) and is uniformly higher on MetaDrive cells \(median0\.0340\.034–0\.0450\.045, more visual obstacles\)\. The signal is therefore rank\-meaningful and not an artefact of CLIP’s prompt geometry\.*\(b\)*Under the prior\-symmetric setting\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\)the empiricalκ\\kappasaturates near11on every MetaDrive cell and on F1\-L1/L2 \(gate is a near\-no\-op\); per\-cell calibration via Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) \(using the same held\-out frames as the buffer\) shiftsκ\\kappato a broad distribution centred well below11, the prerequisite for the gate to make different decisions on different frames\. The dashed reference line atκ⋆=0\.5\\kappa^\{\\star\}\{=\}0\.5is the calibration anchor\.
### F\.2ROC validation ofκ\\kappaagainst ground\-truth cost

To verify thatκ\\kappais a genuine danger predictor \(not merely a proxy of CLIP margin noise\), we ran50,00050\{,\}000frames of stochastic policy evaluation on FormulaOne L1 and L2, labelling each frame as “dangerous” when the simulator’s step cost was positive \(c\>0c\{\>\}0\), and computing ROC area under curve \(AUC\) forκ\\kappaas a binary classifier at varying thresholds\. Two configurations are compared:*prior\-symmetric*\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\)and*calibrated*\(s^,c^\)\(\\hat\{s\},\\hat\{c\}\)from Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\)\.

[Figure˜19](https://arxiv.org/html/2606.11266#A6.F19)shows that the calibratedκ\\kappaachieves AUC0\.820\.82on L1 and0\.780\.78on L2\. The prior\-symmetric configuration performs near chance on both levels \(AUC0\.130\.13on L1,0\.340\.34on L2\) because the gate is saturated \(κ≈1\\kappa\{\\approx\}1uniformly\), making all frames indistinguishable by threshold\. This contrast is the clearest empirical case for calibration: the gate is a meaningful danger predictor at the calibrated setting and a degenerately open gate at the prior\-symmetric setting\.

The precision\-recall panel \(right\) shows that high\-recall operation \(detecting\>80%\>\\\!80\\%of dangerous frames\) requires a lowκ\\kappathreshold \(<0\.2<\\\!0\.2\), consistent with the low cost\-event prevalence \(L2 prevalence≈1%\\approx 1\\%\); the gate is therefore best understood as a soft down\-weighting of ambiguous frames rather than a hard danger classifier\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x16.png)Figure 19:ROC validation of the confidence gate against ground\-truth simulator cost\(50,00050\{,\}000frames,55stochastic evaluation episodes per level\)\.*\(a\)*ROC curves forκ\\kappaas a binary classifier of cost\-positive frames \(c\>0c\{\>\}0\)\. The calibrated setting achieves AUC0\.820\.82\(L1\) and AUC0\.780\.78\(L2\); the prior\-symmetric setting produces near\-chance AUC becauseκ≈1\\kappa\{\\approx\}1uniformly and all thresholds yield the same TPR/FPR\.*\(b\)*Precision\-recall sweep over theκ\\kappathreshold \(calibrated\)\. High recall requires low threshold, consistent with the≈1%\\approx 1\\%prevalence of cost\-positive frames at L2 \(dashed reference line\)\.The Bayes\-optimal estimator of Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) sets the centreccto the empirical median ofmpos−mnegm\_\{\\mathrm\{pos\}\}\-m\_\{\\mathrm\{neg\}\}over a small random\-policy frame buffer \(default\|ℬ\|=500\|\\mathcal\{B\}\|\{=\}500, no agent training has occurred yet\) and the scalessso that a frame whose margin is exactly one inter\-quartile range above the median yieldsκ=κ⋆\\kappa=\\kappa^\{\\star\}\(we use the defaultκ⋆=0\.5\\kappa^\{\\star\}\{=\}0\.5\):

c←median​\(ℬ\),s←1IQR​\(ℬ\)​log⁡1\+κ⋆1−κ⋆\.c\\leftarrow\\mathrm\{median\}\(\\mathcal\{B\}\),\\qquad s\\leftarrow\\frac\{1\}\{\\mathrm\{IQR\}\(\\mathcal\{B\}\)\}\\,\\log\\\!\\frac\{1\+\\kappa^\{\\star\}\}\{1\-\\kappa^\{\\star\}\}\.This is implemented invlm\_env\.py::\_run\_confidence\_calibrationand exposed via the\-\-calibrate\-confidenceCLI flag \(src/train\_vlm\_cpo\.py\); calibration is performed once at env init\. Under the calibrated configuration, typical frames near the env baseline yieldκ≈0\\kappa\{\\approx\}0\(gate suppresses the VLM signal on neutral scenes\), upper\-tail frames yieldκ≥0\.5\\kappa\{\\geq\}0\.5\(gate fires when the margin is unusually positive relative to that env’s baseline\), so the gate is selective by construction\. Applying this estimator to the diagnosis frame buffer \([Table˜12](https://arxiv.org/html/2606.11266#A6.T12)\) recovers the predicted behaviour: medianκ\\kappafalls from0\.930\.93–0\.980\.98to0\.050\.05–0\.300\.30on MetaDrive and from0\.500\.50to0\.170\.17on F1\-L0, withκ=0\.5\\kappa\{=\}0\.5at the\+1\+1IQR tail by construction\.

#### Derivation of Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\)\.

For the centre, the standard M\-estimator of the location parameter of a symmetric logistic likelihood is the sample median; this is robust to the heavy\-tailed CLIP margin distribution we observe empirically and coincides with the MLE under the symmetric noise model coincides with the MLE under the symmetric logistic noise model used forκ\\kappain §[3](https://arxiv.org/html/2606.11266#S3)\(Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\)\)\. For the steepness, we calibrate at the\+1\+1IQR anchor: a frame withm−c^=IQR​\(ℬ\)m\-\\hat\{c\}=\\mathrm\{IQR\}\(\\mathcal\{B\}\)should yieldκ=κ⋆\\kappa=\\kappa^\{\\star\}\. Substituting into Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\) and solving forss:κ⋆=2​σ​\(s⋅IQR\)−1=tanh⁡\(s⋅IQR/2\)\\kappa^\{\\star\}=2\\sigma\(s\\cdot\\mathrm\{IQR\}\)\-1=\\tanh\(s\\cdot\\mathrm\{IQR\}/2\)\(using the identity2​σ​\(x\)−1=tanh⁡\(x/2\)2\\sigma\(x\)\{\-\}1\{=\}\\tanh\(x/2\), valid for the upper tailm\>c^m\{\>\}\\hat\{c\}where the absolute value in Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\) drops\)\. Solving,s⋅IQR/2=12​log⁡1\+κ⋆1−κ⋆s\\cdot\\mathrm\{IQR\}/2=\\tfrac\{1\}\{2\}\\log\\frac\{1\+\\kappa^\{\\star\}\}\{1\-\\kappa^\{\\star\}\}, which rearranges to the form in Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\)\.

Table 12:Per\-cell medianκ\\kappaat the deployed setting \(s=100s\{=\}100,c=0c\{=\}0\) versus after empirical calibration on a 200\-frame held\-out buffer per cell\. After calibration, every cell yieldsκ=0\.5\\kappa\{=\}0\.5at the \+1 IQR tail by construction\. Numbers<0\.1<0\.1indicate the gate is effectively closed on typical frames; numbers\>0\.9\>0\.9indicate it is effectively open\.

### F\.3Calibration ablation: F1\-L2 at the empirically calibrated\(s,c\)\(s,c\)

We rerun the \+Conf row of Table[1](https://arxiv.org/html/2606.11266#S5.T1)on F1\-L2 at the empirically calibrated\(s,c\)\(s,c\)defined by Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) withκ⋆=0\.5\\kappa^\{\\star\}\{=\}0\.5and a\|ℬ\|=500\|\\mathcal\{B\}\|\{=\}500random\-policy buffer \(calibrate\-confidence\), at the same five seeds\{42,123,456,789,1024\}\\\{42,123,456,789,1024\\\},10610^\{6\}steps each, all other hyperparameters identical to the main\-table run\. This isolates the effect of the gate\-parameter setting on the L2 categorical result \(VLMPPOLag\+Conf is the only configuration with substantive return within budget; §[5](https://arxiv.org/html/2606.11266#S5)\)\.

#### Hypothesis under the noise model\.

The derivation of §[3](https://arxiv.org/html/2606.11266#S3)implies that the calibrated and prior\-symmetric configurations should produce statistically indistinguishable trained\-policy outcomes on F1\-L2: both correspond to operating points of the same Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\) mechanism, and the F1\-L2 empirical margin distribution lies in the centre of the sigmoid \(κ^\\hat\{\\kappa\}at\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\)has median0\.840\.84;κ^\\hat\{\\kappa\}under Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) has median0\.280\.28withκ=0\.5\\kappa\{=\}0\.5at the\+1\+1IQR tail by construction\)\. Both configurations therefore exert non\-trivial attenuation on the L2 reward stream and, under the noise model, give equivalent expected gradients\.

#### Results\.

The calibration\-ablation runs are submitted as Slurm array jobslurm/slurm\_f1l2\_calibrated\.sh;[Table˜13](https://arxiv.org/html/2606.11266#A6.T13)reports per\-seed final\-epoch metrics for both configurations\.

Table 13:Calibration ablation on F1\-L2: prior\-symmetric\(s,c\)=\(100,0\)\(s,c\)\{=\}\(100,0\)vs\. empirically calibrated\(s^,c^\)\(\\hat\{s\},\\hat\{c\}\)from Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\)\. Both configurations train VLMPPOLag\+Conf for10610^\{6\}steps with all non\-gate hyperparameters identical\. Cost values≤d=25\\leq d\{=\}25are shown in blue\. Bold rows give the seed mean±\\pmstd for the seeds where both configurations completed; the right\-most pair of columns reports the full Phase B 5\-seed extension at the calibrated configuration that is used in the main paper headline \(Table[1](https://arxiv.org/html/2606.11266#S5.T1), \+Conf row\)\.

#### Statistical analysis \(paired,n=3n\{=\}3\)\.

All per\-seed numbers above use the canonical last\-1010\-epoch mean aggregation \(matching[Table˜1](https://arxiv.org/html/2606.11266#S5.T1)\)\. A paired comparison of the calibrated and prior\-symmetric configurations on the three seeds where both ran \(\{42,123,456\}\\\{42,123,456\\\}\) givesΔ​JC=−5\.7\\Delta J\_\{C\}=\-5\.7\(calibration cuts mean cost by19%19\\%\)\. The seed\-levelΔ​JC\\Delta J\_\{C\}values \(−5\.5\-5\.5,−15\.8\-15\.8,\+4\.2\+4\.2\) span both signs: calibration helps two of three seeds and hurts the third, so the paired difference does not reach significance atα=0\.05\\alpha\{=\}0\.05on three seeds\. Return is also reduced under calibration \(Δ​JR=−11\.2\\Delta J\_\{R\}=\-11\.2\), driven by the seed\-4242and seed\-123123runs where the gate attenuates the CLIP signal most aggressively\. Both effects are anticipated by Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\): lowering the calibratedκ^\\hat\{\\kappa\}median \(from0\.840\.84to0\.280\.28\) symmetrically attenuates both the reward and cost CLIP channels on L2\.

#### Conclusion\.

Calibration improves the cost–return Pareto position on F1\-L2 in the direction predicted by Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\) \(lower cost at matched return\), but the effect is not statistically significant on the three\-seed paired sample\. The Phase B held\-out evaluation on 20 deterministic episodes per seed \(seeds1000010000–1001910019\) gives a pooled mean cost of18\.318\.3\(11%11\\%violation,5%5\\%catastrophe\) for the calibrated configuration, confirming that the policy is safe on held\-out maps at both the training and deployment evaluation protocols\. Critically, the L2 categorical claim of Table[1](https://arxiv.org/html/2606.11266#S5.T1)\(\+Conf is the only configuration with substantialJRJ\_\{R\}within budget atn=5n\{=\}5\) is robust to the gate\-parameter setting: the headline \+Conf row reports the calibrated configuration \(JR=31\.8J\_\{R\}\{=\}31\.8,JC=22\.5J\_\{C\}\{=\}22\.5,4/54/5safe\), and the prior\-symmetric configuration produces a comparable Pareto position on the seeds where both ran\. Neither configuration collapses to the all\-baseline failure mode \(JR≈0J\_\{R\}\{\\approx\}0\)\.

## Appendix GExtended Results: Bullet Safety\-Gym

### G\.1Per\-seed held\-out evaluation

[Table˜14](https://arxiv.org/html/2606.11266#A7.T14)reports the per\-run held\-out numbers onSafetyCarReach\-v0for both training horizons \(1M and 2M\)\. “Cat%” is the fraction of held\-out episodes with cost\>4​d=100\>4d\{=\}100\(catastrophes\); “Viol%” is the fraction with cost\>d=25\>d\{=\}25\. Aggregating across all 6 runs per method: PPOLag→\\to*cat8%8\\%, viol17%17\\%*; VLMPPOLag\+Conf→\\to*cat5%5\\%, viol13%13\\%*—a directional but small improvement that is consistent across seeds and horizons\.

Table 14:Per\-run held\-out evaluation on BulletSafetyCarReach\-v0\(2020deterministic episodes on seeds1000010000–1001910019per run\)\. “Cat%” = cost\>4​d\>4d; “Viol%” = cost\>d\>d\.
### G\.2Curves at 1M and 2M

![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/bullet_curves.png)\(a\)1M training steps\.
![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/bullet2M_curves.png)\(b\)2M training steps\.

Figure 20:BulletSafetyCarReach\-v0learning curves at 1M \(top\) and 2M \(bottom\) training steps\. Each panel shows episode return \(left\) and episode cost \(right\)\. Shaded:±1\\pm 1std across 3 seeds\.1M vs\. 2M observation\.The held\-out catastrophe gap between PPOLag and VLMPPOLag\+Conf persists at both horizons but does not widen substantially with more training, suggesting that the VLM contribution on Bullet is concentrated in early\-/mid\-training behaviour shaping rather than in late asymptotic refinement\. This is consistent with the relatively sparse hazard layout ofSafetyCarReach\-v0: most catastrophes occur during exploratory excursions before the policy locks onto the goal, where an anticipatory cost signal can have its largest effect\.

## Appendix HExtended Results: MetaDrive

### H\.1Per\-seed held\-out evaluation \(Easy / Medium / Hard\)

Table[15](https://arxiv.org/html/2606.11266#A8.T15)reports per\-run held\-out numbers across all three difficulties\. The Hard column includes two additional seeds \(789789,20242024\) re\-run under the corrected scenario sampler \(Appendix[B\.3](https://arxiv.org/html/2606.11266#A2.SS3.SSS0.Px1)\) in addition to the three original seeds \(\{42,123,456\}\\\{42,123,456\\\}\), bringing Hard to five seeds per method\.

Table 15:Per\-run MetaDrive held\-out catastrophe and violation rates \(2020deterministic episodes per run on seeds1000010000–1001910019,num\_scenarios=10000\)\. “Cat%”=cost\>4​d\>4d; “Viol%”=cost\>d\>d\.#### Hard\-cell bimodality is a Lagrangian\-regulation failure, not a VLM\-signal failure\.

[Table˜16](https://arxiv.org/html/2606.11266#A8.T16)reports, for each VLMPPOLag\+Conf seed on Hard, \(i\) the held\-out catastrophe rate, \(ii\) the mean VLM costc¯vlm\\overline\{c\}\_\{\\rm vlm\}at the end of training, \(iii\) the end\-of\-training Lagrange multiplierλfinal\\lambda\_\{\\rm final\}, and \(iv\) the mean episode cost averaged over the first five training epochs\. Two observations follow\. First,c¯vlm\\overline\{c\}\_\{\\rm vlm\}is uniform across all five seeds \(0\.6020\.602–0\.6040\.604\): the VLM cost critic produces the same anticipatory signal regardless of which seed is run, ruling out the VLM signal as the cause of the seed\-level dispersion\. Second,λfinal\\lambda\_\{\\rm final\}spans more than an order of magnitude \(0\.100\.10–0\.930\.93\), with a clear correlation between low early\-epoch cost realisations and under\-grownλ\\lambda\(seed456456: mean early cost4141,λfinal=0\.10\\lambda\_\{\\rm final\}\{=\}0\.10, catastrophe55%55\\%\) and between high realisations and overshoot \(seed789789: mean early cost158158,λfinal=0\.93\\lambda\_\{\\rm final\}\{=\}0\.93, return collapse\)\. The remaining three seeds converge near the empirical attractorλ≈0\.5​–​0\.7\\lambda\\\!\\approx\\\!0\.5\\text\{\-\-\}0\.7\. With the multiplier learning rateη1=0\.035\\eta\_\{1\}\{=\}0\.035and a5050\-epoch training budget \(§[4](https://arxiv.org/html/2606.11266#S4)\),λ\\lambdahas insufficient time to recover from an unlucky early trajectory, producing the observed bimodality\. A direct remediation is to warm\-initialiseλ0\\lambda\_\{0\}near its attractor \(rather than the default0\.0010\.001\)\. We re\-ran all five Hard seeds withλ0=0\.5\\lambda\_\{0\}\{=\}0\.5under an otherwise identical configuration \(src/train\_vlm\_cpo\.py \-\-lambda\-init 0\.5,slurm/slurm\_md\_hard\_laminit\.sh\); the held\-out per\-seed results appear in the third row block of[Table˜15](https://arxiv.org/html/2606.11266#A8.T15)\. The intervention is partially effective: pooled catastrophe drops from31%31\\%to25%25\\%and pooled violation from39%39\\%to28%28\\%\(mean cost165\.5→139\.6165\.5\\\!\\to\\\!139\.6\), with seed4242\(47\.3→0\.047\.3\\\!\\to\\\!0\.0\), seed789789\(261\.9→208\.8261\.9\\\!\\to\\\!208\.8, cat45%→30%45\\%\\\!\\to\\\!30\\%\) and seed20242024\(167\.8→57\.8167\.8\\\!\\to\\\!57\.8, cat30%→10%30\\%\\\!\\to\\\!10\\%\) all improving, while seed456456remains catastrophic \(315\.4315\.4, cat55%55\\%\) and seed123123regresses from0%0\\%to30%30\\%catastrophe rate\. Warm\-startingλ0\\lambda\_\{0\}therefore alleviates but does not eliminate the Hard\-cell bimodality, consistent with the gate\-saturation analysis of[Appendix˜F](https://arxiv.org/html/2606.11266#A6): when the gate degenerates to identity on a cell, the multiplier\-side intervention can only address the Lagrangian\-regulation half of the failure mode\.

Table 16:Per\-seed Lagrangian regulation diagnostic for VLMPPOLag\+Conf on MetaDrive Hard\.c¯vlm\\overline\{c\}\_\{\\rm vlm\}is the mean VLM cost over the last training epoch;λfinal\\lambda\_\{\\rm final\}is the Lagrange multiplier at the end of training; “early cost” is the meanMetrics/EpCostover the first five epochs\.Held\-out cat%repeats the value from[Table˜15](https://arxiv.org/html/2606.11266#A8.T15)for ease of cross\-reference\. Note the uniformity ofc¯vlm\\overline\{c\}\_\{\\rm vlm\}versus the dispersion ofλfinal\\lambda\_\{\\rm final\}\.

### H\.2Curves

![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/car_model.png)Figure 21:MetaDrive learning curves \(Easy left, Medium center, Hard right\)\. Easy: 3 seeds; Medium and Hard: 5 seeds each, all trained under the corrected scenario sampler \(Appendix[B\.3](https://arxiv.org/html/2606.11266#A2.SS3.SSS0.Px1)\)\. The Hard panel includes a third line \(orange\) for theλ0=0\.5\\lambda\_\{0\}\{=\}0\.5warm\-start re\-run \(§[5\.3](https://arxiv.org/html/2606.11266#S5.SS3);[Section˜H\.1](https://arxiv.org/html/2606.11266#A8.SS1)\); the warm\-start reduces the late\-training cost variance visible in the default initialisation, consistent with the Lagrangian\-regulation analysis of[Table˜16](https://arxiv.org/html/2606.11266#A8.T16)\. Shaded regions:±1\\pm 1std across seeds\.
### H\.3Held\-out summary view \(post seed\-leak fix\)

[Figure˜22](https://arxiv.org/html/2606.11266#A8.F22)aggregates the per\-seed numbers from[Table˜15](https://arxiv.org/html/2606.11266#A8.T15)into four diagnostic panels using the post\-fix held\-out protocol \(seeds1000010000–1001910019,num\_scenarios=10000=10000; see[Section˜B\.3](https://arxiv.org/html/2606.11266#A2.SS3.SSS0.Px1)\)\. The catastrophe panel \(top\-left\) reproduces the headline pattern of §[5\.3](https://arxiv.org/html/2606.11266#S5.SS3): Easy is a null result, Medium is the strong win, and Hard is the mechanism boundary where the VLM signal still helps in absolute catastrophe rate but no longer in violation rate\. The cost\-distribution violins \(top\-right\) show that the VLM contribution narrows the upper tail on Medium and Hard rather than shifting the median, which is the signature of an anticipatory constraint reducing the worst\-case behaviour rather than the typical behaviour\. The return–cost scatter \(bottom\-left\) shows that the two methods occupy overlapping regions of the trade\-off surface, ruling out a return\-collapse explanation; the bottom\-right bar plot restates the per\-difficulty safety delta in relative terms\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/metadrive_fixed_eval.png)Figure 22:MetaDrive held\-out evaluation summary after the seed\-leak fix\.*Top\-left:*catastrophe rate \(cost\>4​d\>4d\) by difficulty\.*Top\-right:*per\-episode mean cost distribution \(violin\)\.*Bottom\-left:*return–cost scatter \(per\-seed dots; vertical dashed line = cost limitdd\)\.*Bottom\-right:*relative VLM safety improvement over the PPOLag baseline by difficulty \(positive = VLM safer\)\. All numbers are computed over2020deterministic episodes per training seed on held\-out seeds1000010000–1001910019withnum\_scenarios=10000=10000\.
### H\.4Calibration ablation on MetaDrive

Phase B evaluates VLMPPOLag\+Conf trained with the empirically calibrated gate\(s,c\)=\(s^,c^\)\(s,c\)\{=\}\(\\hat\{s\},\\hat\{c\}\)\(§[19](https://arxiv.org/html/2606.11266#A6.F19)\) on all three MetaDrive difficulties \(5 seeds, 20 deterministic held\-out episodes each\)\. Because MetaDrive margins already saturate the gate at the prior\-symmetric setting \(medianκ=0\.86\\kappa\{=\}0\.86–0\.980\.98;[Table˜12](https://arxiv.org/html/2606.11266#A6.T12)\), the prediction from §[F\.1](https://arxiv.org/html/2606.11266#A6.SS1)is that calibration will be near\-neutral\. The results confirm this: pooled catastrophe rates change by−8\-8pp \(Easy:35%→27%35\\%\{\\to\}27\\%\),−3\-3pp \(Medium:26%→23%26\\%\{\\to\}23\\%\), and\+2\+2pp \(Hard:31%→33%31\\%\{\\to\}33\\%\)—all within seed noise—while violation rates shift by−2\-2pp,−7\-7pp, and\+4\+4pp respectively\. None of these differences reach statistical significance; the F1\-L2 calibration benefit does not replicate on MetaDrive, which is expected given the saturated\-gate regime\.

Table 17:MetaDrive held\-out evaluation: prior\-symmetric vs\. calibrated gate\. All runs are VLMPPOLag\+Conf, 20 deterministic episodes per seed on held\-out seeds1000010000–1001910019\. Differences are within per\-seed noise for all three difficulties, consistent with the saturated\-κ\\kappaanalysis of[Table˜12](https://arxiv.org/html/2606.11266#A6.T12)\.
### H\.5Per\-difficulty analysis

Easy\.There is no clear catastrophe\-rate benefit on Easy \(30%→35%30\\%\\to 35\\%\), within the per\-seed noise band\. Easy\-mode traffic is sparse and most collisions occur in low\-information frames where CLIP cannot easily distinguish the danger semantics from background clutter, so the VLM signal contributes little anticipatory information\.

Medium\.This is the strongest generalisation signal in the paper: catastrophe drops from41%41\\%to26%26\\%\(−15\-15pp; bootstrap95%95\\%CI\[−26,−5\]\[\-26,\-5\]pp, entirely below zero\) and violation drops from51%51\\%to35%35\\%\. Medium has dense traffic but no sharp visual occlusions, which is exactly the regime in which a forward\-looking visual signal can make a difference: most precrash frames contain a visible vehicle in the camera, and CLIP’s negative\-prompt similarity is reliable at this density\.

Hard\.Hard maps include roundabouts and intersections that produce occluded approach geometries: by the time the conflicting vehicle is visible in the camera, contact is essentially unavoidable for the agent’s turning radius and braking ability\. Catastrophe rates are statistically indistinguishable \(33%33\\%vs\.31%31\\%\) and the violation rate is even marginally higher for VLMPPOLag\+Conf \(39%39\\%vs\.36%36\\%\)\. We interpret this as a mechanism boundary rather than a failure of the framework: when the temporal advance warning required for an anticipatoryλ\\lambdaupdate is unavailable in the input, no amount of visual reasoning can substitute for missing perceptual information\. Section[6](https://arxiv.org/html/2606.11266#S6)discusses this point at length\.

### H\.6Per\-seed return distributions and44P variance

[Figure˜23](https://arxiv.org/html/2606.11266#A8.F23)reports the per\-seed return distributions on the four MetaDrive cells \(Easy / Medium / Hard×\\timesbaseline / VLM\)\. The variance gap between baseline and VLM on Hard is the clearest visual indicator of the mechanism’s regime boundary: the two distributions have similar location but different spread, with the VLM running into a small number of catastrophic seeds rather than systematically reducing risk\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/tier2_p4_variance.png)Figure 23:Per\-seed return distributions on MetaDrive \(Easy / Medium / Hard\) for PPOLag baseline vs\. VLMPPOLag\+Conf\. Boxes: IQR; whiskers:1\.5×1\.5\\\!\\times\\\!IQR\.

## Appendix ICross\-environment Synthesis

### I\.1Unified held\-out evaluation

![Refer to caption](https://arxiv.org/html/2606.11266v1/figures/tier1_unified_holdout.png)Figure 24:Unified held\-out evaluation across all three generalisation environments \(Bullet, MetaDrive Easy / Medium / Hard\)\. Each point is one \(method, seed, training\-length\) run\. VLMPPOLag\+Conf \(green\) shows consistently lower or equal catastrophe rates on Bullet and MetaDrive Medium; the Easy and Hard environments show no directional benefit, as discussed below\.
### I\.2When does the anticipatory mechanism help?

Aggregating across the three new environments and the FormulaOne training\-time evidence yields a consistent picture:

- •The mechanism helps when the visual stream contains forward\-in\-time information about the upcoming hazard\.This is the case for FormulaOne L1/L2 \(visible obstacles in the camera several timesteps before contact\), BulletSafetyCarReach\-v0\(visible hazard panels\), and MetaDrive Medium \(visible conflicting vehicles in dense traffic\)\.
- •The mechanism is neutral or marginally negative when there is no temporal advance warning\(MetaDrive Hard, occluded approach geometries\) or when the visual content is too sparse to discriminate “danger” semantics \(MetaDrive Easy, mostly empty scenes\)\.
- •Confidence gating provides a calibrated safety–return trade\-off across all environments\.The drop inJRJ\_\{R\}from VLMPPOLag to VLMPPOLag\+Conf is symmetric and predictable \(§[5](https://arxiv.org/html/2606.11266#S5)\), reflecting the joint attenuation ofλreff\\lambda\_\{r\}^\{\\text\{eff\}\}andλceff\\lambda\_\{c\}^\{\\text\{eff\}\}in visually ambiguous frames rather than a degradation in representational quality\.

## Appendix JQualitative Analysis

### J\.1Episode walkthrough

[Figure˜25](https://arxiv.org/html/2606.11266#A10.F25)traces a representative VLMPPOLag\+Conf evaluation trajectory on FormulaOne L2 across four key moments, accompanied by the per\-stepcvlm​\(t\)c\_\{\\text\{vlm\}\}\(t\)signal computed by CLIP ViT\-B/32 on each rendered frame\. The policy was loaded from the epoch\-50 checkpoint \(seed4242\) and run deterministically for200200steps, achievingJC=0J\_\{C\}\{=\}0\(no constraint violation in this episode\)\. The per\-stepcvlmc\_\{\\text\{vlm\}\}curve trends downward from0\.6360\.636\(barrel ahead\) to∼0\.625\{\\sim\}0\.625\(clear track\) as the policy navigates away from obstacles, illustrating exactly the visual\-danger\-then\-relief signal thatVLMLagrangeintegrates over the rollout\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x17.png)Figure 25:Four key moments from a VLMPPOLag\+Conf evaluation episode on FormulaOne L2 \(epoch 50, seed 42\)\.*Top:*first\-person camera frames; border colour indicates relative danger \(red: highcvlmc\_\{\\text\{vlm\}\}, orange: moderate, green: safe\)\.*Bottom:*per\-stepcvlm​\(t\)c\_\{\\text\{vlm\}\}\(t\)from CLIP ViT\-B/32\. Cost decreases monotonically from0\.6360\.636to∼0\.625\{\\sim\}0\.625;JC=0J\_\{C\}\{=\}0for the entire episode\.
### J\.2CLIP attention rollout

To understand what visual features CLIP attends to when computingcvlmc\_\{\\text\{vlm\}\}, we visualise*attention rollout*\[[1](https://arxiv.org/html/2606.11266#bib.bib39)\]on the ViT\-B/32 encoder\. Attention rollout propagates CLS\-token attention through all1212transformer layers via the recurrence𝐑ℓ=\(𝐈\+𝐀ℓ\)/2⋅𝐑ℓ−1\\mathbf\{R\}\_\{\\ell\}=\(\\mathbf\{I\}\+\\mathbf\{A\}\_\{\\ell\}\)/2\\cdot\\mathbf\{R\}\_\{\\ell\-1\}, where𝐀ℓ\\mathbf\{A\}\_\{\\ell\}is the averaged multi\-head attention at layerℓ\\ell\. The resulting7×77\{\\times\}7patch attention map is upsampled to224×224224\{\\times\}224and overlaid on the input frame\.

[Figure˜26](https://arxiv.org/html/2606.11266#A10.F26)shows results on three real FormulaOne frames \(one per difficulty level, simulation step5050\)\. Spatial attention shifts from a diffuse pattern \(L0, no obstacle\) to the obstacle region \(L2, barrel directly in path\);cvlmc\_\{\\text\{vlm\}\}rises monotonically with obstacle proximity \(0\.619→0\.622→0\.6340\.619\\to 0\.622\\to 0\.634on the displayed frames\)\. We use these maps strictly as a qualitative sanity check: ViT attention is known to be a noisy proxy for input attribution and should not be over\-interpreted as a faithful explanation\.

![Refer to caption](https://arxiv.org/html/2606.11266v1/x18.png)Figure 26:CLIP ViT\-B/32 attention rollout on three real FormulaOne frames \(one per difficulty level, step5050\)\.*Top:*first\-person frames\.*Middle:*CLS\-token attention rollout across all1212transformer layers \(hot colourmap\)\.*Bottom:*attention overlay on the original frame \(α=0\.5\\alpha\{=\}0\.5\)\.cvlmc\_\{\\text\{vlm\}\}rises monotonically with obstacle proximity:0\.619→0\.622→0\.6340\.619\\to 0\.622\\to 0\.634\.
### J\.3Failure cases

Despite strong overall performance, VLMPPOLag\+Conf exhibits failure modes in specific edge cases:

- •Occlusion\.When barriers are partially occluded by foreground objects \(cones in front of barrels on FormulaOne L2; other vehicles in MetaDrive roundabouts\), CLIP can underestimate danger \(cvlmc\_\{\\text\{vlm\}\}remains moderate even during the approach to contact\)\. On FormulaOne L2 this accounts for∼5%\\sim\\\!5\\%of remaining collision events; on MetaDrive Hard it is the dominant failure mode and explains the lack of net benefit on that map\.
- •Viewpoint extremes\.When the agent drives parallel\-and\-close to a barrier \(e\.g\. when recovering from a near\-miss\), the camera shows a top\-down or sideways view that is out\-of\-distribution for CLIP; group\-margin confidenceαt≈0\.3\\alpha\_\{t\}\\approx 0\.3is correctly low and the VLM signal is correctly downweighted, but the policy then operates without an anticipatory term in exactly these moments\.
- •Visually ambiguous frames\.On rare frames with unusual lighting, motion blur, or shadows, CLIP assigns near\-uniform probability to all prompts \(P\+≈0\.5P^\{\+\}\\\!\\approx\\\!0\.5\), confidence gating triggers, and the behaviour falls back to standard PPOLag\.

These three failure modes collectively account for an estimated∼10%\\sim\\\!10\\%of the remaining FormulaOne L2 violations and a larger fraction on MetaDrive Hard\. Plausible mitigations include: \(i\) fusing CLIP with a depth sensor to handle occlusion; \(ii\) replacing ViT\-B/32 with a stronger reasoning\-capable backbone \(e\.g\. LLaVA\-NeXT or Qwen2\-VL\[[30](https://arxiv.org/html/2606.11266#bib.bib4),[44](https://arxiv.org/html/2606.11266#bib.bib5)\]\); and \(iii\) augmenting the prompt set with viewpoint\-specific templates\.

## Appendix KStatistical Methodology

We report bootstrap95%95\\%confidence intervals throughout, computed with20002000resamples over the per\-episode held\-out returns and costs\[[15](https://arxiv.org/html/2606.11266#bib.bib40)\]\. For pairwise comparisons we additionally report Welch’stt\-testpp\-values \(unequal\-variance two\-sample\), appropriate because methods have markedly different return / cost variances \(Levene’s test,p<0\.05p<0\.05on all FormulaOne L2 pairs\)\. With three \(FormulaOne, Easy\) or five \(Medium, Hard\) training seeds per cell, several pairwise differences fall below the conventionalα=0\.05\\alpha\{=\}0\.05threshold despite consistent directional effects across seeds\. We follow the recommendations ofHendersonet al\.\[[21](https://arxiv.org/html/2606.11266#bib.bib41)\], Agarwalet al\.\[[3](https://arxiv.org/html/2606.11266#bib.bib42)\]and interpret these through the bootstrap CIs and effect sizes rather than as binary accept/reject claims\. Per\-environment bootstrap CIs for the generalisation table appear in the main paper text\.

## Appendix LReproducibility Checklist

Key reproducibility commitments:

- •Code\.A patched OmniSafe v0\.5 fork registeringVLMPPOLagas a first\-class algorithm, plus the FormulaOne / Bullet / MetaDrive environment wrappers, will be released at the camera\-ready stage at[https://github\.com/\[anonymised\]](https://github.com/%5Banonymised%5D)\.
- •Configurations\.All training configs \(YAML\), prompt files \(v1/v2/v3\), and SLURM submission scripts will be included\.
- •Seeds\.Training seeds\{42,123,456\}\\\{42,123,456\\\}for FormulaOne and Bullet;\{42,123,456,789,2024\}\\\{42,123,456,789,2024\\\}for MetaDrive Medium/Hard; held\-out evaluation seeds1000010000–1001910019\.
- •Evaluation script\.Deterministic 20\-episode held\-out evaluation with bootstrap CI computation \(eval/holdout\_eval\.py\)\.
- •Environment wrappers\.BulletSafetyCarReach\-v0and MetaDrive wrappers ship with the seed\-leak fix \(num\_scenarios=10000\) pre\-applied\.
- •Data\.Rawprogress\.csvfiles for all9090FormulaOne runs and all held\-out CSVs \(∼500\\sim\\\!500MB total\) will be released via Zenodo\.
- •Compute\.See Appendix[A\.4](https://arxiv.org/html/2606.11266#A1.SS4)\.

Expected variance\.With fixed seeds,JRJ\_\{R\}should reproduce within±1%\\pm 1\\%\. Cost has higher variance \(±15%\\pm 15\\%\) due to the binary nature of collision events; aggregated statistics \(mean over seeds\) should reproduce within the reported error bars\.

## Appendix MBaseline Descriptions

- •PPO\[[39](https://arxiv.org/html/2606.11266#bib.bib30)\]\. Unconstrained PPO, no VLM, no safety mechanism\. Establishes the maximum\-return upper bound and the cost cost of unconstrained training\.
- •CPO\[[2](https://arxiv.org/html/2606.11266#bib.bib18)\]\. Constrained Policy Optimization, no VLM augmentation\. Strong CMDP baseline\.
- •PPOLag\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\]\. PPO\-Lagrangian, no VLM\. The direct ablation of the base algorithmVLMPPOLagextends\.
- •PPO\-CLG / CPO\-CLG\[[37](https://arxiv.org/html/2606.11266#bib.bib10)\]\. Contrasting Language Goals with PPO and CPO respectively\. Uses the coupled softmax VLM reward of prior work, with no cost mechanism\. The closest prior\-work comparison\.
- •CPO\-Coupled\.CPO with the coupled\-softmax VLM reward \(course\-project predecessor of this work\)\. Isolates the decoupling contribution\.
- •CPO\-Decoupled\.CPO with Contribution 1 \(decoupled dual\-path CLIP\)\.
- •PPOLag\-Decoupled\.PPO\-Lagrangian with decoupled VLM reward butη2=0\\eta\_\{2\}\{=\}0\(no anticipatoryλ\\lambdaterm\)\. Isolates theVLMLagrangecontribution\.
- •VLMPPOLag\.Full system: decoupled CLIP\+\+VLMLagrangeanticipatoryλ\\lambdaupdate \(all contributions except confidence gating\)\.
- •VLMPPOLag\+Conf\.VLMPPOLagwith confidence gating \(full system\)\.

## Appendix NPer\-seed Full FormulaOne Results

[Table˜18](https://arxiv.org/html/2606.11266#A14.T18)reports the complete final\-epoch per\-seed numbers used to compute the aggregated entries in[Table˜1](https://arxiv.org/html/2606.11266#S5.T1)of the main paper\.

Table 18:Complete per\-seed final\-epoch performance for all FormulaOne methods and levels\. Left: baselines without VLM and prior\-work\-style CLG baselines\. Right: CMDP\+\+VLM variants \(this work\)\.MethodLevelSeedJRJ\_\{R\}JCJ\_\{C\}Baselines \(no VLM\)PPOL0422\.00\.0PPOL01231\.10\.0PPOL04561\.60\.0PPOL1421\.7273\.9PPOL11231\.6186\.6PPOL14561\.6190\.6PPOL2421\.3286\.7PPOL21231\.4240\.1PPOL24561\.2280\.7CPOL0421\.40\.0CPOL01231\.30\.0CPOL04562\.90\.0CPOL1420\.328\.4CPOL11230\.454\.2CPOL14560\.324\.4CPOL2420\.623\.5CPOL21230\.153\.3CPOL24560\.231\.4PPOLagL0422\.00\.0PPOLagL01231\.10\.0PPOLagL04561\.60\.0PPOLagL1420\.5104\.3PPOLagL11231\.319\.5PPOLagL14560\.680\.0PPOLagL2420\.527\.8PPOLagL21230\.740\.2PPOLagL24560\.899\.4Prior\-work CLG baselinesPPO\-CLGL04251\.90\.0PPO\-CLGL012352\.40\.0PPO\-CLGL045651\.40\.0PPO\-CLGL14251\.6165\.2PPO\-CLGL112351\.681\.1PPO\-CLGL145651\.9154\.4PPO\-CLGL24251\.5155\.9PPO\-CLGL212351\.3107\.4PPO\-CLGL245651\.3206\.3CPO\-CLGL04251\.50\.0CPO\-CLGL012352\.10\.0CPO\-CLGL045651\.30\.0CPO\-CLGL14250\.340\.8CPO\-CLGL112351\.236\.4CPO\-CLGL145650\.721\.1CPO\-CLGL24250\.738\.0CPO\-CLGL212350\.925\.6CPO\-CLGL245651\.038\.0
MethodLevelSeedJRJ\_\{R\}JCJ\_\{C\}CMDP\+\+VLM \(this work\)CPO\-CoupledL04220\.50\.0CPO\-CoupledL012321\.30\.0CPO\-CoupledL045621\.80\.0CPO\-CoupledL14221\.423\.5CPO\-CoupledL112320\.033\.2CPO\-CoupledL145621\.530\.6CPO\-CoupledL24221\.438\.4CPO\-CoupledL212321\.930\.4CPO\-CoupledL245621\.528\.2CPO\-DecoupledL04264\.30\.0CPO\-DecoupledL012364\.40\.0CPO\-DecoupledL045663\.70\.0CPO\-DecoupledL14263\.524\.0CPO\-DecoupledL112363\.859\.8CPO\-DecoupledL145663\.829\.1CPO\-DecoupledL24263\.741\.7CPO\-DecoupledL212364\.030\.1CPO\-DecoupledL245664\.020\.9PPOLag\-Dec\.L04264\.50\.0PPOLag\-Dec\.L012364\.00\.0PPOLag\-Dec\.L045664\.40\.0PPOLag\-Dec\.L14264\.132\.6PPOLag\-Dec\.L112364\.035\.0PPOLag\-Dec\.L145663\.932\.9PPOLag\-Dec\.L24263\.947\.5PPOLag\-Dec\.L212363\.738\.6PPOLag\-Dec\.L245663\.836\.0VLMPPOLagL04263\.80\.0VLMPPOLagL012364\.50\.0VLMPPOLagL045664\.60\.0VLMPPOLagL14264\.335\.3VLMPPOLagL112364\.040\.6VLMPPOLagL145664\.022\.5VLMPPOLagL24263\.636\.3VLMPPOLagL212363\.854\.4VLMPPOLagL245663\.929\.8VLMPPOLag\+ConfL04241\.70\.0VLMPPOLag\+ConfL012348\.60\.0VLMPPOLag\+ConfL045645\.60\.0VLMPPOLag\+ConfL14247\.938\.5VLMPPOLag\+ConfL112348\.514\.6VLMPPOLag\+ConfL145649\.729\.6VLMPPOLag\+ConfL24246\.622\.5VLMPPOLag\+ConfL212349\.442\.0VLMPPOLag\+ConfL245649\.227\.1Additional baselines and ablationsPPOLag\-RNDL1420\.4542\.0PPOLag\-RNDL11230\.4758\.2PPOLag\-RNDL14561\.3886\.9PPOLag\-RNDL2420\.1743\.6PPOLag\-RNDL21230\.8729\.8PPOLag\-RNDL24560\.2563\.7Qwen2\-VL\+ConfL2428\.0324\.9Qwen2\-VL\+ConfL21238\.3934\.0Qwen2\-VL\+ConfL24568\.3177\.7

## Appendix OAdditional Experiments: RND Baseline and Qwen2\-VL Backbone

This appendix collects implementation details, diagnostic data and statistical tests for two additional experiments: the RND intrinsic\-cost baseline \(§[O\.1](https://arxiv.org/html/2606.11266#A15.SS1)\) and the Qwen2\-VL backbone ablation \(§[O\.2](https://arxiv.org/html/2606.11266#A15.SS2)\)\. Per\-seed final\-epoch numbers are folded into[Table˜18](https://arxiv.org/html/2606.11266#A14.T18)above\.

### O\.1RND: novelty signal collapses immediately

We replacecvlmc\_\{\\text\{vlm\}\}with Random Network Distillation novelty\[[10](https://arxiv.org/html/2606.11266#bib.bib44)\]: a random\-init target MLPf^θ:ℝ44→ℝ32\\hat\{f\}\_\{\\theta\}:\\mathbb\{R\}^\{44\}\\\!\\to\\\!\\mathbb\{R\}^\{32\}encodes the proprioceptive observation, and a trainable predictorf^ϕ\\hat\{f\}\_\{\\phi\}regresses onto it; the novelty cost at stepttisνt=1−exp⁡\(−max⁡\(zt,0\)\)\\nu\_\{t\}=1\-\\exp\(\-\\max\(z\_\{t\},0\)\)whereztz\_\{t\}is the running\-mean/std\-normalised prediction error\. Both networks are2×642\\\!\\times\\\!64\-unit MLPs with ReLU; predictor learning rate10−410^\{\-4\}\. The cost is otherwise routed identically tocvlmc\_\{\\text\{vlm\}\}\(per\-step input to the Lagrangian update\), and all PPO\-Lag hyperparameters match the main FormulaOne L1/L2 runs\. Implementation:src/rnd\_module\.py,src/rnd\_env\.py\.

The diagnostic finding is that the novelty signal never has discriminative magnitude\.[Table˜19](https://arxiv.org/html/2606.11266#A15.T19)reports the predictor\-error\-derived costνt\\nu\_\{t\}at four points across training\. Values are essentially flat at∼0\.01\\sim\\\!0\.01–0\.020\.02from100100k steps onward—an order of magnitude below the signal level CLIP and Qwen2\-VL produce on the same frames \(cvlm¯≈0\.55\\bar\{c\_\{\\text\{vlm\}\}\}\\\!\\approx\\\!0\.55–0\.640\.64\)\. This is the canonical RND failure mode in low\-stochasticity environments: the proprioceptive observation manifold is small enough that a22\-layer predictor matches the random target almost immediately, leaving the policy with a near\-constant intrinsic cost that carries no actionable safety information\. The result is a cost\-shaping signal that is dominated by initialisation noise and that the Lagrange multiplier cannot use to anticipate danger\.

Table 19:RND noveltyνt\\nu\_\{t\}across training \(3 seeds per level\)\. Values remain in\[0\.010,0\.024\]\[0\.010,0\.024\]at all sampled checkpoints and on all66runs—an order of magnitude belowcvlm¯\\bar\{c\_\{\\text\{vlm\}\}\}from CLIP \(∼0\.6\\sim\\\!0\.6\) or Qwen2\-VL \(∼0\.62\\sim\\\!0\.62\) on the same environment\.
### O\.2Qwen2\-VL backbone: implementation and timing

Backbone integration\.We useQwen/Qwen2\-VL\-7B\-Instructloaded intorch\.float16viatransformers4\.46\.34\.46\.3withAccelerate1\.0\.11\.0\.1, placing the entire model on a singlecuda:0device via HuggingFace’s automatic device map\. The model occupies∼16\\sim\\\!16GB on a single A100, leaving ample headroom for the policy and replay buffer\. Implementation:src/qwen2vl\_utils\.py\.

Group\-margin scoring\.Rather than per\-prompt cosine similarity, we use a binary yes/no scoring head that exploits Qwen2\-VL’s generative nature\. Given a frameoto\_\{t\}and a polarity descriptor \(positive: “safe driving conditions”; negative: “driving danger or imminent collision”\) we form the chat\-template prompt:"Question: Does this image show <descriptor\>? Answer with a single word: yes or no\."A single forward pass produces logits at the answer position; we extract the token IDs corresponding to “yes”/“Yes”/“ yes”/“ Yes” and the analogous “no” variants \(resolved at init viatokenizer\.encode\(\.\.\.\), keeping only single\-token results\), then score

P​\(yes∣o,d\)=σ​\(logsumexpi∈𝒴​ℓi−logsumexpj∈𝒩​ℓj\)\.P\(\\text\{yes\}\\mid o,d\)\\;=\\;\\sigma\\\!\\left\(\\,\\mathrm\{logsumexp\}\_\{i\\in\\mathcal\{Y\}\}\\\!\\ell\_\{i\}\\;\-\\;\\mathrm\{logsumexp\}\_\{j\\in\\mathcal\{N\}\}\\\!\\ell\_\{j\}\\right\)\.We setrvlm​\(o\)=P​\(yes∣o,positive\)r\_\{\\mathrm\{vlm\}\}\(o\)=P\(\\text\{yes\}\\mid o,\\text\{positive\}\)andcvlm​\(o\)=P​\(yes∣o,negative\)c\_\{\\text\{vlm\}\}\(o\)=P\(\\text\{yes\}\\mid o,\\text\{negative\}\)\.Two forward passes per scoring callregardless of the prompt\-set sizeNN, in contrast to CLIP which performs one image\-encoder pass and2​N2Nsmall dot products\.

Confidence\. Margin\-based confidence is recovered asκt=\|2​rvlm/\(rvlm\+cvlm\)−1\|∈\[0,1\]\\kappa\_\{t\}=\|2r\_\{\\mathrm\{vlm\}\}/\(r\_\{\\mathrm\{vlm\}\}\+c\_\{\\text\{vlm\}\}\)\-1\|\\in\[0,1\], playing the same role as the binary group margin in Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\)\.

Latency budget arithmetic\.A113113ms scoring call sets a budget for2525Hz control: atclip\_inference\_frequency=11the VLM consumes∼2\.8​s\\sim\\\!2\.8\\,\\mathrm\{s\}per second of simulation wall\-clock—infeasible in real\-time\. Atclip\_freq=88\(amortising the call across 8 control steps\) the amortised cost is∼14\\sim\\\!14ms per control step, well within the4040ms control period with∼65%\\sim\\\!65\\%headroom for the policy forward and the physics step\. We useclip\_freq=88throughout the ablation\. The per\-call timing of113113ms is reproduced as the first output line of every Qwen2\-VL training run via the SLURM preamble \(slurm/slurm\_corl\_qwen2vl\.sh\)\.

Stability ofcvlm¯\\bar\{c\_\{\\text\{vlm\}\}\}across training\.The per\-epoch danger\-signal magnitude is stable across all 3 seeds and all four sampled checkpoints \([Table˜20](https://arxiv.org/html/2606.11266#A15.T20)\), confirming Qwen2\-VL produces a calibrated and consistent visual\-danger score on this domain\. The variance in the policy\-level outcome \([Table˜18](https://arxiv.org/html/2606.11266#A14.T18)\) is therefore not attributable to backbone instability but to the Lagrangian controller’s response to the signal\.

Table 20:Qwen2\-VLcvlm¯\\bar\{c\_\{\\text\{vlm\}\}\}at four checkpoints, FormulaOne L2\. The danger\-signal magnitude is stable across seeds and across training, comparable to CLIP’scvlm¯∼0\.6\\bar\{c\_\{\\text\{vlm\}\}\}\\\!\\sim\\\!0\.6in our main runs\.
### O\.3Statistical tests for the additional comparisons

Following the methodology of §[K](https://arxiv.org/html/2606.11266#A11), we report Welch’stt\-test \(unequal variance, two\-sided\) and a bootstrap95%95\\%CI on the difference of means \(10410^\{4\}resamples\) for the two new method comparisons \([Table˜21](https://arxiv.org/html/2606.11266#A15.T21)\)\. The cost \(safety\) comparison between Qwen2\-VL and CLIP is statistically null: the bootstrap CI on the cost difference contains zero and Welch yieldsp=0\.46p\\\!=\\\!0\.46\. The return comparison is significantly different \(p=4×10−4p\\\!=\\\!4\\\!\\times\\\!10^\{\-4\}\): Qwen2\-VL achieves substantially lower return at this fixed gating threshold, consistent with the group\-margin score producing systematically more conservative gating than CLIP’s binary softmax margin \(Eq\. \([4](https://arxiv.org/html/2606.11266#S3.E4)\)\) A per\-backbone gating\-threshold sweep would be needed to disentangle backbone capacity from gating calibration, which we leave to future work\. The RND vs\. VLMPPOLag\+\+Conf comparison is significant on*both*axes: RND is worse by a wide margin in both return \(p<10−4p\\\!<\\\!10^\{\-4\}\) and cost \(p=0\.04p\\\!=\\\!0\.04\), confirming the main\-text claim that semantic grounding rather than auxiliary cost\-shaping drives the safety gain\.

Table 21:Statistical tests for the two ablation comparisons on FormulaOne L2 \(n=3n\\\!=\\\!3seeds per cell\)\. Each row reports the seed\-mean of method1and method2on the listed metric, then their differenceΔ=mean​\(method1\)−mean​\(method2\)\\Delta=\\mathrm\{mean\}\(\\text\{method\}\_\{1\}\)\-\\mathrm\{mean\}\(\\text\{method\}\_\{2\}\), a bootstrap95%95\\%CI onΔ\\Delta, and two one\-sidedpp\-values testing the pre\-registered direction shown in the last column \(<<means we expected method1to be smaller;\>\>larger\)\.JRJ\_\{R\}is episodic return \(higher is better\),JCJ\_\{C\}episodic cost \(lower is safer\)\. Forn1=n2=3n\_\{1\}\\\!=\\\!n\_\{2\}\\\!=\\\!3the smallest possible one\-sided permutationpp\-value is1/\(63\)=0\.051/\\binom\{6\}\{3\}=0\.05\(perfect separation of the two seed groups\); see Appendix[D\.3](https://arxiv.org/html/2606.11266#A4.SS3)\.How to read a row\.For example, row 3 says: on the return axis, our method \(VLMPPOLag\+\+Conf\) achieved a seed\-mean of46\.0146\.01while PPOLag\-RND achieved−1\.99\-1\.99, a gap of\+48\.00\+48\.00in our favour\. The direction we pre\-registered was “\>\>” \(we expected our method to score higher\), and the permutation test reaches the structural floorp=0\.05p\\\!=\\\!0\.05, meaning the three our\-method seeds and the three RND seeds are perfectly separated\.

## Appendix PExtended Robustness Ablations

This section bundles three supplementary analyses that extend the main results: a two\-multiplier ablation that disentanglesη1\\eta\_\{1\}fromη2\\eta\_\{2\}\(addressing the potential “η2\\eta\_\{2\}is just a larger learning rate” interpretation\); a CLIP\-capacity ablation contrasting ViT\-B/32 with ViT\-L/14 \(probing whether the results are sensitive to backbone size\); and a convergence sketch for the modified Lagrangian update\.

### P\.1Two\-multiplier ablation:η1\\eta\_\{1\}onJCJ\_\{C\}vs\.η2\\eta\_\{2\}onc¯vlm\\overline\{c\}\_\{\\text\{vlm\}\}

Motivation\.One natural alternative interpretation of the anticipatory termη2​\(c¯vlm−τ\)\\eta\_\{2\}\(\\overline\{c\}\_\{\\text\{vlm\}\}\-\\tau\)is that it may be acting as a disguised*learning\-rate increase*on the dual variable: a vanilla PPO\-Lag with a sufficiently largeη1\\eta\_\{1\}might recover the sameλ\\lambdatrajectory without any VLM signal\.

Why this is not the case \(structurally\)\.The two terms updateλ\\lambdafrom*different*stochastic processes:η1\\eta\_\{1\}scales the*episodic, post\-collision*residual\(JC−d\)\(J\_\{C\}\-d\), which is zero everywhere except at the small subset of timesteps that actually triggered cost in the simulator;η2\\eta\_\{2\}scales the*per\-step, pre\-collision*VLM signalc¯vlm\\overline\{c\}\_\{\\text\{vlm\}\}, which is non\-zero on every timestep where the camera sees an approaching barrier\. The two have different variances, different sign\-rates, and different temporal locations relative to the collision event\. Increasingη1\\eta\_\{1\}alone amplifies the same backward\-looking signal more aggressively \(which is known to induce oscillation\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\]\); it cannot manufacture pre\-collision information\.

Empirical disentangling at L2\.We instantiate a2×32\{\\times\}3sweep on FormulaOne L2 \(3 seeds\{42,123,456\}\\\{42,123,456\\\}\):η1∈\{0\.035,0\.07\}\\eta\_\{1\}\\in\\\{0\.035,0\.07\\\}crossed withη2∈\{0,0\.01,0\.03\}\\eta\_\{2\}\\in\\\{0,0\.01,0\.03\\\}, fixing all other hyperparameters \(decoupled CLIP, no confidence gate,τ=0\.5\\tau\{=\}0\.5\)\. The\(η1=0\.035,η2=0\)\(\\eta\_\{1\}\{=\}0\.035,\\eta\_\{2\}\{=\}0\)cell is PPOLag\-Decoupled \(existing run,JR=63\.8J\_\{R\}\{=\}63\.8,JC=40\.7J\_\{C\}\{=\}40\.7, viol\.3/33/3\); the\(0\.035,0\.01\)\(0\.035,0\.01\)cell is VLMPPOLag \(existing run,JR=63\.8J\_\{R\}\{=\}63\.8,JC=40\.2J\_\{C\}\{=\}40\.2, viol\.3/33/3\)\.

Table 22:Two\-multiplier ablation on FormulaOne L2\.3 seeds per cell \(\{42,123,456\}\\\{42,123,456\\\},10610^\{6\}env steps\)\. Cell format:JR/JCJ\_\{R\}\\,/\\,J\_\{C\}\(seed\-mean\) followed by \(viol\./3\), where a violation isJC\>d=25J\_\{C\}\>d\{=\}25\. Bold:JC≤dJ\_\{C\}\\leq d\. Existing rows are reproduced from[Tables˜1](https://arxiv.org/html/2606.11266#S5.T1)and[2](https://arxiv.org/html/2606.11266#S5.T2); rows marked‡\\ddaggerare the completed Phase Cη1\\eta\_\{1\}sweep\. Doublingη1\\eta\_\{1\}alone \(bottom\-left\) collapses return without buying safety; addingη2\>0\\eta\_\{2\}\{\>\}0on top of the largerη1\\eta\_\{1\}\(bottom\-middle\) is what actually drivesJCJ\_\{C\}belowdd\.Pre\-registered predictions and outcomes\.*\(i\) Confirmed\.*Theη1=0\.07,η2=0\\eta\_\{1\}\{=\}0\.07,\\eta\_\{2\}\{=\}0cell collapses return toJR=0\.2J\_\{R\}\{=\}0\.2— a∼55%\\sim\\\!55\\%drop from the same\-wrapperη1=0\.035\\eta\_\{1\}\{=\}0\.035reference ofJR≈0\.55J\_\{R\}\{\\approx\}0\.55\(footnote†\) — whileJCJ\_\{C\}remains aboveddon all 3 seeds withλ¯=2\.29\\bar\{\\lambda\}\{=\}2\.29\. This is the textbook PID\-vs\-P pathology of\[[42](https://arxiv.org/html/2606.11266#bib.bib20)\]: pumpingη1\\eta\_\{1\}amplifies the post\-collision residual, drivesλ\\lambdahigh, and degrades the policy without buying safety\.*\(ii\) Partially confirmed\.*Atη1=0\.07\\eta\_\{1\}\{=\}0\.07, sweepingη2∈\{0,0\.01,0\.03\}\\eta\_\{2\}\\in\\\{0,0\.01,0\.03\\\}givesJC=33\.0→20\.1→29\.8J\_\{C\}\{=\}33\.0\\to 20\.1\\to 29\.8: the minimum is atη2=0\.01\\eta\_\{2\}\{=\}0\.01\(non\-monotone, butη2\>0\\eta\_\{2\}\{\>\}0uniformly improves overη2=0\\eta\_\{2\}\{=\}0\)\. Atη1=0\.035\\eta\_\{1\}\{=\}0\.035the row is essentially flat \(40\.7→40\.2→41\.440\.7\\to 40\.2\\to 41\.4\), indicating that without the larger main\-loop pressure the anticipatory term has insufficient leverage\.*\(iii\) Confirmed in direction\.*TheJCJ\_\{C\}\-minimising cell subject toJRJ\_\{R\}retention is\(η1=0\.07,η2=0\.01\)\(\\eta\_\{1\}\{=\}0\.07,\\eta\_\{2\}\{=\}0\.01\)withJC=20\.1≤dJ\_\{C\}\{=\}20\.1\\leq datJR=63\.9J\_\{R\}\{=\}63\.9—*not*\(η1=0\.07,η2=0\)\(\\eta\_\{1\}\{=\}0\.07,\\eta\_\{2\}\{=\}0\), and not the originally pre\-registered\(η1=0\.035,η2=0\.03\)\(\\eta\_\{1\}\{=\}0\.035,\\eta\_\{2\}\{=\}0\.03\)\. Together \(i\)\+\(iii\) directly refute the “η2\\eta\_\{2\}is a disguisedη1\\eta\_\{1\}” interpretation: doublingη1\\eta\_\{1\}alone destroys return without delivering safety, whereas a smallη2\>0\\eta\_\{2\}\{\>\}0on top of the sameη1\\eta\_\{1\}recovers full return*and*achieves the lowestJCJ\_\{C\}in the entire grid\.

Wall\-clock cost\.The four added cells×\\times3 seeds totalled∼250\\sim\\\!250GPU\-hours \(median∼20\\sim\\\!20h/seed on shared A100s, slower than the original4\.54\.5h/seed estimate because rebuttal runs were not pre\-emptively prioritised and several queued behind concurrent jobs\)\.

### P\.2CLIP\-capacity ablation: ViT\-B/32 vs\. ViT\-L/14

Motivation\.The negative result on Qwen2\-VL\-7B \(App\.[O\.2](https://arxiv.org/html/2606.11266#A15.SS2)\) is uninformative about backbone capacity in general, because the Qwen prompting interface differs from CLIP’s contrastive scoring\. A cleaner control swaps CLIP ViT\-B/32 \(151 M params, 32×\\times32 patches\) for CLIP ViT\-L/14 \(428 M params, 14×\\times14 patches\), holding the scoring pipeline \(decoupled cosine of Eq\. \([2](https://arxiv.org/html/2606.11266#S3.E2)\)\)*exactly*fixed\.

Protocol\.The only changes from the FormulaOne L2 VLMPPOLag\+\+Conf configuration are: \(i\) swap the image encoderfIf\_\{I\}fromViT\-B/32toViT\-L/14, \(ii\) re\-cache text featuresF±F^\{\\pm\}through the matched ViT\-L/14 text tower, \(iii\) re\-run the gate\-calibration MLE of Eq\. \([5](https://arxiv.org/html/2606.11266#S3.E5)\) on a freshB=5000B\{=\}5000random\-policy frame buffer \(the margin distribution shifts noticeably between backbones; the same MLE recipe applies unchanged\)\.*No*hyperparameter retuning otherwise:λr=0\.1\\lambda\_\{r\}\{=\}0\.1,λc=0\.5\\lambda\_\{c\}\{=\}0\.5,η2=0\.01\\eta\_\{2\}\{=\}0\.01,τ=0\.5\\tau\{=\}0\.5\. Seeds\{42,123,456\}\\\{42,123,456\\\}for parity with the existing 3\-seed L2 cells; even a single L/14 seed materially probes whether the pipeline is encoder\-agnostic\.

Table 23:CLIP\-capacity ablation on FormulaOne L2\.Both backbones frozen; identical decoupled\-path scoring and gate formula\. ViT\-B/32 row is reproduced from[Table˜1](https://arxiv.org/html/2606.11266#S5.T1)\(Phase B 5\-seed \+Conf cell\)\.‡\\ddaggerPhase C run, 3 seeds\. Per\-step inference: ViT\-B/327\.117\.11ms, ViT\-L/14∼21\\sim 21ms \(both A100\), still inside the4040ms control budget\.

Pre\-registered predictions and outcomes\.Three predictions were registered in advance: \(i\)JCJ\_\{C\}for ViT\-L/14 would lie within the bootstrap\-CI of the ViT\-B/32 cell \(roughly±8\\pm 8of the mean\); \(ii\)JRJ\_\{R\}for ViT\-L/14 would lie within±8\\pm 8of the ViT\-B/32 mean; \(iii\) the gate\-calibration parameters\(s^,c^\)\(\\hat\{s\},\\hat\{c\}\)would shift modestly but leave the medianκ^\\hat\{\\kappa\}in the non\-degenerate range\.

The outcomes split: prediction \(i\) is*confirmed*\(JC=26\.9±5\.7J\_\{C\}\{=\}26\.9\{\\pm\}5\.7vs\. ViT\-B/3222\.5±5\.922\.5\{\\pm\}5\.9; the CIs overlap, the safety mechanism is encoder\-agnostic\)\. Prediction \(ii\) is*not*confirmed: ViT\-L/14 returns atJR=16\.1±1\.7J\_\{R\}\{=\}16\.1\{\\pm\}1\.7are roughly1616points below ViT\-B/32 \(31\.8±12\.231\.8\{\\pm\}12\.2\), well outside the pre\-registered±8\\pm 8tolerance\. Two factors plausibly account for the gap\. First, ViT\-L/14’s14×1414\{\\times\}14patches resolve fine\-grained barrier features more sharply, which under matched gate\-calibration MLE produces a more aggressiveκ^\\hat\{\\kappa\}distribution that attenuates the reward CLIP channel more strongly \(symmetric attenuation, the same mechanism as the calibrated\-gate ablation of App\.[F](https://arxiv.org/html/2606.11266#A6)\)\. Second, the prompt set was tuned on ViT\-B/32 margins; the sameK=L=4K\{=\}L\{=\}4prompts may be suboptimally positioned in ViT\-L/14’s higher\-dimensional embedding space\. Prediction \(iii\) holds qualitatively \(medianκ^\\hat\{\\kappa\}remains in\[0\.4,0\.8\]\[0\.4,0\.8\]\)\.*The headline interpretation is the one anticipated as the negative\-outcome branch below*: the modest receptive field of32×3232\{\\times\}32patches is in fact better matched to the FormulaOne barrier scale at the matched prompt set\. The result confirms encoder\-agnosticism of the safety mechanism \(prediction \(i\)\) while revealing that backbone choice trades along the return axis under matched gating—an architectural finding worth reporting\.

### P\.3Convergence sketch for VLMLagrange

We give an informal stochastic\-approximation argument for the modified dual update\. Recall Eq\. \([3](https://arxiv.org/html/2606.11266#S3.E3)\):

λk\+1=\[λk\+η1​\(JC^k−d\)\+η2​\(c¯^vlm,k−τ\)\]\+,\\lambda\_\{k\+1\}\\;=\\;\\big\[\\lambda\_\{k\}\+\\eta\_\{1\}\\big\(\\widehat\{J\_\{C\}\}\_\{k\}\-d\\big\)\+\\eta\_\{2\}\\big\(\\widehat\{\\overline\{c\}\}\_\{\\text\{vlm\},k\}\-\\tau\\big\)\\big\]\_\{\+\},withJC^k\\widehat\{J\_\{C\}\}\_\{k\}andc¯^vlm,k\\widehat\{\\overline\{c\}\}\_\{\\text\{vlm\},k\}the epoch\-mean Monte\-Carlo estimates over the rollout buffer at outer iterationkk\.

Setup\.Assume \(A1\) the inner PPO update operates on a faster timescale than the dual update \(η1,η2→0\\eta\_\{1\},\\eta\_\{2\}\\to 0, withη1/η2\\eta\_\{1\}/\\eta\_\{2\}held fixed\); \(A2\) for eachλ\\lambdathe inner policyπθ​\(λ\)\\pi\_\{\\theta\}\(\\lambda\)tracks a unique stationary pointθ⋆​\(λ\)\\theta^\{\\star\}\(\\lambda\)of the LagrangianL​\(θ,λ\)=JR​\(πθ\)−λ​\(JC​\(πθ\)−d\)L\(\\theta,\\lambda\)=J\_\{R\}\(\\pi\_\{\\theta\}\)\-\\lambda\(J\_\{C\}\(\\pi\_\{\\theta\}\)\-d\); \(A3\) the VLM bias is*bounded and consistent*in the sense that there exists a constantβ<∞\\beta<\\inftyand a deterministic deviationδ​\(θ\):=𝔼o∼dπθ​\[cvlm​\(o\)\]−g​\(JC​\(πθ\)/T\)\\delta\(\\theta\)\\\!:=\\\!\\mathbb\{E\}\_\{o\\sim d^\{\\pi\_\{\\theta\}\}\}\[c\_\{\\text\{vlm\}\}\(o\)\]\-g\\big\(J\_\{C\}\(\\pi\_\{\\theta\}\)/T\\big\)with\|δ​\(θ\)\|≤β\|\\delta\(\\theta\)\|\\leq\\betaandg​\(⋅\)g\(\\cdot\)a continuous monotone link \(we make this concrete below\)\. Under \(A1\)–\(A3\) the dual update is a standard two\-timescale Robbins–Monro scheme\[[7](https://arxiv.org/html/2606.11266#bib.bib53)\]with perturbed gradient\.

The perturbed dual gradient\.The standard PPO\-Lag dual gradient at the inner equilibrium ish1​\(λ\)=JC​\(πθ⋆​\(λ\)\)−dh\_\{1\}\(\\lambda\)=J\_\{C\}\(\\pi\_\{\\theta^\{\\star\}\(\\lambda\)\}\)\-d\. Our update addsh2​\(λ\)=𝔼​\[c¯vlm\]−τh\_\{2\}\(\\lambda\)=\\mathbb\{E\}\[\\overline\{c\}\_\{\\text\{vlm\}\}\]\-\\tau\. Ifτ\\tauis chosen consistent with the threshold the VLM would assign to the feasibleJCJ\_\{C\}— formally,τ=g​\(d/T\)\+O​\(β\)\\tau=g\(d/T\)\+O\(\\beta\)— then the combined drifth​\(λ\)=η1​h1​\(λ\)\+η2​h2​\(λ\)h\(\\lambda\)=\\eta\_\{1\}h\_\{1\}\(\\lambda\)\+\\eta\_\{2\}h\_\{2\}\(\\lambda\)shares its zero withh1h\_\{1\}alone up to anO​\(η2​β/η1\)O\(\\eta\_\{2\}\\beta/\\eta\_\{1\}\)bias\. Concretely, the fixed pointλ⋆\\lambda^\{\\star\}of the standard update \(h1​\(λ⋆\)=0h\_\{1\}\(\\lambda^\{\\star\}\)\{=\}0\) is perturbed to a nearbyλ~\\widetilde\{\\lambda\}satisfying\|λ~−λ⋆\|≤\(η2/η1\)⋅β/\|h1′​\(λ⋆\)\|\|\\widetilde\{\\lambda\}\-\\lambda^\{\\star\}\|\\leq\(\\eta\_\{2\}/\\eta\_\{1\}\)\\cdot\\beta/\|h\_\{1\}^\{\\prime\}\(\\lambda^\{\\star\}\)\|by the implicit function theorem applied at the unperturbed zero \(assumingh1′​\(λ⋆\)≠0h\_\{1\}^\{\\prime\}\(\\lambda^\{\\star\}\)\\neq 0, which holds wheneverJCJ\_\{C\}is locally strictly increasing inλ\\lambda— the standard regularity assumption in PPO\-Lag analysis\)\.

Conclusion of the sketch\.Under \(A1\)–\(A3\) the iteratesλk\\lambda\_\{k\}converge almost surely to aO​\(η2​β/η1\)O\(\\eta\_\{2\}\\beta/\\eta\_\{1\}\)neighbourhood of the standard PPO\-Lag fixed point; with appropriately decaying step sizes∑kηi,k=∞\\sum\_\{k\}\\eta\_\{i,k\}=\\infty,∑kηi,k2<∞\\sum\_\{k\}\\eta\_\{i,k\}^\{2\}<\\inftyandη2,k/η1,k→0\\eta\_\{2,k\}/\\eta\_\{1,k\}\\to 0, the bias term vanishes and the limit is exactly the unperturbed feasibleλ⋆\\lambda^\{\\star\}\. The VLM term thus accelerates the early\-training trajectory \(the empirically observed effect of[Section˜5\.1](https://arxiv.org/html/2606.11266#S5.SS1)\) without changing the asymptotic feasible point\.

Caveats\.\(i\) \(A2\) is the standard PPO\-Lag assumption and is verifiable empirically only through training\-curve convergence \([Section˜5\.1](https://arxiv.org/html/2606.11266#S5.SS1)\); we make no claim beyond what already holds for PPO\-Lag\. \(ii\) \(A3\) is the substantive assumption: it requires that the VLM signal be biased*but bounded and consistent*with respect to the simulator cost\. The held\-out AUC validation of §[F\.2](https://arxiv.org/html/2606.11266#A6.SS2)\(0\.780\.78–0\.820\.82\) is the empirical evidence for \(A3\); a fully formal treatment would require a Lipschitz noise model oncvlmc\_\{\\text\{vlm\}\}that we do not develop here\. \(iii\) The argument is local; we make no global\-convergence claim\. This sketch is intended as a sanity\-level justification that the modification*is*a standard stochastic\-approximation scheme with a bounded perturbation, not a novel theoretical contribution\. A full proof would adapt the two\-timescale framework of\[[7](https://arxiv.org/html/2606.11266#bib.bib53)\]with the perturbed\-ODE machinery of\[[27](https://arxiv.org/html/2606.11266#bib.bib54)\]\.

Similar Articles

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

arXiv cs.LG

Proposes LILAC+, a framework for safe continual reinforcement learning under nonstationarity that uses three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Evaluations in simulated driving environments show reduced safety violations under distribution shift while maintaining competitive performance.

Revealing Interpretable Failure Modes of VLMs

arXiv cs.AI

This paper introduces Revelio, a framework that systematically discovers interpretable failure modes in Vision-Language Models (VLMs) by searching over discrete concept combinations. Applied to autonomous driving and indoor robotics, it reveals previously unreported vulnerabilities that lead to crashes or safety hazards.