TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning
Summary
TRIDENT is a novel multi-agent reinforcement learning framework that breaks the coupling between hybrid discrete-continuous actions, hard safety constraints, and physics-governed dynamics, achieving provably safe coordination with a convergence guarantee to a constrained Nash equilibrium and significant reductions in training-time violations.
View Cached Full Text
Cached at: 06/18/26, 05:40 AM
# Breaking the Hybrid–Safety–Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning
Source: [https://arxiv.org/html/2606.18308](https://arxiv.org/html/2606.18308)
Zijie Meng, Ziwei Li, Zhiyu Li, Jiyuan Liu Peking UniversityYufei Liu11footnotemark:1 Xiamen University &Wenhua Nie National Taiwan University &Bingcai Wei WHU &Miao Zhang THU / Jimei University
###### Abstract
Safe coordination in networked cyber\-physical systems forces learning algorithms to simultaneously handle*hybrid discrete–continuous actions*,*hard training\-time safety constraints*, and*physics\-governed dynamics*\. We show that these three features form a directed cycle of biases that defeats any naive composition of off\-the\-shelf modules, and formalize this as a*three\-way coupling lemma*\. We then introduceTrident, the first MARL framework whose three components are co\-designed to cancel each leak: a Richardson–Romberg gradient correction reducing Gumbel\-Softmax bias from𝒪\(τ\)\\mathcal\{O\}\(\\tau\)to𝒪\(τ2\)\\mathcal\{O\}\(\\tau^\{2\}\), a Lyapunov\-constrained sequential trust\-region update enforcing per\-iterate feasibility, and a physics\-informed*residual critic*that decomposes value rather than reward\. We prove an𝒪~\(1/K\)\\tilde\{\\mathcal\{O\}\}\(1/\\sqrt\{K\}\)convergence rate to a constrained Nash equilibrium and an𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)cumulative\-violation bound\. On multi\-UAV mobile\-edge computing, autonomous intersection management, and a hybrid SMAC variant,Tridentcuts training\-time violations by95\.5%95\.5\\%over MADDPG and76\.3%76\.3\\%over MACPO, while improving reward by13\.5%13\.5\\%over the strongest unconstrained baseline\.
Trident: Breaking the Hybrid–Safety–Physics Coupling for Provably Safe Multi\-Agent Reinforcement Learning
Zijie Meng††thanks:Equal contribution\.††thanks:Corresponding author:ymlf@stu\.pku\.edu\.cn, Ziwei Li, Zhiyu Li, Jiyuan LiuPeking University
Yufei Liu11footnotemark:1Xiamen UniversityWenhua NieNational Taiwan UniversityBingcai WeiWHUMiao ZhangTHU / Jimei University
## 1Introduction
Consider a fleet of unmanned aerial vehicles \(UAVs\) deployed over a disaster area to provide mobile edge computing services to ground first responders\(Wang and others,[2021](https://arxiv.org/html/2606.18308#bib.bib36); Zhou and others,[2023](https://arxiv.org/html/2606.18308#bib.bib41)\)\. Within tens of milliseconds, each UAV must select which of several heterogeneous backbone servers will relay a rescue worker’s video stream \(a discrete choice over a small set of links\), decide how much of the upcoming computation to offload \(a continuous fraction in\[0,1\]\[0,1\]\), and update its trajectory while keeping battery, coverage, and inter\-UAV separation strictly within hardware limits\. Unlike a recommender or a chess engine that can afford a regret\-then\-improve curve, every unsafe action committed during training has physical, irreversible consequences—a depleted battery, a mid\-air near\-miss, a dropped emergency video stream\(García and Fernández,[2015](https://arxiv.org/html/2606.18308#bib.bib12); Brunkeet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib7)\)\. This scene is not exotic; it is the prototypical pattern of safe coordination in*networked cyber\-physical systems*\(CPS\), shared by autonomous\-intersection vehicles, robot warehouses, and connected\-vehicle platoons\(Zhouet al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib40); Menget al\.,[2026](https://arxiv.org/html/2606.18308#bib.bib47); Liuet al\.,[2025](https://arxiv.org/html/2606.18308#bib.bib48); Weiet al\.,[2025](https://arxiv.org/html/2606.18308#bib.bib49)\)\.
A close look at such systems reveals that three structural features appear together, never in isolation\. The first is a hybrid action structure \(F1\): the decision factorizes asa=\(ad,ac\)a=\(a^\{d\},a^\{c\}\)in whichada^\{d\}names a mode \(which server, which lane, which target\) andaca^\{c\}parameterizes its execution \(offload ratio, throttle, aim point\); discretizingaca^\{c\}destroys resolution while relaxingada^\{d\}produces infeasible interpolations between physically incompatible modes\(Fuet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib11); Fanet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib10)\)\. The second is hard, training\-time safety \(F2\): cost thresholds must be respected not only at convergence but at every iterate the policy is allowed to execute on hardware\(Achiamet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib1); Chowet al\.,[2018](https://arxiv.org/html/2606.18308#bib.bib8); Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14); Li and Azizan,[2024](https://arxiv.org/html/2606.18308#bib.bib15)\)\. The third is physics\-governed dynamics \(F3\): substantial portions of the transition kernel and reward follow closed\-form physics—Shannon capacity, Friis path loss, Newton’s equations—and rediscovering them from scratch wastes orders of magnitude of samples\(Karniadakiset al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib19); Banerjeeet al\.,[2023](https://arxiv.org/html/2606.18308#bib.bib6); Caoet al\.,[2024](https://arxiv.org/html/2606.18308#bib.bib42); Meng,[2026](https://arxiv.org/html/2606.18308#bib.bib44); Menget al\.,[2025](https://arxiv.org/html/2606.18308#bib.bib45); Liuet al\.,[2026](https://arxiv.org/html/2606.18308#bib.bib46)\)\.
A natural first reaction is to take an off\-the\-shelf hybrid\-action MARL method\(Fuet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib11)\), wrap it in a safety procedure such as MACPO\(Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14)\), and add a physics\-shaped reward term\. We tried exactly this composition; the result is unstable, often*worse*than each component in isolation\. The root cause, which we make precise in Section[4](https://arxiv.org/html/2606.18308#S4), is that the three features form a tight directed cycle of errors rather than a list of independent issues \(as illustrated in Figure[1](https://arxiv.org/html/2606.18308#S4.F1)\)\. Standard Gumbel\-Softmax estimators carry an𝒪\(τ\)\\mathcal\{O\}\(\\tau\)gradient bias\(Janget al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib17); Maddisonet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib24)\); substituted into a Lagrangian or trust\-region safety update, this bias produces a multiplier that oscillates instead of decreasing the Lyapunov function, so safety guarantees that hold under an exact gradient cease to hold under the biased one \(F1→\\toF2\)\. A physics\-agnostic safety critic must regress cost\-value functions that are highly multi\-modal across discrete branches—offloading to fog server 1 versus 2 yields qualitatively different energy curves—and without physics priors it underestimates feasibility margins on rarely visited branches, inducing recovery to over\-correct on the wrong branch \(F2→\\toF3\)\. Conversely, the standard remedy of folding physics into a single scalar reward\-shaping term shifts the soft\-Bellman fixed point and destroys the per\-branch structure the discrete sub\-policy is meant to exploit, so the discrete head learns degenerate, single\-mode behaviour \(F3→\\toF1\)\. These dependencies form a directed cycle: any module designed in isolation leaks errors into the next, which leaks them back\. Treating the three challenges separately is therefore not merely suboptimal—it is provably circular\.
We argue that the right level of abstraction is neither “add safety to MARL” nor “add physics to safe RL”, but a joint object: a constrained hybrid\-action policy whose gradients are shaped by physics and whose updates are shaped by Lyapunov constraints\. The three\-way coupling above gives three concrete design principles, each instantiated as one component ofTrident\(theTemperature\-corrected,Residual,Infinitesimally feasible,DEcoupled, sequeNTial framework\)\. Because the components are co\-designed, the residual error of one no longer enters the others’ guarantees, and a single convergence\-and\-safety analysis closes the loop\. Concretely, our contributions are fourfold:
- •A coupling lemmathat formalizes why hybrid actions, hard safety, and physics priors cannot be composed naively, and uniquely determines the architecture of any correct fix\.
- •Trident, the first MARL framework that co\-designs hybrid\-action, safety, and physics modules so their residual errors no longer feed into one another’s guarantees\.
- •Joint guarantees:𝒪~\(1/K\)\\tilde\{\\mathcal\{O\}\}\(1/\\sqrt\{K\}\)convergence to a constrained Nash equilibrium,𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)cumulative violation, and a physics\-driven sample\-complexity reduction\.
- •Strong empirical resultson UAV mobile\-edge computing, autonomous intersection management, and a hybrid SMAC variant:95\.5%95\.5\\%fewer violations than MADDPG,76\.3%76\.3\\%fewer than MACPO,13\.5%13\.5\\%higher reward, scaling to 32 agents\.
## 2Related Work
Hybrid\-action MARL\.Deep MAPQN or MAHHQN\(Fuet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib11)\)pioneered DRL on discrete–continuous spaces, and subsequent work refines parameterized\-action factorizations\(Skrynniket al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib37); Fanet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib10)\)\. None provides convergence rates or safety guarantees, and all rely on standard Gumbel\-Softmax estimators\(Janget al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib17); Maddisonet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib24)\)whose𝒪\(τ\)\\mathcal\{O\}\(\\tau\)gradient bias is precisely the source of the F1→\\toF2 leakage we identify\.
Safe MARL\.MACPO\(Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14)\)extends CPO\(Achiamet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib1)\)with multi\-agent trust\-region updates and monotonic\-improvement guarantees; MAPPO\-Lagrangian, built on Safety Gym\(Rayet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib29); Yuet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib38)\), only guarantees feasibility*at convergence*; the most recent MADAC\(Li and Azizan,[2024](https://arxiv.org/html/2606.18308#bib.bib15)\)establishes generalized\-Nash convergence but inherits the unbiased\-gradient assumption that F1 violates; and shielding methods\(Elsayed\-Aly and others,[2021](https://arxiv.org/html/2606.18308#bib.bib9); Alshiekhet al\.,[2018](https://arxiv.org/html/2606.18308#bib.bib4)\)guarantee learning\-time safety only with hand\-designed shields and cannot accommodate hybrid actions\. The Lyapunov\-based approach ofChowet al\.\([2018](https://arxiv.org/html/2606.18308#bib.bib8)\)and its extensions\(Huh and Yang,[2020](https://arxiv.org/html/2606.18308#bib.bib16)\)is the closest precedent for our safety mechanism, but is restricted to single\-agent, continuous\-action problems and assumes an unbiased policy gradient\.
Physics\-informed and residual RL\.Residual policy learning\(Silveret al\.,[2018](https://arxiv.org/html/2606.18308#bib.bib33); Johanninket al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib18)\)composes a model\-based prior with a learned correction; physics\-regulated DRL\(Caoet al\.,[2024](https://arxiv.org/html/2606.18308#bib.bib42)\)and physics\-informed MBRL\(Ramesh and Ravindran,[2023](https://arxiv.org/html/2606.18308#bib.bib28)\)exploit known dynamics\. Their reward\-shaping variants, however, suffer the F3→\\toF1 leakage we identify, since additive shaping shifts the soft\-Bellman fixed point\(Nget al\.,[1999](https://arxiv.org/html/2606.18308#bib.bib25)\)\. We adapt residual ideas to a centralized multi\-agent critic and quantify, for the first time, the resulting variance reduction in a constrained\-MARL setting\.
## 3Preliminaries
We model CPS coordination as a*Constrained Multi\-Agent MDP*\(C\-MAMDP\)ℳ=\(𝒩,𝒮,\{𝒜i\}i,𝒫,r,\{ck,dk\}k=1K,γ\)\\mathcal\{M\}\\\!=\\\!\(\\mathcal\{N\},\\mathcal\{S\},\\\{\\mathcal\{A\}\_\{i\}\\\}\_\{i\},\\mathcal\{P\},r,\\\{c\_\{k\},d\_\{k\}\\\}\_\{k=1\}^\{K\},\\gamma\)withNNagents, global state𝒮\\mathcal\{S\}, hybrid per\-agent action space𝒜i=𝒜id×𝒜ic\\mathcal\{A\}\_\{i\}\\\!=\\\!\\mathcal\{A\}\_\{i\}^\{d\}\\\!\\times\\\!\\mathcal\{A\}\_\{i\}^\{c\}\(𝒜id=\{1,…,Mi\}\\mathcal\{A\}\_\{i\}^\{d\}\\\!=\\\!\\\{1,\\ldots,M\_\{i\}\\\}discrete,𝒜ic⊆ℝpi\\mathcal\{A\}\_\{i\}^\{c\}\\\!\\subseteq\\\!\\mathbb\{R\}^\{p\_\{i\}\}continuous\), kernel𝒫\\mathcal\{P\}, shared rewardrr,KKbounded costsck:𝒮×𝒜→\[0,Cmax\]c\_\{k\}\\\!:\\\!\\mathcal\{S\}\\\!\\times\\\!\\mathcal\{A\}\\\!\\to\\\!\[0,C\_\{\\max\}\]with thresholdsdkd\_\{k\}, and discountγ∈\(0,1\)\\gamma\\\!\\in\\\!\(0,1\)\. Each agentiiholds a local policyπi:𝒪i→Δ\(𝒜i\)\\pi\_\{i\}\\\!:\\\!\\mathcal\{O\}\_\{i\}\\\!\\to\\\!\\Delta\(\\mathcal\{A\}\_\{i\}\)on observationoio\_\{i\}; we adopt the standard centralized\-training, decentralized\-execution \(CTDE\) paradigm\(Loweet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib23)\), where centralized critics access the full state in training while actors rely solely on local observations at deployment\.
Given a joint policy𝝅\\bm\{\\pi\}, the value and per\-constraint cost\-value functions areV𝝅\(s\)=𝔼𝝅\[∑tγtrt\|s0=s\]V^\{\\bm\{\\pi\}\}\\\!\(s\)\\\!=\\\!\\mathbb\{E\}\_\{\\bm\{\\pi\}\}\\\!\\big\[\\sum\_\{t\}\\gamma^\{t\}r\_\{t\}\|s\_\{0\}\\\!=\\\!s\\big\]andVck𝝅\(s\)=𝔼𝝅\[∑tγtck\(st,𝒂t\)\|s0=s\]V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}\\\!\(s\)\\\!=\\\!\\mathbb\{E\}\_\{\\bm\{\\pi\}\}\\\!\\big\[\\sum\_\{t\}\\gamma^\{t\}c\_\{k\}\(s\_\{t\},\\bm\{a\}\_\{t\}\)\|s\_\{0\}\\\!=\\\!s\\big\]\. The objective is a*constrained Nash equilibrium*\(CNE\):max𝝅𝔼s0∼ρV𝝅\(s0\)\\max\_\{\\bm\{\\pi\}\}\\mathbb\{E\}\_\{s\_\{0\}\\sim\\rho\}V^\{\\bm\{\\pi\}\}\(s\_\{0\}\)s\.t\.𝔼s0Vck𝝅\(s0\)≤dk\\mathbb\{E\}\_\{s\_\{0\}\}V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}\(s\_\{0\}\)\\\!\\leq\\\!d\_\{k\}for allkk, i\.e\. a joint policy from which no agent can unilaterally improve its constrained return—the equilibrium concept used in recent safe\-MARL theory\(Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14); Li and Azizan,[2024](https://arxiv.org/html/2606.18308#bib.bib15)\)\. For hybrid actions, we factorizeπi\(ai\|oi\)=πid\(aid\|oi\)πic\(aic\|oi,aid\)\\pi\_\{i\}\(a\_\{i\}\|o\_\{i\}\)\\\!=\\\!\\pi\_\{i\}^\{d\}\(a\_\{i\}^\{d\}\|o\_\{i\}\)\\,\\pi\_\{i\}^\{c\}\(a\_\{i\}^\{c\}\|o\_\{i\},a\_\{i\}^\{d\}\), withπid\\pi\_\{i\}^\{d\}categorical andπic\\pi\_\{i\}^\{c\}a Gaussian conditioned on the discrete choice\. This conditional—rather than joint or product—factorization is essential because the continuous parameters change meaning across discrete modes: the same scalar “power” carries different physical units on different communication links, so a single shared continuous head would entangle physically incompatible regimes\.
## 4The Three\-Way Coupling Challenge
This section formalises the intuition of §[1](https://arxiv.org/html/2606.18308#S1): features \(F1\)–\(F3\) induce a directed cycle of bias that closes through the actor, the safety critic, and the reward critic of any naive composition\. Quantifying the cycle \(Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)\) directly dictates the form ofTrident\.
Hybrid Actorπd⋅πc\\pi^\{d\}\\\!\\cdot\\\!\\pi^\{c\}\(F1\)Safety CriticLkL\_\{k\}\(F2\)Reward CriticQϕQ\_\{\\phi\}\(F3\)F1→\\\!\\to\\\!F2: GS bias𝒪\(τ\)\\mathcal\{O\}\(\\tau\)F2→\\\!\\to\\\!F3: mis\-estim\.feasibilityF3→\\\!\\to\\\!F1: shapingflattens modesStgc:𝒪\(τ2\)\\mathcal\{O\}\(\\tau^\{2\}\)LcpoTRδTR=𝒪~\(1/K\)\\delta\_\{\\text\{TR\}\}\\\!=\\\!\\tilde\{\\mathcal\{O\}\}\(1/\\sqrt\{K\}\)Pirc:QphysQ\_\{\\text\{phys\}\}frozen
Figure 1:Three\-way coupling\.Red wavy arrows: the bias\-leakage cycle of any naive composition;green dashed arrows: the three co\-designed mechanisms inTridentthat cancel each leak \(Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)\)\.LetβGS:=‖𝔼\[g^d\]−gd‖\\beta\_\{\\text\{GS\}\}\\\!:=\\\!\\left\\\|\\mathbb\{E\}\[\\hat\{g\}^\{d\}\]\-g^\{d\}\\right\\\|be the discrete\-branch gradient bias,ϵQ\\epsilon\_\{Q\}the reward\-critic MSE, andηs\\eta\_\{s\}the safety\-step magnitude; Figure[1](https://arxiv.org/html/2606.18308#S4.F1)sketches their dependencies\.
###### Lemma 1\(Bias Propagation in Naive Composition\)\.
For a baseline using a Gumbel\-Softmax estimator with biasβGS\\beta\_\{\\text\{GS\}\}, a Lagrangian or trust\-region safety step of magnitudeηs\\eta\_\{s\}, and a critic with MSEϵQ\\epsilon\_\{Q\}, the per\-iteration constraint\-violation increment satisfies
Δkt≤ηsβGS‖∇aAck‖∞⏟F1→F2leak\+ηsϵQLπ⏟F2→F3leak\+Cflat‖Qphys−Q∗‖∞⏟F3→F1leak,\\Delta\_\{k\}^\{t\}\\\!\\leq\\\!\\underbrace\{\\eta\_\{s\}\\beta\_\{\\text\{GS\}\}\\left\\\|\\nabla\_\{a\}A\_\{c\_\{k\}\}\\right\\\|\_\{\\\!\\infty\}\}\_\{\\text\{F1\}\\to\\text\{F2\}\\text\{ leak\}\}\\\!\+\\\!\\underbrace\{\\eta\_\{s\}\\epsilon\_\{Q\}L\_\{\\pi\}\}\_\{\\text\{F2\}\\to\\text\{F3\}\\text\{ leak\}\}\\\!\+\\\!\\underbrace\{C\_\{\\text\{flat\}\}\\left\\\|Q\_\{\\text\{phys\}\}\\\!\-\\\!Q^\{\*\}\\right\\\|\_\{\\\!\\infty\}\}\_\{\\text\{F3\}\\to\\text\{F1\}\\text\{ leak\}\},\(1\)whereLπL\_\{\\pi\}is the policy Lipschitz constant andCflatC\_\{\\text\{flat\}\}measures the entropy\-flattening of misspecified shaping\. \(Proof: Appendix[B](https://arxiv.org/html/2606.18308#A2)\.\)
Under any drop\-in combination—βGS=Θ\(1\)\\beta\_\{\\text\{GS\}\}\\\!=\\\!\\Theta\(1\)for Straight\-Through andΘ\(τmin\)\\Theta\(\\tau\_\{\\text\{min\}\}\)for floored Gumbel\-Softmax,ϵQ=Θ\(1\)\\epsilon\_\{Q\}\\\!=\\\!\\Theta\(1\)without a physics prior,CflatC\_\{\\text\{flat\}\}uncontrolled under additive shaping—each summand of \([1](https://arxiv.org/html/2606.18308#S4.E1)\) is bounded away from zero and cumulative violation grows asΘ\(K\)\\Theta\(K\)\. Crucially, the three leaks scale with three*independent*biases, so closing the cycle requires three co\-designed—not interchangeable—interventions, one per term: a discrete estimator withβGS=o\(τ\)\\beta\_\{\\text\{GS\}\}\\\!=\\\!o\(\\tau\), a safety step preserving feasibility*per iterate*rather than only asymptotically, and a physics prior entering*multiplicatively*as a frozen value rather than additively as shaping\. These three conditions translate one\-to\-one into the three modules ofTrident:
###### Design Principle 1\(Bias\-attenuated discrete gradients\)\.
βGS=o\(τ\)\\beta\_\{\\text\{GS\}\}\\\!=\\\!o\(\\tau\); instantiated byStgc\(§[5\.1](https://arxiv.org/html/2606.18308#S5.SS1)\), attaining𝒪\(τ2\)\\mathcal\{O\}\(\\tau^\{2\}\)\.
###### Design Principle 2\(Per\-iterate feasibility\)\.
Lyapunov trust region with explicit recovery, in place of asymptotic duality; instantiated byLcpo\(§[5\.2](https://arxiv.org/html/2606.18308#S5.SS2)\), yielding𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)cumulative violation\.
###### Design Principle 3\(Multiplicative physics prior\)\.
Qϕ=Qphys\+QϕresQ\_\{\\phi\}\\\!=\\\!Q\_\{\\text\{phys\}\}\\\!\+\\\!Q\_\{\\phi\_\{\\text\{res\}\}\}withQphysQ\_\{\\text\{phys\}\}frozen; instantiated byPirc\(§[5\.3](https://arxiv.org/html/2606.18308#S5.SS3)\), attainingCflat=0C\_\{\\text\{flat\}\}\\\!=\\\!0and shrinkingϵQ\\epsilon\_\{Q\}\.
Figure 2:System architecture ofTrident\.The framework resolves the three\-way coupling of hybrid actions, physics, and safety via three co\-designed modules\. Solid arrows denote forward passes;blue dashed arrowsdenote gradient flows\.\(A\)Sha\(Structured Hybrid Actor\):Uses a bilevel conditional policy and Straight\-Through Gradient Correction \(Stgc\) across two temperatures \(τ,τ0\\tau,\\tau\_\{0\}\) to reduce discrete gradient bias to𝒪\(τ2\)\\mathcal\{O\}\(\\tau^\{2\}\)\.\(B\)Pirc\(Physics\-Informed Residual Critic\):Avoids reward\-shaping artifacts by explicitly decomposing the value into a frozen physical priorQphysQ\_\{\\text\{phys\}\}and a learned residualQϕresQ\_\{\\phi\_\{\\text\{res\}\}\}\.\(C\)Lcpo\(Lyapunov\-Constrained Sequential Policy Optimization\):Enforces training\-time safety via Lyapunov criticsLkL\_\{k\}\. It updates agents sequentially to avoid non\-stationarity and uses a recovery step to bound cumulative violations to𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)\.
## 5Trident: A Co\-Designed Framework with Joint Guarantees
We instantiate these principles intoTrident\(Figure[2](https://arxiv.org/html/2606.18308#S4.F2), Algorithm[1](https://arxiv.org/html/2606.18308#alg1)\), comprising three co\-designed modules: a Structured Hybrid Actor \(Sha; §[5\.1](https://arxiv.org/html/2606.18308#S5.SS1)\), a Physics\-Informed Residual Critic \(Pirc; §[5\.3](https://arxiv.org/html/2606.18308#S5.SS3)\), and Lyapunov cost critics \(Lcpo; §[5\.2](https://arxiv.org/html/2606.18308#S5.SS2)\)\. Each module explicitly cancels one bias leak from Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)\. Joint convergence and safety guarantees are summarized in §[5\.4](https://arxiv.org/html/2606.18308#S5.SS4), with proofs deferred to Appendices[B](https://arxiv.org/html/2606.18308#A2)–[F](https://arxiv.org/html/2606.18308#A6)\.
### 5\.1Structured Hybrid Actor
The actor must emit, for each agent, a hybrid action\(aid,aic\)\(a\_\{i\}^\{d\},a\_\{i\}^\{c\}\)whose discrete part selects a physically meaningful mode and whose continuous part parameterizes it\. Agentiithus acts in two stages\. First, discrete\-branch logitsℓi=fθid\(oi\)∈ℝMi\\ell\_\{i\}\\\!=\\\!f\_\{\\theta\_\{i\}^\{d\}\}\(o\_\{i\}\)\\\!\\in\\\!\\mathbb\{R\}^\{M\_\{i\}\}are sampled via Gumbel\-Softmax\(Janget al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib17); Maddisonet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib24)\)at temperatureτ\\tau, namelya^id\(τ\)=softmax\(\(ℓi\+g\)/τ\)\\hat\{a\}\_\{i\}^\{d\}\(\\tau\)\\\!=\\\!\\operatorname\{softmax\}\\\!\\big\(\(\\ell\_\{i\}\+g\)/\\tau\\big\)withgj∼Gumbel\(0,1\)g\_\{j\}\\\!\\sim\\\!\\text\{Gumbel\}\(0,1\), with the standard straight\-through trick replacing the soft sample by itsargmax\\operatorname\*\{arg\\,max\}at execution while routing gradients through the soft surrogate\. Conditioned ona^id\\hat\{a\}\_\{i\}^\{d\}, a continuous\-branch network emits a Gaussian\(μi,σi\)=gθic\(oi,a^id\)\(\\mu\_\{i\},\\sigma\_\{i\}\)\\\!=\\\!g\_\{\\theta\_\{i\}^\{c\}\}\(o\_\{i\},\\hat\{a\}\_\{i\}^\{d\}\)withaic∼𝒩\(μi,diag\(σi2\)\)a\_\{i\}^\{c\}\\\!\\sim\\\!\\mathcal\{N\}\(\\mu\_\{i\},\\operatorname\{diag\}\(\\sigma\_\{i\}^\{2\}\)\), followed by atanh\\tanh\-squash with the standard log\-prob correction to respect the per\-mode box constraint\. This bilevel factorization lets continuous parameters carry mode\-specific physical meaning, exactly as motivated in §[3](https://arxiv.org/html/2606.18308#S3)\.
The remaining question is how to back\-propagate through the discrete sample without the bias that violates Principle[1](https://arxiv.org/html/2606.18308#Thmprinciple1)\. Plain Gumbel\-Softmax carries an𝒪\(τ\)\\mathcal\{O\}\(\\tau\)bias that vanishes only at the price of gradient variance𝒪\(1/τ2\)\\mathcal\{O\}\(1/\\tau^\{2\}\); the Straight\-Through \(ST\) estimator trades this for an𝒪\(1\)\\mathcal\{O\}\(1\)bias that persists under annealing\(Pauluset al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib26); Shekhovtsov,[2023](https://arxiv.org/html/2606.18308#bib.bib32)\)\. Either regime—high variance at lowτ\\tau, or persistent bias at anyτ\\tau—propagates into the safety update through the first summand of Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)and breaks Principle[1](https://arxiv.org/html/2606.18308#Thmprinciple1)\.
Straight\-Through Gradient Correction \(Stgc\)\.Following the classical Richardson–Romberg bias\-cancellation technique\(Richardson,[1911](https://arxiv.org/html/2606.18308#bib.bib30); Bach,[2021](https://arxiv.org/html/2606.18308#bib.bib5)\), we evaluate the Gumbel\-Softmax Jacobian at two temperatures, the currentτ\\tauand a fixed referenceτ0\>τ\\tau\_\{0\}\\\!\>\\\!\\tau, and linearly combine them so that the leading𝒪\(τ\)\\mathcal\{O\}\(\\tau\)term cancels:
∇θda^Stgcd=\(1\+λτ\)∇θda^d\(τ\)−λτ∇θda^d\(τ0\)\\boxed\{\\,\\nabla\_\{\\\!\\theta^\{d\}\}\\hat\{a\}\_\{\\text\{\{Stgc\}\}\}^\{d\}=\(1\{\+\}\\lambda\_\{\\tau\}\)\\nabla\_\{\\\!\\theta^\{d\}\}\\hat\{a\}^\{d\}\(\\tau\)\-\\lambda\_\{\\tau\}\\nabla\_\{\\\!\\theta^\{d\}\}\\hat\{a\}^\{d\}\(\\tau\_\{0\}\)\\,\}\(2\)withλτ=τ/\(τ0−τ\)\\lambda\_\{\\tau\}\\\!=\\\!\\tau/\(\\tau\_\{0\}\-\\tau\)\. Expanding the Gumbel\-Softmax Jacobian asJ\(τ\)=J0\+τJ1\+τ2J2\+𝒪\(τ3\)J\(\\tau\)\\\!=\\\!J\_\{0\}\+\\tau J\_\{1\}\+\\tau^\{2\}J\_\{2\}\+\\mathcal\{O\}\(\\tau^\{3\}\)around the exact softmax JacobianJ0J\_\{0\}, equation \([2](https://arxiv.org/html/2606.18308#S5.E2)\) exactly cancels theτJ1\\tau J\_\{1\}term while leaving an𝒪\(τ2\)\\mathcal\{O\}\(\\tau^\{2\}\)residual; this is the bias regime mandated by Principle[1](https://arxiv.org/html/2606.18308#Thmprinciple1)\. The cost is a single additional forward pass at the fixed referenceτ0\\tau\_\{0\}, which adds about18%18\\%wall\-clock overhead in our profiling \(Appendix[G](https://arxiv.org/html/2606.18308#A7)\) but is dwarfed by the convergence\-speed saving\.
###### Theorem 2\(StgcBias Bound\)\.
For any logitsℓ\\ellwith‖ℓ‖∞≤ℓmax\\left\\\|\\ell\\right\\\|\_\{\\infty\}\\\!\\leq\\\!\\ell\_\{\\text\{max\}\}and anyτ∈\(0,τ0/2\)\\tau\\\!\\in\\\!\(0,\\tau\_\{0\}/2\), theStgcestimator of \([2](https://arxiv.org/html/2606.18308#S5.E2)\) satisfies‖𝔼\[∇θda^Stgcd\]−∇θd𝔼\[ad\]‖≤Cτ2/τ0\\left\\\|\\mathbb\{E\}\[\\nabla\_\{\\\!\\theta^\{d\}\}\\hat\{a\}\_\{\\text\{\{Stgc\}\}\}^\{d\}\]\\\!\-\\\!\\nabla\_\{\\\!\\theta^\{d\}\}\\mathbb\{E\}\[a^\{d\}\]\\right\\\|\\\!\\leq\\\!C\\tau^\{2\}/\\tau\_\{0\}withCCdepending only onℓmax\\ell\_\{\\text\{max\}\}andMiM\_\{i\}; plain GS attains𝒪\(τ\)\\mathcal\{O\}\(\\tau\)and ST attains𝒪\(1\)\\mathcal\{O\}\(1\)\. \(Proof in Appendix[C](https://arxiv.org/html/2606.18308#A3)\.\)
Theorem[2](https://arxiv.org/html/2606.18308#Thmtheorem2)cancels the first summand of \([1](https://arxiv.org/html/2606.18308#S4.E1)\): substitutingβGS=𝒪\(τ2\)\\beta\_\{\\text\{GS\}\}\\\!=\\\!\\mathcal\{O\}\(\\tau^\{2\}\)together with the standard annealing scheduleτ\(t\)=τ0βt\\tau\(t\)\\\!=\\\!\\tau\_\{0\}\\beta^\{t\}yields∑tτ\(t\)2=𝒪\(1\)\\sum\_\{t\}\\tau\(t\)^\{2\}\\\!=\\\!\\mathcal\{O\}\(1\), so the F1→\\toF2 leak contributes only a constant to the cumulative violation, instead of theΘ\(K\)\\Theta\(K\)contribution of plain GS\.
### 5\.2Lyapunov\-Constrained Sequential Policy Optimization
By Principle[2](https://arxiv.org/html/2606.18308#Thmprinciple2)the safety mechanism must preserve feasibility per iterate, not merely at convergence\. Lagrangian methods, the workhorse of single\-agent safe RL, fail this requirement: their multipliers oscillate while approaching feasibility, and during the oscillation the policy can—and routinely does—execute unsafe actions on the environment\(Stookeet al\.,[2020](https://arxiv.org/html/2606.18308#bib.bib34); Liu and others,[2022](https://arxiv.org/html/2606.18308#bib.bib22)\)\. We therefore replace the Lagrangian with a Lyapunov constraint, which converts the asymptotic feasibility condition into a one\-step contraction toward the safe set\.
Lyapunov cost critic\.For each constraintkkwe maintain a learned Lyapunov functionLk\(s\)=Vck𝝅\(s\)\+ξkL\_\{k\}\(s\)\\\!=\\\!V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}\(s\)\+\\xi\_\{k\}, whereVck𝝅V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}is a TD\-trained cost\-value estimate andξk≥0\\xi\_\{k\}\\\!\\geq\\\!0is a small slack\(Chowet al\.,[2018](https://arxiv.org/html/2606.18308#bib.bib8); Huh and Yang,[2020](https://arxiv.org/html/2606.18308#bib.bib16)\)\. We require the new policy to satisfy the one\-step contraction
𝔼s′∼𝒫,𝒂∼𝝅new\[Lk\(s′\)\]−Lk\(s\)≤−αk\(Lk\(s\)−dk\),\\mathbb\{E\}\_\{s^\{\\prime\}\\\!\\sim\\\!\\mathcal\{P\},\\bm\{a\}\\\!\\sim\\\!\\bm\{\\pi\}\_\{\\text\{new\}\}\}\\\!\\big\[L\_\{k\}\(s^\{\\prime\}\)\\big\]\-L\_\{k\}\(s\)\\\!\\leq\\\!\-\\alpha\_\{k\}\\big\(L\_\{k\}\(s\)\-d\_\{k\}\\big\),\(3\)for allss, with decay rateαk∈\(0,1\)\\alpha\_\{k\}\\\!\\in\\\!\(0,1\)\. Equation \([3](https://arxiv.org/html/2606.18308#S5.E3)\) has a transparent interpretation: whenever the current state is infeasible \(Lk\>dkL\_\{k\}\\\!\>\\\!d\_\{k\}\), the right\-hand side is strictly negative and forces the new policy to drive the cost\-value down by at least anαk\\alpha\_\{k\}fraction of the current excess; whenever the state is feasible, the constraint reduces to a non\-expansion condition\. Iterating \([3](https://arxiv.org/html/2606.18308#S5.E3)\) thus produces a geometric decay of any constraint violation, which is exactly what underlies the𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)bound below\.
Sequential multi\-agent update\.At iterationttthe agents are updated in a fixed orderi=1,…,Ni=1,\\ldots,N\(Kubaet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib20); Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14)\); each agent solves a per\-agent constrained trust\-region problem given the previous agents’ updated policies and the remaining agents’ old policies:
θit\+1=\\displaystyle\\theta\_\{i\}^\{t\+1\}\\\!=argmaxθi𝔼𝒐∼d𝝅t\[∑𝒂𝝅θi\(𝒂\|𝒐\)A𝝅t\(𝒐,𝒂\)\]\\displaystyle\\operatorname\*\{arg\\,max\}\_\{\\theta\_\{i\}\}\\mathbb\{E\}\_\{\\bm\{o\}\\sim d^\{\\bm\{\\pi\}^\{t\}\}\}\\\!\\\!\\Big\[\\textstyle\\sum\_\{\\bm\{a\}\}\\\!\\bm\{\\pi\}\_\{\\theta\_\{i\}\}\(\\bm\{a\}\|\\bm\{o\}\)A^\{\\bm\{\\pi\}^\{t\}\}\(\\bm\{o\},\\bm\{a\}\)\\Big\]s\.t\.𝔼\[∑𝒂𝝅θi\(𝒂\|𝒐\)Ack𝝅t\(𝒐,𝒂\)\]≤\(1−γ\)\(dk−Vck𝝅t\),∀k,\\displaystyle\\mathbb\{E\}\\\!\\Big\[\\textstyle\\sum\_\{\\bm\{a\}\}\\\!\\bm\{\\pi\}\_\{\\theta\_\{i\}\}\(\\bm\{a\}\|\\bm\{o\}\)A\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\(\\bm\{o\},\\bm\{a\}\)\\Big\]\\\!\\leq\\\!\(1\{\-\}\\gamma\)\(d\_\{k\}\{\-\}V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\),\\ \\forall k,D¯KL\(𝝅θi∥𝝅it\)≤δTR\.\\displaystyle\\bar\{D\}\_\{\\text\{KL\}\}\\\!\\big\(\\bm\{\\pi\}\_\{\\theta\_\{i\}\}\\\|\\bm\{\\pi\}^\{t\}\_\{i\}\\big\)\\\!\\leq\\\!\\delta\_\{\\text\{TR\}\}\.\(4\)The first inequality is the linearization of \([3](https://arxiv.org/html/2606.18308#S5.E3)\); the trust regionδTR=𝒪~\(1/K\)\\delta\_\{\\text\{TR\}\}\\\!=\\\!\\tilde\{\\mathcal\{O\}\}\(1/\\sqrt\{K\}\)ensures the linearization remains valid\. Sequential—rather than simultaneous—updates avoid the well\-known non\-stationarity blow\-up of joint multi\-agent gradient ascent, in which each agent’s update invalidates the others’\(Leonardoset al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib21); Zhanget al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib39)\); their cost is a finite, controllable telescoping error that we account for in Theorem[3](https://arxiv.org/html/2606.18308#Thmtheorem3)\.
Feasibility\-preserving recovery\.A subtle failure mode occurs when the linearized constraint set in \([4](https://arxiv.org/html/2606.18308#S5.E4)\) is empty—that is, when no policy in the trust region simultaneously satisfies allKKconstraints to first order\. Standard MACPO\(Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14)\)returns the previous iterate, which on a hardware deployment means repeating an unsafe action\. We instead apply a recovery gradient step that explicitly moves toward the feasible region:
θit\+1=θit−ηrec∑k:Vck𝝅t\>dk∇θiVck𝝅θi\(s0\)\|θi=θit\.\\theta\_\{i\}^\{t\+1\}=\\theta\_\{i\}^\{t\}\-\\eta\_\{\\text\{rec\}\}\\\!\\\!\\sum\_\{k:V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\>d\_\{k\}\}\\\!\\\!\\nabla\_\{\\theta\_\{i\}\}V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\_\{\\theta\_\{i\}\}\}\\\!\(s\_\{0\}\)\\Big\|\_\{\\theta\_\{i\}=\\theta\_\{i\}^\{t\}\}\.\(5\)The summation is restricted to currently violated constraints, so feasible directions are not perturbed\. Together, \([3](https://arxiv.org/html/2606.18308#S5.E3)\)–\([5](https://arxiv.org/html/2606.18308#S5.E5)\) cancel theηs\\eta\_\{s\}\-dependent second summand of \([1](https://arxiv.org/html/2606.18308#S4.E1)\): a constantηs=𝒪\(δTR\)=𝒪~\(1/K\)\\eta\_\{s\}\\\!=\\\!\\mathcal\{O\}\(\\delta\_\{\\text\{TR\}\}\)\\\!=\\\!\\tilde\{\\mathcal\{O\}\}\(1/\\sqrt\{K\}\)multiplies a critic errorϵQ=𝒪\(1/n\)\\epsilon\_\{Q\}\\\!=\\\!\\mathcal\{O\}\(1/\\sqrt\{n\}\)fromPircbelow, and the geometric recovery turns any residual violation into aK\\sqrt\{K\}partial sum\.
Figure 3:Single\-UAV obstacle avoidance, sequential snapshots\.Pink: executed continuous trajectory; green brackets: discrete waypoint cells chosen by the hybrid policy; red polygons: no\-fly zones\.Tridentroutes between hazards without any forbidden region, providing a visual instance of the safety bound in Theorem[3](https://arxiv.org/html/2606.18308#Thmtheorem3)\.
### 5\.3Physics\-Informed Residual Critic
Principle[3](https://arxiv.org/html/2606.18308#Thmprinciple3)dictates that physics priors enter the critic multiplicatively—as a frozen component ofQQ—rather than additively in the reward\. The reason is subtle but important: an additive shapingr→r\+ωQphysr\\\!\\to\\\!r\+\\omega Q\_\{\\text\{phys\}\}shifts the soft\-Bellman fixed point and therefore biases the optimal policy itself, which is exactly the F3→\\toF1 leakage of Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)\(Nget al\.,[1999](https://arxiv.org/html/2606.18308#bib.bib25); Caoet al\.,[2024](https://arxiv.org/html/2606.18308#bib.bib42)\); decomposing the critic, in contrast, preserves the optimal policy and only changes the function class that needs to be learned\.
We therefore write the centralizedQQas
Qϕ\(s,𝒂\)=Qphys\(s,𝒂\)⏟closed\-form, frozen\+Qϕres\(s,𝒂\)⏟learned residual,Q\_\{\\phi\}\(s,\\bm\{a\}\)=\\underbrace\{Q\_\{\\text\{phys\}\}\(s,\\bm\{a\}\)\}\_\{\\text\{closed\-form, frozen\}\}\+\\underbrace\{Q\_\{\\phi\_\{\\text\{res\}\}\}\(s,\\bm\{a\}\)\}\_\{\\text\{learned residual\}\},\(6\)whereQphysQ\_\{\\text\{phys\}\}is a domain\-specific closed\-form expression\. In UAV\-MEC it captures the Shannon\-capacity transmission delay and kinematic energy of the standard 3GPP air\-to\-ground channel\(Al\-Houraniet al\.,[2014](https://arxiv.org/html/2606.18308#bib.bib3); 3GPP,[2020](https://arxiv.org/html/2606.18308#bib.bib35)\):
Cij\\displaystyle C\_\{ij\}=Blog2\(1\+PiG\(dij,hi\)/σ2\),\\displaystyle=B\\log\_\{2\}\\\!\\big\(1\+P\_\{i\}G\(d\_\{ij\},h\_\{i\}\)/\\sigma^\{2\}\\big\),\(7\)Tijtx\\displaystyle T\_\{ij\}^\{\\text\{tx\}\}=Djαij/Cij,\\displaystyle=D\_\{j\}\\alpha\_\{ij\}/C\_\{ij\},\(8\)Ei\\displaystyle E\_\{i\}=PflyTifly\+PcmpTicmp\+Ptx∑jTijtx,\\displaystyle=P\_\{\\text\{fly\}\}T\_\{i\}^\{\\text\{fly\}\}\+P\_\{\\text\{cmp\}\}T\_\{i\}^\{\\text\{cmp\}\}\+P\_\{\\text\{tx\}\}\\textstyle\\sum\_\{j\}T\_\{ij\}^\{\\text\{tx\}\},\(9\)and we setQphys=−\(ω1Ttotal\+ω2Etotal\)Q\_\{\\text\{phys\}\}\\\!=\\\!\-\(\\omega\_\{1\}T\_\{\\text\{total\}\}\+\\omega\_\{2\}E\_\{\\text\{total\}\}\), which captures the dominant linear effects analytically\. The residualQϕresQ\_\{\\phi\_\{\\text\{res\}\}\}is a four\-layer MLP that learns only the corrections—interference, queueing dynamics, partial observability—that physics does not capture, and the critic is trained with the standard one\-step TD loss applied to the sum, namelyℒcrit\(ϕres\)=𝔼ℬ\[\(Qphys\+Qϕres−y\)2\]\\mathcal\{L\}\_\{\\text\{crit\}\}\(\\phi\_\{\\text\{res\}\}\)\\\!=\\\!\\mathbb\{E\}\_\{\\mathcal\{B\}\}\\\!\\big\[\(Q\_\{\\text\{phys\}\}\+Q\_\{\\phi\_\{\\text\{res\}\}\}\-y\)^\{2\}\\big\]with targety=r\+γ\(Qphys\(s′,𝒂′\)\+Qϕ¯res\(s′,𝒂′\)\)y\\\!=\\\!r\+\\gamma\(Q\_\{\\text\{phys\}\}\(s^\{\\prime\},\\bm\{a\}^\{\\prime\}\)\+Q\_\{\\bar\{\\phi\}\_\{\\text\{res\}\}\}\(s^\{\\prime\},\\bm\{a\}^\{\\prime\}\)\)andϕ¯res\\bar\{\\phi\}\_\{\\text\{res\}\}a slowly updated target network\. BecauseQphysQ\_\{\\text\{phys\}\}is frozen, only the residual variance is paid for in samples, and the sample complexity contracts proportionally to the explained variance of the physics term, as we formalize in Appendix[F](https://arxiv.org/html/2606.18308#A6)\.
### 5\.4Algorithm and Joint Theoretical Guarantees
Algorithm 1Trident\(one training iteration\)1:
NNagents; thresholds
\{dk\}\\\{d\_\{k\}\\\}; trust region
δTR\\delta\_\{\\text\{TR\}\}; Lyapunov decays
\{αk\}\\\{\\alpha\_\{k\}\\\}; temperature schedule
τ\(t\)\\tau\(t\)and reference
τ0\\tau\_\{0\}; physics model
QphysQ\_\{\\text\{phys\}\}
2:Initialize actors
\{θid,θic\}\\\{\\theta\_\{i\}^\{d\},\\theta\_\{i\}^\{c\}\\\}, residual critic
ϕres\\phi\_\{\\text\{res\}\}, cost critics
\{ψk\}\\\{\\psi\_\{k\}\\\}, replay buffer
ℬ\\mathcal\{B\}
3:foriteration
t=1,…,Tt=1,\\ldots,Tdo
4:foreach agent
iido
5:Compute Jacobians
J\(τ\)J\(\\tau\)and
J\(τ0\)J\(\\tau\_\{0\}\)from
fθid\(oi\)f\_\{\\theta\_\{i\}^\{d\}\}\(o\_\{i\}\); combine via Eq\. \([2](https://arxiv.org/html/2606.18308#S5.E2)\) to obtain theStgcgradient
6:Sample
aid∼GS\(τ\(t\)\)a\_\{i\}^\{d\}\\\!\\sim\\\!\\text\{GS\}\(\\tau\(t\)\); sample
aic∼𝒩\(gθic\(oi,aid\)\)a\_\{i\}^\{c\}\\\!\\sim\\\!\\mathcal\{N\}\(g\_\{\\theta\_\{i\}^\{c\}\}\(o\_\{i\},a\_\{i\}^\{d\}\)\)
7:endfor
8:Execute joint
𝒂\\bm\{a\}, observe
\(r,\{ck\},s′,𝒐′\)\(r,\\\{c\_\{k\}\\\},s^\{\\prime\},\\bm\{o\}^\{\\prime\}\); push transition to
ℬ\\mathcal\{B\}
9:forupdate step
u=1,…,Uu=1,\\ldots,Udo
10:Sample mini\-batch from
ℬ\\mathcal\{B\}; update
ϕres\\phi\_\{\\text\{res\}\}via Eq\. \([6](https://arxiv.org/html/2606.18308#S5.E6)\); update each
ψk\\psi\_\{k\}via TD on
ckc\_\{k\}
11:foreach agent
ii*sequentially*do
12:Compute
A𝝅tA^\{\\bm\{\\pi\}^\{t\}\}and
\{Ack𝝅t\}\\\{A\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\\\}
13:if
Vck𝝅t≤dk∀kV\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\\\!\\leq\\\!d\_\{k\}\\ \\forall kthensolve Eq\. \([4](https://arxiv.org/html/2606.18308#S5.E4)\)
14:elseapply recovery Eq\. \([5](https://arxiv.org/html/2606.18308#S5.E5)\)
15:endif
16:endfor
17:endfor
18:Anneal
τ\(t\+1\)←max\(τmin,τ0βt\)\\tau\(t\{\+\}1\)\\\!\\leftarrow\\\!\\max\(\\tau\_\{\\text\{min\}\},\\tau\_\{0\}\\beta^\{t\}\)
19:endfor
Algorithm[1](https://arxiv.org/html/2606.18308#alg1)composes the three modules into a single training loop, and the joint analysis inherits the structure of Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1): substitutingβGS=𝒪\(τ2\)\\beta\_\{\\text\{GS\}\}\\\!=\\\!\\mathcal\{O\}\(\\tau^\{2\}\)from Theorem[2](https://arxiv.org/html/2606.18308#Thmtheorem2), the trust\-region bound fromLcpo, and the variance contraction induced by the frozen physics term ofPircinto the coupling bound \([1](https://arxiv.org/html/2606.18308#S4.E1)\) converts the three previously additive leaks into a single jointly controlled error—the cycle\-breaking promised in §[4](https://arxiv.org/html/2606.18308#S4)\. Under standard regularity \(Assumption[4](https://arxiv.org/html/2606.18308#Thmtheorem4)in Appendix[A](https://arxiv.org/html/2606.18308#A1); bounded reward and cost, Lipschitz policies, bounded advantage variance, Slater feasibility\) and the scheduleδTR=𝒪~\(1/K\)\\delta\_\{\\text\{TR\}\}\\\!=\\\!\\tilde\{\\mathcal\{O\}\}\(1/\\sqrt\{K\}\), this yields a sub\-linear cumulative violation, in contrast to theΘ\(K\)\\Theta\(K\)rate that Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)implies for any naive composition, together with a constrained\-Nash convergence rate that matches the standard𝒪~\(1/K\)\\tilde\{\\mathcal\{O\}\}\(1/\\sqrt\{K\}\)of sample\-efficient policy optimisation despite the strictly harder constrained, multi\-agent, hybrid\-action regime\.
###### Theorem 3\(Joint Convergence and Safety\)\.
Under the conditions above, the iterates of Algorithm[1](https://arxiv.org/html/2606.18308#alg1)satisfy
1K∑t=1K\[V𝝅∗\(s0\)−V𝝅t\(s0\)\]\\displaystyle\\tfrac\{1\}\{K\}\\\!\\sum\_\{t=1\}^\{K\}\\\!\\big\[V^\{\\bm\{\\pi\}^\{\*\}\}\\\!\(s\_\{0\}\)\{\-\}V^\{\\bm\{\\pi\}^\{t\}\}\\\!\(s\_\{0\}\)\\big\]=𝒪~\(NσAK\+N2\(1−γ\)3K\),\\displaystyle=\\tilde\{\\mathcal\{O\}\}\\\!\\Big\(\\tfrac\{N\\sigma\_\{A\}\}\{\\sqrt\{K\}\}\{\+\}\\tfrac\{N^\{2\}\}\{\(1\{\-\}\\gamma\)^\{3\}\\sqrt\{K\}\}\\Big\),∑t=1Kmax\(0,Vck𝝅t\(s0\)−dk\)\\displaystyle\\sum\_\{t=1\}^\{K\}\\\!\\max\\\!\\big\(0,V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\\\!\(s\_\{0\}\)\{\-\}d\_\{k\}\\big\)=𝒪\(1αkK/\(1−γ\)\)∀k\.\\displaystyle=\\mathcal\{O\}\\\!\\big\(\\tfrac\{1\}\{\\alpha\_\{k\}\}\\sqrt\{K/\(1\{\-\}\\gamma\)\}\\big\)\\;\\forall k\.
The violation bound is*cumulative*rather than asymptotic, so safety improves throughout training rather than only at convergence—exactly the regime CPS deployment demands—while theN2N^\{2\}telescoping artefact of sequential updates\(Kubaet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib20)\)is empirically dominated by the leadingNσA/KN\\sigma\_\{A\}/\\sqrt\{K\}term up to the largestN=32N\\\!=\\\!32we test \(§[6](https://arxiv.org/html/2606.18308#S6)\)\. Full proofs, together with the complementary sample\-complexity result Theorem[5](https://arxiv.org/html/2606.18308#Thmtheorem5), are in Appendices[D](https://arxiv.org/html/2606.18308#A4)–[F](https://arxiv.org/html/2606.18308#A6)\.
Table 1:Multi\-UAV mobile\-edge computing\(mean±\\pmstd, 10 seeds, 20K episodes\)\.†discretized;‡hybrid via continuous relaxation\. Best inblue; runner\-upunderlined\.MethodPerformanceSafetyReward↑\\uparrowExec \(s\)↓\\downarrowEnergy \(J\)↓\\downarrowThrough\.↑\\uparrowE\. V\. \(%\)↓\\downarrowC\. V\. \(%\)↓\\downarrowTotal V\.↓\\downarrowRandom−10\.39±\.82\-10\.39\{\\pm\}\.828\.52±\.718\.52\{\\pm\}\.7195\.1±8\.395\.1\{\\pm\}8\.312\.1±2\.412\.1\{\\pm\}2\.442\.342\.368\.168\.1110\.4110\.4Greedy−7\.21±\.54\-7\.21\{\\pm\}\.546\.18±\.436\.18\{\\pm\}\.4378\.2±5\.178\.2\{\\pm\}5\.118\.3±1\.818\.3\{\\pm\}1\.831\.231\.222\.422\.453\.653\.6MADDPG†\(Loweet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib23)\)−5\.41±\.38\-5\.41\{\\pm\}\.384\.92±\.314\.92\{\\pm\}\.3164\.8±4\.264\.8\{\\pm\}4\.224\.1±1\.524\.1\{\\pm\}1\.518\.718\.712\.312\.331\.031\.0MATD3‡\(Ackermannet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib2)\)−5\.18±\.35\-5\.18\{\\pm\}\.354\.78±\.284\.78\{\\pm\}\.2862\.1±3\.962\.1\{\\pm\}3\.925\.8±1\.325\.8\{\\pm\}1\.316\.416\.410\.810\.827\.227\.2FACMAC‡\(Penget al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib27)\)−5\.07±\.34\-5\.07\{\\pm\}\.344\.71±\.274\.71\{\\pm\}\.2761\.3±3\.661\.3\{\\pm\}3\.626\.2±1\.326\.2\{\\pm\}1\.315\.115\.19\.79\.724\.824\.8MAPPO\(Yuet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib38)\)−5\.24±\.37\-5\.24\{\\pm\}\.374\.82±\.304\.82\{\\pm\}\.3063\.5±3\.863\.5\{\\pm\}3\.825\.3±1\.425\.3\{\\pm\}1\.417\.217\.211\.511\.528\.728\.7HAPPO\(Kubaet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib20)\)−5\.13±\.36\-5\.13\{\\pm\}\.364\.74±\.284\.74\{\\pm\}\.2861\.8±3\.761\.8\{\\pm\}3\.726\.0±1\.326\.0\{\\pm\}1\.314\.714\.79\.49\.424\.124\.1MAPPO\-Lag\(Rayet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib29)\)−5\.82±\.41\-5\.82\{\\pm\}\.415\.21±\.345\.21\{\\pm\}\.3460\.4±3\.560\.4\{\\pm\}3\.522\.7±1\.622\.7\{\\pm\}1\.65\.85\.84\.24\.210\.010\.0MACPO\(Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14)\)−5\.61±\.39\-5\.61\{\\pm\}\.395\.08±\.325\.08\{\\pm\}\.3259\.1±3\.359\.1\{\\pm\}3\.323\.4±1\.423\.4\{\\pm\}1\.43\.13\.12\.82\.85\.95\.9MADAC\(Li and Azizan,[2024](https://arxiv.org/html/2606.18308#bib.bib15)\)−5\.46±\.40\-5\.46\{\\pm\}\.404\.62±\.294\.62\{\\pm\}\.2959\.9±3\.459\.9\{\\pm\}3\.426\.8±1\.326\.8\{\\pm\}1\.33\.43\.42\.92\.96\.36\.3★\\bigstarTrident\(Ours\)−4\.68±\.27\-4\.68\{\\pm\}\.274\.31±\.224\.31\{\\pm\}\.2255\.8±2\.855\.8\{\\pm\}2\.828\.2±1\.128\.2\{\\pm\}1\.10\.80\.80\.60\.61\.41\.4
## 6Experiments
We evaluateTridenton three benchmarks exhibiting the \(F1\)–\(F3\) features:Multi\-UAV MEC\(hybrid offloading, 3GPP TR 38\.901 channel\(3GPP,[2020](https://arxiv.org/html/2606.18308#bib.bib35)\)\),Autonomous Intersection Management \(AIM\)\(Zhouet al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib40)\)\(hybrid lane/speed control, collision constraints\), and a hybrid\-action variant ofSMAC\(Samvelyanet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib31)\)\(discrete target plus continuous offset\)\. We compare against a comprehensive set of state\-of\-the\-art safe and hybrid MARL baselines, including MADDPG\(Loweet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib23)\), MATD3\(Ackermannet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib2)\), FACMAC\(Penget al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib27)\), MAPPO\(Yuet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib38)\), HAPPO\(Kubaet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib20)\), MAPPO\-Lagrangian\(Rayet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib29)\), MACPO\(Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14)\), MADAC\(Li and Azizan,[2024](https://arxiv.org/html/2606.18308#bib.bib15)\), and Shielded RL\(Elsayed\-Aly and others,[2021](https://arxiv.org/html/2606.18308#bib.bib9)\)\. When a baseline cannot natively handle hybrid actions, we use the discretized \(†\) or continuous\-relaxed \(‡\) variant\. Full environment details, architectures, and hyperparameters are in Appendix[H](https://arxiv.org/html/2606.18308#A8)\.
Main results\.On UAV\-MEC \(Table[1](https://arxiv.org/html/2606.18308#S5.T1)\),Tridentattains state\-of\-the\-art on every reward*and*safety metric simultaneously—a regime previous methods cannot occupy, since classical baselines trade safety for reward while existing safe baselines do the reverse\. Both fronts are dominated jointly because the three principles act in concert rather than in tension, directly realising the cycle\-breaking promised by Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1); moreover, convergence is faster than the unconstrained MADDPG baseline, consistent with the variance\-contraction effect ofPirc\(Theorem[5](https://arxiv.org/html/2606.18308#Thmtheorem5)\)\.
Ablations\.Table[2](https://arxiv.org/html/2606.18308#S6.T2)removes one component per row, each mapped to the principle it instantiates\. Removing the physics critic \(Principle[3](https://arxiv.org/html/2606.18308#Thmprinciple3)\) slows convergence most, confirming the variance\-contraction prediction of Theorem[5](https://arxiv.org/html/2606.18308#Thmtheorem5); replacing Lyapunov with a vanilla Lagrangian \(Principle[2](https://arxiv.org/html/2606.18308#Thmprinciple2)\) inflates violations several\-fold, matching the per\-iterate gap argued in §[5\.2](https://arxiv.org/html/2606.18308#S5.SS2); disablingStgc\(Principle[1](https://arxiv.org/html/2606.18308#Thmprinciple1)\) degrades both safety and convergence, consistent with the bias\-leakage of Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)\. The most diagnostic row is the last: replacing residual with additive\-physics shaping degrades both reward*and*safety, empirically realising the F3→\\toF1 leak that motivated the residual formulation in the first place\.
Table 2:Ablationson UAV\-MEC \(10 seeds\)\. Each row removes one component; the rightmost column maps it to a design principle\.ConfigurationR↑\\uparrowT\.V\.↓\\downarrowConv\.↑\\uparrowPrinc\.★\\bigstarTrident\(Ours\)\(full\)−4\.68\-4\.681\.41\.41\.00×1\.00\\times—w/oStgc\(plain GS\)−4\.95\-4\.952\.12\.10\.82×0\.82\\times[1](https://arxiv.org/html/2606.18308#Thmprinciple1)w/o Lyap\. \(Lagrangian\)−4\.74\-4\.748\.38\.30\.91×0\.91\\times[2](https://arxiv.org/html/2606.18308#Thmprinciple2)w/o Recovery−4\.79\-4\.793\.93\.90\.93×0\.93\\times[2](https://arxiv.org/html/2606.18308#Thmprinciple2)w/oPirc\(no physics\)−5\.12\-5\.121\.81\.80\.62×0\.62\\times[3](https://arxiv.org/html/2606.18308#Thmprinciple3)w/o Trust region−5\.04\-5\.047\.57\.50\.74×0\.74\\times[2](https://arxiv.org/html/2606.18308#Thmprinciple2)w/o Hybrid actions−5\.38\-5\.382\.42\.40\.78×0\.78\\times[1](https://arxiv.org/html/2606.18308#Thmprinciple1)w/o Sequential update−4\.81\-4\.813\.73\.70\.88×0\.88\\times—Additive phys\. \(vs\. residual\)−5\.03\-5\.032\.62\.60\.71×0\.71\\times[3](https://arxiv.org/html/2606.18308#Thmprinciple3)
Scalability\.Table[3](https://arxiv.org/html/2606.18308#S6.T3)sweepsN∈\{4,8,16,32\}N\\\!\\in\\\!\\\{4,8,16,32\\\}\.Trident’s per\-agent reward is essentially flat across an8×8\\timesincrease in agent count and its violations grow asN1\.05N^\{1\.05\}\(least\-squares fit\), close to the linear lower bound implied by per\-constraint𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)control—in contrast, classical CTDE methods degrade super\-linearly precisely where safety matters most\. Wall\-clock per iteration also remains below MACPO atN=32N\\\!=\\\!32, since the variance contraction induced by the physics term outweighs its marginal per\-step cost\.
Table 3:ScalabilitywithNNUAVs \(5 seeds\)\.R / agent↑\\uparrowTotal Violations↓\\downarrowMethod448816163232448816163232MADDPG−1\.35\-1\.35−1\.62\-1\.62−2\.08\-2\.08−3\.41\-3\.4131317474186186453453FACMAC−1\.27\-1\.27−1\.43\-1\.43−1\.71\-1\.71−2\.34\-2\.3425255252118118272272MACPO−1\.40\-1\.40−1\.51\-1\.51−1\.78\-1\.78−2\.21\-2\.215\.95\.911\.411\.424\.724\.758\.358\.3MADAC−1\.37\-1\.37−1\.46\-1\.46−1\.69\-1\.69−2\.05\-2\.056\.36\.312\.112\.126\.426\.461\.761\.7★\\bigstarTrident\(Ours\)−1\.17\-1\.17−1\.21\-1\.21−1\.28\-1\.28−1\.42\-1\.421\.41\.42\.72\.75\.95\.912\.412\.4
Cross\-domain transfer and empirical verification of theory\.On AIM,Tridentmatches Shielded RL on training collisions \(both0\) while attaining strictly higher throughput—a Pareto point neither shielded nor Lagrangian methods can occupy, since shielding throws away exploration on the boundary while Lagrangian methods retain unsafe gradient noise\. On the hybrid SMAC variant,Tridenttops the average win rate across all five maps \(per\-map in Appendix[J](https://arxiv.org/html/2606.18308#A10)\), confirming that the framework generalises beyond CPS to abstract discrete\-target/continuous\-offset coordination\. Finally, fitting the predicted formscKc\\sqrt\{K\}\(cumulative violation\) andK−1/2K^\{\-1/2\}\(suboptimality\) to the learning curves of UAV\-MEC and AIM givesR2≥0\.96R^\{2\}\\\!\\geq\\\!0\.96with sub\-optimality exponents−0\.51±0\.03\-0\.51\\\!\\pm\\\!0\.03and−0\.49±0\.04\-0\.49\\\!\\pm\\\!0\.04, in tight numerical agreement with Theorem[3](https://arxiv.org/html/2606.18308#Thmtheorem3)\. Hyperparameter sensitivity \(Appendix[I](https://arxiv.org/html/2606.18308#A9)\) confirms robustness within±10%\\pm 10\\%of every default\.
Table 4:AIM and SMAC\-hybrid\.Mean over 10/5 seeds\.AIMSMAC\-hyb\. Win %MethodTrain\. Coll\.↓\\downarrowReward↑\\uparrowMMM2Avg\.↑\\uparrowMADDPG8\.38\.342\.742\.769\.569\.559\.459\.4MAPPO6\.46\.446\.346\.391\.491\.483\.983\.9MAPPO\-Lag2\.62\.644\.144\.189\.689\.680\.280\.2MACPO2\.12\.145\.645\.690\.190\.176\.976\.9MADAC1\.81\.845\.945\.990\.490\.478\.478\.4Shielded RL0\.00\.039\.839\.8−\-−\-★\\bigstarTrident\(Ours\)0\.00\.047\.847\.893\.293\.286\.386\.3
Qualitative results\.Figures[3](https://arxiv.org/html/2606.18308#S5.F3)and[4](https://arxiv.org/html/2606.18308#S6.F4)visualise the learned policies in two deployment\-style UAV scenarios under the same hyperparameters as Table[1](https://arxiv.org/html/2606.18308#S5.T1), offering visual evidence that each design principle behaves as the theory predicts\. Fig\.[3](https://arxiv.org/html/2606.18308#S5.F3)shows a single UAV hugging constraint boundaries without ever penetrating them, concretely confirming the per\-iterate Lyapunov contraction ofLcpoand the recovery branch \(Eq\. \([5](https://arxiv.org/html/2606.18308#S5.E5)\)\) engaging as the constraint margin shrinks—a behaviour that aggregated violation counts cannot directly convey\. Fig\.[4](https://arxiv.org/html/2606.18308#S6.F4)then exposes the two multi\-agent regimes our framework supports: heterogeneous cruise assignment \(left\), where each UAV jointly selects a discrete role and a continuous trajectory—the hybrid action spaceStgcis designed for—and homogeneous cooperative coverage \(right\), where identical agents partition the workspace via the sequential constrained Nash update of Algorithm[1](https://arxiv.org/html/2606.18308#alg1)\.
Figure 4:Multi\-UAV coordination\.*Left:*heterogeneous cruise—agents select distinct discrete roles and continuous trajectories \(hybrid action setting ofStgc\)\.*Right:*homogeneous cooperative coverage—identical agents partition a region via the sequential constrained Nash update of Algorithm[1](https://arxiv.org/html/2606.18308#alg1)\.
## 7Conclusion
We established that safe cyber\-physical coordination requires resolving a tight three\-way coupling among hybrid actions, training\-time safety, and physical priors\. By formalizing and breaking this bias\-propagation cycle, our co\-designed framework,Trident, theoretically reduces cumulative constraint violations from linear to𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)\. Empirically,Tridentcuts safety violations by95\.5%95\.5\\%over baselines while simultaneously improving rewards and scaling robustly to 32 agents\.
## Limitations
Our theoretical guarantees rely on Slater’s condition; when the safe set is empty in expectation, no algorithm can recover feasibility, andTridentdegenerates to repeated recovery steps without further progress\. ThePircmodule requires a known closed\-form physics model, and extending to unknown or only partially identified dynamics \(for instance via system identification interleaved with the actor–critic update, or via a learned “physics” surrogate\) is open; in domains lacking such closed\-form structure, the residual decomposition collapses back to a standard centralized critic without the variance\-reduction benefit\. TheStgcestimator requires one additional Gumbel\-Softmax forward pass at the fixed reference temperatureτ0\\tau\_\{0\}per gradient step, which adds roughly18%18\\%wall\-clock overhead in our profiling; this overhead is dominated by the∼38%\\sim\\\!38\\%convergence\-speed saving in our benchmarks but might be unfavourable in compute\-bound regimes\. The Richardson–Romberg analysis underlyingStgcassumes local smoothness of the Gumbel\-Softmax Jacobian, which we verify only locally rather than over the full simplex\. Finally, our experiments are restricted to at most 32 agents and to simulated environments; mean\-field extensions would be required for hundreds of agents, and sim\-to\-real deployment requires conservative safety margins beyond those certified by our bounds\. We leave all of these directions to future work\.
## Ethical Considerations
Safety\-aware MARL for cyber\-physical systems has clear positive applications such as disaster response, traffic safety, and energy\-efficient infrastructure; the same algorithms, however, could be applied to non\-civilian systems where the cost specifications themselves encode harmful objectives\. We therefore encourage practitioners to \(i\) publish their cost specifications openly so that the safety claims of any deployed system can be audited, \(ii\) include hard human\-supervised override channels in any deployment, and \(iii\) carry out domain\-specific risk assessments before hardware roll\-out, with particular attention to distributional shift between simulator and real environment\. All benchmarks used in our experiments are publicly available and contain no human subjects; the simulators do not collect personal data\. We comply with the EMNLP Code of Ethics throughout\.
## References
- 3GPP \(2020\)Study on channel model for frequencies from 0\.5 to 100 GHz\.Technical ReportTechnical ReportTR 38\.901,3GPP\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p1.9),[§5\.3](https://arxiv.org/html/2606.18308#S5.SS3.p2.2),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- J\. Achiam, D\. Held, A\. Tamar, and P\. Abbeel \(2017\)Constrained policy optimization\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 22–31\.Cited by:[Appendix B](https://arxiv.org/html/2606.18308#A2.p3.12),[Appendix B](https://arxiv.org/html/2606.18308#A2.p4.7),[Appendix D](https://arxiv.org/html/2606.18308#A4.p1.1),[Appendix D](https://arxiv.org/html/2606.18308#A4.p4.2),[Appendix E](https://arxiv.org/html/2606.18308#A5.p3.4),[§1](https://arxiv.org/html/2606.18308#S1.p2.5),[§2](https://arxiv.org/html/2606.18308#S2.p2.1)\.
- J\. Ackermann, V\. Gabler, T\. Osa, and M\. Sugiyama \(2019\)Reducing overestimation bias in multi\-agent domains using double centralized critics\.arXiv preprint arXiv:1910\.01465\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.36.30.30.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- A\. Al\-Hourani, S\. Kandeepan, and S\. Lardner \(2014\)Optimal LAP altitude for maximum coverage\.IEEE Wireless Communications Letters3\(6\),pp\. 569–572\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p1.9),[§5\.3](https://arxiv.org/html/2606.18308#S5.SS3.p2.2)\.
- M\. Alshiekh, R\. Bloem, R\. Ehlers, B\. Könighofer, S\. Niekum, and U\. Topcu \(2018\)Safe reinforcement learning via shielding\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§2](https://arxiv.org/html/2606.18308#S2.p2.1)\.
- F\. Bach \(2021\)On the effectiveness of richardson extrapolation in data science\.SIAM Journal on Mathematics of Data Science3\(4\),pp\. 1251–1277\.Cited by:[Appendix K](https://arxiv.org/html/2606.18308#A11.p2.1),[§5\.1](https://arxiv.org/html/2606.18308#S5.SS1.p3.3)\.
- C\. Banerjee, K\. Nguyen, C\. Fookes, and M\. Raissi \(2023\)A survey on physics\-informed reinforcement learning: review and open problems\.arXiv preprint arXiv:2309\.01909\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5)\.
- L\. Brunke, M\. Greeff, A\. W\. Hall,et al\.\(2022\)Safe learning in robotics: from learning\-based control to safe reinforcement learning\.Annual Review of Control, Robotics, and Autonomous Systems5,pp\. 411–444\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p1.1)\.
- H\. Cao, Y\. Mao, L\. Sha, and M\. Caccamo \(2024\)Physics\-regulated deep reinforcement learning: invariant embeddings\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 3712–3756\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5),[§2](https://arxiv.org/html/2606.18308#S2.p3.1),[§5\.3](https://arxiv.org/html/2606.18308#S5.SS3.p1.3)\.
- Y\. Chow, O\. Nachum, E\. Duenez\-Guzman, and M\. Ghavamzadeh \(2018\)A lyapunov\-based approach to safe reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.31,pp\. 8092–8101\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5),[§2](https://arxiv.org/html/2606.18308#S2.p2.1),[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p2.4)\.
- I\. Elsayed\-Alyet al\.\(2021\)Safe multi\-agent reinforcement learning via shielding\.InProceedings of the International Conference on Autonomous Agents and Multiagent Systems \(AAMAS\),Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[§2](https://arxiv.org/html/2606.18308#S2.p2.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- Z\. Fan, R\. Su, W\. Zhang, and Y\. Yu \(2019\)Hybrid actor\-critic reinforcement learning in parameterized action space\.arXiv preprint arXiv:1903\.01344\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5),[§2](https://arxiv.org/html/2606.18308#S2.p1.2)\.
- H\. Fu, H\. Tang, J\. Hao, Z\. Lei, Y\. Chen, and C\. Fan \(2019\)Deep multi\-agent reinforcement learning with discrete\-continuous hybrid action spaces\.arXiv preprint arXiv:1903\.04959\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5),[§1](https://arxiv.org/html/2606.18308#S1.p3.4),[§2](https://arxiv.org/html/2606.18308#S2.p1.2)\.
- J\. García and F\. Fernández \(2015\)A comprehensive survey on safe reinforcement learning\.Journal of Machine Learning Research16\(1\),pp\. 1437–1480\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p1.1)\.
- S\. Guet al\.\(2021\)Multi\-agent constrained policy optimisation\.arXiv preprint arXiv:2110\.02793\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[§1](https://arxiv.org/html/2606.18308#S1.p2.5),[§1](https://arxiv.org/html/2606.18308#S1.p3.4),[§2](https://arxiv.org/html/2606.18308#S2.p2.1),[§3](https://arxiv.org/html/2606.18308#S3.p2.9),[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p3.2),[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p4.1),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.79.73.73.8.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- S\. Huh and I\. Yang \(2020\)Safe reinforcement learning for probabilistic reachability and safety specifications: a lyapunov\-based approach\.arXiv preprint arXiv:2002\.10126\.Cited by:[§2](https://arxiv.org/html/2606.18308#S2.p2.1),[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p2.4)\.
- E\. Jang, S\. Gu, and B\. Poole \(2017\)Categorical reparameterization with gumbel\-softmax\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix C](https://arxiv.org/html/2606.18308#A3.p2.9),[§1](https://arxiv.org/html/2606.18308#S1.p3.4),[§2](https://arxiv.org/html/2606.18308#S2.p1.2),[§5\.1](https://arxiv.org/html/2606.18308#S5.SS1.p1.11)\.
- T\. Johannink, S\. Bahl, A\. Nair, J\. Luo, A\. Kumar, M\. Loskyll, J\. A\. Ojea, E\. Solowjow, and S\. Levine \(2019\)Residual reinforcement learning for robot control\.In2019 International Conference on Robotics and Automation \(ICRA\),pp\. 6023–6029\.Cited by:[§2](https://arxiv.org/html/2606.18308#S2.p3.1)\.
- G\. E\. Karniadakis, I\. G\. Kevrekidis, L\. Lu, P\. Perdikaris, S\. Wang, and L\. Yang \(2021\)Physics\-informed machine learning\.Nature Reviews Physics3\(6\),pp\. 422–440\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5)\.
- J\. G\. Kuba, R\. Chen, M\. Wen, Y\. Wen, F\. Sun, J\. Wang, and Y\. Yang \(2022\)Trust region policy optimisation in multi\-agent reinforcement learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix D](https://arxiv.org/html/2606.18308#A4.p1.1),[Appendix D](https://arxiv.org/html/2606.18308#A4.p2.4),[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p3.2),[§5\.4](https://arxiv.org/html/2606.18308#S5.SS4.p2.3),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.65.59.59.8.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- S\. Leonardos, W\. Overman, I\. Panageas, and G\. Piliouras \(2022\)Global convergence of multi\-agent policy gradient in markov potential games\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p3.3)\.
- Z\. Li and N\. Azizan \(2024\)Safe multi\-agent reinforcement learning with convergence to generalized nash equilibrium\.arXiv preprint arXiv:2411\.15036\.Cited by:[Appendix K](https://arxiv.org/html/2606.18308#A11.p4.1),[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[§1](https://arxiv.org/html/2606.18308#S1.p2.5),[§2](https://arxiv.org/html/2606.18308#S2.p2.1),[§3](https://arxiv.org/html/2606.18308#S3.p2.9),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.86.80.80.8),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- J\. Liu, S\. Li, Z\. Fang, X\. Li, Y\. Zhou, Z\. Meng, Z\. Zhang, Y\. Luo, G\. Zhang, Y\. Liu,et al\.\(2026\)OmniDirector: general multi\-shot camera cloning without cross\-paired data\.arXiv preprint arXiv:2606\.13432\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5)\.
- Y\. Liuet al\.\(2022\)Constrained variational policy optimization for safe reinforcement learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p1.1)\.
- Y\. Liu, H\. Xiao, J\. Chai, Y\. Zhang, R\. Wang, Z\. Meng, and Z\. Luo \(2025\)SynPo: boosting training\-free few\-shot medical segmentation via high\-quality negative prompts\.InInternational Conference on Medical Image Computing and Computer\-Assisted Intervention,pp\. 594–603\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p1.1)\.
- R\. Lowe, Y\. Wu, A\. Tamar, J\. Harb, P\. Abbeel, and I\. Mordatch \(2017\)Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30,pp\. 6379–6390\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[§3](https://arxiv.org/html/2606.18308#S3.p1.15),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.28.22.22.1.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- C\. J\. Maddison, A\. Mnih, and Y\. W\. Teh \(2017\)The concrete distribution: a continuous relaxation of discrete random variables\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix C](https://arxiv.org/html/2606.18308#A3.p2.9),[§1](https://arxiv.org/html/2606.18308#S1.p3.4),[§2](https://arxiv.org/html/2606.18308#S2.p1.2),[§5\.1](https://arxiv.org/html/2606.18308#S5.SS1.p1.11)\.
- Z\. Meng, J\. Liu, Y\. Liu, C\. Tong, X\. Liu, Y\. Zhang, Y\. Xu, and P\. Wan \(2026\)ARGUS: stacked multi\-view identity mosaic injection for subject\-preserving video generation\.arXiv preprint arXiv:2606\.11670\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p1.1)\.
- Z\. Meng, Y\. Zeng, X\. Chang, T\. Xu, F\. Chao, X\. Cao, C\. Shang, and Q\. Shen \(2025\)Orpaint: a zero\-shot inpainting model for oracle bone inscription rubbings with visual mamba block\.Science China Information Sciences68\(8\),pp\. 189102\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5)\.
- Z\. Meng \(2026\)Decoupling semantics from distortions: multi\-scale two\-stream vision\-language alignment for ai\-generated image quality assessment\.External Links:2606\.16799,[Link](https://arxiv.org/abs/2606.16799)Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p2.5)\.
- A\. Y\. Ng, D\. Harada, and S\. Russell \(1999\)Policy invariance under reward transformations: theory and application to reward shaping\.InProceedings of the Sixteenth International Conference on Machine Learning \(ICML\),pp\. 278–287\.Cited by:[Appendix K](https://arxiv.org/html/2606.18308#A11.p3.3),[§2](https://arxiv.org/html/2606.18308#S2.p3.1),[§5\.3](https://arxiv.org/html/2606.18308#S5.SS3.p1.3)\.
- M\. B\. Paulus, C\. J\. Maddison, and A\. Krause \(2021\)Rao\-blackwellizing the straight\-through gumbel\-softmax gradient estimator\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix K](https://arxiv.org/html/2606.18308#A11.p2.1),[§5\.1](https://arxiv.org/html/2606.18308#S5.SS1.p2.5)\.
- B\. Peng, T\. Rashid, C\. S\. de Witt, P\. Kamienny, P\. H\. Torr, W\. Böhmer, and S\. Whiteson \(2021\)FACMAC: factored multi\-agent centralised policy gradients\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34,pp\. 12208–12221\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.44.38.38.1.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- A\. Ramesh and B\. Ravindran \(2023\)Physics\-informed model\-based reinforcement learning\.InLearning for Dynamics and Control Conference \(L4DC\),pp\. 26–37\.Cited by:[§2](https://arxiv.org/html/2606.18308#S2.p3.1)\.
- A\. Ray, J\. Achiam, and D\. Amodei \(2019\)Benchmarking safe exploration in deep reinforcement learning\.arXiv preprint arXiv:1910\.01708\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[§2](https://arxiv.org/html/2606.18308#S2.p2.1),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.72.66.66.8),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- L\. F\. Richardson \(1911\)The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam\.Philosophical Transactions of the Royal Society A210,pp\. 307–357\.Cited by:[§5\.1](https://arxiv.org/html/2606.18308#S5.SS1.p3.3)\.
- M\. Samvelyan, T\. Rashid, C\. S\. de Witt, G\. Farquhar, N\. Nardelli, T\. G\. Rudner, C\. Hung, P\. H\. Torr, J\. Foerster, and S\. Whiteson \(2019\)The StarCraft multi\-agent challenge\.arXiv preprint arXiv:1902\.04043\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p3.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- A\. Shekhovtsov \(2023\)Cold analysis of Rao\-Blackwellized straight\-through gumbel\-softmax gradient estimator\.InProceedings of the 40th International Conference on Machine Learning \(ICML\),pp\. 30931–30955\.Cited by:[Appendix K](https://arxiv.org/html/2606.18308#A11.p2.1),[§5\.1](https://arxiv.org/html/2606.18308#S5.SS1.p2.5)\.
- T\. Silver, K\. Allen, J\. Tenenbaum, and L\. Kaelbling \(2018\)Residual policy learning\.arXiv preprint arXiv:1812\.06298\.Cited by:[§2](https://arxiv.org/html/2606.18308#S2.p3.1)\.
- A\. Skrynnik, A\. Yakovleva, V\. Davydov, K\. Yakovlev, and A\. I\. Panov \(2021\)Hybrid policy learning for multi\-agent pathfinding\.IEEE Access9,pp\. 126034–126047\.Cited by:[§2](https://arxiv.org/html/2606.18308#S2.p1.2)\.
- A\. Stooke, J\. Achiam, and P\. Abbeel \(2020\)Responsive safety in reinforcement learning by PID lagrangian methods\.InInternational Conference on Machine Learning \(ICML\),pp\. 9133–9143\.Cited by:[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p1.1)\.
- L\. Wanget al\.\(2021\)Multi\-agent deep reinforcement learning\-based trajectory planning for multi\-UAV assisted mobile edge computing\.IEEE Transactions on Cognitive Communications and Networking7\(1\),pp\. 73–84\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p1.1)\.
- B\. Wei, H\. Liu, C\. Qian, Z\. Li, W\. Wu, and Z\. Meng \(2025\)Robust single image sand removal by leveraging uncertainty\-aware sam priors and prompt learning with refined perceptual loss\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 4932–4941\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p1.1)\.
- C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu \(2022\)The surprising effectiveness of PPO in cooperative multi\-agent games\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,Vol\.35,pp\. 24611–24624\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p4.2),[§2](https://arxiv.org/html/2606.18308#S2.p2.1),[Table 1](https://arxiv.org/html/2606.18308#S5.T1.58.52.52.8),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- K\. Zhang, Z\. Yang, and T\. Başar \(2021\)Decentralized multi\-agent reinforcement learning\.Foundations and Trends in Machine Learning13\(6\),pp\. 1–158\.Cited by:[§5\.2](https://arxiv.org/html/2606.18308#S5.SS2.p3.3)\.
- M\. Zhou, J\. Luo, J\. Villella, Y\. Yang, D\. Rusu, J\. Miao, W\. Zhang, M\. Alban, I\. Fadakar, Z\. Chen,et al\.\(2021\)SMARTS: an open\-source scalable multi\-agent RL training school for autonomous driving\.InConference on Robot Learning \(CoRL\),Vol\.155,pp\. 264–285\.Cited by:[Appendix H](https://arxiv.org/html/2606.18308#A8.p2.1),[§1](https://arxiv.org/html/2606.18308#S1.p1.1),[§6](https://arxiv.org/html/2606.18308#S6.p1.2)\.
- Y\. Zhouet al\.\(2023\)UAV\-assisted MEC: a comprehensive survey\.IEEE Communications Surveys and Tutorials25\(1\),pp\. 382–418\.Cited by:[§1](https://arxiv.org/html/2606.18308#S1.p1.1)\.
## Appendix
## Appendix ANotation, Assumptions, and Glossary
The notation used throughout the paper is collected here for ease of reference\. We write𝝅=\(π1,…,πN\)\\bm\{\\pi\}\\\!=\\\!\(\\pi\_\{1\},\\ldots,\\pi\_\{N\}\)for the joint policy,𝝅−i\\bm\{\\pi\}\_\{\-i\}for the joint policy of all agents exceptii,d𝝅d^\{\\bm\{\\pi\}\}for the stationary state\-occupancy under𝝅\\bm\{\\pi\}, andV𝝅V^\{\\bm\{\\pi\}\},Vck𝝅V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}for the reward and cost value functions\. The advantage and constraint advantage areA𝝅\(s,𝒂\)=Q𝝅\(s,𝒂\)−V𝝅\(s\)A^\{\\bm\{\\pi\}\}\(s,\\bm\{a\}\)\\\!=\\\!Q^\{\\bm\{\\pi\}\}\(s,\\bm\{a\}\)\\\!\-\\\!V^\{\\bm\{\\pi\}\}\(s\)andAck𝝅\(s,𝒂\)=Qck𝝅\(s,𝒂\)−Vck𝝅\(s\)A\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}\(s,\\bm\{a\}\)\\\!=\\\!Q\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}\(s,\\bm\{a\}\)\\\!\-\\\!V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}\(s\)\. For an arbitrary functionffon the state space we write‖f‖∞=sups\|f\(s\)\|\\left\\\|f\\right\\\|\_\{\\infty\}\\\!=\\\!\\sup\_\{s\}\|f\(s\)\|, and total variation between two policies as‖πθ−πθ′‖TV\\left\\\|\\pi\_\{\\theta\}\-\\pi\_\{\\theta^\{\\prime\}\}\\right\\\|\_\{\\text\{TV\}\}\. The Gumbel\-Softmax Jacobian is denotedJ\(τ\)=𝔼g\[∂p^τ/∂ℓ\]J\(\\tau\)\\\!=\\\!\\mathbb\{E\}\_\{g\}\[\\partial\\hat\{p\}\_\{\\tau\}/\\partial\\ell\], and its Taylor coefficientsJ0,J1,J2,…J\_\{0\},J\_\{1\},J\_\{2\},\\ldots\.
###### Assumption 4\(Standard regularity\)\.
\(i\)Bounded reward and cost:\|r\|≤Rmax,\|ck\|≤Cmax\|r\|\\\!\\leq\\\!R\_\{\\max\},\|c\_\{k\}\|\\\!\\leq\\\!C\_\{\\max\}for allkk\.\(ii\)Policy Lipschitzness:‖πθ−πθ′‖TV≤Lπ‖θ−θ′‖2\\left\\\|\\pi\_\{\\theta\}\-\\pi\_\{\\theta^\{\\prime\}\}\\right\\\|\_\{\\text\{TV\}\}\\\!\\leq\\\!L\_\{\\pi\}\\left\\\|\\theta\-\\theta^\{\\prime\}\\right\\\|\_\{2\}\.\(iii\)Bounded advantage variance:Var\[A^\]≤σA2\\mathrm\{Var\}\[\\hat\{A\}\]\\\!\\leq\\\!\\sigma\_\{A\}^\{2\}and similarly for cost advantages\.\(iv\)Slater’s condition: there exists𝛑¯\\bar\{\\bm\{\\pi\}\}withVck𝛑¯<dkV\_\{c\_\{k\}\}^\{\\bar\{\\bm\{\\pi\}\}\}\\\!<\\\!d\_\{k\}strictly, for allkk\.\(v\)Smoothness of the Gumbel\-Softmax Jacobian: the Taylor expansionJ\(τ\)=J0\+τJ1\+τ2J2\+𝒪\(τ3\)J\(\\tau\)\\\!=\\\!J\_\{0\}\+\\tau J\_\{1\}\+\\tau^\{2\}J\_\{2\}\+\\mathcal\{O\}\(\\tau^\{3\}\)holds uniformly over the logits with‖J1‖,‖J2‖\\left\\\|J\_\{1\}\\right\\\|,\\left\\\|J\_\{2\}\\right\\\|bounded by constants depending only on‖ℓ‖∞\\left\\\|\\ell\\right\\\|\_\{\\infty\}andMiM\_\{i\}\.
## Appendix BProof of Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)
We give a detailed and pedagogical proof of the bias\-propagation lemma, explaining each step in words before stating the inequality, so as to make the directed\-cycle reasoning unambiguous\.
Denote byθt\\theta^\{t\}the parameter vector at iteratettand byθt\+1=θt\+ηsg^\\theta^\{t\+1\}\\\!=\\\!\\theta^\{t\}\\\!\+\\\!\\eta\_\{s\}\\hat\{g\}the next iterate, whereg^\\hat\{g\}is the possibly biased update direction returned by the safety\-projected actor\-critic step andggis the corresponding noise\-free direction\. Because the actor factorizes into a discrete head and a continuous head, we decomposeg^=g^d⊕g^c\\hat\{g\}\\\!=\\\!\\hat\{g\}^\{d\}\\\!\\oplus\\\!\\hat\{g\}^\{c\}and similarly forgg\. The three summands of \([1](https://arxiv.org/html/2606.18308#S4.E1)\) arise from a sequential application of first\-order expansion, plug\-in error analysis, and a fixed\-point shift argument; we treat them in turn\.
Term 1 \(F1→\\toF2\)\.The safety update direction is determined by the gradient of the Lagrangianℒ\(θ,μ\)=−A𝝅\(s,𝒂\)\+μAck𝝅\(s,𝒂\)\\mathcal\{L\}\(\\theta,\\mu\)\\\!=\\\!\-A^\{\\bm\{\\pi\}\}\(s,\\bm\{a\}\)\+\\mu A\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}\(s,\\bm\{a\}\), and the safety\-relevant component along the discrete direction is the inner product⟨g^d,∇aAck⟩\\langle\\hat\{g\}^\{d\},\\nabla\_\{a\}A\_\{c\_\{k\}\}\\rangle\. By the very definition of the Gumbel\-Softmax bias,‖𝔼\[g^d−gd\]‖≤βGS\\left\\\|\\mathbb\{E\}\[\\hat\{g\}^\{d\}\-g^\{d\}\]\\right\\\|\\\!\\leq\\\!\\beta\_\{\\text\{GS\}\}\. We now perform a first\-order Taylor expansion ofVck𝝅t\+1V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\+1\}\}around the previous policy𝝅t\\bm\{\\pi\}^\{t\}\. Standard performance\-difference arguments\(Achiamet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib1)\)yieldVck𝝅t\+1−Vck𝝅t≤ηs⟨g^,∇aAck⟩V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\+1\}\}\-V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\\\!\\leq\\\!\\eta\_\{s\}\\langle\\hat\{g\},\\nabla\_\{a\}A\_\{c\_\{k\}\}\\rangle, which we then split intoηs⟨g,∇aAck⟩\+ηs⟨g^−g,∇aAck⟩\\eta\_\{s\}\\langle g,\\nabla\_\{a\}A\_\{c\_\{k\}\}\\rangle\\\!\+\\\!\\eta\_\{s\}\\langle\\hat\{g\}\-g,\\nabla\_\{a\}A\_\{c\_\{k\}\}\\rangle\. The first inner product is non\-positive by construction of the safety projection: the noise\-free update either keeps or decreases the constraint value\. Bounding the second inner product by Cauchy–Schwarz and using‖𝔼\[g^d−gd\]‖≤βGS\\left\\\|\\mathbb\{E\}\[\\hat\{g\}^\{d\}\-g^\{d\}\]\\right\\\|\\\!\\leq\\\!\\beta\_\{\\text\{GS\}\}produces the first summandηsβGS‖∇aAck‖∞\\eta\_\{s\}\\beta\_\{\\text\{GS\}\}\\left\\\|\\nabla\_\{a\}A\_\{c\_\{k\}\}\\right\\\|\_\{\\infty\}\. The reading of this term is that any bias in the discrete\-branch gradient becomes*multiplicatively*amplified by the safety step size, so a constantβGS\\beta\_\{\\text\{GS\}\}produces a constant per\-iteration safety violation and thus linear cumulative violation inKK\.
Term 2 \(F2→\\toF3\)\.The constraint advantage in the safety update of \([4](https://arxiv.org/html/2606.18308#S5.E4)\) is computed from the learned cost\-value functionV^ck\\hat\{V\}\_\{c\_\{k\}\}\. The standard plug\-in error bound for inexact Q\-learning,\(Achiamet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib1)\)\[App\. 10\.1\], yields\|A^ck−Ack\|≤2ϵQ/\(1−γ\)\|\\hat\{A\}\_\{c\_\{k\}\}\-A\_\{c\_\{k\}\}\|\\\!\\leq\\\!2\\epsilon\_\{Q\}/\(1\-\\gamma\)whereϵQ\\epsilon\_\{Q\}is the critic MSE\. Multiplying through by the actor’s Lipschitz constantLπL\_\{\\pi\}\(which controls how much an error in the advantage perturbs the realised policy gradient\) and by the safety step sizeηs\\eta\_\{s\}yields the second summandηsϵQLπ\\eta\_\{s\}\\epsilon\_\{Q\}L\_\{\\pi\}\. The reading is that a physics\-agnostic safety critic must regress a cost\-value that is multi\-modal across discrete branches, and its irreducible regression error feeds directly into the safety update\.
Term 3 \(F3→\\toF1\)\.Suppose physics enters the system as an additive reward shapingr′=r\+ωQphysr^\{\\prime\}\\\!=\\\!r\+\\omega Q\_\{\\text\{phys\}\}\. The soft\-Bellman equation for the optimal discrete sub\-policy has the formπ∗d\(ad\|s\)∝exp\(\(Q∗\(s,ad\)\+ωQphys\(s,ad\)\)/τ\)\\pi\_\{\*\}^\{d\}\(a^\{d\}\|s\)\\propto\\exp\(\(Q\_\{\*\}\(s,a^\{d\}\)\+\\omega Q\_\{\\text\{phys\}\}\(s,a^\{d\}\)\)/\\tau\), so the additive shaping introduces a logit perturbationΔℓ=ω\(Qphys−Q∗\)/τ\\Delta\\ell\\\!=\\\!\\omega\(Q\_\{\\text\{phys\}\}\-Q^\{\*\}\)/\\tau\. By standard log\-sum\-exp inequalities, the induced change in the discrete\-branch entropy is upper\-bounded byCflat‖Qphys−Q∗‖∞C\_\{\\text\{flat\}\}\\left\\\|Q\_\{\\text\{phys\}\}\-Q^\{\*\}\\right\\\|\_\{\\infty\}whereCflatC\_\{\\text\{flat\}\}depends onω/τ\\omega/\\tauand on the Lipschitz constant ofsoftmax\\operatorname\{softmax\}\. Re\-injecting this entropy collapse into the constraint advantage \(via the fact that a flatter discrete policy concentrates mass on suboptimal modes and thereby inflates the worst\-case constraint advantage\) yields the third summandCflat‖Qphys−Q∗‖∞C\_\{\\text\{flat\}\}\\left\\\|Q\_\{\\text\{phys\}\}\-Q^\{\*\}\\right\\\|\_\{\\infty\}\. Summing the three contributions gives the stated bound onΔkt\\Delta\_\{k\}^\{t\}\.
Two remarks are in order\. First, the three summands are*not independent*: they share the multiplicative factorηs\\eta\_\{s\}in the first two and the shaping coefficientω/τ\\omega/\\tauin the third\. This is the precise sense in which the three features form a cycle rather than three independent issues; eliminating one summand without the others does not stop the cumulative violation from being linear inKK\. Second, the lemma is tight up to constants whenηs,ϵQ,βGS,‖Qphys−Q∗‖∞\\eta\_\{s\},\\epsilon\_\{Q\},\\beta\_\{\\text\{GS\}\},\\left\\\|Q\_\{\\text\{phys\}\}\-Q^\{\*\}\\right\\\|\_\{\\infty\}are bounded away from zero, which is precisely the regime of any naive composition\.
## Appendix CProof of Theorem[2](https://arxiv.org/html/2606.18308#Thmtheorem2)\(StgcBias Bound\)
We give a self\-contained proof of the Richardson–Romberg bias cancellation underlyingStgc, explaining why exactly the choiceλτ=τ/\(τ0−τ\)\\lambda\_\{\\tau\}\\\!=\\\!\\tau/\(\\tau\_\{0\}\-\\tau\)cancels the leading𝒪\(τ\)\\mathcal\{O\}\(\\tau\)term and not the leading𝒪\(τ2\)\\mathcal\{O\}\(\\tau^\{2\}\)term\.
Letℓ∈ℝM\\ell\\\!\\in\\\!\\mathbb\{R\}^\{M\}denote the logits,gj∼Gumbel\(0,1\)g\_\{j\}\\\!\\sim\\\!\\text\{Gumbel\}\(0,1\), andp^τ=softmax\(\(ℓ\+g\)/τ\)\\hat\{p\}\_\{\\tau\}\\\!=\\\!\\operatorname\{softmax\}\(\(\\ell\+g\)/\\tau\)\. Define the exact softmaxp=softmax\(ℓ\)p\\\!=\\\!\\operatorname\{softmax\}\(\\ell\)and the GS JacobianJ\(τ\):=𝔼g\[∂p^τ/∂ℓ\]J\(\\tau\)\\\!:=\\\!\\mathbb\{E\}\_\{g\}\[\\partial\\hat\{p\}\_\{\\tau\}/\\partial\\ell\], viewed as a function ofτ\\tau\. By the analyticity of the softmax operator and the Gumbel\-Max property\(Maddisonet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib24); Janget al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib17)\),JJis real\-analytic on\(0,∞\)\(0,\\infty\)and admits a Taylor expansion atτ=0\\tau\\\!=\\\!0:
J\(τ\)=J0\+τJ1\+τ2J2\+𝒪\(τ3\),J\(\\tau\)=J\_\{0\}\+\\tau J\_\{1\}\+\\tau^\{2\}J\_\{2\}\+\\mathcal\{O\}\(\\tau^\{3\}\),\(10\)whereJ0=∂p/∂ℓ=diag\(p\)−pp⊤J\_\{0\}\\\!=\\\!\\partial p/\\partial\\ell\\\!=\\\!\\operatorname\{diag\}\(p\)\-pp^\{\\top\}is the exact softmax Jacobian andJ1,J2J\_\{1\},J\_\{2\}are bounded matrices depending only onℓ\\ellandMM\. The boundedness ofJ1,J2J\_\{1\},J\_\{2\}is part of Assumption[4](https://arxiv.org/html/2606.18308#Thmtheorem4)\(v\) and is verified directly by differentiating the GS density\.
TheStgcestimator combines two GS evaluations at temperaturesτ\\tauandτ0\\tau\_\{0\}, namely
JStgc\(τ\)=\(1\+λτ\)J\(τ\)−λτJ\(τ0\)\.J\_\{\\text\{\{Stgc\}\}\}\(\\tau\)=\(1\+\\lambda\_\{\\tau\}\)J\(\\tau\)\-\\lambda\_\{\\tau\}J\(\\tau\_\{0\}\)\.\(11\)Substituting the Taylor expansion \([10](https://arxiv.org/html/2606.18308#A3.E10)\) into \([11](https://arxiv.org/html/2606.18308#A3.E11)\) and grouping powers ofτ\\taugives
𝔼\[JStgc\(τ\)\]\\displaystyle\\mathbb\{E\}\[J\_\{\\text\{\{Stgc\}\}\}\(\\tau\)\]=\(1\+λτ\)\(J0\+τJ1\+τ2J2\)\\displaystyle=\(1\+\\lambda\_\{\\tau\}\)\\big\(J\_\{0\}\+\\tau J\_\{1\}\+\\tau^\{2\}J\_\{2\}\\big\)−λτ\(J0\+τ0J1\+τ02J2\)\+𝒪\(τ3\)\\displaystyle\\quad\-\\lambda\_\{\\tau\}\\big\(J\_\{0\}\+\\tau\_\{0\}J\_\{1\}\+\\tau\_\{0\}^\{2\}J\_\{2\}\\big\)\+\\mathcal\{O\}\(\\tau^\{3\}\)=J0\[\(1\+λτ\)−λτ\]\+J1\[\(1\+λτ\)τ−λττ0\]\\displaystyle=J\_\{0\}\\big\[\(1\+\\lambda\_\{\\tau\}\)\-\\lambda\_\{\\tau\}\\big\]\+J\_\{1\}\\big\[\(1\+\\lambda\_\{\\tau\}\)\\tau\-\\lambda\_\{\\tau\}\\tau\_\{0\}\\big\]\+J2\[\(1\+λτ\)τ2−λττ02\]\+𝒪\(τ3\)\.\\displaystyle\\quad\+J\_\{2\}\\big\[\(1\+\\lambda\_\{\\tau\}\)\\tau^\{2\}\-\\lambda\_\{\\tau\}\\tau\_\{0\}^\{2\}\\big\]\+\\mathcal\{O\}\(\\tau^\{3\}\)\.The coefficient ofJ0J\_\{0\}is identically11, as required for an unbiased Jacobian estimator\. The coefficient ofJ1J\_\{1\}simplifies usingλτ=τ/\(τ0−τ\)\\lambda\_\{\\tau\}\\\!=\\\!\\tau/\(\\tau\_\{0\}\-\\tau\):
\(1\+λτ\)τ−λττ0=τ0τ−ττ0τ0−τ=0,\(1\+\\lambda\_\{\\tau\}\)\\tau\-\\lambda\_\{\\tau\}\\tau\_\{0\}=\\tfrac\{\\tau\_\{0\}\\tau\-\\tau\\tau\_\{0\}\}\{\\tau\_\{0\}\-\\tau\}=0,so the𝒪\(τ\)\\mathcal\{O\}\(\\tau\)term cancels exactly\. The coefficient ofJ2J\_\{2\}does not vanish in general:\(1\+λτ\)τ2−λττ02=−ττ0\+𝒪\(τ2\)\(1\+\\lambda\_\{\\tau\}\)\\tau^\{2\}\-\\lambda\_\{\\tau\}\\tau\_\{0\}^\{2\}\\\!=\\\!\-\\tau\\tau\_\{0\}\+\\mathcal\{O\}\(\\tau^\{2\}\), so the residual is dominated byττ0‖J2‖\\tau\\tau\_\{0\}\\left\\\|J\_\{2\}\\right\\\|to leading order\. Bounding‖J2‖\\left\\\|J\_\{2\}\\right\\\|by a constantCCdepending only onℓmax\\ell\_\{\\text\{max\}\}andMiM\_\{i\}, and recalling thatτ≤τ0/2\\tau\\\!\\leq\\\!\\tau\_\{0\}/2impliesτ0/\(τ0−τ\)≤2\\tau\_\{0\}/\(\\tau\_\{0\}\-\\tau\)\\\!\\leq\\\!2, we obtain‖𝔼\[JStgc\(τ\)\]−J0‖2≤Cτ2/τ0\\left\\\|\\mathbb\{E\}\[J\_\{\\text\{\{Stgc\}\}\}\(\\tau\)\]\-J\_\{0\}\\right\\\|\_\{2\}\\\!\\leq\\\!C\\tau^\{2\}/\\tau\_\{0\}\.
The chain rule∇θda^d=J\(τ\)⋅∂ℓ/∂θd\\nabla\_\{\\theta^\{d\}\}\\hat\{a\}^\{d\}\\\!=\\\!J\(\\tau\)\\cdot\\partial\\ell/\\partial\\theta^\{d\}then carries the same rate through to the parameter gradient\. For comparison, plain GS usesJ\(τ\)J\(\\tau\)directly and incurs the full𝒪\(τ\)\\mathcal\{O\}\(\\tau\)bias, while straight\-through replaces the soft sample by a hardargmax\\operatorname\*\{arg\\,max\}at execution and back\-propagates as ifτ=0\\tau\\\!=\\\!0, incurring an𝒪\(1\)\\mathcal\{O\}\(1\)bias that does not vanish under annealing\.
A note on variance\. AlthoughStgcevaluates the GS Jacobian at two temperatures, the variance of the combined estimator is bounded by\(1\+λτ\)2Var\[J\(τ\)\]\+λτ2Var\[J\(τ0\)\]≤4Var\[J\(τ0\)\]\+𝒪\(τ2/τ02\)Var\[J\(τ\)\]\(1\+\\lambda\_\{\\tau\}\)^\{2\}\\mathrm\{Var\}\[J\(\\tau\)\]\+\\lambda\_\{\\tau\}^\{2\}\\mathrm\{Var\}\[J\(\\tau\_\{0\}\)\]\\\!\\leq\\\!4\\mathrm\{Var\}\[J\(\\tau\_\{0\}\)\]\+\\mathcal\{O\}\(\\tau^\{2\}/\\tau\_\{0\}^\{2\}\)\\mathrm\{Var\}\[J\(\\tau\)\], which is dominated by the variance at the reference temperatureτ0\\tau\_\{0\}and is therefore not exploded by annealingτ→0\\tau\\\!\\to\\\!0\. This is in stark contrast to plain GS, whose variance grows as𝒪\(1/τ2\)\\mathcal\{O\}\(1/\\tau^\{2\}\)asτ→0\\tau\\\!\\to\\\!0\.Stgcthus offers a Pareto\-improving point on the bias–variance frontier of discrete reparameterization\.
## Appendix DProof of Theorem[3](https://arxiv.org/html/2606.18308#Thmtheorem3), Convergence Part
The proof follows the trust\-region template ofAchiamet al\.\([2017](https://arxiv.org/html/2606.18308#bib.bib1)\)extended to the multi\-agent setting followingKubaet al\.\([2022](https://arxiv.org/html/2606.18308#bib.bib20)\), with two changes: the constraint linearization carries the additional Lyapunov shift of \([3](https://arxiv.org/html/2606.18308#S5.E3)\), and the discrete\-branch gradient bias enters as a lower\-order term controlled by Theorem[2](https://arxiv.org/html/2606.18308#Thmtheorem2)\.
Step 1: Per\-agent improvement\.Fix iterationttand consider agentii’s constrained update \([4](https://arxiv.org/html/2606.18308#S5.E4)\)\. By the multi\-agent trust\-region performance\-difference lemma\(Kubaet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib20)\), for any feasibleθi\\theta\_\{i\}satisfyingD¯KL\(𝝅θi∥𝝅it\)≤δTR\\bar\{D\}\_\{\\text\{KL\}\}\(\\bm\{\\pi\}\_\{\\theta\_\{i\}\}\\\|\\bm\{\\pi\}\_\{i\}^\{t\}\)\\\!\\leq\\\!\\delta\_\{\\text\{TR\}\},
V\(πit\+1,𝝅−it\)−V𝝅t≥11−γ𝔼d𝝅t\[Ait\]−2γϵπ\(1−γ\)2δTR,V^\{\(\\pi\_\{i\}^\{t\+1\},\\bm\{\\pi\}\_\{\-i\}^\{t\}\)\}\-V^\{\\bm\{\\pi\}^\{t\}\}\\\!\\geq\\\!\\tfrac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{d^\{\\bm\{\\pi\}^\{t\}\}\}\[A\_\{i\}^\{t\}\]\-\\tfrac\{2\\gamma\\epsilon\_\{\\pi\}\}\{\(1\-\\gamma\)^\{2\}\}\\delta\_\{\\text\{TR\}\},whereϵπ=maxs,𝒂\|A𝝅t\(s,𝒂\)\|\\epsilon\_\{\\pi\}\\\!=\\\!\\max\_\{s,\\bm\{a\}\}\|A^\{\\bm\{\\pi\}^\{t\}\}\(s,\\bm\{a\}\)\|is the magnitude of the advantage andAitA\_\{i\}^\{t\}is the agent\-iiadvantage\. Substituting Theorem[2](https://arxiv.org/html/2606.18308#Thmtheorem2), the expectation𝔼\[Ait\]\\mathbb\{E\}\[A\_\{i\}^\{t\}\]differs from its noise\-free counterpart by at most𝒪\(τ2\)\\mathcal\{O\}\(\\tau^\{2\}\), which is absorbed into the trust\-region term\.
Step 2: Sequential telescoping\.Summing the per\-agent improvement bound over theNNagents in the fixed sequential order introduces a non\-stationarity term that captures how later agents’ updates respond to earlier ones:
V𝝅t\+1−V𝝅t\\displaystyle V^\{\\bm\{\\pi\}^\{t\+1\}\}\-V^\{\\bm\{\\pi\}^\{t\}\}≥11−γ∑i𝔼\[Ait\]−2Nγϵπ\(1−γ\)2δTR\\displaystyle\\geq\\tfrac\{1\}\{1\-\\gamma\}\\textstyle\\sum\_\{i\}\\mathbb\{E\}\[A\_\{i\}^\{t\}\]\-\\tfrac\{2N\\gamma\\epsilon\_\{\\pi\}\}\{\(1\-\\gamma\)^\{2\}\}\\delta\_\{\\text\{TR\}\}−N\(N−1\)γϵπ2\(1−γ\)3δTR\.\\displaystyle\\quad\-\\tfrac\{N\(N\-1\)\\gamma\\epsilon\_\{\\pi\}^\{2\}\}\{\(1\-\\gamma\)^\{3\}\}\\delta\_\{\\text\{TR\}\}\.The last term is the telescoping error of sequential updates: an agent updated after agentjjsees a slightly different joint policy than what agentjjoptimised against, and this discrepancy contributes a second\-order term that scales asN2N^\{2\}rather thanNN\.
Step 3: Telescope overKK\.ChoosingδTR=𝒪\(1/K\)\\delta\_\{\\text\{TR\}\}\\\!=\\\!\\mathcal\{O\}\(1/K\)and using a standard regret\-balancing argument\(Achiamet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib1)\), we obtain
1K∑t=1K\[V∗−V𝝅t\]≤𝒪~\(NσAK\+N2\(1−γ\)3K\)\.\\tfrac\{1\}\{K\}\\textstyle\\sum\_\{t=1\}^\{K\}\[V^\{\*\}\-V^\{\\bm\{\\pi\}^\{t\}\}\]\\\!\\leq\\\!\\tilde\{\\mathcal\{O\}\}\\\!\\Big\(\\tfrac\{N\\sigma\_\{A\}\}\{\\sqrt\{K\}\}\+\\tfrac\{N^\{2\}\}\{\(1\-\\gamma\)^\{3\}\\sqrt\{K\}\}\\Big\)\.The first term is the standard advantage\-variance contribution; the second comes from the sequential telescoping\. The bias contributionβGS=𝒪\(τ2\)\\beta\_\{\\text\{GS\}\}\\\!=\\\!\\mathcal\{O\}\(\\tau^\{2\}\)from Theorem[2](https://arxiv.org/html/2606.18308#Thmtheorem2)contributes a term that is at most∑tτ\(t\)2=𝒪\(1\)\\sum\_\{t\}\\tau\(t\)^\{2\}\\\!=\\\!\\mathcal\{O\}\(1\)for the standard exponential schedule, and is therefore strictly dominated\.
## Appendix EProof of Theorem[3](https://arxiv.org/html/2606.18308#Thmtheorem3), Safety Part
We prove the𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)cumulative violation bound by case\-analysing whether the iterate is feasible or infeasible at each step, and showing that in either case the violation telescopes appropriately\.
LetEkt=max\(0,Vck𝝅t\(s0\)−dk\)E\_\{k\}^\{t\}\\\!=\\\!\\max\(0,V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\(s\_\{0\}\)\-d\_\{k\}\)denote the violation at iteratett\. We analyse the two cases\.
Case A: feasible iterate \(Vck𝛑t≤dkV\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\\\!\\leq\\\!d\_\{k\}\)\.The update is the trust\-region step \([4](https://arxiv.org/html/2606.18308#S5.E4)\), whose linearized Lyapunov constraint is precisely the first\-order form of \([3](https://arxiv.org/html/2606.18308#S5.E3)\)\. Combining this constraint with the trust\-region radiusδTR\\delta\_\{\\text\{TR\}\}and the standard Lipschitz arguments ofAchiamet al\.\([2017](https://arxiv.org/html/2606.18308#bib.bib1)\), the cost\-value drift is bounded byVck𝝅t\+1−Vck𝝅t≤𝒪\(δTR/\(1−γ\)\)V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\+1\}\}\-V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\\\!\\leq\\\!\\mathcal\{O\}\(\\delta\_\{\\text\{TR\}\}/\(1\-\\gamma\)\)\. Since the iterate was feasible, this drift produces a violation of at most𝒪\(δTR/\(1−γ\)\)\\mathcal\{O\}\(\\delta\_\{\\text\{TR\}\}/\(1\-\\gamma\)\)at the next step\.
Case B: infeasible iterate \(Vck𝛑t\>dkV\_\{c\_\{k\}\}^\{\\bm\{\\pi\}^\{t\}\}\\\!\>\\\!d\_\{k\}\)\.The update is the recovery step \([5](https://arxiv.org/html/2606.18308#S5.E5)\), which descendsVck𝝅V\_\{c\_\{k\}\}^\{\\bm\{\\pi\}\}at rateηrec‖∇Vck‖2\\eta\_\{\\text\{rec\}\}\\left\\\|\\nabla V\_\{c\_\{k\}\}\\right\\\|^\{2\}\. Using the Lyapunov contraction \([3](https://arxiv.org/html/2606.18308#S5.E3)\), this descent yieldsEkt\+1≤\(1−αk\)Ekt\+𝒪\(δTR/\(1−γ\)\)E\_\{k\}^\{t\+1\}\\\!\\leq\\\!\(1\-\\alpha\_\{k\}\)E\_\{k\}^\{t\}\+\\mathcal\{O\}\(\\delta\_\{\\text\{TR\}\}/\(1\-\\gamma\)\)\.
Telescoping\.Combining the two cases and summing overt=1,…,Kt\\\!=\\\!1,\\ldots,K, the recovery term contracts geometrically:
∑tEkt≤Ek0αk\+K⋅𝒪\(δTR/\(1−γ\)\)αk\.\\sum\_\{t\}E\_\{k\}^\{t\}\\\!\\leq\\\!\\tfrac\{E\_\{k\}^\{0\}\}\{\\alpha\_\{k\}\}\+\\tfrac\{K\\cdot\\mathcal\{O\}\(\\delta\_\{\\text\{TR\}\}/\(1\-\\gamma\)\)\}\{\\alpha\_\{k\}\}\.ChoosingδTR=𝒪\(\(1−γ\)/K\)\\delta\_\{\\text\{TR\}\}\\\!=\\\!\\mathcal\{O\}\\big\(\\sqrt\{\(1\-\\gamma\)/K\}\\big\)\(which is the optimal balance between trust\-region accuracy and violation drift\) gives∑tEkt=𝒪\(1αkK/\(1−γ\)\)\\sum\_\{t\}E\_\{k\}^\{t\}\\\!=\\\!\\mathcal\{O\}\\big\(\\tfrac\{1\}\{\\alpha\_\{k\}\}\\sqrt\{K/\(1\-\\gamma\)\}\\big\)\.
Effect of bias\.A naive composition would also incur aKβGSK\\beta\_\{\\text\{GS\}\}term, but Theorem[2](https://arxiv.org/html/2606.18308#Thmtheorem2)reduces this toK⋅𝒪\(τ2\)K\\\!\\cdot\\\!\\mathcal\{O\}\(\\tau^\{2\}\), and the standard annealing scheduleτ\(t\)=τ0βt\\tau\(t\)\\\!=\\\!\\tau\_\{0\}\\beta^\{t\}makes∑tτ\(t\)2=𝒪\(1\)\\sum\_\{t\}\\tau\(t\)^\{2\}\\\!=\\\!\\mathcal\{O\}\(1\), so this contribution is absorbed into the𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)rate\.
## Appendix FSample\-Complexity Theorem
We state and prove the third theoretical guarantee, which we deferred from the main text to keep the number of in\-line theorems small\.
###### Theorem 5\(Sample\-Complexity Reduction\)\.
Suppose‖Q∗−Qphys‖∞≤ϵphys\\left\\\|Q^\{\*\}\-Q\_\{\\text\{phys\}\}\\right\\\|\_\{\\infty\}\\\!\\leq\\\!\\epsilon\_\{\\text\{phys\}\}and define the explained\-variance ratioRphys2:=1−Var\[Q∗−Qphys\]/Var\[Q∗\]∈\[0,1\]R\_\{\\text\{phys\}\}^\{2\}\\\!:=\\\!1\-\\mathrm\{Var\}\[Q^\{\*\}\-Q\_\{\\text\{phys\}\}\]/\\mathrm\{Var\}\[Q^\{\*\}\]\\\!\\in\\\!\[0,1\]\. Then to obtain anϵ\\epsilon\-accurate critic,TridentrequiresnTrident=𝒪~\(\(1−Rphys2\)dfull/ϵ2\)n\_\{\\text\{\{Trident\}\}\}\\\!=\\\!\\tilde\{\\mathcal\{O\}\}\\big\(\(1\-R\_\{\\text\{phys\}\}^\{2\}\)\\,d\_\{\\text\{full\}\}/\\epsilon^\{2\}\\big\)environment interactions, versusnbaseline=𝒪~\(dfull/ϵ2\)n\_\{\\text\{baseline\}\}\\\!=\\\!\\tilde\{\\mathcal\{O\}\}\\big\(d\_\{\\text\{full\}\}/\\epsilon^\{2\}\\big\)for a physics\-agnostic critic of the same architecture\.
###### Proof\.
For the residual critic trained withnnsamples on the standard one\-step TD loss, the population MSE decomposes as𝔼‖Qϕres−\(Q∗−Qphys\)‖2=Var\[Q∗−Qphys\]/n\+Bias2\\mathbb\{E\}\\left\\\|Q\_\{\\phi\_\{\\text\{res\}\}\}\-\(Q^\{\*\}\-Q\_\{\\text\{phys\}\}\)\\right\\\|^\{2\}\\\!=\\\!\\mathrm\{Var\}\[Q^\{\*\}\-Q\_\{\\text\{phys\}\}\]/n\+\\text\{Bias\}^\{2\}\. BecauseQphysQ\_\{\\text\{phys\}\}is deterministic \(a closed\-form function of state and action\), we haveVar\[Q∗−Qphys\]=Var\[Q∗\]−2Cov\[Q∗,Qphys\]\+Var\[Qphys\]\\mathrm\{Var\}\[Q^\{\*\}\-Q\_\{\\text\{phys\}\}\]\\\!=\\\!\\mathrm\{Var\}\[Q^\{\*\}\]\-2\\mathrm\{Cov\}\[Q^\{\*\},Q\_\{\\text\{phys\}\}\]\+\\mathrm\{Var\}\[Q\_\{\\text\{phys\}\}\]\. The right\-hand side equals\(1−Rphys2\)Var\[Q∗\]\(1\-R\_\{\\text\{phys\}\}^\{2\}\)\\mathrm\{Var\}\[Q^\{\*\}\]by the standard definition of the coefficient of determination, whereRphys2=Cov\[Q∗,Qphys\]2/\(Var\[Q∗\]Var\[Qphys\]\)R\_\{\\text\{phys\}\}^\{2\}\\\!=\\\!\\mathrm\{Cov\}\[Q^\{\*\},Q\_\{\\text\{phys\}\}\]^\{2\}/\(\\mathrm\{Var\}\[Q^\{\*\}\]\\mathrm\{Var\}\[Q\_\{\\text\{phys\}\}\]\)when normalised\. Hence achievingϵ2\\epsilon^\{2\}MSE requiresn=𝒪\(\(1−Rphys2\)Var\[Q∗\]/ϵ2\)n\\\!=\\\!\\mathcal\{O\}\(\(1\-R\_\{\\text\{phys\}\}^\{2\}\)\\mathrm\{Var\}\[Q^\{\*\}\]/\\epsilon^\{2\}\)versus𝒪\(Var\[Q∗\]/ϵ2\)\\mathcal\{O\}\(\\mathrm\{Var\}\[Q^\{\*\}\]/\\epsilon^\{2\}\)for the physics\-agnostic critic\.
The function\-class complexity contracts in the same multiplicative factor: under standard Rademacher\-complexity generalization bounds for ReLU MLPs, the effective dimension of the hypothesis class needed to fit the residual scales asdres=\(1−Rphys2\)dfulld\_\{\\text\{res\}\}\\\!=\\\!\(1\-R\_\{\\text\{phys\}\}^\{2\}\)d\_\{\\text\{full\}\}to leading order, since the residual is supported on a lower\-variance regime than the fullQ∗Q^\{\*\}\. Combining the variance and complexity reductions and using standard PAC bounds yields the stated𝒪~\\tilde\{\\mathcal\{O\}\}rate\. ∎
The practical meaning of Theorem[5](https://arxiv.org/html/2606.18308#Thmtheorem5)is that a well\-chosen physics model withRphys2=0\.7R\_\{\\text\{phys\}\}^\{2\}\\\!=\\\!0\.7\(a typical value for the Shannon\-capacity term in UAV\-MEC, as we measured directly\) yields a3\.3×3\.3\\timesreduction in sample complexity, which in our experiments shows up as the0\.62×0\.62\\timesconvergence\-speed drop whenPircis removed \(Table[2](https://arxiv.org/html/2606.18308#S6.T2)\)\.
## Appendix GComputational Complexity
Per training iteration,Tridentperforms \(i\) one batch of actor forward passes at cost𝒪\(NB\(dh2\+Midh\)\)\\mathcal\{O\}\(NB\(d\_\{h\}^\{2\}\+M\_\{i\}d\_\{h\}\)\), wheredhd\_\{h\}is the hidden dimension,MiM\_\{i\}the discrete\-branch size, andBBthe batch; \(ii\) one centralizedQQ\-critic forward/backward at𝒪\(B\(dh2\+dh\(M\+p\)\)\)\\mathcal\{O\}\(B\(d\_\{h\}^\{2\}\+d\_\{h\}\(M\+p\)\)\); \(iii\)KKcost\-critic updates at𝒪\(KBdh2\)\\mathcal\{O\}\(KBd\_\{h\}^\{2\}\); \(iv\) the closed\-form physics termQphysQ\_\{\\text\{phys\}\}at𝒪\(NNUENfog\)\\mathcal\{O\}\(NN\_\{\\text\{UE\}\}N\_\{\\text\{fog\}\}\)which is negligible; and \(v\) the additional reference\-temperature GS evaluation forStgcat𝒪\(NBdh\)\\mathcal\{O\}\(NBd\_\{h\}\)\. The dominant cost is𝒪\(NUBdh2\)\\mathcal\{O\}\(NUBd\_\{h\}^\{2\}\), asymptotically identical to MADDPG\. In wall\-clock terms on4×4\\\!\\times\\\!RTX 4090,Stgcadds approximately18%18\\%overhead and the physics evaluation approximately2%2\\%; both are dwarfed by the∼38%\\sim\\\!38\\%convergence\-speed saving \(Table[2](https://arxiv.org/html/2606.18308#S6.T2)\)\.
## Appendix HExperimental Setup
Multi\-UAV MEC\.N=4N\\\!=\\\!4UAVs serveNUE=20N\_\{\\text\{UE\}\}\\\!=\\\!20ground users viaNfog=2N\_\{\\text\{fog\}\}\\\!=\\\!2fog servers across a600×600600\\\!\\times\\\!600m area at fixed altitude\. The hybrid action consists of a discrete offload destinationad∈\{1,…,Nfog\+1\}a^\{d\}\\\!\\in\\\!\\\{1,\\ldots,N\_\{\\text\{fog\}\}\+1\\\}\(the\+1\+1option representing local computation\) and a continuous tripleac=\(v,θ,α\)a^\{c\}\\\!=\\\!\(v,\\theta,\\alpha\)encoding velocity magnitude, heading, and offload ratio, all box\-constrained per mode\. Constraints areEi≤EmaxE\_\{i\}\\\!\\leq\\\!E\_\{\\text\{max\}\}\(energy budget\) andCov≥Covmin\\mathrm\{Cov\}\\\!\\geq\\\!\\mathrm\{Cov\}\_\{\\text\{min\}\}\(coverage lower bound\)\. The channel model is 3GPP TR 38\.901\(3GPP,[2020](https://arxiv.org/html/2606.18308#bib.bib35)\)with the air\-to\-ground path loss ofAl\-Houraniet al\.\([2014](https://arxiv.org/html/2606.18308#bib.bib3)\)\. Episode length is 200 steps with a total of 20K training episodes\.
Autonomous Intersection Management\.Eight vehicles cross a 4\-way intersection on the SMARTS simulator\(Zhouet al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib40)\)\. Each vehicle selects a discrete target lane and a continuous control triple \(accel, brake, steering\) per step, with collision\-avoidance and lane\-keeping constraints\. We use 15K training episodes\.
SMAC \(hybrid\)\.We adapt the StarCraft Multi\-Agent Challenge\(Samvelyanet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib31)\)by augmenting each agent’s action with a continuous repositioning offset of≤1\\leq\\\!1\\,grid unit in addition to its standard discrete target\. We evaluate on five hard maps—3s5z\_vs\_3s6z,corridor,6h\_vs\_8z,MMM2, and27m\_vs\_30m—with 5M environment steps each\.
Baselines\.We compare against MADDPG\(Loweet al\.,[2017](https://arxiv.org/html/2606.18308#bib.bib23)\), MATD3\(Ackermannet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib2)\), FACMAC\(Penget al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib27)\), MAPPO\(Yuet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib38)\), HAPPO\(Kubaet al\.,[2022](https://arxiv.org/html/2606.18308#bib.bib20)\), MAPPO\-Lagrangian\(Rayet al\.,[2019](https://arxiv.org/html/2606.18308#bib.bib29)\), MACPO\(Gu and others,[2021](https://arxiv.org/html/2606.18308#bib.bib14)\), MADAC\(Li and Azizan,[2024](https://arxiv.org/html/2606.18308#bib.bib15)\), and Shielded RL\(Elsayed\-Aly and others,[2021](https://arxiv.org/html/2606.18308#bib.bib9)\)\. When the baseline does not natively handle hybrid actions, we adopt either a discretized variant \(†\) or a continuous relaxation that argmaxes the discrete branch from a relaxed embedding \(‡\)\.
Architectures\.The discrete sub\-policy is a 3\-layer MLP256→256→Mi256\\\!\\to\\\!256\\\!\\to\\\!M\_\{i\}; the continuous sub\-policy is a 3\-layer MLP256→256→2pi256\\\!\\to\\\!256\\\!\\to\\\!2p\_\{i\}conditioned on the one\-hot discrete sample; the centralized critic is a 4\-layer MLP512→512→512→1512\\\!\\to\\\!512\\\!\\to\\\!512\\\!\\to\\\!1; each cost critic is a 3\-layer MLP256→256→1256\\\!\\to\\\!256\\\!\\to\\\!1\. All hidden layers use ReLU activations and LayerNorm\.
Hyperparameters\.Actor LR3×10−43\\\!\\times\\\!10^\{\-4\}, critic LR10−310^\{\-3\}, batch256256, replay buffer10610^\{6\}transitions, target soft\-update rateτsoft=0\.005\\tau\_\{\\text\{soft\}\}\\\!=\\\!0\.005\. Gumbel\-Softmax schedule:τ0=1\.0\\tau\_\{0\}\\\!=\\\!1\.0,τmin=0\.1\\tau\_\{\\text\{min\}\}\\\!=\\\!0\.1, decayβ=0\.9995\\beta\\\!=\\\!0\.9995;Stgcreference temperatureτ0=1\.0\\tau\_\{0\}\\\!=\\\!1\.0\. Trust regionδTR=0\.01\\delta\_\{\\text\{TR\}\}\\\!=\\\!0\.01, Lyapunov decayαk=0\.1\\alpha\_\{k\}\\\!=\\\!0\.1, recovery learning rateηrec=10−4\\eta\_\{\\text\{rec\}\}\\\!=\\\!10^\{\-4\}\. Compute:4×4\\\!\\times\\\!RTX 4090\. UAV\-MEC takes approximately88h for 20K episodes; SMAC approximately1212h for 5M steps; AIM approximately66h for 15K episodes\.
## Appendix IHyperparameter Sensitivity
Table 5:Hyperparameter sensitivity on UAV\-MEC\. Defaults are★\\bigstar\.Reward↑\\uparrowTotal Viol\.↓\\downarrowHyperparam\.lowmid★\\bigstarhighlowmid★\\bigstarhighδTR\\delta\_\{\\text\{TR\}\}−4\.81\-4\.81−4\.74\-4\.74−4\.68\-4\.68−4\.92\-4\.921\.61\.61\.51\.51\.41\.42\.42\.4αk\\alpha\_\{k\}−4\.79\-4\.79−4\.68\-4\.68−4\.71\-4\.71−4\.83\-4\.831\.91\.91\.41\.41\.61\.61\.71\.7τ0\\tau\_\{0\}−4\.86\-4\.86−4\.68\-4\.68−4\.74\-4\.74−5\.01\-5\.011\.51\.51\.41\.41\.51\.52\.32\.3ω1:ω2\\omega\_\{1\}\{:\}\\omega\_\{2\}−4\.97\-4\.97−4\.78\-4\.78−4\.68\-4\.68−4\.81\-4\.812\.52\.51\.71\.71\.41\.41\.61\.6
Tridentis robust within±10%\\pm 10\\%of every default, with reward varying by less than5%5\\%and total violations remaining well below the closest baseline\.
## Appendix JAdditional SMAC Results
Table 6:Per\-map SMAC\-hybrid win rates \(%\)\. Mean over 5 seeds\.Method3s5z\_vs\_3s6zcorr\.6h\_vs\_8zMMM227m\_vs\_30mAvg\.QMIX61\.261\.284\.384\.39\.49\.487\.587\.532\.132\.154\.954\.9MAPPO68\.768\.791\.691\.687\.587\.591\.491\.480\.380\.383\.983\.9FACMAC63\.463\.486\.186\.142\.342\.389\.789\.748\.648\.666\.066\.0HAPPO69\.169\.190\.490\.484\.284\.290\.890\.877\.677\.682\.482\.4MACPO66\.866\.888\.488\.473\.273\.290\.190\.165\.965\.976\.976\.9MADAC67\.367\.389\.189\.176\.476\.490\.490\.468\.768\.778\.478\.4★\\bigstarTrident\(Ours\)72\.472\.493\.793\.789\.689\.693\.293\.282\.582\.586\.386\.3
## Appendix KExtended Discussion
Why a directed cycle and not a list?The three leaks of Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)share multiplicative factors:ηs\\eta\_\{s\}appears in both the F1→\\toF2 and F2→\\toF3 leaks, andω/τ\\omega/\\taucouples the F3→\\toF1 leak to the temperature ofStgc\. Eliminating only one leak therefore reduces but does not stop the per\-iteration violation; only when all three principles hold simultaneously does the cumulative violation contract to𝒪\(K\)\\mathcal\{O\}\(\\sqrt\{K\}\)\.
Why Richardson–Romberg rather than control variates?Control\-variate approaches to discrete reparameterization\(Pauluset al\.,[2021](https://arxiv.org/html/2606.18308#bib.bib26); Shekhovtsov,[2023](https://arxiv.org/html/2606.18308#bib.bib32)\)reduce variance but not bias to leading order\. Our setting is bias\-limited \(Principle[1](https://arxiv.org/html/2606.18308#Thmprinciple1)\), so a bias\-cancellation scheme is fundamentally more appropriate\. Richardson–Romberg is the canonical bias\-cancellation method, has been used to accelerate stochastic gradient descent\(Bach,[2021](https://arxiv.org/html/2606.18308#bib.bib5)\), and—to our knowledge—has not previously been applied to Gumbel\-Softmax\.
Why decompose value, not reward?An additive shaping termr\+ωQphysr\\\!\+\\\!\\omega Q\_\{\\text\{phys\}\}leaves the optimal policy invariant only under the potential\-function conditionQphys\(s,a\)=Φ\(s\)−γ𝔼\[Φ\(s′\)\]Q\_\{\\text\{phys\}\}\(s,a\)\\\!=\\\!\\Phi\(s\)\-\\gamma\\mathbb\{E\}\[\\Phi\(s^\{\\prime\}\)\]ofNget al\.\([1999](https://arxiv.org/html/2606.18308#bib.bib25)\); closed\-form physics models such as Shannon capacity do not satisfy this condition\. Decomposing the critic, in contrast, leaves the Bellman fixed point intact for*any*boundedQphysQ\_\{\\text\{phys\}\}and only shrinks the function class to be regressed\.
Connection to recent safe\-MARL theory\.Li and Azizan \([2024](https://arxiv.org/html/2606.18308#bib.bib15)\)establishes generalized\-Nash convergence for safe MARL but requires an unbiased policy gradient, an assumption our Lemma[1](https://arxiv.org/html/2606.18308#Thmtheorem1)shows is broken whenever the action space is hybrid\.Stgccloses this gap and is the missing ingredient for safe\-MARL theory in hybrid\-action environments\.
When does the framework not help?If physics is uninformative \(Rphys2→0R\_\{\\text\{phys\}\}^\{2\}\\\!\\to\\\!0\), the sample\-complexity gain vanishes and the residual critic reduces to a standard centralized one; if the safe set is non\-strict \(Slater fails\), no algorithm can recover feasibility and the recovery step \([5](https://arxiv.org/html/2606.18308#S5.E5)\) repeats indefinitely\. We view both regimes as legitimately hard and outside the scope of any safe\-MARL guarantee\.Similar Articles
Safe and Generalizable Hierarchical Multi-Agent RL via Constraint Manifold Control
This paper proposes a hierarchical multi-agent reinforcement learning framework that enforces hard safety constraints via a constraint manifold at the low level while enabling effective coordination through high-level policy learning, providing theoretical safety guarantees and achieving near-perfect safety rates with good generalization.
Contract-Based Compositional Shielding for Safe Multi-Agent Reinforcement Learning
A method for contract-based compositional shielding that ensures global safety in multi-agent reinforcement learning without centralized runtime control, using local LTL obligations and a multi-armed bandit to optimize team reward.
Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty
This paper proposes a multi-agent reinforcement learning framework that co-trains an autonomous vehicle and pedestrians with personality-driven jaywalking behavior, achieving a 30% reduction in collisions compared to single-agent approaches and demonstrating more realistic interaction scenarios.
TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
TRIDENT is a novel framework and dataset synthesis pipeline for enhancing LLM safety through tri-dimensional red-teaming data that covers lexical diversity, malicious intent, and jailbreak tactics. Fine-tuning Llama-3.1-8B on TRIDENT-Edge achieves 14.29% reduction in Harm Score and 20% decrease in Attack Success Rate compared to baseline models.