Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
Summary
This paper introduces Trust Region Inverse Reinforcement Learning (TRIRL), a method that combines monotonic dual improvement with efficient local policy updates to outperform state-of-the-art imitation learning methods. It addresses the trade-off between stability and computational cost in IRL by using trust-region constraints.
View Cached Full Text
Cached at: 05/13/26, 06:29 AM
# Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
Source: [https://arxiv.org/html/2605.11020](https://arxiv.org/html/2605.11020)
Davide TateoChristopher E\. MowerHaitham Bou AmmarJan PetersOleg Arenz
###### Abstract
Inverse reinforcement learning \(IRL\) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories\. Classical \(dual\-ascent\) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients\. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards\. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration\. Our key theoretical insight is that a trust\-region\-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction\. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy\. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL—one that can be globally optimized to match expert demonstrations\. Our proposed algorithm,Trust Region Inverse Reinforcement Learning \(TRIRL\), outperforms state\-of\-the\-art imitation learning methods across multiple challenging tasks by a factor of 2\.4x in terms of aggregate inter\-quartile mean, while recovering reward functions that generalize to system dynamics shifts\.‡\\ddagger
Inverse Reinforcement Learning, Learning from Demonstrations, Imitation Learning, Reward Learning, Reinforcement Learning, Robotics
## 1Introduction
As Autonomous agents become prevalent in everyday settings, empowering these systems with human\-like behaviour becomes an important challenge\. This challenge can be addressed with Inverse Reinforcement Learning \(IRL\)\(Ng and Russell,[2000](https://arxiv.org/html/2605.11020#bib.bib10); Russell,[1998](https://arxiv.org/html/2605.11020#bib.bib11)\), a machine learning framework where agents infer the underlying intentions of human demonstrations in the form of a reward function\. By optimizing this learnt reward with reinforcement learning, IRL methods can recover robust imitation policies with the additional benefit of transferring the behaviour to new environments\. However, inferring an informative reward function from demonstrations is challenging, and therefore, many methods focus on Imitation Learning\(Osaet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib55)\), the problem of directly learning a policy that matches the demonstrations in some given environment\.
Figure 1:A comparison of TRIRL \(ours\) and a MaxCausalEnt\-IRL style update\. The Lagrangian dual to be optimized is indicated by the curveℒ\(πr,r\)\\mathcal\{L\}\(\\pi\_\{r\},r\)\. MCE\-IRL performs a full RL optimization after updating the reward function\. In contrast, TRIRL only optimizes the policy within a trust region of the previous MCE policy and accounts for this by correcting the updated reward function\. Trust region policy updates are much cheaper to compute and reward correction ensures that the new policy\-reward pair gets closer to the saddle point ofℒ\(π,r\)\\mathcal\{L\}\(\\pi,r\)\. Our algorithm hence has the same monotonic improvement guarantees as MCE\-IRL, while being able to converge in complex, high\-dimensional settings\.IL only addresses the policy\-side problem in IRL and does not extract the underlying reward function\. It is generally solved using adversarial optimization, with methods such as GAIL\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6)\)and its several variants\(Penget al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib16); Kostrikovet al\.,[2019](https://arxiv.org/html/2605.11020#bib.bib19); Ghasemipouret al\.,[2020](https://arxiv.org/html/2605.11020#bib.bib20); Penget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib12)\)being the standard approach\. These methods formulate IL as a two\-player min\-max game, where one player \(called the discriminator\) assigns local rewards based on how well the RL policy imitates the expert, while the policy updates itself to maximise this local signal given by the discriminator\. However, adversarial IL is inherently unstable and noisy because the discriminator reward provides only a local correction signal\. Practically speaking, these methods are challenging to tune and their performance is highly task\-dependent\. This raises a key question:how can we learn informative reward functions and effective policies in a scalable, principled way, while avoiding the instability of adversarial IL?We approach this question by retracing modern adversarial IL back to its theoretical roots in the Maximum Causal Entropy IRL \(MCE\-IRL\) framework\(Ziebartet al\.,[2008](https://arxiv.org/html/2605.11020#bib.bib5),[2010](https://arxiv.org/html/2605.11020#bib.bib4)\)\. Our insights motivate a scalable approach for imitation learning and inverse reinforcement learning based on explicit dual ascent\.
MCE\-IRL interprets imitation as having similar occupancy111Originally,Ziebartet al\.\([2010](https://arxiv.org/html/2605.11020#bib.bib4)\)formalize MCE\-IRL as a state\-visitation matching problem but this is equivalent to occupancy matching\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6); Arenzet al\.,[2016](https://arxiv.org/html/2605.11020#bib.bib1)\)\.over the underlying Markov Decision Process\. An imitation policy is considered MCE optimal if it has the maximum causal entropy among all candidate policies that induce the same occupancy as the expert\. This problem is typically solved by optimizing its Lagrangianℒ\(π,r\)\\mathcal\{L\}\(\\pi,r\), where𝒢\(r\)=ℒ\(πr,r\)\\mathcal\{G\}\(r\)=\\mathcal\{L\}\(\\pi\_\{r\},r\)is its dual, andπr\\pi\_\{r\}is the optimal policy forrr\. The original MCE\-IRL method\(Ziebartet al\.,[2010](https://arxiv.org/html/2605.11020#bib.bib4)\)is an iterative saddle point algorithm thatalternates between policy optimization and reward updates\. At every iteration, the reward function’s \(dual variable\) gradients are given by the difference between expert and agent feature counts and the policy \(primal variable\) is learnt by solving the maximum entropy RL problem till convergence\. Modern adversarial IL algorithms reformulate this dual\-ascent procedure as a two\-player min\-max game\. However, in doing so, the global intermediate reward function from MCE\-IRL is replaced by a local reward based on the previous policy’s rollouts \(provided by a discriminator\)\. Consequently, the policy update is not MCE optimal; instead, the policy only takes a few gradient steps toward maximizing the entropy\-augmented reward\. While this adversarial procedure has the same saddle point as MCE\-IRL\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6), Proposition 3\.2\), it relies on per\-step local optimization to obtain a solution and does not optimize the reward function corresponding to the dual\. Ultimately, the local discriminator rewards and suboptimal policies render it practically unstable and difficult to train reliably and consistently\.
This paper proposes a new, non\-adversarial IRL algorithm that solves the problem of local rewards and suboptimal intermediate policies\. We presentTrust Region Inverse Reinforcement Learning \(TRIRL\), an algorithm that performs dual ascent on the original MCE\-IRL problem, leading to monotonic performance improvement and stable learning\. Crucially, our method avoids both the need to run expensive, full RL solutions at every time step \(like MCE\-IRL\) or the reliance on approximate local optimization \(like GAIL\)\.
Our work builds on prior work byArenzet al\.\([2016](https://arxiv.org/html/2605.11020#bib.bib1)\)that showed that instead of descending𝒢\(r\)\\mathcal\{G\}\(r\)along its parametric gradient, it is significantly more efficient to perform a reward update in function space\. Given an initial max\-ent pair\(πr,r\)\(\\pi\_\{r\},r\)and their function\-space reward update onrr, our main result is to show that policy optimization within a reverse KL trust\-region aroundπr\\pi\_\{r\}is sufficient to find a policyπmce\\pi^\{\\text\{mce\}\}that is max\-ent optimal for a new, corrected reward function computed by taking asmaller update stepalong the same function\-space update direction\. Hence, we can leverage a novel mechanic for IRL: instead of finding a max\-ent optimal policy for the updated reward, we find a trust region optimal policy for this reward, and correct the reward function to account for the fact that our policy was only optimized locally\. This results in a valid maximum entropy pair\(πrcorrectedmce,rcorrected\)\(\\pi^\{\\text\{mce\}\}\_\{r\_\{\\text\{corrected\}\}\},r\_\{\\text\{corrected\}\}\)\. This means that the policy we learn at each iteration \(i\) is a global optimizer of the corrected reward function and \(ii\) gets closer to the expert in terms of the reverse KL divergence\. By repeating this procedure we recover an IRL algorithm with the same monotonic performance improvement as MCE\-IRL\. We illustrate our method in[Figure1](https://arxiv.org/html/2605.11020#S1.F1)\. Our algorithm outperforms prior works like GAIL, AIRL, AMP, LSIQ, NEAR, and SFM\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6); Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2); Penget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib12); Al\-Hafezet al\.,[2023a](https://arxiv.org/html/2605.11020#bib.bib7); Diwanet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib22); Jainet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib51)\)on a variety of challenging continuous control tasks\.
#### Related Work\.
Behavioural Cloning \(BC\) is arguably the most straightforward approach to Imitation Learning\. It formulates IL as a supervised learning problem to find a policy that closely matches the demonstrated actions\. Although BC methods likePomerleau \([1991](https://arxiv.org/html/2605.11020#bib.bib17)\); Reddyet al\.\([2019](https://arxiv.org/html/2605.11020#bib.bib18)\)have previously shown successful imitation capabilities, the supervised fitting has theoretical limitations—namely covariate shift, poor generalization, and the need for large datasets—that degrade their performance in real\-world environments\. On the other hand, Inverse RL methods like MCE\-IRL\(Ziebartet al\.,[2008](https://arxiv.org/html/2605.11020#bib.bib5),[2010](https://arxiv.org/html/2605.11020#bib.bib4)\)learn imitation policies using reinforcement learning and are hence more robust than BC\. This formulation can also be used for deriving direct IL methods, such as GAIL\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6)\), which reformulate the dual\-ascent formulation ofZiebartet al\.\([2010](https://arxiv.org/html/2605.11020#bib.bib4)\)into an adversarial min\-max optimization procedure that minimizes the Jensen\-Shannon divergence between the agent and expert occupancies\. Several other works build on this adversarial formulation\. For instance,Fuet al\.\([2018](https://arxiv.org/html/2605.11020#bib.bib2)\)modify the GAIL procedure into a state\-based algorithm focusing on reward recovery\.Ghasemipouret al\.\([2020](https://arxiv.org/html/2605.11020#bib.bib20)\)reformulate it for generalf\-divergences andPenget al\.\([2018](https://arxiv.org/html/2605.11020#bib.bib16)\)leverage the empirical benefits ofχ2\\chi^\{2\}\-divergence GANs\(Maoet al\.,[2017](https://arxiv.org/html/2605.11020#bib.bib23)\)for adversarial IL\.Kostrikovet al\.\([2020](https://arxiv.org/html/2605.11020#bib.bib13)\)present ValueDice, a method that leverages the inverse Bellman operator to reformulate GAIL into a value\-function based off\-policy, distribution matching approach\. Several other works\(Kostrikovet al\.,[2019](https://arxiv.org/html/2605.11020#bib.bib19); Orsiniet al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib21); Diwanet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib22)\)shed light on important factors like survival/termination bias induced by learnt reward functions, empirical training dynamics of adversarial IL, and theoretical reasoning for their instability\. While the adversarial IL prior works listed here are diverse in terms of their contributions, all of them are prone to instabilities rooted in the local rewards and suboptimal policies arising from adversarial learning\. The work ofZiebartet al\.\([2010](https://arxiv.org/html/2605.11020#bib.bib4)\)can also be formulated into non\-adversarial techniques\.Arenz and Neumann \([2020](https://arxiv.org/html/2605.11020#bib.bib14)\)extend ValueDICE by formulating a lower\-bound reward function for reverse KL distribution matching and using soft actor critic \(SAC\)\(Haarnojaet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib24)\)to learn aQQ\-function and policy\.Garget al\.\([2021](https://arxiv.org/html/2605.11020#bib.bib15)\)propose IQ\-Learn, a similar method that generalizes to a variety of divergences between the agent and expert occupancies\. Instead of optimizing the dual and primal variables separately, IQ\-learn learns aQQ\-function via distribution matching and leverages the fact that, in discrete settings, the MCE\-optimal policy depends on theQQ\-function in closed form\. However, IQ\-learn requires a dynamics model to recover the reward function from this learntQQ\-function\. Recently,Al\-Hafezet al\.\([2023a](https://arxiv.org/html/2605.11020#bib.bib7)\)introduced LSIQ, an extension of IQ\-Learn that leverages the benefits ofχ2\\chi^\{2\}\-divergence minimization and uses mixture distributions for improved performance\. Several of these non\-adversarial IRL methods still use off\-policy RL \(SAC\) for policy learning\. This makes it challenging to apply them to larger, highly parallelized environments, where SAC faces scaling challenges\. Finally,Boulariaset al\.\([2011](https://arxiv.org/html/2605.11020#bib.bib52)\)previously explored the idea of constraining policy updates in IRL to a relative\-entropy\-based trust region around a baseline policy\. However, their method does not address reward updates under local policy optimization, and inherits the scaling challenges ofZiebartet al\.\([2010](https://arxiv.org/html/2605.11020#bib.bib4)\), requiring trajectory\-level sampling, and hand\-specified features\.
## 2Background
#### Preliminaries\.
Similarly to previous works, we model the environment as a Markov Decision Process \(MDP\) defined by a tuple\(𝒮,𝒜,μ0,𝒫,r,γ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mu\_\{0\},\\mathcal\{P\},r,\\gamma\), where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,μ0\\mu\_\{0\}is the initial state distribution,𝒫\(𝐬′\|𝐬,𝐚\)\\mathcal\{P\}\(\\mathbf\{s\}^\{\\prime\}\|\\mathbf\{s\},\\mathbf\{a\}\)represents the transition dynamics,r\(𝐬,𝐚\)∈ℝr\(\\mathbf\{s\},\\mathbf\{a\}\)\\in\\mathbb\{R\}is the \(unknown\) reward function, andγ\\gammais the discount factor\.Π\\Piis the set of all stationary stochastic policies mapping states in𝒮\\mathcal\{S\}to actions in𝒜\\mathcal\{A\}\. We define the occupancy measureρπ\(𝐬,𝐚\)=π\(𝐚\|𝐬\)∑t=0∞γtμtπ\(𝐬\)\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)=\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\mu^\{\\pi\}\_\{t\}\(\\mathbf\{s\}\), whereμtπ\(𝐬′\)=∑𝐬μtπ\(𝐬\)∑𝐚π\(𝐚\|𝐬\)P\(𝐬′\|𝐬,𝐚\)\\mu\_\{t\}^\{\\pi\}\(\\mathbf\{s\}^\{\\prime\}\)=\\sum\_\{\\mathbf\{s\}\}\\mu^\{\\pi\}\_\{t\}\(\\mathbf\{s\}\)\\sum\_\{\\mathbf\{a\}\}\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)P\(\\mathbf\{s\}^\{\\prime\}\|\\mathbf\{s\},\\mathbf\{a\}\)is the state distribution fort\>0t\>0, withμ0π\(𝐬\)=μ0\(𝐬\)\\mu^\{\\pi\}\_\{0\}\(\\mathbf\{s\}\)=\\mu\_\{0\}\(\\mathbf\{s\}\)\. We work in theγ\\gamma\-discounted infinite horizon setting and use an expectation with respect to a policyπ∈Π\\pi\\in\\Pito denote an expectation with respect to the trajectory it generates;𝔼π\[r\(𝐬,𝐚\)\]≜𝔼\(𝐬0∼μ0,𝐚t∼π,𝐬t\+1∼𝒫\)\[∑t=0∞γtr\(𝐬t,𝐚t\)\]\\mathbb\{E\}\_\{\\pi\}\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\]\\triangleq\\mathbb\{E\}\_\{\(\\mathbf\{s\}\_\{0\}\\sim\\mu\_\{0\},\\mathbf\{a\}\_\{t\}\\sim\\pi,\\mathbf\{s\}\_\{t\+1\}\\sim\\mathcal\{P\}\)\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\}\)\]\. We refer to the \(unknown\) expert policy asπE\\pi\_\{E\}, its occupancy measure asρE\(𝐬,𝐚\)\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\), and the dataset of expert demonstrations as𝒟\\mathcal\{D\}\. Following prior work\(Ziebartet al\.,[2010](https://arxiv.org/html/2605.11020#bib.bib4); Haarnojaet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib24)\), we define entropy regularized value functionsVπ\(𝐬\)=𝔼π\[Qπ\(𝐬,𝐚\)−logπ\(𝐚\|𝐬\)\]V\_\{\\pi\}\(\\mathbf\{s\}\)=\\mathbb\{E\}\_\{\\pi\}\\left\[Q\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)\-\\log\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\right\]andQπ\(𝐬,𝐚\)=R\(𝐬,𝐚\)\+γ𝔼𝐬′∼𝒫\[Vπ\(𝐬′\)\]Q\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)=R\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\gamma\\mathbb\{E\}\_\{\\mathbf\{s\}^\{\\prime\}\\sim\\mathcal\{P\}\}\\left\[V\_\{\\pi\}\(\\mathbf\{s\}^\{\\prime\}\)\\right\]as “soft” value functions\.
#### Max Causal Entropy IRL/RL\.
The maximum entropy principle\(Jaynes,[1957](https://arxiv.org/html/2605.11020#bib.bib25)\)states that given several probability distributions consistent with the observed data, the best approach is to choose the least committal one \(i\.e\., one that has the maximum entropy\)\. When applied to the problem of \(inverse\) reinforcement learning\(Ziebartet al\.,[2010](https://arxiv.org/html/2605.11020#bib.bib4); Eysenbach and Levine,[2022](https://arxiv.org/html/2605.11020#bib.bib44)\), the maximum entropy principle offers improved robustness and an elegant solution to the problem of encouraging exploration\. Given a set of demonstration trajectories𝒟=\{\(𝐬i,𝐚i\)\}i=1N\\mathcal\{D\}=\\\{\(\\mathbf\{s\}\_\{i\},\\mathbf\{a\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}sampled from an expert, MCE\-IRL aims to find a reward functionr∈ℝ𝒮×𝒜r\\in\\mathbb\{R\}^\{\\mathcal\{S\}\\times\\mathcal\{A\}\}and a policyπ∈Π\\pi\\in\\Pithat solve the optimization problem:maxπminr\(𝔼ρπ\[r\(𝐬,𝐚\)\]\+H\(π\)\)−𝔼ρE\[r\(𝐬,𝐚\)\]\\max\_\{\\pi\}\\min\_\{r\}\\left\(\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\}\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\]\+H\(\\pi\)\\right\)\-\\mathbb\{E\}\_\{\\rho\_\{E\}\}\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\]\. Whereas, given a reward functionrr, Maximum Entropy RL aims to find a policyπ\\pithat maximises the expected reward plus entropy:maxπ𝔼ρπ\[r\(𝐬,𝐚\)\]\+H\(π\)\\max\_\{\\pi\}\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\}\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\]\+H\(\\pi\)\. HereH\(π\)≜𝔼π\[−logπ\(𝐚\|𝐬\)\]H\(\\pi\)\\triangleq\\mathbb\{E\}\_\{\\pi\}\[\-\\log\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\]is the discounted causal entropy of the policy in state𝐬\\mathbf\{s\}\. On solving the Lagrangian, the optimal policy satisfies the Boltzmann distributionπ⋆\(𝐚\|𝐬\)=1Z𝐬exp\(Q⋆\(𝐬,𝐚\)\)\\pi^\{\\star\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)=\\frac\{1\}\{Z\_\{\\mathbf\{s\}\}\}\\exp\{\(Q^\{\\star\}\(\\mathbf\{s\},\\mathbf\{a\}\)\)\}where
Q⋆\(𝐬,𝐚\)\\displaystyle Q^\{\\star\}\(\\mathbf\{s\},\\mathbf\{a\}\)=r\(𝐬,𝐚\)\+γ𝔼𝐬′∼𝒫\[log∑𝐚′exp\(Q⋆\(𝐬′,𝐚′\)\)\],\\displaystyle=r\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\gamma\\mathbb\{E\}\_\{\\mathbf\{s\}^\{\\prime\}\\sim\\mathcal\{P\}\}\\left\[\\log\\sum\_\{\\mathbf\{a\}^\{\\prime\}\}\\exp\(Q^\{\\star\}\(\\mathbf\{s\}^\{\\prime\},\\mathbf\{a\}^\{\\prime\}\)\)\\right\],V⋆\(𝐬\)\\displaystyle V^\{\\star\}\(\\mathbf\{s\}\)=log∑𝐚′exp\(Q⋆\(𝐬′,𝐚′\)\)\\displaystyle=\\log\\sum\_\{\\mathbf\{a\}^\{\\prime\}\}\\exp\(Q^\{\\star\}\(\\mathbf\{s\}^\{\\prime\},\\mathbf\{a\}^\{\\prime\}\)\)are the optimal soft value functions andZ𝐬Z\_\{\\mathbf\{s\}\}is a normalization term\. Notice that maximum entropy RL is a subroutine of the MCE\-IRL procedure\.
#### IRL by Distribution Matching\.
Our work is based on IRL by reverse KL\-divergence based distribution matching\. This problem is deeply rooted in prior work\(Arenzet al\.,[2016](https://arxiv.org/html/2605.11020#bib.bib1); Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2); Kostrikovet al\.,[2020](https://arxiv.org/html/2605.11020#bib.bib13); Arenz and Neumann,[2020](https://arxiv.org/html/2605.11020#bib.bib14); Ghasemipouret al\.,[2020](https://arxiv.org/html/2605.11020#bib.bib20); Garget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib15)\), and is expressed by the optimization problem,
maxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\]−β𝔼ρπ\(𝐬,𝐚\)\[logρπ\(𝐬,𝐚\)ρE\(𝐬,𝐚\)\]\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\max\}\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\\right\]\-\\beta\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\left\[\\log\\frac\{\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\right\]\(1\)whereβ\\betatrades off between entropy regularization and the objective of matching the expert’s occupancy\.Arenzet al\.\([2016](https://arxiv.org/html/2605.11020#bib.bib1)\)show that under this problem setting, in the limitβ→∞\\beta\\rightarrow\\infty, the optimal policy and value functions are the same as for MCE\-IRL, and the reward function for matching the expert’s distribution is,r⋆=βlog\(ρE\(𝐬,𝐚\)/ρπ⋆\(𝐬,𝐚\)\)r^\{\\star\}=\\beta\\log\\left\(\\nicefrac\{\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\{\{\\rho\_\{\\pi^\{\\star\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\\right\), where\(π⋆,r⋆\)\(\\pi^\{\\star\},r^\{\\star\}\)is the same saddle point as in MCE\-IRL\. The optimal reward function depends on the state\-action distribution induced by the optimal policy of[Equation1](https://arxiv.org/html/2605.11020#S2.E1), resulting in a cyclic dependency between the optimal policy and the reward function\. However,Arenzet al\.\([2016](https://arxiv.org/html/2605.11020#bib.bib1)\)show that the reward function can also be learnt by iteratively applying the function\-space update operator𝒰ϵρ\(i\)\\mathcal\{U\}\_\{\\epsilon\}^\{\\rho^\{\(i\)\}\}
r\(i\+1\)=\(𝒰ϵρ\(i\)\)r\(i\)\\displaystyle r^\{\(i\+1\)\}=\\left\(\\mathcal\{U\}\_\{\\epsilon\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}=r\(i\)−ϵ\(r\(i\)−βlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\)⏟δ\(i\)\\displaystyle=r^\{\(i\)\}\-\\epsilon\\underbrace\{\\left\(r^\{\(i\)\}\-\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\right\)\}\_\{\\delta^\{\(i\)\}\}\(2\)to the current estimate of the reward function,r\(i\)r^\{\(i\)\}, whereπ\(i\)\\pi^\{\(i\)\}is the optimal maximum entropy policy for reward functionr\(i\)r^\{\(i\)\}\(we useρ\(i\)\\rho^\{\(i\)\}as shorthand forρπ\(i\)\\rho\_\{\\pi^\{\(i\)\}\}\)\. It can be shown that this update aligns with the gradient of the dual of[Equation1](https://arxiv.org/html/2605.11020#S2.E1)\(proof in[AppendixA](https://arxiv.org/html/2605.11020#A1)\) and is empirically several orders of magnitude more efficient than using the actual gradient\. Hence, successive applications of[Equation2](https://arxiv.org/html/2605.11020#S2.E2)lead to monotonic improvement in the objective\. Such prior work was restricted to linear approximations of the system dynamics and assumed the expert distribution to be Gaussian\. Under these relaxations, the optimal policy can be computed using dynamic programming and the log density ratio,log\(ρE\(𝐬,𝐚\)/ρπ\(i\)\(𝐬,𝐚\)\)\\log\\left\(\\nicefrac\{\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\{\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\\right\), computed in closed form\. However, such relaxations do not work in most real\-world applications\.
## 3Trust Region Inverse Reinforcement Learning


Figure 2:We demonstrate TRIRL in a grid\-world experiment and compare policies and normalized rewards\. We also show the monotonically improving reverse KL divergence and dual objective\. TRIRL exactly recovers the expert’s policy and recovers a reward function that matches the expert’s reward \(except for ambiguity due to the temporal credit assignment problem\)\.In real\-world, continuous settings, the densitiesρπ\(𝐬,𝐚\)\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)andρE\(𝐬,𝐚\)\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)cannot be computed in closed form\. Instead, the log density ratio,log\(ρE\(𝐬,𝐚\)/ρπ\(i\)\(𝐬,𝐚\)\)\\log\\left\(\\nicefrac\{\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\{\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\\right\)needs to be approximated from the dataset of expert demonstrations and policy rollouts\. In such a setting, the distribution matching approach yields a reward formulation that is very similar to modern adversarial techniques like GAIL\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6)\)\. However, as highlighted in[Section1](https://arxiv.org/html/2605.11020#S1), the local reward approximations and suboptimal policy updates often lead to unstable convergence for such adversarial methods\.
Instead, we propose non\-adversarial reward function updates similar to[Equation2](https://arxiv.org/html/2605.11020#S2.E2)that ensure monotonic improvement on the objective function \([Equation1](https://arxiv.org/html/2605.11020#S2.E1)\)\. The main challenge in applying[Equation2](https://arxiv.org/html/2605.11020#S2.E2)in real\-world settings is that this update—just like the gradient\-based update—depends on the optimal policyπ\(i\)\(𝐚\|𝐬\)\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)for the current reward function estimater\(i\)\(𝐬,𝐚\)r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\. Vanilla MCE\-IRL finds this optimal policy by running maximum entropy reinforcement learning until convergence\. However, solving the full deep reinforcement learning problem at every iteration of inverse reinforcement learning is prohibitively expensive for most real\-world applications\. Ideally, we would like to come up with a procedure that obtains the current iteration’s optimal policy only after a few updates to the previous policy, such that we could iteratively keep obtaining the MCE optimal policy and the corresponding updated reward function\. In this paper, we derive reward function updates similar to[Equation2](https://arxiv.org/html/2605.11020#S2.E2)and show that the MCE\-optimal policy can be computed based on trust\-region optimal policies
πtr\(i\+1\)\\displaystyle\\pi^\{\(i\+1\)\}\_\{\\text\{tr\}\}=argmaxπ\(𝐚\|𝐬\)\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\(3\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\]\+𝔼ρπ\(𝐬\)\[𝔼π\[r\(i\)\(𝐬,𝐚\)\]\],\\displaystyle\\quad\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\\right\]\+\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\mathbb\{E\}\_\{\\pi\}\\left\[r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\\right\],s\.t\.𝔼ρπ\(𝐬\)\[KL\(π\|\|πtr\(i\)\)\]≤ζ\.\\displaystyle\\quad\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\textrm\{KL\}\\Big\(\\pi\|\|\\pi^\{\(i\)\}\_\{\\text\{tr\}\}\\Big\)\\right\]\\leq\\zeta\.
Our main contribution is to show that a trust region policyπtr\(i\+1\)\(𝐚\|𝐬\)\\pi\_\{\\text\{tr\}\}^\{\(i\+1\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)for a reward functionr~\(i\+1\)=\(𝒰ϵρ\(i\)\)r\(i\)\\tilde\{r\}^\{\(i\+1\)\}=\\left\(\\mathcal\{U\}\_\{\\epsilon\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}is the maximum\-entropy optimal policy with respect to a different reward functionr\(i\+1\)=\(𝒰ϵtrρ\(i\)\)r\(i\)r^\{\(i\+1\)\}=\\left\(\\mathcal\{U\}\_\{\\epsilon\_\{\\text\{tr\}\}\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}, where reward updates are computed as per[Equation2](https://arxiv.org/html/2605.11020#S2.E2)andϵtr≤ϵ\\epsilon\_\{\\text\{tr\}\}\\leq\\epsilon\. This concretely solves the problem of finding an optimal policy in the inner loop of IRL since finding a trust\-region optimal policy is easier and only requires a local search\. Further, the rewards learnt using[Equation2](https://arxiv.org/html/2605.11020#S2.E2)are guaranteed to improve the distribution matching objective’s dual\. Maximizing this reward function guarantees that the updated policy’s occupancy is closer to the expert’s occupancy than the occupancy of the policy used to learn this reward \(in terms of reverse KL divergence\)\.
###### Lemma 3\.1\.
The trust\-region optimal maximum causal entropy policy for reward functionr\(𝐬,𝐚\)r\(\\mathbf\{s\},\\mathbf\{a\}\), corresponds to the optimal maximum causal entropy policy for a reward function that takes the formrη\(𝐬,𝐚\)=1\(1\+η\)r\(𝐬,𝐚\)\+η\(1\+η\)logπ\(i\)\(𝐚\|𝐬\)r\_\{\\eta\}\(\\mathbf\{s\},\\mathbf\{a\}\)=\\tfrac\{1\}\{\(1\+\\eta\)\}r\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\tfrac\{\\eta\}\{\(1\+\\eta\)\}\\log\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\), whereη\\etais the Lagrangian multiplier corresponding to the trust\-region constraint\.
###### Theorem 3\.2\.
Letπtr\(i\+1\)\(𝐚\|𝐬\)\\pi\_\{\\text\{tr\}\}^\{\(i\+1\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)denote a trust\-region optimal policy for a reward functionr~\(i\+1\)=\(𝒰ϵρ\(i\)\)r\(i\)\\tilde\{r\}^\{\(i\+1\)\}=\\left\(\\mathcal\{U\}\_\{\\epsilon\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}with stepsizeϵ\\epsilon\. There exists a positive stepsizeϵtr≤ϵ\\epsilon\_\{\\text\{tr\}\}\\leq\\epsilon, such thatπtr\(i\+1\)\(𝐚\|𝐬\)\\pi\_\{\\text\{tr\}\}^\{\(i\+1\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)is an optimal maximum causal entropy policy with respect to the reward functionr\(i\+1\)=\(𝒰ϵtrρ\(i\)\)r\(i\)r^\{\(i\+1\)\}=\\left\(\\mathcal\{U\}\_\{\\epsilon\_\{\\text\{tr\}\}\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}\.
Proofs for[Lemma3\.1](https://arxiv.org/html/2605.11020#S3.Thmtheorem1)and[Theorem3\.2](https://arxiv.org/html/2605.11020#S3.Thmtheorem2)are in[AppendixA](https://arxiv.org/html/2605.11020#A1)\. Assuming we know an optimal policyπ\(i\)\(𝐚\|𝐬\)\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)for the reward functionr\(i\)\(𝐬,𝐚\)r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\),[Theorem3\.2](https://arxiv.org/html/2605.11020#S3.Thmtheorem2)shows thatπtron\(𝒰ϵρ\(i\)\)r\(i\)≡πMCEon\(𝒰ϵtrρ\(i\)\)r\(i\)\\pi\_\{\\text\{tr\}\}\\text\{ on \}\\left\(\\mathcal\{U\}\_\{\\epsilon\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}\\equiv\\pi\_\{\\text\{MCE\}\}\\text\{ on \}\\left\(\\mathcal\{U\}\_\{\\epsilon\_\{\\text\{tr\}\}\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}\. Hereϵtr=ϵ/\(1\+η\)\\epsilon\_\{\\text\{tr\}\}=\\nicefrac\{\{\\epsilon\}\}\{\{\\left\(1\+\\eta\\right\)\}\}is a step size smaller than the initial step size on which the trust\-region optimal policy was computed, andη≥0\\eta\\geq 0is the Lagrangian multiplier associated with the trust region constraint in[Equation3](https://arxiv.org/html/2605.11020#S3.E3)\.[Theorem3\.2](https://arxiv.org/html/2605.11020#S3.Thmtheorem2)can be used to construct an IRL procedure where we first use a large step size to get the reward functionr~\(i\+1\)\\tilde\{r\}^\{\(i\+1\)\}, then compute a trust region optimal policy using this, and subsequently correct the updated reward function using a new step sizeϵtr\\epsilon\_\{\\text\{tr\}\}\. The corrected reward function is such that the previously computed trust region optimal policy is MCE\-optimal on it\.[Figure1](https://arxiv.org/html/2605.11020#S1.F1)illustrates this procedure and compares it with maximum causal entropy IRL\. Starting from constant rewards and a uniform policy \(r\(0\)r^\{\(0\)\};π\(0\)\\pi^\{\(0\)\}\), this subroutine can be iterated to monotonically improve on the dual objective solely based on trust\-region updates on the policy\.[Algorithm1](https://arxiv.org/html/2605.11020#alg1)shows an overview ofTrust Region Inverse Reinforcement Learning \(TRIRL\)and[Figure2](https://arxiv.org/html/2605.11020#S3.F2)demonstrates it in a discrete setting\.
Algorithm 1Trust Region Inverse RL1:Initialize:
ϵ\\epsilon;
r\(0\)=0\.0r^\{\(0\)\}=0\.0;
π\(0\)=unif\.\\pi^\{\(0\)\}=\\text\{unif\.\}
2:Output:
r⋆r^\{\\star\}and
π⋆\\pi^\{\\star\}
3:repeat
4:rollout
π\(i\)\\pi^\{\(i\)\}; learn
D\(i\)≈log\(ρE/ρπ\(i\)\)D^\{\(i\)\}\\approx\\log\\left\(\\nicefrac\{\{\\rho\_\{E\}\}\}\{\{\\rho\_\{\\pi^\{\(i\)\}\}\}\}\\right\)
5:
r~\(i\+1\)=\(1−ϵ\)r\(i\)\+ϵβD\(i\)\\tilde\{r\}^\{\(i\+1\)\}=\(1\-\\epsilon\)r^\{\(i\)\}\+\\epsilon\\beta D^\{\(i\)\}[Equation2](https://arxiv.org/html/2605.11020#S2.E2)
6:
πtr\(i\+1\)\\pi\_\{\\text\{tr\}\}^\{\(i\+1\)\}&
η\(i\+1\)←\\eta^\{\(i\+1\)\}\\leftarrowtrust region policy update
7:
ϵtr=ϵ/\(1\+η\(i\+1\)\)\\epsilon\_\{\\text\{tr\}\}=\\nicefrac\{\{\\epsilon\}\}\{\{\\left\(1\+\\eta^\{\(i\+1\)\}\\right\)\}\}
8:
r\(i\+1\)=\(1−ϵtr\)r\(i\)\+ϵtrβD\(i\)r^\{\(i\+1\)\}=\(1\-\\epsilon\_\{\\text\{tr\}\}\)r^\{\(i\)\}\+\\epsilon\_\{\\text\{tr\}\}\\beta D^\{\(i\)\}
9:
πtronr~\(i\+1\)≡πMCEonr\(i\+1\)\\pi\_\{\\text\{tr\}\}\\text\{ on \}\\tilde\{r\}^\{\(i\+1\)\}\\equiv\\pi\_\{\\text\{MCE\}\}\\text\{ on \}r^\{\(i\+1\)\}
10:
r\(i\)←r\(i\+1\)r^\{\(i\)\}\\leftarrow r^\{\(i\+1\)\};
π\(i\)←πtr\(i\+1\)\\pi^\{\(i\)\}\\leftarrow\\pi\_\{\\text\{tr\}\}^\{\(i\+1\)\}
11:untilconverged
## 4Practical Considerations
In this section, we discuss practical considerations needed for applying TRIRL to real\-world problems\. These involve estimating the log density ratio used in the reward update \([Equation2](https://arxiv.org/html/2605.11020#S2.E2)\), realizing function\-space updates on parametric reward functions, and performing the trust\-region policy updates\. Furthermore, we discuss important properties, namely the ability to learn from observations and to re\-optimize the reward function after changes to the system dynamics\.
### 4\.1Density Ratio Estimation
In continuous state spaces, the log density ratiolog\(ρE\(𝐬,𝐚\)/ρπ\(i\)\(𝐬,𝐚\)\)\\log\\left\(\\nicefrac\{\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\{\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\}\\right\)cannot be computed analytically and must be approximated using a neural network\. Following previous works, we train a binary classifierDDto distinguish between expert and agent samples\. Given infinite data and an optimally trainedDD, the log density ratio is equivalent to the logits ofDD\(Menon and Ong,[2016](https://arxiv.org/html/2605.11020#bib.bib45)\)\. While adversarial methods like GAIL directly uselogσ\(D\(\.\)\)\\log\\sigma\(D\(\.\)\)as a reward function, we use the logits to perform a function\-space reward update\. Consequently, unlike adversarial methods, we do not require a fine\-tuned balance between the classifier and policy updates\. Where adversarial IRL methods typically fail with a perfect discriminator\(Diwanet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib22)\), our method always benefits from having a highly accurate classifier\. An accurate classifier further improves the practical accuracy of the monotonic improvement guarantees of[Equation2](https://arxiv.org/html/2605.11020#S2.E2), wherein the reward is a global solution of the MCE\-IRL Lagrangian and the policy is a global optimizer of this reward\.
### 4\.2Circular Buffer of Discriminators
An additional challenge arises from the fact that we perform updates in function space rather than parameter space\. While functional updates have been shown to be more efficient than parametric gradient descent\(Arenzet al\.,[2016](https://arxiv.org/html/2605.11020#bib.bib1)\), the recursive application of the update in[Equation2](https://arxiv.org/html/2605.11020#S2.E2)results in a reward function that explicitly depends on the entire history of trained discriminators\. Maintaining this history is computationally prohibitive, particularly regarding the memory footprint\. Furthermore, imposing inductive biases \(e\.g\., action invariance\) on the aggregate reward function would require enforcing them on every individual discriminator\.
To circumvent these limitations, one could project the reward function onto a parametric model at every iteration via a regression problem\. However, such projections introduce approximation errors that may affect training stability\. Instead, we propose a middle ground: we maintain a fixed buffer of thekkmost recent discriminators along with a parametric reward functionRfit\(i−k\)R\_\{\\text\{fit\}\}^\{\(i\-k\)\}that was fitted at iterationi−ki\-k\. Due to the exponential decay of past coefficients caused by the recursive updates, the approximation errors of the fitted reward function have limited effect on the overall rewards, allowing us to impose structural priors on the parametric model without sacrificing the precision of the most recent functional updates\.
We illustrate the discriminator buffer in[Figure3](https://arxiv.org/html/2605.11020#S4.F3)\. While this does entail added computational effort, these operations are highly parallelizable and are completed relatively quickly by leveraging jit\-compilation\. An evaluation of the effects on runtime and memory consumption can be found in[SectionB\.1](https://arxiv.org/html/2605.11020#A2.SS1)\.
Figure 3:An illustration of a discriminator buffer of sizek=2k~=~2\. Given a fitted rewardRfit\(i−k\)R\_\{\\text\{fit\}\}^\{\(i\-k\)\}and discriminators\{D\(i\)\}i−ki\\\{D^\{\(i\)\}\\\}\_\{i\-k\}^\{i\}, intermediate uncorrected rewardsR~\(i\+1\)\\tilde\{R\}^\{\(i\+1\)\}are computed by repeated application of line 8 in[Algorithm1](https://arxiv.org/html/2605.11020#alg1)\. Then, a trust region policyπ\(i\+1\)\\pi^\{\(i\+1\)\}is learnt, and the final corrected rewardR\(i\+1\)R^\{\(i\+1\)\}is computed using line 5 in[Algorithm1](https://arxiv.org/html/2605.11020#alg1)\.
### 4\.3Trust\-Region Optimal Policy Updates
Notice that[Theorem3\.2](https://arxiv.org/html/2605.11020#S3.Thmtheorem2)only necessitates reward correction based on multipliersη\\etaand a policy that is optimal w\.r\.t the trust region corresponding toη\\eta\. However, there is no additional constraint on the actual trust region bound to whichη\\etacorresponds\. Hence,[Theorem3\.2](https://arxiv.org/html/2605.11020#S3.Thmtheorem2)can be applied naively by treatingη\\etaas a hyperparameter and enforcing the trust region constraint under expectation through an auxiliary lossℒtr≜ηKL\(π\|\|πtr\(i\)\)\\mathcal\{L\}\_\{\\textrm\{tr\}\}\\triangleq\\eta\\;\\textrm\{KL\}\\Big\(\\pi\|\|\\pi^\{\(i\)\}\_\{\\text\{tr\}\}\\Big\)\. We note that it is theoretically sound to optimize such an expected KL penalty \(instead of a hard constraint\)\.
However, specifyingη\\etaarbitrarily is difficult since its effect depends on the scale of the reward function\. Instead, it is more convenient to explicitly specify a trust region and strictly enforce it for better training stability\. To ensure such explicit trust region satisfaction, and anη\\etacorresponding to the current optimization landscape, we propose to leverage differentiable trust\-region projection layers\(Ottoet al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib29), TRPL\)\. TRPL parametrizes the policy as a state\-dependent Gaussian distributionπ\(⋅\|𝐬\)=𝒩\(μθ\(𝐬\),Σϕ\)\\pi\(\\cdot\|\\mathbf\{s\}\)=\\mathcal\{N\}\(\\mu\_\{\\theta\}\(\\mathbf\{s\}\),\\Sigma\_\{\\phi\}\)and exactly enforces the analytical reverse KL trust region by the application of a projection layer during policy optimization\. The projection layer maps every violating policy prediction back into the trust region such that the projected policy parameters\(μ𝐬~,Σ~\)\\left\(\\tilde\{\\mu\_\{\\mathbf\{s\}\}\},\\tilde\{\\Sigma\}\\right\)satisfy a trust\-region constraint around the previous policy while simultaneously minimizing the distance to policy predictions\(μθ\(𝐬\),Σϕ\)\\left\(\\mu\_\{\\theta\}\(\\mathbf\{s\}\),\\Sigma\_\{\\phi\}\\right\)\. TRPL uses separate bounds for the mean and covariance and obtains Lagrangian multipliersημ\(𝐬\)≥0\\eta\_\{\\mu\}\(\\mathbf\{s\}\)\\geq 0andηΣ≥0\\eta\_\{\\Sigma\}\\geq 0\. However, as we require a single, scalar multiplierη\\eta, we compute the maximum over the state\-dependent multipliers for mean and covariance,η=maxbatch\{ημ\(𝐬\),ηΣ\}\\eta~=~\\underset\{\\textrm\{batch\}\}\{\\max\}\\\{\\eta\_\{\\mu\}\(\\mathbf\{s\}\),\\eta\_\{\\Sigma\}\\\}, resulting in a more conservative step size\. While these design choices enable us to strictly enforce trust\-region satisfaction, we note that this variant slightly deviates from our theory, in that the trust\-region constraint is satisfied for every state individually, instead of enforcing it in expectation as in[Equation3](https://arxiv.org/html/2605.11020#S3.E3)\. For further details on TRPL, we refer to[SectionC\.3](https://arxiv.org/html/2605.11020#A3.SS3)and the original work\(Ottoet al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib29)\)\. Both variants, \(i\) treatingη\\etaas penalty coefficient \(TR lossversion\) and \(ii\) computing it using TRPL \(maxη\\max\\,\\etaversion\) use PPO for policy optimization\.
We further evaluate a variant based on TRPL with an interesting update mechanic: instead of using the maximum eta over all states, we use state\-dependent step sizes—if a largerη\\etais required to satisfy a trust\-region in a given state, this variant will result in smaller changes to the rewards of that state\. In combination with the circular buffer of discriminators, we can even use this mechanism to adapt step sizes in hindsight\. Here, we need to additionally keep track of past policies to recompute past step sizes based on the given state\. We will refer to this variant asretrospective\-η\\eta\.
### 4\.4Learning from Observations & Transfer Learning
Finally, we highlight that TRIRL is a general framework and can accommodate a variety of discriminator architectures\. TRIRL can be used in observation\-based imitation settings, where we do not have access to the states and actions of the expert, but only to observed features\. As a special case, it can be used for state\-only observations, potentially using a discriminator based on state transitions,D\(𝐬,𝐬′\)D\(\\mathbf\{s\},\\mathbf\{s\}^\{\\prime\}\)\(Torabiet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib46)\)\. Further, TRIRL directly learns reward functions that can be globally optimized, enabling us to re\-optimize the policy to adapt to changes in the system dynamics\. This distinguishes it from AIRL\(Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2)\)and related methods, where the reward function is “entangled” with the log\-density of the learnt policy \(the advantage function\)\. We demonstrate learning from observations through motion\-capture imitation experiments on humanoid robots, and global, transferable reward learning in[Section5\.1](https://arxiv.org/html/2605.11020#S5.SS1)\.
Figure 4:Imitation learning results on Mujoco benchmarks and robotics tasks\.†The G1 tasks use mocap demonstrations where only the expert’s observations are available\.
## 5Experiments
Through our experiments and ablation studies, we aim to answer the following questions:
1. 1\.How does TRIRL compare to prominent prior works in complex imitation learning settings?
2. 2\.Is there any advantage to computing Lagrangian multipliers retrospectively? What is the impact of reward fitting, the TR loss, TRPL, and the discriminator buffer on performance?
3. 3\.Can TRIRL learn a global reward function? Is this reward also transferable?
Figure 5:An ablation study comparing the relative performance of TRIRL’s variants on Mujoco benchmarks\.We conduct experiments on continuous control tasks on the following Mujoco benchmarks: Half\-Cheetah, Ant, Walker, Hopper, Humanoid; as well as more complex robotics settings: Unitree G1 Walking/Running, Unitree Go2 Locomotion\. A PPO policy trained till convergence is used as the expert and 30 demonstration trajectories are collected per task\. In the Unitree G1 environments, we use the LocoMujoco\(Al\-Hafezet al\.,[2023b](https://arxiv.org/html/2605.11020#bib.bib33)\)motion capture dataset and train using a state\-based discriminatorD\(𝐬,𝐬′\)D\(\\mathbf\{s\},\\mathbf\{s\}^\{\\prime\}\)\. The motion capture datasets for walking and running contain approximately 35 and 9 trajectories, respectively \(1000 step horizon\)\. We compare TRIRL with the following prior works: GAIL\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6)\), AIRL\(Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2)\), AMP\(Penget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib12)\), LSIQ\(Al\-Hafezet al\.,[2023a](https://arxiv.org/html/2605.11020#bib.bib7)\), NEAR\(Diwanet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib22)\), and SFM\(Jainet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib51)\)\. Together, these baselines enable a systematic comparison of our method against adversarial and non\-adversarial approaches, state\-based and state\-only variants, as well as methods focused on reward recovery\. All methods are tuned separately for each task and we report the mean performance across 20 independent seeds\. Due to space constraints, we defer secondary results \(ablations, runtime/memory comparisons, data scaling\) to[AppendicesB](https://arxiv.org/html/2605.11020#A2)and[B\.1](https://arxiv.org/html/2605.11020#A2.SS1), and provide experimental details and baseline descriptions in[AppendixC](https://arxiv.org/html/2605.11020#A3)\.
[Figure4](https://arxiv.org/html/2605.11020#S4.F4)shows that TRIRL matches and, in most cases, outperforms all baselines in all environments\. Further, the TR loss version of TRIRL—which is theoretically sound—also often outperforms most baselines, albeit performing poorly on harder robotics tasks like the G1 due to its weaker trust region enforcement\.
Table 1:A comparison of normalized performance during training, retraining from scratch with the learnt reward, and retraining from scratch with the learnt reward on an environment with changed dynamics\. For both baselines, we find that only a fraction of the seeds converged with any meaningful performance \(notice the high variance\)\. This is likely because of different training \- retraining initialization, and their reward being only local\. The global reward learnt by our method, however, ensures convergence much more reliably\.†For NEAR, training and re\-training performance are equivalent\.TaskTrainingRetrainingTransferTRIRLAIRLNEARTRIRLAIRLNEARTRIRLAIRLNEARPoint Maze1\.03±0\.01\\bm\{1\.03\\pm 0\.01\}0\.45±0\.120\.45\\pm 0\.120\.28±0\.090\.28\\pm 0\.090\.98±0\.01\\bm\{0\.98\\pm 0\.01\}0\.35±0\.070\.35\\pm 0\.070\.28±0\.090\.28\\pm 0\.090\.96±0\.001\\bm\{0\.96\\pm 0\.001\}0\.06±0\.640\.06\\pm 0\.640\.29±0\.130\.29\\pm 0\.13Ant0\.91±0\.17\\bm\{0\.91\\pm 0\.17\}0\.59±0\.250\.59\\pm 0\.250\.46±0\.290\.46\\pm 0\.290\.63±0\.09\\bm\{0\.63\\pm 0\.09\}0\.10±0\.130\.10\\pm 0\.130\.46±0\.290\.46\\pm 0\.290\.89±0\.12\\bm\{0\.89\\pm 0\.12\}0\.42±0\.250\.42\\pm 0\.250\.33±0\.180\.33\\pm 0\.18Half Cheetah0\.83±0\.19\\bm\{0\.83\\pm 0\.19\}0\.39±0\.140\.39\\pm 0\.140\.09±0\.280\.09\\pm 0\.280\.70±0\.24\\bm\{0\.70\\pm 0\.24\}0\.08±0\.280\.08\\pm 0\.280\.09±0\.280\.09\\pm 0\.28\(W\)0\.63±0\.29\\bm\{0\.63\\pm 0\.29\}\(MG\)0\.30±0\.13\\bm\{0\.30\\pm 0\.13\}\(W\)0\.16±0\.250\.16\\pm 0\.25\(MG\)−0\.10±0\.06\-0\.10\\pm 0\.06\(W\)0\.10±0\.180\.10\\pm 0\.18\(MG\)−0\.06±0\.12\-0\.06\\pm 0\.12Hopper0\.49±0\.160\.49\\pm 0\.160\.68±0\.11\\bm\{0\.68\\pm 0\.11\}0\.22±0\.090\.22\\pm 0\.090\.36±0\.13\\bm\{0\.36\\pm 0\.13\}0\.12±0\.110\.12\\pm 0\.110\.22±0\.090\.22\\pm 0\.09———
Next, we evaluate the relative significance of the components of TRIRL by modifying its policy optimization and reward correction steps\. We carry out ablation experiments on the same Mujoco benchmarks and compare the following ablated configurations:
\(i\)maxη\(ii\) TR loss\(iii\) retrospectiveη\(iv\) retrospectiveηw/o reward fitting\(v\) w/o disc\. buffer\(vi\) w/o TRPL & disc\. buffer\(vii\) GAIL w/ TRPL\\begin\{array\}\[\]\{ll\}\\text\{\{\\color\[rgb\]\{0,0,0\}\(i\)\}\}\\max\\eta&\\text\{\{\\color\[rgb\]\{0,0,0\}\(ii\)\} TR loss\}\\\\\[3\.0pt\] \\text\{\{\\color\[rgb\]\{0,0,0\}\(iii\)\} retrospective \}\\eta&\\noindent\\hbox\{\}\{\{\\hbox\{$\\begin\{array\}\[c\]\{@\{\}l@\{\}\}\\text\{\{\\color\[rgb\]\{0,0,0\}\(iv\)\} retrospective \}\\eta\\\\ \\text\{w/o reward fitting\}\\end\{array\}$\}\\hbox\{\}\\hfill\}\}\\\\\[3\.0pt\] \\text\{\{\\color\[rgb\]\{0,0,0\}\(v\)\} w/o disc\. buffer\}&\\text\{\{\\color\[rgb\]\{0,0,0\}\(vi\)\} w/o TRPL \\& disc\. buffer\}\\\\\[3\.0pt\] \\text\{\{\\color\[rgb\]\{0,0,0\}\(vii\)\} GAIL w/ TRPL\}&\\\\ \\end\{array\}[Figure5](https://arxiv.org/html/2605.11020#S5.F5)leads to several interesting observations\. First, we find that in most cases, themaxη\\max\\,\\etavariant of TRIRL outperforms all other variants\. While this variant slightly deviates from our theory, the strict trust region enforcement of TRPL and the associated non\-arbitrary lagrangian multipliers, indeed offer increased stability and performance\. Moreover, we see that retrospectively computing state\-dependent Lagrangian multipliers usually does not yield significant performance improvements\. It turns out that taking amax\\maxover previously computed Lagrangian multipliers is typically conservative enough to ensure trust region satisfaction on a new rollout batch\. In this ablation, we also examine how fitting the reward function onto a parametric model affects performance\. As expected, such fitting has a small effect on imitation learning performance\. Further, we observe that the TR loss variant of our method also performs well in Mujoco benchmarks\. This version, being less computationally demanding, is a reasonable alternative in such simpler settings\. Finally, ablated configurations \(v\), \(vi\), and \(vii\) show the contribution of the discriminator buffer and the trust region projection layer\. Configuration \(v\) removes reward correction by replacing TRIRL’s reward function with anuncorrectedinterpolation of a buffer of discriminators \(withη=0\\eta=0\)\. Configuration \(vi\) just boils down to GAIL with the sameuncorrectedrewards\. Configuration \(vii\) is GAIL with TRPL for policy optimization\. The poor performance of these ablated configurations shows that our method’s improvements aren’t simply rooted in TRPL\(Ottoet al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib29)\)or the buffer of discriminators\. Instead, we attribute TRIRL’s improvements to the explicit optimization of the dual objective\.
### 5\.1Global & Transferable Rewards
Here, we show that TRIRL learns a global reward function, i\.e\., one that can be re\-optimized from scratch, starting from a new random agent initialization\.[Figure2](https://arxiv.org/html/2605.11020#S3.F2)demonstrates this in a discrete setting\. In continuous settings, we rely on function approximation to learn rewards\. In these environments, we demonstrate global reward learning by freezing a trained reward network and re\-optimizing it from scratch with PPO\. Crucially, this retraining is done on a new set of seeds to ensure that the agent is initialized differently than during training\. The same setup is also used to evaluate the transferability of the learnt reward\. Ideally, a global reward function captures the expert’s intrinsic motivations \(e\.g\. moving forward\), rather than rewarding the agent just for duplicating the specific state transitions executed by the expert\. Such a reward should also transfer to a different environment with similar goals\. We test this by re\-optimizing the learnt reward on an environment with changed dynamics\. Specifically, we use the Point Maze Flipped and Ant Disabled environments from\(Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2)\), where the maze is flipped and the dynamics of the Ant task are changed by shortening and disabling the agent’s front legs\. We further add two new transfer tasks, Half Cheetah Windy \(W\) and Half Cheetah Mars Gravity \(MG\), where a constant55m/s wind blows against the agent, and the gravitational constant is changed to3\.73m/s23\.73m/s^\{2\}\. We evaluate a feature\-based variant of our method with a feature encoder, and a reward function that is linear in these features\.[Table1](https://arxiv.org/html/2605.11020#S5.T1)shows comparisons against AIRL\(Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2)\)and NEAR\(Diwanet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib22)\), both of which claim to learn a global reward function\. TRIRL outperforms baselines in both retraining and transfer settings, while ensuring the best training performance\.
## 6Discussion and Outlook
We show that trust region policy updates \(primal optimization\), followed by reward \(dual\) correction, result in monotonic improvement of the distribution matching IRL objective\. Given this, we presentTrust Region Inverse Reinforcement Learning \(TRIRL\), a non\-adversarial IRL method that achieves monotonic improvement of the reward function and policy, without having to solve a full RL problem at each iteration of IRL \(like classical IRL methods\)\. Practically, our method offers stable training characteristics and is capable of recovering global reward functions that are robust to changes in system dynamics\.
While our method outperforms strong baselines, there are a few limitations and failure modes to consider\. First, we note that the theoretical guarantees discussed in our work do not perfectly translate to real\-world settings with function approximation\. Discriminator\-based density ratio estimation can introduce approximation errors due to data and sampling limitations, and using a discriminator buffer further increases VRAM requirements—though its impact is minimal as shown in[AppendixB](https://arxiv.org/html/2605.11020#A2)\. We also clarify that, although our experiments use TRPL to learn trust\-region policies, our method is not limited to Gaussian policies or to TRPL’s computationally expensive trust\-region update\. In practice, solving the MaxEnt RL problem in[Algorithm1](https://arxiv.org/html/2605.11020#alg1)only requires a likelihood\-based policy so that the entropy can be evaluated\. Our theory \([Theorem3\.2](https://arxiv.org/html/2605.11020#S3.Thmtheorem2)\) provides a general framework for learning MCE\-optimal IRL policies, and we also present a trust\-region loss variant of our method that often outperforms the baselines\. Finally, the IRL problem is ill\-posed due to temporal credit assignment: different reward functions may induce the same optimal policy in a given environment, but lead to different behavior under different dynamics\(Nget al\.,[1999](https://arxiv.org/html/2605.11020#bib.bib3)\)\. Like prior work, our method is also subject to this limitation\. However, in[AppendixB](https://arxiv.org/html/2605.11020#A2), we briefly highlight that this issue can be addressed by instilling inductive biases into the problem, which prior work such as\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6); Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2)\)cannot accommodate because it requires access to specific expert states/actions\. In terms of failure modes, our method could fail to converge with very lax trust region bounds and take prohibitively long with very conservative ones\. In the TR loss version, whereη\\etais a hyperparameter, this is also not intuitively tunable\. Future work in such trust region based IRL methods could focus on extending this procedure to generalff\-divergences, and studying the properties of function\-space reward updates in detail\.
## Software and Data
To aid reproducibility, the full algorithm and ablation studies are given in[AppendixB](https://arxiv.org/html/2605.11020#A2), hyperparameters are listed in[SectionC\.5](https://arxiv.org/html/2605.11020#A3.SS5), and specific implementation details are provided in[SectionsC\.2](https://arxiv.org/html/2605.11020#A3.SS2),[C\.3](https://arxiv.org/html/2605.11020#A3.SS3)and[C\.4](https://arxiv.org/html/2605.11020#A3.SS4)\. Code and datasets will be open\-sourced by the camera\-ready date\.
## Acknowledgements
This project has been supported by a hardware donation by NVIDIA through the Academic Grant Program\. Calculations for this research were conducted on the Lichtenberg high\-performance computer of the TU Darmstadt\. Further, this work has been partially supported by the German Federal Ministry of Research, Technology and Space \(BMFTR\) under the Robotics Institute Germany \(RIG\)\.
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.
## References
- J\. Achiam, D\. Held, A\. Tamar, and P\. Abbeel \(2017\)Constrained policy optimization\.InInternational Conference on Machine Learning,pp\. 22–31\.Cited by:[Fact A\.3](https://arxiv.org/html/2605.11020#A1.Thmtheorem3.p1.2)\.
- F\. Al\-Hafez, D\. Tateo, O\. Arenz, G\. Zhao, and J\. Peters \(2023a\)LS\-iq: implicit reward regularization for inverse reinforcement learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix B](https://arxiv.org/html/2605.11020#A2.p1.2),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px4.p1.5),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p1.4),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p2.2),[§C\.5](https://arxiv.org/html/2605.11020#A3.SS5.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p5.6),[§5](https://arxiv.org/html/2605.11020#S5.p2.1)\.
- F\. Al\-Hafez, G\. Zhao, J\. Peters, and D\. Tateo \(2023b\)LocoMuJoCo: a comprehensive imitation learning benchmark for locomotion\.In6th Robot Learning Workshop, NeurIPS,Cited by:[§5](https://arxiv.org/html/2605.11020#S5.p2.1)\.
- O\. Arenz, H\. Abdulsamad, and G\. Neumann \(2016\)Optimal control and inverse optimal control by distribution matching\.In2016 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 4046–4053\.Note:[PDF](https://www.ias.informatik.tu-darmstadt.de/uploads/Team/OlegArenz/OC%20and%20IOC%20By%20Matching%20Distributions_withSupplements.pdf)Cited by:[Appendix A](https://arxiv.org/html/2605.11020#A1.1.p1.1),[Appendix A](https://arxiv.org/html/2605.11020#A1.2.p2.10),[Appendix A](https://arxiv.org/html/2605.11020#A1.2.p2.13),[Appendix A](https://arxiv.org/html/2605.11020#A1.2.p2.8),[§1](https://arxiv.org/html/2605.11020#S1.p5.6),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px3.p1.12),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px3.p1.5),[§4\.2](https://arxiv.org/html/2605.11020#S4.SS2.p1.1),[footnote 1](https://arxiv.org/html/2605.11020#footnote1)\.
- O\. Arenz and G\. Neumann \(2020\)Non\-adversarial imitation learning and its connections to adversarial methods\.arXiv preprint arXiv:2008\.03525\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px3.p1.12)\.
- M\. Blondel, Q\. Berthet, M\. Cuturi, R\. Frostig, S\. Hoyer, F\. Llinares\-Lopez, F\. Pedregosa, and J\. Vert \(2022\)Efficient and modular implicit differentiation\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 5230–5242\.Cited by:[§C\.3](https://arxiv.org/html/2605.11020#A3.SS3.p1.12)\.
- N\. Bohlinger and K\. Dorer \(2023\)RL\-x: a deep reinforcement learning library \(not only\) for robocup\.InRobot World Cup,pp\. 228–239\.Cited by:[§C\.2](https://arxiv.org/html/2605.11020#A3.SS2.p2.1)\.
- A\. Boularias, J\. Kober, and J\. Peters \(2011\)Relative entropy inverse reinforcement learning\.InProceedings of the fourteenth international conference on artificial intelligence and statistics,pp\. 182–189\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6)\.
- J\. Bradbury, R\. Frostig, P\. Hawkins, M\. J\. Johnson, and C\. Leary \(2022\)JAX: autograd and xla\.\.\.External Links:[Link](https://github.com/google/jax)Cited by:[§C\.2](https://arxiv.org/html/2605.11020#A3.SS2.p2.1)\.
- A\. A\. Diwan, J\. Urain, J\. Kober, and J\. Peters \(2025\)Noise\-conditioned energy\-based annealed rewards \(near\): a generative framework for imitation learning from observation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[Appendix B](https://arxiv.org/html/2605.11020#A2.p1.2),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px5.p1.3),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p2.2),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p5.6),[§4\.1](https://arxiv.org/html/2605.11020#S4.SS1.p1.5),[§5\.1](https://arxiv.org/html/2605.11020#S5.SS1.p1.2),[§5](https://arxiv.org/html/2605.11020#S5.p2.1)\.
- B\. Eysenbach and S\. Levine \(2022\)Maximum entropy RL \(provably\) solves some robust RL problems\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px2.p1.10)\.
- J\. Fu, K\. Luo, and S\. Levine \(2018\)Learning robust rewards with adverserial inverse reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.11020#A2.SS0.SSS0.Px2.p1.1),[Appendix B](https://arxiv.org/html/2605.11020#A2.p1.2),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px2.p1.3),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p1.4),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p5.6),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px3.p1.12),[§4\.4](https://arxiv.org/html/2605.11020#S4.SS4.p1.1),[§5\.1](https://arxiv.org/html/2605.11020#S5.SS1.p1.2),[§5](https://arxiv.org/html/2605.11020#S5.p2.1),[§6](https://arxiv.org/html/2605.11020#S6.p2.2)\.
- D\. Garg, S\. Chakraborty, C\. Cundy, J\. Song, and S\. Ermon \(2021\)Iq\-learn: inverse soft\-q learning for imitation\.Advances in Neural Information Processing Systems34,pp\. 4028–4039\.Cited by:[Appendix B](https://arxiv.org/html/2605.11020#A2.SS0.SSS0.Px2.p1.1),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px4.p1.2),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px4.p1.5),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p1.4),[§C\.5](https://arxiv.org/html/2605.11020#A3.SS5.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px3.p1.12)\.
- M\. Geist, B\. Scherrer, and O\. Pietquin \(2019\)A theory of regularized markov decision processes\.InInternational Conference on Machine Learning,pp\. 2160–2169\.Cited by:[Fact A\.3](https://arxiv.org/html/2605.11020#A1.Thmtheorem3.p1.2)\.
- S\. K\. S\. Ghasemipour, R\. Zemel, and S\. Gu \(2020\)A divergence minimization perspective on imitation learning methods\.InConference on Robot Learning,pp\. 1259–1277\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p2.1),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px3.p1.12)\.
- T\. Haarnoja, A\. Zhou, P\. Abbeel, and S\. Levine \(2018\)Soft actor\-critic: off\-policy maximum entropy deep reinforcement learning with a stochastic actor\.InInternational conference on machine learning,pp\. 1861–1870\.Cited by:[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px4.p1.5),[§C\.5](https://arxiv.org/html/2605.11020#A3.SS5.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px1.p1.22)\.
- J\. Heek, A\. Levskaya, A\. Oliver, M\. Ritter, B\. Rondepierre, A\. Steiner, and M\. van Zee \(2024\)Flax: a neural network library and ecosystem for jax\.\.\.External Links:[Link](http://github.com/google/flax)Cited by:[§C\.2](https://arxiv.org/html/2605.11020#A3.SS2.p2.1)\.
- J\. Ho and S\. Ermon \(2016\)Generative adversarial imitation learning\.Advances in neural information processing systems29\.Cited by:[Fact A\.1](https://arxiv.org/html/2605.11020#A1.Thmtheorem1.p1.3),[Fact A\.2](https://arxiv.org/html/2605.11020#A1.Thmtheorem2.p1.2),[Appendix B](https://arxiv.org/html/2605.11020#A2.p1.2),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px1.p1.2),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p1.4),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p2.1),[§1](https://arxiv.org/html/2605.11020#S1.p3.4),[§1](https://arxiv.org/html/2605.11020#S1.p5.6),[§3](https://arxiv.org/html/2605.11020#S3.p1.3),[§5](https://arxiv.org/html/2605.11020#S5.p2.1),[§6](https://arxiv.org/html/2605.11020#S6.p2.2),[footnote 1](https://arxiv.org/html/2605.11020#footnote1)\.
- A\. K\. Jain, H\. Wiltzer, J\. Farebrother, I\. Rish, G\. Berseth, and S\. Choudhury \(2025\)Non\-adversarial inverse reinforcement learning via successor feature matching\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Table 2](https://arxiv.org/html/2605.11020#A2.T2),[Table 2](https://arxiv.org/html/2605.11020#A2.T2.19.2),[Appendix B](https://arxiv.org/html/2605.11020#A2.p1.2),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px6.p1.5),[§1](https://arxiv.org/html/2605.11020#S1.p5.6),[§5](https://arxiv.org/html/2605.11020#S5.p2.1)\.
- E\. T\. Jaynes \(1957\)Information theory and statistical mechanics\.Physical review106\(4\),pp\. 620\.Cited by:[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px2.p1.10)\.
- I\. Kostrikov, K\. K\. Agrawal, D\. Dwibedi, S\. Levine, and J\. Tompson \(2019\)Discriminator\-actor\-critic: addressing sample inefficiency and reward bias in adversarial imitation learning\.InInternational Conference on Learning Representations,Cited by:[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p1.4),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p2.1),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p2.1)\.
- I\. Kostrikov, O\. Nachum, and J\. Tompson \(2020\)Imitation learning via off\-policy distribution matching\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px3.p1.12)\.
- Z\. Li, T\. Chen, Z\. Hong, A\. Ajay, and P\. Agrawal \(2023\)ParallelQQ\-learning: scaling off\-policy reinforcement learning under massively parallel simulation\.InInternational Conference on Machine Learning,pp\. 19440–19459\.Cited by:[§C\.5](https://arxiv.org/html/2605.11020#A3.SS5.SSS0.Px1.p1.1)\.
- D\. C\. Liu and J\. Nocedal \(1989\)On the limited memory bfgs method for large scale optimization\.Mathematical programming45\(1\),pp\. 503–528\.Cited by:[§C\.3](https://arxiv.org/html/2605.11020#A3.SS3.p1.12)\.
- X\. Mao, Q\. Li, H\. Xie, R\. Y\. Lau, Z\. Wang, and S\. Paul Smolley \(2017\)Least squares generative adversarial networks\.InProceedings of the IEEE international conference on computer vision,pp\. 2794–2802\.Cited by:[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px3.p1.3),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6)\.
- A\. Menon and C\. S\. Ong \(2016\)Linking losses for density ratio and class\-probability estimation\.InProceedings of The 33rd International Conference on Machine Learning,M\. F\. Balcan and K\. Q\. Weinberger \(Eds\.\),Proceedings of Machine Learning Research, Vol\.48,pp\. 304–313\.Cited by:[§4\.1](https://arxiv.org/html/2605.11020#S4.SS1.p1.5)\.
- G\. Neu, A\. Jonsson, and V\. Gómez \(2017\)A unified view of entropy\-regularized markov decision processes\.arXiv preprint arXiv:1705\.07798\.Cited by:[Fact A\.2](https://arxiv.org/html/2605.11020#A1.Thmtheorem2.p1.2),[Fact A\.3](https://arxiv.org/html/2605.11020#A1.Thmtheorem3.p1.2)\.
- A\. Y\. Ng, D\. Harada, and S\. J\. Russell \(1999\)Policy invariance under reward transformations: theory and application to reward shaping\.InProceedings of the Sixteenth International Conference on Machine Learning,pp\. 278–287\.Cited by:[Appendix A](https://arxiv.org/html/2605.11020#A1.Ex47.2.1),[Appendix A](https://arxiv.org/html/2605.11020#A1.Ex47.4.3.2.1),[§6](https://arxiv.org/html/2605.11020#S6.p2.2)\.
- A\. Y\. Ng and S\. J\. Russell \(2000\)Algorithms for inverse reinforcement learning\.InProceedings of the Seventeenth International Conference on Machine Learning,pp\. 663–670\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.p1.1)\.
- M\. Orsini, A\. Raichuk, L\. Hussenot, D\. Vincent, R\. Dadashi, S\. Girgin, M\. Geist, O\. Bachem, O\. Pietquin, and M\. Andrychowicz \(2021\)What matters for adversarial imitation learning?\.Advances in Neural Information Processing Systems34,pp\. 14656–14668\.Cited by:[Table 5](https://arxiv.org/html/2605.11020#A2.T5),[Table 5](https://arxiv.org/html/2605.11020#A2.T5.4.2),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6)\.
- T\. Osa, J\. Pajarinen, G\. Neumann, J\. A\. Bagnell, P\. Abbeel, and J\. Peters \(2018\)An algorithmic perspective on imitation learning\.Foundations and Trends® in Robotics7\(1\-2\),pp\. 1–179\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.p1.1)\.
- F\. Otto, P\. Becker, V\. A\. Ngo, H\. C\. M\. Ziesche, and G\. Neumann \(2021\)Differentiable trust region layers for deep reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§C\.3](https://arxiv.org/html/2605.11020#A3.SS3.p1.10),[§C\.3](https://arxiv.org/html/2605.11020#A3.SS3.p1.12),[§C\.3](https://arxiv.org/html/2605.11020#A3.SS3.p1.2),[Table 7](https://arxiv.org/html/2605.11020#A3.T7.2.5.1),[§4\.3](https://arxiv.org/html/2605.11020#S4.SS3.p2.11),[§5](https://arxiv.org/html/2605.11020#S5.p4.3)\.
- D\. Palenicek, F\. Vogt, J\. Watson, I\. Posner, and J\. Peters \(2026\)XQC: well\-conditioned optimization accelerates deep reinforcement learning\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§C\.5](https://arxiv.org/html/2605.11020#A3.SS5.SSS0.Px1.p1.1)\.
- X\. B\. Peng, A\. Kanazawa, S\. Toyer, P\. Abbeel, and S\. Levine \(2018\)Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p2.1)\.
- X\. B\. Peng, Z\. Ma, P\. Abbeel, S\. Levine, and A\. Kanazawa \(2021\)Amp: adversarial motion priors for stylized physics\-based character control\.ACM Transactions on Graphics \(ToG\)40\(4\),pp\. 1–20\.Cited by:[Appendix B](https://arxiv.org/html/2605.11020#A2.p1.2),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px3.p1.3),[§C\.4](https://arxiv.org/html/2605.11020#A3.SS4.p2.1),[§1](https://arxiv.org/html/2605.11020#S1.p2.1),[§1](https://arxiv.org/html/2605.11020#S1.p5.6),[§5](https://arxiv.org/html/2605.11020#S5.p2.1)\.
- D\. A\. Pomerleau \(1991\)Efficient training of artificial neural networks for autonomous navigation\.Neural computation3\(1\),pp\. 88–97\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6)\.
- S\. Reddy, A\. D\. Dragan, and S\. Levine \(2019\)SQIL: imitation learning via reinforcement learning with sparse rewards\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6)\.
- S\. Russell \(1998\)Learning agents for uncertain environments\.InProceedings of the eleventh annual conference on Computational learning theory,pp\. 101–103\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.p1.1)\.
- J\. Schulman, S\. Levine, P\. Abbeel, M\. Jordan, and P\. Moritz \(2015\)Trust region policy optimization\.InInternational conference on machine learning,pp\. 1889–1897\.Cited by:[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px1.p1.2)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px1.p1.2),[§C\.5](https://arxiv.org/html/2605.11020#A3.SS5.SSS0.Px1.p1.1)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.Advances in neural information processing systems32\.Cited by:[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px5.p1.3)\.
- U\. Syed, M\. Bowling, and R\. E\. Schapire \(2008\)Apprenticeship learning using linear programming\.InProceedings of the 25th international conference on Machine learning,pp\. 1032–1039\.Cited by:[Fact A\.1](https://arxiv.org/html/2605.11020#A1.Thmtheorem1.p1.3)\.
- E\. Todorov, T\. Erez, and Y\. Tassa \(2012\)MuJoCo: a physics engine for model\-based control\.In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems,pp\. 5026–5033\.External Links:[Document](https://dx.doi.org/10.1109/IROS.2012.6386109)Cited by:[§C\.2](https://arxiv.org/html/2605.11020#A3.SS2.p2.1)\.
- F\. Torabi, G\. Warnell, and P\. Stone \(2018\)Generative adversarial imitation from observation\.arXiv preprint arXiv:1807\.06158\.Cited by:[§4\.4](https://arxiv.org/html/2605.11020#S4.SS4.p1.1)\.
- B\. D\. Ziebart, J\. A\. Bagnell, and A\. K\. Dey \(2010\)Modeling interaction via the principle of maximum causal entropy\.\.\.Cited by:[Appendix A](https://arxiv.org/html/2605.11020#A1.2.p2.8),[§C\.1](https://arxiv.org/html/2605.11020#A3.SS1.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p2.1),[§1](https://arxiv.org/html/2605.11020#S1.p3.4),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px1.p1.22),[§2](https://arxiv.org/html/2605.11020#S2.SS0.SSS0.Px2.p1.10),[footnote 1](https://arxiv.org/html/2605.11020#footnote1)\.
- B\. D\. Ziebart, A\. L\. Maas, J\. A\. Bagnell, A\. K\. Dey,et al\.\(2008\)Maximum entropy inverse reinforcement learning\.\.InAaai,Vol\.8,pp\. 1433–1438\.Cited by:[§1](https://arxiv.org/html/2605.11020#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2605.11020#S1.p2.1)\.
## Appendix AProofs
We first list some fundamental properties of our optimization problem that are necessary for our analysis\.
###### Fact A\.1\(Policy\-Occupancy Relationship\)\.
There is a unique relationship between a policyπ\(𝐚\|𝐬\)\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)and it’s state\-action occupancyρπ\(𝐬,𝐚\)\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\), whereinρπ\(𝐬,𝐚\)=π\(𝐚\|𝐬\)ρπ\(𝐬\)\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)=\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\. Proof in\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6), Lemmas 3\.2, 3\.3\)and\(Syedet al\.,[2008](https://arxiv.org/html/2605.11020#bib.bib50), Theorem 2\)\.
###### Fact A\.2\(Entropy Concavity\)\.
The expected entropy𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\]\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\\right\]is concave inρπ\(𝐬,𝐚\)\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)\. Proof in\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6); Neuet al\.,[2017](https://arxiv.org/html/2605.11020#bib.bib49), Lemma 3\.1, Appendix A\.1\)\.
###### Fact A\.3\(Strong Duality\)\.
The optimization problem in[Equation3](https://arxiv.org/html/2605.11020#S3.E3)has strong duality\. The proof of this naturally follows from[FactA\.2](https://arxiv.org/html/2605.11020#A1.Thmtheorem2)and the strong duality of entropy regularized RL\(Achiamet al\.,[2017](https://arxiv.org/html/2605.11020#bib.bib48); Neuet al\.,[2017](https://arxiv.org/html/2605.11020#bib.bib49); Geistet al\.,[2019](https://arxiv.org/html/2605.11020#bib.bib47)\)\. The expected entropy term is concave, the reward term is linear, the KL constraint is convex, and the Bellman flow constraints are linear \(all inρπ\(𝐬,𝐚\)\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)\)\. Hence Slater’s conditions hold and strong duality applies\. Finally from[FactA\.1](https://arxiv.org/html/2605.11020#A1.Thmtheorem1), the policy is equivalently obtained by optimizing overρπ\(𝐬,𝐚\)\\rho\_\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)followed by marginalization\.
###### Proof of reward update soundness\.
Here, we prove that the reward updater\(i\+1\)=\(𝒰ϵρ\(i\)\)r\(i\)=r\(i\)−ϵδ\(i\)r^\{\(i\+1\)\}=\\left\(\\mathcal\{U\}\_\{\\epsilon\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}=r^\{\(i\)\}\-\\epsilon\\delta^\{\(i\)\}\([Equation2](https://arxiv.org/html/2605.11020#S2.E2)\) corresponds to an ascent direction with respect to the dual of[Equation1](https://arxiv.org/html/2605.11020#S2.E1)\. This proof is a reworking of the results fromArenzet al\.\([2016](https://arxiv.org/html/2605.11020#bib.bib1)\)\. To show this, we first redefine the reward update in terms of it’s parameters and arrive at an equivalent form of the original update direction\. Then, we show that this aligns with the gradient of the dual of[Equation1](https://arxiv.org/html/2605.11020#S2.E1)\.
From\(Ziebartet al\.,[2010](https://arxiv.org/html/2605.11020#bib.bib4); Arenzet al\.,[2016](https://arxiv.org/html/2605.11020#bib.bib1)\), we note that the learnt reward is linear in it’s features\. LetρE\(𝐬,𝐚\)=ZE−1exp\(ϕEψ\(𝐬,𝐚\)\)\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)=Z^\{\-1\}\_\{E\}\\exp\{\\left\(\\phi\_\{E\}\\;\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)\}andρπ\(i\)\(𝐬,𝐚\)=Zρ\(i\)−1exp\(ϕρ\(i\)ψ\(𝐬,𝐚\)\)\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)=Z^\{\-1\}\_\{\\rho^\{\(i\)\}\}\\exp\{\\left\(\\phi\_\{\\rho^\{\(i\)\}\}\\;\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)\}be Boltzmann distributions with energy functions that are—without loss of generality—linear in featuresψ\(𝐬,𝐚\)\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)that are not assumed to be known\. In our analysis, we denote reward parametersθ\\theta, occupancy parametersϕ\\phi, and a general feature functionψ\(𝐬,𝐚\)\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\. Estimates of the optimal reward and optimal policy occupancy are termedr^\\hat\{r\}andρπ^\\rho\_\{\\hat\{\\pi\}\}\. Owing to the reward’s linearity, we replace the above update with an equivalent version in parameter space:
θ\(i\+1\)\\displaystyle\\theta^\{\(i\+1\)\}=θ\(i\)−ϵδ\(i\)whereδ\(i\)=θ\(i\)−βlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\\displaystyle=\\theta^\{\(i\)\}\-\\epsilon\\delta^\{\(i\)\}\\text\{\\quad where \\quad\}\\delta^\{\(i\)\}=\\theta^\{\(i\)\}\-\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}andr^=βlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\\hat\{r\}=\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}is an estimate of the optimal reward functionr⋆r^\{\\star\}\.Arenzet al\.\([2016](https://arxiv.org/html/2605.11020#bib.bib1)\)also show that this can be used to obtain an estimate of the optimal policy’s occupancy:
ρπ^\(𝐬,𝐚\)\\displaystyle\\rho\_\{\\hat\{\\pi\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)∝exp\(logρE\(𝐬,𝐚\)−1βr^\)\\displaystyle\\propto\\exp\{\\left\(\\log\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\-\\frac\{1\}\{\\beta\}\\hat\{r\}\\right\)\}∝exp\(\(ϕE−θ\(i\)β\)⊺ψ\(𝐬,𝐚\)\)\\displaystyle\\propto\\exp\{\\left\(\\left\(\\phi\_\{E\}\-\\frac\{\\theta^\{\(i\)\}\}\{\\beta\}\\right\)^\{\\intercal\}\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)\}∝exp\(ϕ^⊺ψ\(𝐬,𝐚\)\)\\displaystyle\\propto\\exp\{\\left\(\\hat\{\\phi\}^\{\\intercal\}\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)\}\(4\)Then, the estimate of the optimal reward is given by
βlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\\displaystyle\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}=β\(logZE−1exp\(ϕEψ\(𝐬,𝐚\)\)−logZρi−1exp\(ϕρiψ\(𝐬,𝐚\)\)\)\\displaystyle=\\beta\\left\(\\log Z^\{\-1\}\_\{E\}\\exp\{\\left\(\\phi\_\{E\}\\;\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)\}\-\\log Z^\{\-1\}\_\{\\rho^\{i\}\}\\exp\{\\left\(\\phi\_\{\\rho^\{i\}\}\\;\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)\}\\right\)=β\(ϕE−ϕρi\)\+const\.\\displaystyle=\\beta\\left\(\\phi\_\{E\}\-\\phi\_\{\\rho^\{i\}\}\\right\)\+\\textrm\{const\.\}The update direction then translates to:
δ\(i\)\\displaystyle\\delta^\{\(i\)\}=θ\(i\)−β\(ϕE−ϕρi\)\\displaystyle=\\theta^\{\(i\)\}\-\\beta\\left\(\\phi\_\{E\}\-\\phi\_\{\\rho^\{i\}\}\\right\)=β\(θ\(i\)β−ϕE\+ϕρi\)\\displaystyle=\\beta\\left\(\\frac\{\\theta^\{\(i\)\}\}\{\\beta\}\-\\phi\_\{E\}\+\\phi\_\{\\rho^\{i\}\}\\right\)=β\(ϕρi−ϕ^\)\(from[AppendixA](https://arxiv.org/html/2605.11020#A1.Ex5)\)\.\\displaystyle=\\beta\\left\(\\phi\_\{\\rho^\{i\}\}\-\\hat\{\\phi\}\\right\)\\textrm\{\\quad\(from \\lx@cref\{creftypecap~refnum\}\{lemma4\_0\}\)\}\.\(5\)Given[AppendixA](https://arxiv.org/html/2605.11020#A1.Ex9), we now show thatδ′=1/βδ\(i\)=ϕρi−ϕ^\\delta^\{\\prime\}=\\nicefrac\{\{1\}\}\{\{\\beta\}\}\\;\\delta^\{\(i\)\}=\\phi\_\{\\rho^\{i\}\}\-\\hat\{\\phi\}aligns with the gradient of the Lagrangian dual𝒢\\mathcal\{G\}\. Where,
𝒢\\displaystyle\\mathcal\{G\}=𝔼μ0\[Vπsoft\(𝐬0\)\]\+βlog∑\(𝐬,𝐚\)exp\(logρE\(𝐬,𝐚\)−1βr⋆\(𝐬,𝐚\)\)\\displaystyle=\\mathbb\{E\}\_\{\\mu\_\{0\}\}\\left\[V^\{\\textrm\{soft\}\}\_\{\\pi\}\(\\mathbf\{s\}\_\{0\}\)\\right\]\+\\beta\\log\\sum\_\{\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\exp\{\\left\(\\log\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\-\\frac\{1\}\{\\beta\}r^\{\\star\}\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)\}∂G∂θ\\displaystyle\\frac\{\\partial G\}\{\\partial\\theta\}=𝔼ρπ\(𝐬\)\[ψ\(𝐬,𝐚\)\]−𝔼ρπ^\[ψ\(𝐬,𝐚\)\]\\displaystyle=\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\-\\mathbb\{E\}\_\{\\rho\_\{\\hat\{\\pi\}\}\}\\left\[\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]We refer the reader toArenzet al\.\([2016](https://arxiv.org/html/2605.11020#bib.bib1)\)for the derivations of the Lagrangian dual and it’s gradient\. Let⟨\.,\.⟩\\langle\.\\;,\\;\.\\rangledefine the inner product of two vectors\.
⟨δ′,∂G∂θ⟩\\displaystyle\\left\\langle\\delta^\{\\prime\}\\;,\\;\\frac\{\\partial G\}\{\\partial\\theta\}\\right\\rangle=⟨ϕρi−ϕ^,𝔼ρπ\(𝐬\)\[ψ\(𝐬,𝐚\)\]−𝔼ρπ^\[ψ\(𝐬,𝐚\)\]⟩\\displaystyle=\\left\\langle\\phi\_\{\\rho^\{i\}\}\-\\hat\{\\phi\}\\;,\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\-\\mathbb\{E\}\_\{\\rho\_\{\\hat\{\\pi\}\}\}\\left\[\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\\right\\rangle=𝔼ρπ\(𝐬\)\[\(ϕρi−ϕ^\)⊺ψ\(𝐬,𝐚\)\]−𝔼ρπ^\[\(ϕρi−ϕ^\)⊺ψ\(𝐬,𝐚\)\]\\displaystyle=\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\left\(\\phi\_\{\\rho^\{i\}\}\-\\hat\{\\phi\}\\right\)^\{\\intercal\}\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\-\\mathbb\{E\}\_\{\\rho\_\{\\hat\{\\pi\}\}\}\\left\[\\left\(\\phi\_\{\\rho^\{i\}\}\-\\hat\{\\phi\}\\right\)^\{\\intercal\}\\psi\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]=𝔼ρπ\(𝐬\)\[logρπ\(i\)\(𝐬,𝐚\)ρπ^\(𝐬,𝐚\)\+logZρiZρ^\]\+𝔼ρπ^\[logρπ^\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\+logZρ^Zρi\]\\displaystyle=\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\log\\frac\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\hat\{\\pi\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\+\\log\\frac\{Z\_\{\\rho^\{i\}\}\}\{Z\_\{\\hat\{\\rho\}\}\}\\right\]\+\\mathbb\{E\}\_\{\\rho\_\{\\hat\{\\pi\}\}\}\\left\[\\log\\frac\{\\rho\_\{\\hat\{\\pi\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\+\\log\\frac\{Z\_\{\\hat\{\\rho\}\}\}\{Z\_\{\\rho^\{i\}\}\}\\right\]=DKL\[ρπ\(i\)\|\|ρ^\]\+DKL\[ρ^\|\|ρπ\(i\)\]\\displaystyle=D\_\{\\textrm\{KL\}\}\\left\[\\rho\_\{\\pi^\{\(i\)\}\}\\;\\lvert\\rvert\\;\\hat\{\\rho\}\\right\]\+D\_\{\\textrm\{KL\}\}\\left\[\\hat\{\\rho\}\\;\\lvert\\rvert\\;\\rho\_\{\\pi^\{\(i\)\}\}\\right\]≥0\(sum of KLs\)\\displaystyle\\geq 0\\textrm\{\\quad\(sum of KLs\)\}The update directionδ\(i\)\\delta^\{\(i\)\}must align with the gradient since their inner product is positive\. In other words, repeated applications of[Equation2](https://arxiv.org/html/2605.11020#S2.E2)lead to monotonic improvement in the objective\. ∎
###### Proof of[Lemma3\.1](https://arxiv.org/html/2605.11020#S3.Thmtheorem1)\.
From[FactA\.3](https://arxiv.org/html/2605.11020#A1.Thmtheorem3), we note the existence of an optimal Lagranginan multiplier\. We further clarify that we do not require exact, per\-state satisfaction of the constraint in[Equation3](https://arxiv.org/html/2605.11020#S3.E3)\. Instead, it is sufficient to optimize an expected KL constraint using anη\\eta\-weighted penalty term \(as shown below\)\. We start with the Lagrangian expression of optimization problem[3](https://arxiv.org/html/2605.11020#S3.E3), given by
ℒ\(π,η\)=𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\]\+𝔼ρπ\(𝐬\)\[𝔼π\[r\(𝐬,𝐚\)\]\]\+ηζ−η𝔼ρπ\(𝐬\)\[KL\(π\(𝐚\|𝐬\)\|\|π\(i\)\(𝐚\|𝐬\)\)\]\\displaystyle\\mathcal\{L\}\(\\pi,\\eta\)=\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\\right\]\+\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\\right\]\+\\eta\\zeta\-\\eta\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\textrm\{KL\}\\Big\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\|\|\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\Big\)\\right\]For a given Lagrangian multiplierη\\eta, the optimal policy can be computed as
πη\(𝐚\|𝐬\)=\\displaystyle\\pi^\{\\eta\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)=argmaxπ\(𝐚\|𝐬\)ℒ\(π,η\)\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathcal\{L\}\(\\pi,\\eta\)=\\displaystyle=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\]\+𝔼ρπ\(𝐬\)\[𝔼π\[r\(𝐬,𝐚\)\]\]\+ηζ−η𝔼ρπ\(𝐬\)\[KL\(π\(𝐚\|𝐬\)\|\|π\(i\)\(𝐚\|𝐬\)\)\]\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\\right\]\+\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\\right\]\+\\eta\\zeta\-\\eta\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\textrm\{KL\}\\Big\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\|\|\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\Big\)\\right\]=\\displaystyle=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[r\(𝐬,𝐚\)\]−ηKL\(π\(𝐚\|𝐬\)\|\|π\(i\)\(𝐚\|𝐬\)\)\]\+ηζ\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\-\\eta\\textrm\{KL\}\\Big\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\|\|\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\Big\)\\right\]\+\\eta\\zeta=\\displaystyle=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[r\(𝐬,𝐚\)\]−η𝔼π\[logπ\(𝐚\|𝐬\)π\(i\)\(𝐚\|𝐬\)\]\]\+ηζ\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\-\\eta\\mathbb\{E\}\_\{\\pi\}\\left\[\\log\\frac\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\\right\]\\right\]\+\\eta\\zeta=\\displaystyle=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[r\(𝐬,𝐚\)\]−η𝔼π\[logπ\(𝐚\|𝐬\)−logπ\(i\)\(𝐚\|𝐬\)\]\]\+ηζ\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\-\\eta\\mathbb\{E\}\_\{\\pi\}\\left\[\\log\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\-\\log\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\right\]\\right\]\+\\eta\\zeta=\\displaystyle=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[r\(𝐬,𝐚\)\]−η𝔼π\[logπ\(𝐚\|𝐬\)\]\+η𝔼π\[logπ\(i\)\(𝐚\|𝐬\)\]\]\+ηζ\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\-\\eta\\mathbb\{E\}\_\{\\pi\}\\left\[\\log\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\right\]\+\\eta\\mathbb\{E\}\_\{\\pi\}\\left\[\\log\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\right\]\\right\]\+\\eta\\zeta=\\displaystyle=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[\(1\+η\)H\(π\(𝐚\|𝐬\)\)\+𝔼π\[r\(𝐬,𝐚\)\+ηlogπ\(i\)\(𝐚\|𝐬\)\]\]\+ηζ\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\(1\+\\eta\)H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\eta\\log\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\right\]\\right\]\+\\eta\\zeta=\\displaystyle=argmaxπ\(𝐚\|𝐬\)\(1\+η\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[r\(𝐬,𝐚\)\(1\+η\)\+η\(1\+η\)logπ\(i\)\(𝐚\|𝐬\)\]\]\+ηζ\\displaystyle\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\(1\+\\eta\)\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[\\frac\{r\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\(1\+\\eta\)\}\+\\frac\{\\eta\}\{\(1\+\\eta\)\}\\log\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\right\]\\right\]\+\\eta\\zeta=\\displaystyle=πrηMCE\\displaystyle\\pi^\{\\text\{MCE\}\}\_\{r\_\{\\eta\}\}∎
###### Proof of[Theorem3\.2](https://arxiv.org/html/2605.11020#S3.Thmtheorem2)\.
πtr\(i\+1\)\\displaystyle\\pi\_\{\\text\{tr\}\}^\{\(i\+1\)\}\(𝐚\|𝐬\)\\displaystyle\(\\mathbf\{a\}\|\\mathbf\{s\}\)=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\]\+𝔼ρπ\(𝐬\)\[𝔼π\[r~\(i\+1\)\(𝐬,𝐚\)\]\]\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\\right\]\+\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\mathbb\{E\}\_\{\\pi\}\\left\[\\tilde\{r\}^\{\(i\+1\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\]\\right\]−η𝔼ρπ\(𝐬\)\[KL\(π\|\|πtr\(i\)\)\]\+ηζ\\displaystyle\-\\eta\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[\\textrm\{KL\}\\Big\(\\pi\|\|\\pi^\{\(i\)\}\_\{\\text\{tr\}\}\\Big\)\\right\]\+\\eta\\zetaApply Lemma[3\.1](https://arxiv.org/html/2605.11020#S3.Thmtheorem1)=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[r~\(i\+1\)\(𝐬,𝐚\)\(1\+η\)\+η\(1\+η\)logπ\(i\)\(𝐚\|𝐬\)\]\]\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[\\frac\{\\tilde\{r\}^\{\(i\+1\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\(1\+\\eta\)\}\+\\frac\{\\eta\}\{\(1\+\\eta\)\}\\log\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\right\]\\right\]Substituter~\(i\+1\)=\(𝒰ϵρ\(i\)\)r\(i\)\\displaystyle\\text\{\{Substitute \}\}\\tilde\{r\}^\{\(i\+1\)\}=\\left\(\\mathcal\{U\}\_\{\\epsilon\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[1\(1\+η\)\(\(1−ϵ\)r\(i\)\(𝐬,𝐚\)\+ϵβlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\)\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\Biggl\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\Biggl\[\\frac\{1\}\{\(1\+\\eta\)\}\\left\(\(1\-\\epsilon\)r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\epsilon\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\right\)\+η\(1\+η\)logπ\(i\)\(𝐚\|𝐬\)\]\]\\displaystyle\+\\frac\{\\eta\}\{\(1\+\\eta\)\}\\log\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\Biggr\]\\Biggr\]π\(i\)\(𝐚\|𝐬\)is Boltzmann\\displaystyle\\pi^\{\(i\)\}\(\\mathbf\{a\}\|\\mathbf\{s\}\)\\text\{\{ is Boltzmann\}\}=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[1\(1\+η\)\(\(1−ϵ\)r\(i\)\(𝐬,𝐚\)\+ϵβlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\)\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\Biggl\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\Biggl\[\\frac\{1\}\{\(1\+\\eta\)\}\\left\(\(1\-\\epsilon\)r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\epsilon\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\right\)\+η\(1\+η\)\(Q\(i\)\(𝐬,𝐚\)−V\(i\)\(𝐬\)\)\]\]\\displaystyle\+\\frac\{\\eta\}\{\(1\+\\eta\)\}\\left\(Q^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\-V^\{\(i\)\}\(\\mathbf\{s\}\)\\right\)\\Biggr\]\\Biggr\]=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[1\(1\+η\)\(\(1−ϵ\)r\(i\)\(𝐬,𝐚\)\+ϵβlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\)\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\Biggl\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\Biggl\[\\frac\{1\}\{\(1\+\\eta\)\}\\left\(\(1\-\\epsilon\)r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\epsilon\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\right\)\+η\(1\+η\)\(r\(i\)\(𝐬,𝐚\)\+γ𝔼𝐬′\[V\(i\)\(𝐬′\)\]−V\(i\)\(𝐬\)\)\]\]\\displaystyle\+\\frac\{\\eta\}\{\(1\+\\eta\)\}\\left\(r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\gamma\\mathbb\{E\}\_\{\\mathbf\{s\}^\{\\prime\}\}\\left\[V^\{\(i\)\}\(\\mathbf\{s\}^\{\\prime\}\)\\right\]\-V^\{\(i\)\}\(\\mathbf\{s\}\)\\right\)\\Biggr\]\\Biggr\]=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[\(1−ϵ\)\(1\+η\)r\(i\)\(𝐬,𝐚\)\+ϵ\(1\+η\)βlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\Biggl\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\Biggl\[\\frac\{\(1\-\\epsilon\)\}\{\(1\+\\eta\)\}r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\frac\{\\epsilon\}\{\(1\+\\eta\)\}\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\+η\(1\+η\)r\(i\)\(𝐬,𝐚\)\+η\(1\+η\)\(γ𝔼𝐬′\[V\(i\)\(𝐬′\)\]−V\(i\)\(𝐬\)\)\]\]\\displaystyle\+\\frac\{\\eta\}\{\(1\+\\eta\)\}r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\frac\{\\eta\}\{\(1\+\\eta\)\}\\left\(\\gamma\\mathbb\{E\}\_\{\\mathbf\{s\}^\{\\prime\}\}\\left\[V^\{\(i\)\}\(\\mathbf\{s\}^\{\\prime\}\)\\right\]\-V^\{\(i\)\}\(\\mathbf\{s\}\)\\right\)\\Biggr\]\\Biggr\]Policy invariance under potential shaping\(Nget al\.,[1999](https://arxiv.org/html/2605.11020#bib.bib3)\)=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[\(1−ϵ\)\(1\+η\)r\(i\)\(𝐬,𝐚\)\+ϵ\(1\+η\)βlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\+η\(1\+η\)r\(i\)\(𝐬,𝐚\)\]\]\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\Biggl\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\Biggl\[\\frac\{\(1\-\\epsilon\)\}\{\(1\+\\eta\)\}r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\frac\{\\epsilon\}\{\(1\+\\eta\)\}\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\+\\frac\{\\eta\}\{\(1\+\\eta\)\}r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\\Biggr\]\\Biggr\]=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[\(1−ϵ\(1\+η\)\)r\(i\)\(𝐬,𝐚\)\+ϵ\(1\+η\)βlogρE\(𝐬,𝐚\)ρπ\(i\)\(𝐬,𝐚\)\]\]\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\Biggl\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\Biggl\[\\left\(1\-\\frac\{\\epsilon\}\{\(1\+\\eta\)\}\\right\)r^\{\(i\)\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\frac\{\\epsilon\}\{\(1\+\\eta\)\}\\beta\\log\\frac\{\\rho\_\{E\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\{\\rho\_\{\\pi^\{\(i\)\}\}\(\\mathbf\{s\},\\mathbf\{a\}\)\}\\Biggr\]\\Biggr\]=argmaxπ\(𝐚\|𝐬\)𝔼ρπ\(𝐬\)\[H\(π\(𝐚\|𝐬\)\)\+𝔼π\[\(𝒰\(ϵ\(1\+η\)\)ρ\(i\)\)r\(i\)\]\]\\displaystyle=\\underset\{\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\{\\operatorname\*\{arg\\,max\}\}\\;\\mathbb\{E\}\_\{\\rho\_\{\\pi\}\(\\mathbf\{s\}\)\}\\left\[H\(\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\)\+\\mathbb\{E\}\_\{\\pi\}\\left\[\\left\(\\mathcal\{U\}\_\{\\left\(\\frac\{\\epsilon\}\{\(1\+\\eta\)\}\\right\)\}^\{\\rho^\{\(i\)\}\}\\right\)r^\{\(i\)\}\\right\]\\right\]=πr\(i\+1\)MCE\\displaystyle=\\pi^\{\\text\{MCE\}\}\_\{r^\{\(i\+1\)\}\}∎
## Appendix BAdditional Results
In this section, we present auxiliary experimental results that support our claims\.[Table2](https://arxiv.org/html/2605.11020#A2.T2)compares our method with Successor Feature Matching \(SFM\)\(Jainet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib51)\)\.[Table3](https://arxiv.org/html/2605.11020#A2.T3)compares aggregate normalized inter\-quartile mean and VRAM usage\.[Figure6](https://arxiv.org/html/2605.11020#A2.F6)compares wall\-clock training times between TRIRL and baselines in Mujoco benchmarks and robotics imitation learning settings\. As described further in[SectionC\.2](https://arxiv.org/html/2605.11020#A3.SS2), we reimplement all methods \(including baselines\) to leverage jit\-compilation in Jax and use a parallelized RL simulator to further reduce training times\. We find that adversarial IL baselines like GAIL\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6)\), AIRL\(Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2)\), and AMP\(Penget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib12)\)train faster in comparison to TRIRL\. The primary computational costs of our method arise from two components: \(i\) the reward correction step that uses a buffer of discriminators, and \(ii\) TRPL for policy optimization \(which uses a numerical solver\)\. The runtime impact of the reward correction step is illustrated by comparing adversarial methods with the TR loss variant of TRIRL \([Section4\.3](https://arxiv.org/html/2605.11020#S4.SS3)\)\. Although this variant incurs a modest increase in runtime, it already outperforms most baselines in terms of performance\. Among the TRIRL variants, the TR loss version is also the fastest because of its simpler policy update\. This is followed by themaxη\\max\\;\\etaand retrospectiveη\\etaversions\. NEAR\(Diwanet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib22)\)is roughly comparable to TRIRL because of the high computational burden of learning an energy\-based reward via score matching\. In our experiments, LSIQ\(Al\-Hafezet al\.,[2023a](https://arxiv.org/html/2605.11020#bib.bib7)\)takes much longer to train because we intentionally reduce RL parallelization in favour of improved SAC performance\. The limited ability to scale with parallelized environments is a key limitation of SAC\-based imitation learning methods and ultimately constrains their applicability to complex environments such as the Unitree G1\.
Table 2:We compare our method against Successor Feature Matching \(SFM\)\(Jainet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib51)\)with a forward dynamics model \(FDM\) for base features and a TD7 policy \(best reported configuration in the paper\)\. We used the official SFM codebase, tuned SFM following the guidelines in the paper, and trained it till convergence\.TaskTRIRLmaxη\\mathbf\{\\max\\eta\}TRIRL TR lossSFMAnt1\.00±0\.05\\mathbf\{1\.00\\pm 0\.05\}1\.03±0\.061\.03\\pm 0\.060\.36±0\.070\.36\\pm 0\.07Half Cheetah1\.05±0\.04\\mathbf\{1\.05\\pm 0\.04\}0\.96±0\.040\.96\\pm 0\.041\.07±0\.07\\mathbf\{1\.07\\pm 0\.07\}Walker0\.73±0\.08\\mathbf\{0\.73\\pm 0\.08\}0\.72±0\.080\.72\\pm 0\.080\.51±0\.210\.51\\pm 0\.21Hopper0\.81±0\.06\\mathbf\{0\.81\\pm 0\.06\}0\.56±0\.29\\mathit\{0\.56\\pm 0\.29\}0\.56±0\.22\\mathit\{0\.56\\pm 0\.22\}Humanoid0\.92±0\.03\\mathbf\{0\.92\\pm 0\.03\}0\.78±0\.050\.78\\pm 0\.050\.70±0\.080\.70\\pm 0\.08
Table 3:Aggregate normalized inter\-quartile mean \(IQM\) with 95% CI \(higher is better\) and aggregate VRAM usage \(lower is better\) across all imitation learning tasks\.†: LSIQ has higher memory usage in our experiments because of the larger replay buffers\.AlgorithmIQM95 % CI low95 % CI highMemory \(Gb\)TRIRLmaxη\{\\max\\;\\eta\}\(ours\)0\.78810\.77860\.79727\.5774TRIRL \- TR loss \(ours\)0\.62930\.57940\.67967\.5774GAIL0\.32970\.27250\.38681\.1192LSIQ0\.17950\.14940\.21343\.6924†AIRL0\.16940\.14670\.19281\.2744AMP0\.11670\.09850\.14181\.0105NEAR0\.09860\.08180\.11842\.5355
[Figures8](https://arxiv.org/html/2605.11020#A2.F8)and[7](https://arxiv.org/html/2605.11020#A2.F7)highlight the improved training stability offered by our method\. We find that our method is also much more robust to overtrained discriminators\. Finally,[Algorithm2](https://arxiv.org/html/2605.11020#alg2)shows an extend, practical pseudocode of our method and[Figure9](https://arxiv.org/html/2605.11020#A2.F9)shows the learnt reward function in the point maze task\.
Figure 6:A comparison of the runtimes of all methods\. Except for LSIQ on the G1 tasks, all other methods were trained on an RTX 3090 GPU\. An RTX PRO 6000 Blackwell was used to accommodate the large replay buffer and batch sizes needed for LSIQ on the complex G1 environment \([SectionC\.5](https://arxiv.org/html/2605.11020#A3.SS5)\)\.Figure 7:TRIRL is stable and highly performant across a wide range of hyperparameter values\. Here, we show that TRIRL has stable performance, even with a near\-perfect discriminator\. In contrast, adversarial methods like GAIL are highly sensitive and typically fail due to the sharp decision boundaries induced by perfect discrimination\.Figure 8:To underscore the monotonic performance improvement offered by our method, we plot all seeds from the Ant imitation learning experiment\. TRIRL has much more stable and consistent training, and its performance grows approximately monotonically\. In contrast, owing to its local rewards and suboptimal policies, GAIL arbitrarily fluctuates in performance during training \(example seed in dark green\)\.#### TRIRL in Discrete Settings:
Here, we briefly explain the procedure used to generate[Figure2](https://arxiv.org/html/2605.11020#S3.F2)\. In the discrete case, our implementation closely follows[Algorithm1](https://arxiv.org/html/2605.11020#alg1)\. The log density ratio is computed by evaluating the expert’s and the agent’s occupancies analytically from the policy \(and transition dynamics\)\. The reward update and reward correction are also computed analytically for the whole grid at once\. We use soft value iteration for policy optimization and augment the reward with a reverse KL divergence trust region penalty weighted by a scheduledη\\eta\. Soft value iteration returns Boltzmann policies and guarantees convergence to the trust region optimal policy\. In the discrete setting, we naturally also don’t need discriminators or a discriminator buffer\.
#### Feature\-Based Variants:
As highlighted in[Section4\.4](https://arxiv.org/html/2605.11020#S4.SS4), our method is a general framework and is not limited to a specific discriminator architecture or data modality\. We can hence instill additional inductive biases in our method by doing IRL in the space of features\. For example, it is reasonable to assume that the reward function for humanoid locomotion is a function of the floating\-base velocity, torso height, and action cost\. We can compute such features from the provided expert demonstrations, and learn non\-linear extensions into a useful feature\-space\. TRIRL can directly learn reward functions in such feature spaces\. We find that such inductive biases can greatly improve the the quality of learnt rewards \([Figure9](https://arxiv.org/html/2605.11020#A2.F9)\) and highlight that this is generally a limitation of prior works like\(Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2); Garget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib15)\)\.
Figure 9:We show global reward functions learnt using a feature\-based variant of our method, where we first learn a non\-linear transformation of known base\-features, and then a global reward function in this feature\-space by employing a linear discriminator\. We note this reward function is also transferrable owing to it’s general depiction of the desired “goal”\.Algorithm 2Trust Region Inverse RL \(continuous state\-action spaces\)1:Initialize:neural networks
π\(0\)\\pi^\{\(0\)\},
D\(0\)D^\{\(0\)\}, and
Rfit\(0\)R\_\{\\mathrm\{fit\}\}^\{\(0\)\};
2:Initialize:circular buffer
ℬ=\{\(D\(i\),π\(i\),Rfit\(i\),η\(i\)\)\}i=0k\\mathcal\{B\}=\\\{\(D^\{\(i\)\},\\pi^\{\(i\)\},R\_\{\\mathrm\{fit\}\}^\{\(i\)\},\\eta^\{\(i\)\}\)\\\}\_\{i=0\}^\{k\};
3:Initialize:trust\-region bounds
ζμ,ζΣ\\zeta\_\{\\mu\},\\zeta\_\{\\Sigma\},
ηinit\\eta\_\{\\mathrm\{init\}\},
ϵ\\epsilon;
4:Output:
r⋆r^\{\\star\}and
π⋆\\pi^\{\\star\};
5:repeat
6:⊳\\trianglerightROLLOUT
7:
\{\(𝐬t,𝐚t,𝐬t\+1′\)\}t=0rollout horizon∼rollout\(π\(i\)\)\\\{\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\},\\mathbf\{s\}^\{\\prime\}\_\{t\+1\}\)\\\}\_\{t=0\}^\{\\text\{rollout horizon\}\}\\sim\\mathrm\{rollout\}\(\\pi^\{\(i\)\}\);
8:learn
D\(i\)≈log\(ρEρπ\(i\)\)D^\{\(i\)\}\\approx\\log\\\!\\left\(\\frac\{\\rho\_\{E\}\}\{\\rho\_\{\\pi^\{\(i\)\}\}\}\\right\)by minimizing BCE loss ;
9:append
D\(i\)D^\{\(i\)\}to
ℬ\\mathcal\{B\};
10:⊳\\trianglerightRETROSPECTIVE𝜼\\bm\{\\eta\}
11:ifretr\.
η\\etathen
12:
\{η\(i\)\}i=0k←TRprojection\(\{π\(i\)\}i=0k∼ℬ\)\\\{\\eta^\{\(i\)\}\\\}\_\{i=0\}^\{k\}\\leftarrow\\mathrm\{TR\\ projection\}\\\!\\left\(\\\{\\pi^\{\(i\)\}\\\}\_\{i=0\}^\{k\}\\sim\\mathcal\{B\}\\right\);
13:else
14:
\{η\(i\)\}i=0k∼ℬ\\\{\\eta^\{\(i\)\}\\\}\_\{i=0\}^\{k\}\\sim\\mathcal\{B\};
15:endif
16:⊳\\trianglerightREWARD CORRECTION
17:
\{D\(i\)\}i=0k∼ℬ\\\{D^\{\(i\)\}\\\}\_\{i=0\}^\{k\}\\sim\\mathcal\{B\};
18:
logits←jax\.vmap\(D\(i\)\.predict\(\{\(𝐬t,𝐚t,𝐬t\+1′\)\}\)\)overi\\mathrm\{logits\}\\leftarrow\\mathrm\{jax\.vmap\}\\\!\\left\(D^\{\(i\)\}\.\\mathrm\{predict\}\\bigl\(\\\{\(\\mathbf\{s\}\_\{t\},\\mathbf\{a\}\_\{t\},\\mathbf\{s\}^\{\\prime\}\_\{t\+1\}\)\\\}\\bigr\)\\right\)\_\{\\text\{over \}i\};
19:
correctedreward←Rfit\(i−k\)\\mathrm\{corrected\\ reward\}\\leftarrow R\_\{\\mathrm\{fit\}\}^\{\(i\-k\)\};
20:for
logit\(i\),η\(i\)\\mathrm\{logit\}^\{\(i\)\},\\,\\eta^\{\(i\)\}in
zip\(logits,\{η\(i\)\}i=0k\)\\mathrm\{zip\}\(\\mathrm\{logits\},\\\{\\eta^\{\(i\)\}\\\}\_\{i=0\}^\{k\}\)do
21:
step←ϵ/\(1\.0\+η\(i\)\)\\mathrm\{step\}\\leftarrow\\epsilon/\(1\.0\+\\eta^\{\(i\)\}\);
22:
correctedreward←\(1\.0−step\)⋅correctedreward\+step⋅β⋅logit\(i\)\\mathrm\{corrected\\ reward\}\\leftarrow\(1\.0\-\\mathrm\{step\}\)\\cdot\\mathrm\{corrected\\ reward\}\+\\mathrm\{step\}\\cdot\\beta\\cdot\\mathrm\{logit\}^\{\(i\)\};
23:endfor
24:
intermediatereward←\(1\.0−ϵ\)⋅correctedreward\+ϵ⋅β⋅logit\(i=k\)\\mathrm\{intermediate\\ reward\}\\leftarrow\(1\.0\-\\epsilon\)\\cdot\\mathrm\{corrected\\ reward\}\+\\epsilon\\cdot\\beta\\cdot\\mathrm\{logit\}^\{\(i=k\)\};
25:learn
Rfit\(i\+1\)≈correctedrewardR\_\{\\mathrm\{fit\}\}^\{\(i\+1\)\}\\approx\\mathrm\{corrected\\ reward\};
26:⊳\\trianglerightPOLICY OPTIMIZATION
27:ifTR lossthen
28:optimize
π\(i\+1\)\\pi^\{\(i\+1\)\}to maximize
intermediatereward\+schedule\(ηinit\)⋅TRloss\\mathrm\{intermediate\\ reward\}\+\\mathrm\{schedule\}\(\\eta\_\{\\mathrm\{init\}\}\)\\cdot\\mathrm\{TR\\ loss\};
29:else
30:
η\(i\+1\)∼\\eta^\{\(i\+1\)\}\\simoptimize
π\(i\+1\)\\pi^\{\(i\+1\)\}to maximize
intermediatereward\\mathrm\{intermediate\\ reward\}with TRPL ;
31:endif
32:append
η\(i\+1\)\\eta^\{\(i\+1\)\},
π\(i\+1\)\\pi^\{\(i\+1\)\}, and
Rfit\(i\+1\)R\_\{\\mathrm\{fit\}\}^\{\(i\+1\)\}to
ℬ\\mathcal\{B\};
33:untilconverged
### B\.1Ablation Studies
[Figure10](https://arxiv.org/html/2605.11020#A2.F10)shows how TRIRL scales with varying amounts of expert demonstrations\. In this regard, our method is roughly comparable to the prior works considered in this paper, with the exception of NEAR, which performs much worse in low\-data settings \(because of the challenges of learning an accurate energy function\)\.[Figures11](https://arxiv.org/html/2605.11020#A2.F11)and[4](https://arxiv.org/html/2605.11020#A2.T4)present an ablation study comparing the performance, runtime, and memory usage of TRIRL with different values of buffer size \(k\)\. We find that while a larger buffer size helps improve stability and performance, our method can also perform quite well with small values of buffer size, often outperforming baselines\. We also find that the low\-k configurations have much lower runtime and memory usage\. Finally[Tables5](https://arxiv.org/html/2605.11020#A2.T5)and[6](https://arxiv.org/html/2605.11020#A2.T6)show ablation studies for the parametersβ\\betaandϵ\\epsilon\.
Figure 10:A plot showing how TRIRL scales with different amounts of expert demonstration trajectories\.Figure 11:The scaling of performance with buffer size \(k\)\. TRIRL outperforms baselines even with very low values of buffer size \(k\)\. While a low k still beats baselines, performance and training stability benefit from larger k\. This is expected since the contribution of reward fitting \(and approximation errors\) diminishes exponentially with k\.Table 4:The scaling of VRAM usage and runtime with buffer size \(k\) in the Mujoco Ant environment\. TRIRL outperforms baselines even with very low values of buffer size \(k\)\. The low\-k configurations have a negligible runtime and VRAM overhead\.AlgorithmMemory \(GB\)Runtime \(sec/M steps\)TRIRL k=10\.9270\.72±\\pm0\.96k=21\.4671\.11±\\pm0\.98k=252\.5376\.46±\\pm1\.32k=502\.5378\.07±\\pm1\.11k=1002\.5381\.98±\\pm3\.33GAIL1\.1251\.37±\\pm2\.13AMP3\.6961\.76±\\pm0\.93AIRL1\.2766\.51±\\pm8\.85NEAR1\.01100\.41±\\pm3\.87LSIQ2\.54143\.25±\\pm12\.86
Table 5:We conduct an ablation study to investigate the impact of the hyperparameterβ\\beta\(entropy regularization\)\. We find that TRIRL is generally insensitive toβ\\betaand the optimal entropy coefficient is generally task\-dependent\. Our results align with\(Orsiniet al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib21), Fig\.19\)\.ent coeff𝜷\(𝟏/ent coeff\)\\bm\{\\beta\(1/\\text\{\{ent coeff\}\}\)\}AntHalf CheetahWalker0\.1101\.01±0\.02\\bm\{1\.01\\pm 0\.02\}0\.92±0\.090\.92\\pm 0\.090\.67±0\.080\.67\\pm 0\.080\.011000\.95±0\.060\.95\\pm 0\.060\.95±0\.09\\bm\{0\.95\\pm 0\.09\}0\.73±0\.08\\bm\{0\.73\\pm 0\.08\}0\.00110001\.00±0\.081\.00\\pm 0\.080\.91±0\.100\.91\\pm 0\.100\.67±0\.080\.67\\pm 0\.080\.0001100001\.00±0\.061\.00\\pm 0\.060\.93±0\.080\.93\\pm 0\.080\.67±0\.090\.67\\pm 0\.090\.000011000000\.94±0\.120\.94\\pm 0\.120\.92±0\.080\.92\\pm 0\.080\.73±0\.080\.73\\pm 0\.08
Table 6:We conduct an ablation study to investigate the impact of the hyperparameterϵ\\epsilonwhich controls the ratio of newly fitted log density ratios and the corrected reward from the previous iteration\. TRIRL is also not overly sensitive toϵ\\epsilon\. We find that, setting this to a low value generally ensures good performance, though higher values can speed\-up convergence at the cost of performance guarantees\.ϵ\\bm\{\\epsilon\}AntHalf CheetahWalker0\.011\.02±0\.03\\bm\{1\.02\\pm 0\.03\}0\.94±0\.080\.94\\pm 0\.080\.75±0\.030\.75\\pm 0\.030\.20\.99±0\.050\.99\\pm 0\.050\.95±0\.09\\bm\{0\.95\\pm 0\.09\}0\.76±0\.03\\bm\{0\.76\\pm 0\.03\}0\.41\.01±0\.031\.01\\pm 0\.030\.91±0\.080\.91\\pm 0\.080\.75±0\.060\.75\\pm 0\.060\.61\.00±0\.061\.00\\pm 0\.060\.91±0\.100\.91\\pm 0\.100\.72±0\.090\.72\\pm 0\.090\.80\.99±0\.020\.99\\pm 0\.020\.91±0\.080\.91\\pm 0\.080\.69±0\.040\.69\\pm 0\.040\.990\.95±0\.120\.95\\pm 0\.120\.90±0\.090\.90\\pm 0\.090\.68±0\.100\.68\\pm 0\.10
## Appendix CExperiment Details
### C\.1Baselines
#### Generative Adversarial Imitation Learning \(GAIL\):
Ho and Ermon \([2016](https://arxiv.org/html/2605.11020#bib.bib6)\)propose GAIL, one of the first methods to reformulate the MCE\-IRL dual ascent algorithm\(Ziebartet al\.,[2010](https://arxiv.org/html/2605.11020#bib.bib4)\)into a scalable method for imitation learning in complex continuous\-domain settings\. GAIL achieves the same optimal primal solution as MCE\-IRL, however, it relies on per\-step local policy optimization and a local reward function learnt using a neural network discriminator\. GAIL minimizes the Jensen\-Shannon divergence between the expert and agent distributions\. Practically, the reward function is defined aslogσ\(D\(𝐬,𝐚\)\)\\log\\sigma\\left\(D\(\\mathbf\{s\},\\mathbf\{a\}\)\\right\)whereD\(𝐬,𝐚\)D\(\\mathbf\{s\},\\mathbf\{a\}\)is a binary classifier that distinguishes expert and agent samples\. While the paper originally proposes TRPO\(Schulmanet al\.,[2015](https://arxiv.org/html/2605.11020#bib.bib27)\)for policy optimization, most modern implementations use PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.11020#bib.bib28)\)because of its better performance and scalability\.
#### Adversarial Inverse Reinforcement Learning \(AIRL\):
Fuet al\.\([2018](https://arxiv.org/html/2605.11020#bib.bib2)\)present AIRL, another adversarial IL algorithm that learns unshaped rewards and focuses on reward recovery\. Similarly to our method, AIRL minimizes the reverse KL divergence between the expert and agent samples\. The main contribution of the paper is to introduce an unshaped reward function to disentangle the reward from the environment’s dynamics\. In doing so,Fuet al\.\([2018](https://arxiv.org/html/2605.11020#bib.bib2)\)claim to learn a reward function that conveys the expert’s motivations, instead of simply rewarding closeness to the expert’s transitions\. In AIRL, the discriminator takes the form:
D\(𝐬,𝐚\)\\displaystyle D\(\\mathbf\{s\},\\mathbf\{a\}\)=exp\(fθ,ϕ\(𝐬,𝐚\)\)exp\(fθ,ϕ\(𝐬,𝐚\)\)\+π\(𝐚\|𝐬\)where\\displaystyle=\\frac\{\\exp\{\(f\_\{\\theta,\\phi\}\(\\mathbf\{s\},\\mathbf\{a\}\)\)\}\}\{\\exp\{\(f\_\{\\theta,\\phi\}\(\\mathbf\{s\},\\mathbf\{a\}\)\)\}\+\\pi\(\\mathbf\{a\}\|\\mathbf\{s\}\)\}\\quad\\text\{where\}fθ,ϕ\(𝐬,𝐚\)\\displaystyle f\_\{\\theta,\\phi\}\(\\mathbf\{s\},\\mathbf\{a\}\)=gθ\(𝐬,𝐚\)\+γhϕ\(𝐬′\)−hϕ\(𝐬\)\\displaystyle=g\_\{\\theta\}\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\gamma h\_\{\\phi\}\(\\mathbf\{s\}^\{\\prime\}\)\-h\_\{\\phi\}\(\\mathbf\{s\}\)Here, the unshaped rewards are learnt asgθ\(𝐬,𝐚\)g\_\{\\theta\}\(\\mathbf\{s\},\\mathbf\{a\}\)orgθ\(𝐬\)g\_\{\\theta\}\(\\mathbf\{s\}\)\(in case of state\-only rewards\)\. Similarly to GAIL, the authors propose TRPO for policy optimization, however, modern implementations use PPO\.
#### Adversarial Motion Priors \(AMP\):
Recently,Penget al\.\([2021](https://arxiv.org/html/2605.11020#bib.bib12)\)propose AMP, a method that leverages the improvements from least\-squares GANs\(Maoet al\.,[2017](https://arxiv.org/html/2605.11020#bib.bib23)\), for adversarial imitation learning\. Mainly, AMP differs from GAIL in its minimization of theχ2\\chi^\{2\}divergence between the expert and agent distributions\.Penget al\.\([2021](https://arxiv.org/html/2605.11020#bib.bib12)\)also engineer the reward function slightly differently to better leverage the squared\-error minimization\. The AMP reward function isr\(𝐬,𝐚\)=max\[0,1−0\.25\(D\(𝐬,𝐚\)−1\)2\]r\(\\mathbf\{s\},\\mathbf\{a\}\)=\\max\[0,1\-0\.25\(D\(\\mathbf\{s\},\\mathbf\{a\}\)\-1\)^\{2\}\]whereD\(𝐬,𝐚\)D\(\\mathbf\{s\},\\mathbf\{a\}\)is a binary classifier trained to assign a label of 1 to the expert samples and \-1 to the agent samples\.
#### Least Squares Inverse Q\-Learning \(LSIQ\):
Garget al\.\([2021](https://arxiv.org/html/2605.11020#bib.bib15)\)introduce Inverse Soft Q\-Learning \(IQ\-Learn\), a non\-adversarial method that reformulates MCE\-IRL in the Q\-policy space\. IQ\-Learn uses the inverse soft Bellman operator𝒯π𝒬\(𝐬,𝐚\)=Q\(𝐬,𝐚\)−γ𝔼𝐬′∼𝒫\[Vπ\(𝐬′\)\]\\mathcal\{T^\{\\pi\}Q\(\\mathbf\{s\},\\mathbf\{a\}\)\}=Q\(\\mathbf\{s\},\\mathbf\{a\}\)\-\\gamma\\mathbb\{E\}\_\{\\mathbf\{s\}^\{\\prime\}\\sim\\mathcal\{P\}\}\\left\[V\_\{\\pi\}\(\\mathbf\{s\}^\{\\prime\}\)\\right\]to reformulate the entropy regularized distribution matching objective into an objective that only depends on the Q function\. They do this by leveraging the fact that the optimal policy depends onQ\(𝐬,𝐚\)Q\(\\mathbf\{s\},\\mathbf\{a\}\)in closed form\. Ultimately, the Q function is learnt by minimizing
𝒥\(π,Q\)=𝔼ρE\[ϕ\(Q\(𝐬,𝐚\)−γ𝔼𝐬′∼𝒫\[Vπ\(𝐬′\)\]\)\]−\(1−γ\)𝔼ρ𝟘\[Vπ\(𝐬0\)\]\\displaystyle\\mathcal\{J\}\(\\pi,Q\)=\\mathbb\{E\}\_\{\\rho\_\{E\}\}\\left\[\\phi\\left\(Q\(\\mathbf\{s\},\\mathbf\{a\}\)\-\\gamma\\mathbb\{E\}\_\{\\mathbf\{s\}^\{\\prime\}\\sim\\mathcal\{P\}\}\[V\_\{\\pi\}\(\\mathbf\{s\}^\{\\prime\}\)\]\\right\)\\right\]\-\(1\-\\gamma\)\\mathbb\{E\_\{\\rho\_\{0\}\}\}\\left\[V^\{\\pi\}\(\\mathbf\{s\}\_\{0\}\)\\right\]and the policy is learnt using SAC\(Haarnojaet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib24)\)\.Al\-Hafezet al\.\([2023a](https://arxiv.org/html/2605.11020#bib.bib7)\)extend IQ\-Learn by leveraging a mixture distribution for computing the expectation in𝒥\(π,Q\)\\mathcal\{J\}\(\\pi,Q\)\. The resulting optimization problem is shown to minimize a boundedχ2\\chi^\{2\}divergence between the expert and the mixture distribution\. The resulting method, called LSIQ, also properly handles absorbing states \(discussed further in[SectionC\.4](https://arxiv.org/html/2605.11020#A3.SS4)\), leading to a better performing algorithm\. In our motion\-capture imitation experiments on the G1 humanoid robot, we use the state\-onlyQQ\-function objective from\(Garget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib15), Appendix A\.1\)within LSIQ\.
#### Noise Conditioned Energy Based Annealed Rewards \(NEAR\):
Diwanet al\.\([2025](https://arxiv.org/html/2605.11020#bib.bib22)\)present NEAR, an energy\-based imitation learning method that uses score\-based models to learn a smooth, stationary reward function that can then directly be optimized by reinforcement learning\. NEAR uses noise\-conditioned score networks\(Song and Ermon,[2019](https://arxiv.org/html/2605.11020#bib.bib40)\), a score\-based generative model, to approximate the expert distribution as an unnormalized energy functionEθ\(𝐬,𝐚,σ\)E\_\{\\theta\}\(\\mathbf\{s\},\\mathbf\{a\},\\sigma\)\. Given a perturbation level defined by varianceσ2\\sigma^\{2\}, the energy function is shown to directly correspond to a reward function\.Diwanet al\.\([2025](https://arxiv.org/html/2605.11020#bib.bib22)\)then show that this energy function can be optimized with PPO to learn imitation policies\. The paper also introduces an annealing framework to gradually change theσ\\sigmacurriculum to improve performance\. One of the primary drawbacks of NEAR is that learning an expressive energy\-based reward function requires a lot more expert demonstrations that most traditional IRL methods\. Hence, NEAR often underperforms in our low\-data experimental settings \(especially on the G1 and Go2 robots\)\.
#### Successor Feature Matching \(SFM\):
Jainet al\.\([2025](https://arxiv.org/html/2605.11020#bib.bib51)\)introduce SFM, a non\-adversarial IL method that directly matches the expert and agent’s successor features instead of learning an explicit discriminator or reward function\. SFM assumes a base feature mapϕ\(𝐬\)\\phi\(\\mathbf\{s\}\)and represents long\-horizon feature occupancy through successor featuresψπ\(𝐬,𝐚\)=𝔼π\[∑t=0∞γtϕ\(𝐬t\)∣𝐬0=𝐬,𝐚0=𝐚\]\\psi^\{\\pi\}\(\\mathbf\{s\},\\mathbf\{a\}\)=\\mathbb\{E\}\_\{\\pi\}\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\phi\(\\mathbf\{s\}\_\{t\}\)\\mid\\mathbf\{s\}\_\{0\}=\\mathbf\{s\},\\mathbf\{a\}\_\{0\}=\\mathbf\{a\}\]\. The main observation is that, under a linear reward class, the witness reward that maximally distinguishes the expert and agent can be written in closed form as the difference between their expected successor features\. Concretely, SFM estimates𝐰^=ψ^E−ψ^π\\hat\{\\mathbf\{w\}\}=\\hat\{\\psi\}^\{E\}\-\\hat\{\\psi\}^\{\\pi\}and uses the induced reward directionr\(𝐬\)∝ϕ\(𝐬\)⊤𝐰^r\(\\mathbf\{s\}\)\\propto\\phi\(\\mathbf\{s\}\)^\{\\top\}\\hat\{\\mathbf\{w\}\}to update the policy\. Rather than training a separate reward model, SFM uses the learnt successor features directly in the actor update, making the method a direct policy\-search procedure for minimizing the feature\-matching imitation gap\. In practice,Jainet al\.\([2025](https://arxiv.org/html/2605.11020#bib.bib51)\)learnϕ\\phijointly with the policy and implement policy optimization with TD3/TD7\-style actor\-critic updates\. The main drawback of SFM is that its performance depends on the quality of the learnt base features, and it offers no method for extracting the learnt reward function\.
### C\.2Implementation Details
We conduct experiments on Mujoco continuous control benchmarks222[https://gymnasium\.farama\.org/environments/mujoco/](https://gymnasium.farama.org/environments/mujoco/)as well as more challenging robotics tasks\. The Mujoco environments are described in detail on the official website linked below\. For the robotics experiments, we use the Unitree G1 and Unitree Go2 robots\. The G1 is a 35 kg, 1\.3 m tall humanoid robot capable of dynamic locomotion tasks\. It is a 23 degree\-of\-freedom system with high\-torque quasi\-direct\-drive actuators for greater speed and precise control\. The G1 simulation environment has a 256\-dimensional observation space composed of base angular velocity and roll/pitch, full joint positions and velocities, the previous action, and a short history stack of recent proprioceptive states\. The G1 has a 23\-dimensional action space, where each component is a target joint position increment passed through a PD controller \(position control\)\. The Go2 is a 15\.8 kg, 0\.67 m long quadruped robot designed for dynamic legged locomotion\. It is a 12 degree\-of\-freedom system with actuated joints in each of its four legs\. The Go2 simulation environment has a 42\-dimensional observation space composed of trunk linear and angular velocities, full joint positions and velocities, and the previous action\. It has a 12\-dimensional action space, where each component is a target joint position passed through a PD controller\.[Figure12](https://arxiv.org/html/2605.11020#A3.F12)shows a snapshot of all environments\.
Figure 12:Environments used in our experiments\.We leverage massively parallel reinforcement learning environments for all our experiments\(Bohlinger and Dorer,[2023](https://arxiv.org/html/2605.11020#bib.bib32)\)\. To this end, we use the Mujoco XLA simulation framework\(Todorovet al\.,[2012](https://arxiv.org/html/2605.11020#bib.bib35)\)to handle all algorithm and environment computations on the GPU\. Further, for a fair comparison, we implement all methods using the just\-in\-time compiled Python frameworks, Jax/Flax\(Bradburyet al\.,[2022](https://arxiv.org/html/2605.11020#bib.bib37); Heeket al\.,[2024](https://arxiv.org/html/2605.11020#bib.bib36)\)\. This allows us to parallelize multiple algorithmic components for all methods efficiently\. Such parallelization is especially useful for quickly computing the several intermediate quantities described in[Sections4\.2](https://arxiv.org/html/2605.11020#S4.SS2)and[4\.3](https://arxiv.org/html/2605.11020#S4.SS3)\. Specifically, we rely on parallelization for:
- •Trust\-region projections:computing Lagrangian multipliers for all samples in parallel\.
- •Reward interpolations:running inference on all discriminators in the buffer at once\.
- •Retrospective Lagrangian multipliers:computing Lagrangian multipliers for all samples across all past policies in the buffer\.
### C\.3Trust Region Projections
Here, we briefly elaborate on the algorithmic and implementation details of computing trust region policy projections \([Section4\.3](https://arxiv.org/html/2605.11020#S4.SS3)\)\. Given a Gaussian policyπ\(⋅\|𝐬\)=𝒩\(μθ\(𝐬\),Σϕ\)\\pi\(\\cdot\|\\mathbf\{s\}\)=\\mathcal\{N\}\(\\mu\_\{\\theta\}\(\\mathbf\{s\}\),\\Sigma\_\{\\phi\}\),Ottoet al\.\([2021](https://arxiv.org/html/2605.11020#bib.bib29)\)propose a method to exactly enforce the analytical reverse KL trust region by the application of a projection layer during policy optimization\. The projection layer maps every violating policy prediction back into the trust region such that the projected policy parameters\(μ𝐬~,Σ~\)\(\\tilde\{\\mu\_\{\\mathbf\{s\}\}\},\\tilde\{\\Sigma\}\)solve the following optimization problem:
argminμ~𝐬dmean\(μ~𝐬,μθ\(𝐬\)\),\\displaystyle\\arg\\min\_\{\\tilde\{\\mu\}\_\{\\mathbf\{s\}\}\}d\_\{\\textrm\{mean\}\}\\left\(\\tilde\{\\mu\}\_\{\\mathbf\{s\}\},\\mu\_\{\\theta\}\(\\mathbf\{s\}\)\\right\),\\quads\.t\.dmean\(μ~𝐬,μold\(𝐬\)\)≤ζμ,\\displaystyle\\text\{s\.t\.\}\\quad d\_\{\\textrm\{mean\}\}\\left\(\\tilde\{\\mu\}\_\{\\mathbf\{s\}\},\\mu\_\{\\textrm\{old\}\}\(\\mathbf\{s\}\)\\right\)\\leq\\zeta\_\{\\mu\},\\quadargminΣ~dcov\(Σ~,Σϕ\),\\displaystyle\\arg\\min\_\{\\tilde\{\\Sigma\}\}d\_\{\\textrm\{cov\}\}\\left\(\\tilde\{\\Sigma\},\\Sigma\_\{\\phi\}\\right\),\\quads\.t\.dcov\(Σ~,Σold\)≤ζΣ,\\displaystyle\\text\{s\.t\.\}\\quad d\_\{\\textrm\{cov\}\}\\left\(\\tilde\{\\Sigma\},\\Sigma\_\{\\textrm\{old\}\}\\right\)\\leq\\zeta\_\{\\Sigma\},\(6\)where,dmean=1/2\(\(μ2−μ1\)TΣ2−1\(μ2−μ1\)\)d\_\{\\textrm\{mean\}\}=\\nicefrac\{\{1\}\}\{\{2\}\}\\left\(\(\\mu\_\{2\}\-\\mu\_\{1\}\)^\{T\}\\Sigma\_\{2\}^\{\-1\}\(\\mu\_\{2\}\-\\mu\_\{1\}\)\\right\)anddcov=1/2\(log\|Σ2\|/\|Σ1\|\+tr\{Σ2−1Σ1\}−d\)d\_\{\\textrm\{cov\}\}=\\nicefrac\{\{1\}\}\{\{2\}\}\\left\(\\log\\nicefrac\{\{\|\\Sigma\_\{2\}\|\}\}\{\{\|\\Sigma\_\{1\}\|\}\}\+\\text\{tr\}\\\{\\Sigma\_\{2\}^\{\-1\}\\Sigma\_\{1\}\\\}\-d\\right\)are the mean and covariance components of the reverse KL divergence for two Gaussian distributions with meansμ1\\mu\_\{1\}andμ2\\mu\_\{2\}, covariancesΣ1\\Sigma\_\{1\}andΣ2\\Sigma\_\{2\}, and dimensionalitydd\. The projected policy’s mean is known in closed form as:
μs~\\displaystyle\\tilde\{\\mu\_\{s\}\}=μθ\(𝐬\)\+ημ\(𝐬\)μold\(𝐬\)1\+ημ\(𝐬\)with Lagrangian multiplier\\displaystyle=\\frac\{\\mu\_\{\\theta\}\(\\mathbf\{s\}\)\+\\eta\_\{\\mu\}\(\\mathbf\{s\}\)\\;\\mu\_\{\\text\{old\}\}\(\\mathbf\{s\}\)\}\{1\+\\eta\_\{\\mu\}\(\\mathbf\{s\}\)\}\\quad\\text\{with Lagrangian multiplier\}ημ\(𝐬\)\\displaystyle\\eta\_\{\\mu\}\(\\mathbf\{s\}\)=\(μold\(𝐬\)−μθ\(𝐬\)\)TΣold−1\(μold\(𝐬\)−μθ\(𝐬\)\)ζμ−1\\displaystyle=\\sqrt\{\\frac\{\{\\left\(\\mu\_\{\\text\{old\}\}\(\\mathbf\{s\}\)\-\\mu\_\{\\theta\}\(\\mathbf\{s\}\)\\right\)\}^\{T\}\{\\Sigma\_\{\\text\{old\}\}^\{\-1\}\}\\left\(\\mu\_\{\\text\{old\}\}\(\\mathbf\{s\}\)\-\\mu\_\{\\theta\}\(\\mathbf\{s\}\)\\right\)\}\{\\zeta\_\{\\mu\}\}\}\-1For the covariance,Ottoet al\.\([2021](https://arxiv.org/html/2605.11020#bib.bib29)\)compute the projected policy’s precisionΛ~=Σ~−1\\tilde\{\\Lambda\}=\\tilde\{\\Sigma\}^\{\-1\}as an interpolation between precision matrices:
Λ~\\displaystyle\\tilde\{\\Lambda\}=ηΣ⋆Λold\+ΛηΣ⋆\+1,ηΣ⋆=argminηΣg\(ηΣ\),s\.t\.ηΣ≥0\\displaystyle=\\frac\{\\eta\_\{\\Sigma\}^\{\\star\}\\;\\Lambda\_\{\\text\{old\}\}\+\\Lambda\}\{\\eta\_\{\\Sigma\}^\{\\star\}\+1\},\\quad\\eta\_\{\\Sigma\}^\{\\star\}=\\arg\\min\_\{\\eta\_\{\\Sigma\}\}\\;g\(\\eta\_\{\\Sigma\}\),\\;\\text\{s\.t\.\}\\;\\eta\_\{\\Sigma\}\\geq 0whereηΣ\\eta\_\{\\Sigma\}is the covariance Lagrangian multiplier andg\(ηΣ\)g\(\\eta\_\{\\Sigma\}\)the dual function of[Equation6](https://arxiv.org/html/2605.11020#A3.E6)\. This dual minimization cannot be solved in closed form, howeverOttoet al\.\([2021](https://arxiv.org/html/2605.11020#bib.bib29)\)formulate a differentiable trust region layer by solving the minimization using L\-BFGS\(Liu and Nocedal,[1989](https://arxiv.org/html/2605.11020#bib.bib30)\)and computing its gradients by taking the differentials of the KKT conditions of the dual\. Following\(Ottoet al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib29), Appendix B\.5\), we re\-implement the trust region projection layer in Jax, usingBlondelet al\.\([2022](https://arxiv.org/html/2605.11020#bib.bib38)\)for L\-BFGS andjax\.custom\_vjpto compute its gradients\. Crucially, Jax’s parallelization allows us to compute these multipliers very quickly, even for very large rollout batches \(approx\. 41k samples at once\)\.
### C\.4Absorbing State Handling
In traditional reinforcement learning settings, when an agent enters an absorbing state𝐬a\\mathbf\{s\}\_\{a\}, it receives zero reward \(i\.e\.r\(𝐬a,\.\)=0r\(\\mathbf\{s\}\_\{a\},\.\)=0\), and the next state for any next agent action is always𝐬a\\mathbf\{s\}\_\{a\}\(i\.e\.𝒫\(𝐬a\|𝐬a,\.\)=1\\mathcal\{P\}\(\\mathbf\{s\}\_\{a\}\|\\mathbf\{s\}\_\{a\},\.\)=1\)\. This elegantly handles episode termination without hindering the RL objective of maximizing a known reward function\.Kostrikovet al\.\([2019](https://arxiv.org/html/2605.11020#bib.bib19)\)highlight that this assumption, however, causes problems in inverse reinforcement learning — where the reward is learnt and the goal is to instead learn policies that imitate the expert\. Primarily, the issue is that the optimal learnt reward can vary in scale arbitrarily, meaning that absorbing states are either seen are highly rewarding or highly costly\. Depending on the task, this induces a survival/termination bias in the agent\. Improper handling of absorbing states may cause some methods to over/under\-perform relative to their true potential\. As shown empirically inKostrikovet al\.\([2019](https://arxiv.org/html/2605.11020#bib.bib19)\), adversarial methods like GAIL and AIRL\(Ho and Ermon,[2016](https://arxiv.org/html/2605.11020#bib.bib6); Fuet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib2)\)can get a large performance gain by such biases\.Al\-Hafezet al\.\([2023a](https://arxiv.org/html/2605.11020#bib.bib7)\)also highlight that the same is true for non\-adversarial methods like IQ\-Learn\(Garget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib15)\)\.
For a fair comparison, we learn the reward function for absorbing states in all methods shown in this paper\. For adversarial methods like GAIL, AIRL, and AMP\(Penget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib12)\), as well as our proposed algorithm \(TRIRL\), we handle absorbing states similarly toKostrikovet al\.\([2019](https://arxiv.org/html/2605.11020#bib.bib19)\)\. i\.e\., we fit the discriminator on expert and agent samples in absorbing states, add an indicator variable in the discriminator input to indicate the absorbing state, and use the absorbing state value during advantage estimation:
Vπ\(𝐬\)\\displaystyle V\_\{\\pi\}\(\\mathbf\{s\}\)=𝔼π\[r\(𝐬,𝐚\)\+\(1−ν\)γVπ\(𝐬′\)\+νV\(𝐬a\)\]where\\displaystyle=\\mathbb\{E\}\_\{\\pi\}\\left\[r\(\\mathbf\{s\},\\mathbf\{a\}\)\+\\left\(1\-\\nu\\right\)\\;\\gamma V\_\{\\pi\}\(\\mathbf\{s\}^\{\\prime\}\)\+\\nu\\;V\(\\mathbf\{s\}\_\{a\}\)\\right\]\\quad\\text\{where\}V\(𝐬a\)\\displaystyle V\(\\mathbf\{s\}\_\{a\}\)=γ1−γr\(𝐬a\);r\(𝐬a\)is learnt;ν=𝟏\{𝐬′is absorbing\}\.\\displaystyle=\\frac\{\\gamma\}\{1\-\\gamma\}\\;r\(\\mathbf\{s\}\_\{a\}\)\\quad;\\quad r\(\\mathbf\{s\}\_\{a\}\)\\text\{ is learnt \}\\quad;\\quad\\nu=\\mathbf\{1\}\_\{\\\{\\mathbf\{s\}^\{\\prime\}\\text\{ is absorbing\}\\\}\}\.For NEAR\(Diwanet al\.,[2025](https://arxiv.org/html/2605.11020#bib.bib22)\), we train the energy\-based reward on absorbing states and follow the same procedure for advantage estimation\. LSIQ\(Al\-Hafezet al\.,[2023a](https://arxiv.org/html/2605.11020#bib.bib7)\)rectifies absorbing state handling in IQ\-Learn, and we implement it as described in the paper\.
### C\.5Hyperparameters
The following hyperparameters were fixed across all methods in this paper\. For the policy, we use a 3\-layer dense network and predict thelog\\logstandard deviation to ensure non\-negativity\. For the simpler, Mujoco tasks we use 256 neurons per layer and for the harder, robotics tasks we use \[512, 256, 128\] neurons respectively\. We usetanh\\tanhactivations between all layers and clip actions to the environment’s action range\. We use the same architecture for the discriminator and critic networks \(except for the robotics tasks where we use \[1024, 512\] neurons for the critic\)\. We use a rollout horizon of 10 steps and set the following RL hyperparams: \{γ:0\.99\\gamma:0\.99,λgae:0\.95\\lambda\_\{\\text\{gae\}\}:0\.95, PPO clipping: 0\.2, gradient clipping: 1\.0, gradient penalty: 0\.005, mini\-batch size: 512 \}\. We tune learning rates and the number of gradient updates for each algorithm\. In general, we found optimal values in the following ranges: \{LRπ:4e−5\\text\{LR\}\_\{\\pi\}:4e^\{\-5\}\-1e−41e^\{\-4\},LRdisc:8e−5\\text\{LR\}\_\{\\text\{disc\}\}:8e^\{\-5\}\-3e−43e^\{\-4\}, grad steps: 20 \- 30\}\. For adversarial methods, we found that significantly fewer discriminator gradient steps were needed \(approx\. 3 \- 10\)\. We also tune the entropy coefficient \(1/β\\nicefrac\{\{1\}\}\{\{\\beta\}\}in our case\) specifically for each environment\. In general, we found that values in the range5e−35e^\{\-3\}\-8e−58e^\{\-5\}perform the best\.[Table7](https://arxiv.org/html/2605.11020#A3.T7)shows hyperparameters specific to our method\.[Table8](https://arxiv.org/html/2605.11020#A3.T8)reports raw expert and random returns\.
Table 7:Hyperparameters specific to our method\.HyperparameterOptimal Rangeϵ\\epsilon0\.45 \- 0\.75disc\. buffer capacity80 \- 120TR regression loss coef\.\(Ottoet al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib29), Section 4\.4\)0\.4 \- 0\.6mean bound0\.0002 \- 0\.005cov bound0\.0001 \- 0\.004ηinit\\eta\_\{\\text\{init\}\}\(TR loss version only\)60 \- 100#### Notes on SAC Training:
As mentioned above, we leverage massively parallel simulators in our reinforcement learning setup\. These simulators greatly reduce training time and make better use of the available computational resources\. In our analysis, all baselines except LSIQ\(Al\-Hafezet al\.,[2023a](https://arxiv.org/html/2605.11020#bib.bib7)\)use PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.11020#bib.bib28)\)for policy optimization\. Being on\-policy, PPO scales very easily with parallelized environments and allows us to greatly scale up training speed for all methods\. LSIQ, however, relies on SAC\(Haarnojaet al\.,[2018](https://arxiv.org/html/2605.11020#bib.bib24)\)for policy optimization in continuous settings\. This is a drawback because SAC is difficult to scale with parallel simulations\. Without parallelized environments, we found that LSIQ \(as well as IQ\-Learn\(Garget al\.,[2021](https://arxiv.org/html/2605.11020#bib.bib15)\)\) takes roughly 12–18 hours \(about 8\-10x the mean runtime of the other methods\) to reach the same number of training samples as the baselines, while performing roughly equally\. To ensure a fair comparison with a similar computational budget for all methods, we tune LSIQ specifically for parallelized environments\. In this context,Liet al\.\([2023](https://arxiv.org/html/2605.11020#bib.bib34)\)identify key challenges in scaling SAC and provide hyperparameter recommendations for improving training performance\. Their main suggestions are to greatly increase the replay buffer capacity, the batch size, and the critic–policy update ratio\. Following these recommendations, we tuned LSIQ to perform well in parallelized simulators while requiring much less runtime than the non\-parallelized version\. Because of these scaling issues, we also considered the number of environments as a hyperparameter for LSIQ tuning\. We also note that both the parallel and single\-environment versions of LSIQ reached similar maximum performance when using a tuned set of hyperparameters\. Sample efficiency could be further improved by considering recent advancements in using batch/weight norms in SAC\(Paleniceket al\.,[2026](https://arxiv.org/html/2605.11020#bib.bib54)\)\.
Table 8:Expert and random policy performance\.†The G1 experiments use motion\-capture data for the expert dataset\. For this task, we normalize with respect to a “perfect imitation” score with the per step score asexp\(−\(vπ−vπE\)2\)\\exp\(\{\-\(v\_\{\\pi\}\-v\_\{\\pi\_\{E\}\}\)^\{2\}\}\), wherevπv\_\{\\pi\}andvπEv\_\{\\pi\_\{E\}\}are the agent’s and expert’s upper\-body velocities\.EnvironmentExpert PerformanceRandom PerformanceHalf Cheetah5050\.39\-310\.55Half Cheetah \(wind\)4841\.05\-369\.77Half Cheetah \(mars grav\.\)4100\.14\-418\.77Ant5390\.6294\.43Ant\-Disabled4377\.2876\.59Walker5480\.8337\.53Hopper3531\.7624\.56Humanoid7404\.87236\.45Go22341\.381\.37G1†1000\.002\.86Similar Articles
Trust Region Q Adjoint Matching
Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.
Trust-Region Behavior Blending for On-Policy Distillation
Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.
TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.
Trust Region On-Policy Distillation
The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.
Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization
This paper introduces Interactive Inverse Reinforcement Learning (IIRL), a framework where a learner actively interacts with an expert to infer reward functions, formulated as a stochastic bi-level optimization problem. The authors propose the BISIRL algorithm, providing convergence guarantees and experimental validation for this interactive learning paradigm.