$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

arXiv cs.LG 05/18/26, 04:00 AM Papers
gflownets generative-models llms f-divergences trajectory-balance off-policy on-policy
Summary
This paper introduces a family of loss functions derived from f-divergences for training generative models like GFlowNets and LLMs, which are valid off-policy while matching on-policy gradients of the corresponding f-divergence. Applications include molecule discovery and asynchronous LLM tuning.
arXiv:2605.15417v1 Announce Type: new Abstract: In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models. This loss has the property that when evaluated \emph{on-policy} its gradients correspond to those of the KL divergence, while \emph{off-policy} it remains a valid loss with the same global minimizer. In this work, we demonstrate that this construction can be extended to the whole family of $f$-divergences, leading to a family of losses whose on-policy gradients are that of the corresponding $f$-divergence, but retain the same global minimizer off-policy. Specifically, we show that the on-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, and $f$-divergences. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the corresponding $f$-divergence, such as being more mode covering, whilst being applicable to off-policy data. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model (LLM) tuning, demonstrating that our models retain their predicted properties on- and off-policy in a wide class of generative models.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:41 AM
# A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data
Source: [https://arxiv.org/html/2605.15417](https://arxiv.org/html/2605.15417)
###### Abstract

In GFlowNets and variational inference, it has been shown that the mean square error between target and model log probabilities is an effective, low variance, surrogate loss for training generative models\. This loss has the property that when evaluated*on\-policy*its gradients correspond to those of the KL divergence, while*off\-policy*it remains a valid loss with the same global minimizer\. In this work, we demonstrate that this construction can be extended to the whole family offf\-divergences, leading to a family of losses whose on\-policy gradients are that of the correspondingff\-divergence, but retain the same global minimizer off\-policy\. Specifically, we show that the on\-policy gradients lead to a one to one correspondence between translation invariant loss functions on the target and model log probabilities, andff\-divergences\. This equivalence allows us to design new surrogate loss functions for tuning a wide class of generative models that inherit the properties of the correspondingff\-divergence, such as being more mode covering, whilst being applicable to off\-policy data\. We apply our losses on a range of tasks, including classic synthetic examples, SynFlowNets for molecule discovery, and asynchronous large language model \(LLM\) tuning, demonstrating that our models retain their predicted properties on\- and off\-policy in a wide class of generative models\.

Machine Learning, ICML

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.15417v1/x1.png)Figure 1:ff\-divergences include theα\\alpha\-divergence family which contains both Forward KL \(α=0\\alpha\\\!=\\\!0\) and Reverse KL / standard Trajectory Balance \(α=1\\alpha\\\!=\\\!1\) as special cases\. Lowerα\\alphayields a more mode\-covering loss, higherα\\alphaa more mode\-seeking one\.The alignment and fine\-tuning of generative models—whether they are large language models \(LLMs\) generating text, or Generative Flow Networks \(GFlowNets\) generating molecular graphs—relies on minimizing the discrepancy between the model’s distribution\. In Reinforcement Learning \(RL\) fine\-tuning, this objective is often framed as maximizing expected reward subject to a KL\-divergence penalty, or equivalently, minimizing a divergence between the policy and the target distribution\.

Traditionally, optimizing these objectives involves gradient estimators with high variance, such as REINFORCE\(Williams,[1992](https://arxiv.org/html/2605.15417#bib.bib42)\), or sophisticated policy gradient methods\(e\.g\. Schulman et al\.,[2017](https://arxiv.org/html/2605.15417#bib.bib33)\)\. However, recently the GFlowNet literature and large scale LLM tuning in Kimi\(Team et al\.,[2025a](https://arxiv.org/html/2605.15417#bib.bib38)\)has used a simpler approach: minimizing the mean squared error \(MSE\) between the log\-probabilities of the model and the target\. This has similarities with the K2 estimator of the𝕂𝕃\\mathbb\{KL\}divergence proposed inSchulman \([2020](https://arxiv.org/html/2605.15417#bib.bib31)\), which we term𝕂𝕃sq\\mathbb\{KL\}\_\{sq\}\.Tang and Munos \([2025](https://arxiv.org/html/2605.15417#bib.bib36)\), note this “squared KL” loss has a surprising property: whilst it is a biased estimator of the𝕂𝕃\\mathbb\{KL\}, its gradients on on\-policy data are unbiased to the true KL gradients\. Further, as it is built on the MSE, it remains a valid loss function with the same global minimizer when applied to off\-policy data\(Bartoldson et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib2); Tang et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib37)\)\.

While the𝕂𝕃sq\\mathbb\{KL\}\_\{sq\}estimator offers stability and off\-policy compatibility, by construction, it inherits the properties of the reverse KL divergence\. In particular, models trained under this objective will typically exhibit “mode\-seeking” behavior where the model collapses onto relatively few high\-probability modes rather than covering the full diversity of the target distribution\. In many generative tasks, such as drug discovery or exploratory agents, “mode\-covering” behavior is often preferred because generative models are used to explore the space of high reward candidates \(e\.g\. all possible drug targets\), rather than to seek out one \(or a few\) particularly good candidate\(s\)\.

In this work, we demonstrate that the effectiveness of the𝕂𝕃sq\\mathbb\{KL\}\_\{sq\}loss is not a unique phenomenon restricted to the KL divergence by deriving analogous losses for the entire family offf\-divergences\. We establish a theoretical equivalence showing thatanytranslation\-invariant loss function on log\-probabilities corresponds to a specificff\-divergence\. This allows us to derive a new family of surrogate losses that inherit the specific properties of their parent divergences \(e\.g\., the mode\-covering behavior of the Forward KL or Hellinger distance\) while retaining the optimization benefits of the𝕂𝕃sq\\mathbb\{KL\}\_\{sq\}estimator: low\-variance gradients on\-policy and validity off\-policy\.

We demonstrate the behaviour and utility of this family of loss functions across a variety of tasks\. On the Hypergrid task\(Bengio et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib3)\)—a synthetic task explicitly designed to test mode coverage—we show that the Hellinger and forward KL losses discover more modes of the reverse KL loss implied by the trajectory balance loss\(Malkin et al\.,[2022b](https://arxiv.org/html/2605.15417#bib.bib20)\)\. We then apply these losses to both molecule generation—by modifying the losses in SynFlowNet\(Cretu et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib8)\)—, class conditional sampling in diffusion models and asynchronous reinforcement learning on language models\. In both cases, we found that we were able to learn higher entropy policies with mode\-covering instances of our family of surrogate loss functions\. In summary, our contributions are as follows:

- •We derive a general family of translation\-invariant surrogate loss functions,ℒf\\mathcal\{L\}\_\{f\}, whose expected auto\-differentiated gradients match those of the correspondingff\-divergence on\-policy\.
- •We prove the inverse relationship: any convex, translation\-invariant loss on log\-probabilities corresponds to minimizing a specificff\-divergence\.
- •We generalize the “Vargrad”\(Richter et al\.,[2020](https://arxiv.org/html/2605.15417#bib.bib28)\)and batch\-wise normalization techniques to arbitraryff\-divergences, addressing the intractable partition function problem in RL fine\-tuning\.
- •We apply this framework to GFlowNets, showing that our losses generalize the standard Trajectory Balance objective\(Malkin et al\.,[2022a](https://arxiv.org/html/2605.15417#bib.bib19)\), and empirically demonstrate on synthetic grids and SynFlowNet molecular discovery tasks that we can control the exploration\-exploitation trade\-off \(mode\-seeking vs\. mode\-covering\) simply by changing the surrogate loss function\.

## 2Related work

##### GFlowNets and GFlowNet losses

Bengio et al\. \([2021](https://arxiv.org/html/2605.15417#bib.bib3)\)introduce GFlowNets and train them via the flow matching loss, which applies directly to a neural network aiming to learn the optimal Markovian flow for a given GFlowNets and applies on an action level\. The detailed balance objective was proposed inBengio et al\. \([2023](https://arxiv.org/html/2605.15417#bib.bib4)\), which again applies to the action level but parametrises a forward and backward policy distribution\.Malkin et al\. \([2022a](https://arxiv.org/html/2605.15417#bib.bib19)\)propose trajectory balance, which enforces an equality constraint between the forward policy, backward policy, and normalisation constant on the trajectory level and penalises it with the mean square error\. It was shown inMalkin et al\. \([2023](https://arxiv.org/html/2605.15417#bib.bib21)\)that on\-policy training with this loss is equivalent to a form of hierarchical variational inference\(Ranganath et al\.,[2016](https://arxiv.org/html/2605.15417#bib.bib27)\)with the𝕂𝕃\\mathbb\{KL\}divergence\. Our results extend these to allff\-divergences\. The Vargrad objective\(Richter et al\.,[2020](https://arxiv.org/html/2605.15417#bib.bib28)\)is often used in GFlowNet training, which use the same square error form asMalkin et al\. \([2022a](https://arxiv.org/html/2605.15417#bib.bib19)\)but estimate the normalisation constant in a batch\-wise fashion\. This Vargrad loss comes from the log variance divergence introduced in\(Nüsken and Richter,[2021](https://arxiv.org/html/2605.15417#bib.bib26)\), where they also propose a variance based estimator built of the Pearsonχ2\\chi^\{2\}divergence which is most similar to our work\. However, this would not share the same property of matching on\-policy gradients as we show in the paper the correct way to extend this property is using generalised deviations\(Rockafellar et al\.,[2006](https://arxiv.org/html/2605.15417#bib.bib30)\)instead of the variance\. Finally, and most related to our workSilva et al\. \([2024](https://arxiv.org/html/2605.15417#bib.bib34)\)discuss using alternative divergence measures for training GFlowNets, however their work focuses on the on\-policy setting where the main exploratory benefits of GFlowNets comes from off\-policy training\.

##### ff\-Divergences and Generative Models

ff\-divergences\(Ali and Silvey,[1966](https://arxiv.org/html/2605.15417#bib.bib1); Morimoto,[1963](https://arxiv.org/html/2605.15417#bib.bib23)\)have an extensive history in the tuning of probabilistic models\(Minka et al\.,[2005](https://arxiv.org/html/2605.15417#bib.bib22)\)\. In the context of generative models, they have been applied heavily in variational inference\(Wang et al\.,[2018](https://arxiv.org/html/2605.15417#bib.bib41)\)as well as to GANs\(Nowozin et al\.,[2016](https://arxiv.org/html/2605.15417#bib.bib25)\), and more recently some work on appliesff\-divergences to tuning diffusion models\(Tang,[2024](https://arxiv.org/html/2605.15417#bib.bib35); Novello et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib24)\)\.ff\-divergences have been applied in the context of LLMS, either directly as an objective\(Go et al\.,[2023](https://arxiv.org/html/2605.15417#bib.bib10); Han et al\.,[2024](https://arxiv.org/html/2605.15417#bib.bib11)\)or as a regularised\([Huang et al\.,](https://arxiv.org/html/2605.15417#bib.bib13)\)\. However to the best of our knowledge, none of this work demonstrates off policy validity, an important aspect as much LLM tuning is done asynchronously\(Intellect,[2025](https://arxiv.org/html/2605.15417#bib.bib14)\)or completely off\-policy\.ff\-divergences have also been applied extensively to imitation learning\(Ke et al\.,[2020](https://arxiv.org/html/2605.15417#bib.bib16); Ghasemipour et al\.,[2020](https://arxiv.org/html/2605.15417#bib.bib9)\), suggesting our work could be applied to asynchronous distillation of LLMs\(Lu and Lab,[2025](https://arxiv.org/html/2605.15417#bib.bib18)\)\.

## 3Motivation In RL Tuning LLMs

##### RL tuning LLMs and KL gradients

We useπθ\\pi\_\{\\theta\}to denote the LLM which is viewed as a policy over sequences of tokens,𝐲\{\\mathbf\{y\}\}, given a prompt𝐱\{\\mathbf\{x\}\}\. Reinforcement learning is then used to tune this model by maximizing the following reward for a distributionP\(𝐱\)P\(\{\\mathbf\{x\}\}\)over prompts/states,𝐱\{\\mathbf\{x\}\}:

𝒥\(θ\)\\displaystyle\{\\mathcal\{J\}\}\(\\theta\)=𝔼𝐱∼P\(𝐱\)\[𝔼𝐲∼πθ\(𝐲∣𝐱\)\[r\(𝐱,𝐲\)\]\],\\displaystyle=\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim P\(\{\\mathbf\{x\}\}\)\}\\left\[\{\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\}\\left\[\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\}\\right\]\}\\right\],\(1\)−β𝕂𝕃\(πθ\(⋅\|𝐱\),πref\(⋅\|𝐱\)\)\.\\displaystyle\-\\beta\\mathbb\{KL\}\\left\(\{\\pi\_\{\\theta\}\\left\(\{\\cdot\\lvert\{\\mathbf\{x\}\}\}\\right\),\\pi\_\{\\mathrm\{ref\}\}\\left\(\{\\cdot\\lvert\{\\mathbf\{x\}\}\}\\right\)\}\\right\)\.\(2\)This objective is then generally optimised via PPO\(Schulman et al\.,[2017](https://arxiv.org/html/2605.15417#bib.bib33)\)or REINFORCE\(Williams,[1992](https://arxiv.org/html/2605.15417#bib.bib42)\)\. However, the gradient of the𝕂𝕃\\mathbb\{KL\}term cannot be obtained via auto\-differentiation through a Monte Carlo estimator of the𝕂𝕃\\mathbb\{KL\}, asTang and Munos \([2025](https://arxiv.org/html/2605.15417#bib.bib36)\)noted many open source implementations did at the time\. This is because the𝕂𝕃\\mathbb\{KL\}depends onθ\\thetaboth in terms of its sampling and its objective, where auto\-differentiation only accounts for the latter\. The correct gradient is:

∇θ𝕂𝕃=𝔼πθ\(𝐲\|𝐱\)\[log⁡πθ\(𝐲\|𝐱\)π\(𝐲\|𝐱\)∇θlogθ⁡πθ\(𝐲\|𝐱\)\],\\displaystyle\\nabla\_\{\\theta\}\\mathbb\{KL\}=\{\\mathbb\{E\}\}\_\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\}\\left\[\{\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\}\\nabla\_\{\\theta\}\\log\_\{\\theta\}\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\}\\right\],whereas a naive auto\-differentiation would give the gradient as:

𝔼πθ\[∇θlog⁡πθ\(𝐲\|𝐱\)π\(𝐲\|𝐱\)\]=𝔼πθ\[∇θlog⁡πθ\(𝐲\|𝐱\)\],\\displaystyle\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\\nabla\_\{\\theta\}\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\pi\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\\right\]=\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\\right\],where this expression’s expectation is zero\.Tang and Munos \([2025](https://arxiv.org/html/2605.15417#bib.bib36)\)suggest using the standard REINFORCE estimator of the gradient, but also discuss the squared𝕂𝕃\\mathbb\{KL\}estimator ofSchulman \([2020](https://arxiv.org/html/2605.15417#bib.bib31)\)\. This is denoted𝕂𝕃sq\\mathbb\{KL\}\_\{\\mathrm\{sq\}\}and given by:

𝕂𝕃sq\(πθ∥πref\)\(𝐱\)=12𝔼𝐲∼πθ\[\(log⁡πθ\(𝐲\|𝐱\)πref\(𝐲\|𝐱\)\)2\],\\mathbb\{KL\}\_\{\\text\{sq\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\text\{ref\}\}\)\(\{\\mathbf\{x\}\}\)=\\frac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathbf\{y\}\\sim\\pi\_\{\\theta\}\}\\\!\\left\[\\left\(\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\\right\)^\{2\}\\right\],where we have written this as a function of the conditioning value𝐱\{\\mathbf\{x\}\}\. This is itself biased estimator of the𝕂𝕃\\mathbb\{KL\}, however it has the property that the naive auto\-differentiated gradient is in expectation the gradient of the true𝕂𝕃\\mathbb\{KL\}\. Specifically,

𝔼𝐲∼πθ\[∇θ12\(log⁡πθ\(𝐲\|𝐱\)πref\(𝐲\|𝐱\)\)2\]=∇θ𝕂𝕃\(πθ∥πref\)\(𝐱\)\.\\displaystyle\\mathbb\{E\}\_\{\\mathbf\{y\}\\sim\\pi\_\{\\theta\}\}\\\!\\left\[\\\!\\nabla\_\{\\theta\}\\frac\{1\}\{2\}\\left\(\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\\right\)^\{2\}\\\!\\right\]\\\!=\\\!\\nabla\_\{\\theta\}\\mathbb\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\text\{ref\}\}\)\(\{\\mathbf\{x\}\}\)\.

##### A single loss for off\- and on\-policy RL from the𝕂𝕃\\mathbb\{KL\}divergence

We now describe how the𝕂𝕃sq\\mathbb\{KL\}\_\{sq\}estimator leads to a loss that can be used to optimise the objective in Equation refeq:J\_theta using either off or on\-policy data, as inTang et al\. \([2025](https://arxiv.org/html/2605.15417#bib.bib37)\)andBartoldson et al\. \([2025](https://arxiv.org/html/2605.15417#bib.bib2)\)\. First, we have that𝒥\(θ\)\\mathcal\{J\}\(\\theta\)may be written as:

𝒥\(θ\)\\displaystyle\\mathcal\{J\}\(\\theta\)=β𝔼𝐱∼P\(𝐱\)\[𝔼𝐲∼πθ\(𝐲∣𝐱\)\[log⁡eβ−1r\(𝐱,𝐲\)\]\],\\displaystyle=\\beta\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim P\(\{\\mathbf\{x\}\}\)\}\\left\[\{\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\}\\left\[\{\\log e^\{\\beta^\{\-1\}r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\}\}\\right\]\}\\right\],−β𝔼𝐱∼P\(𝐱\)\[𝔼𝐲∼πθ\(𝐲∣𝐱\)\[log⁡πθ\(𝐲\|𝐱\)πref\(𝐲\|𝐱\)\]\]\\displaystyle\-\\beta\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim P\(\{\\mathbf\{x\}\}\)\}\\left\[\{\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\}\\left\[\{\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\}\}\\right\]\}\\right\]=β𝔼x∼P\(𝐱\)\[−𝕂𝕃\(πθ∥π⋆\)\(𝐱\)\]\+C\\displaystyle=\\beta\{\\mathbb\{E\}\}\_\{x\\sim P\(\{\\mathbf\{x\}\}\)\}\\left\[\{\-\\mathbb\{KL\}\\left\(\{\\pi\_\{\\theta\}\\\|\\pi\_\{\\star\}\}\\right\)\(\{\\mathbf\{x\}\}\)\}\\right\]\+CwhereCCis a constant independent ofθ\\thetaandπ⋆\\pi\_\{\\star\}defined as:

π⋆\(𝐲\|𝐱\)=πref\(𝐲∣𝐱\)exp⁡\(β−1r\(𝐱,𝐲\)\)Z\(𝐱\),\\displaystyle\\pi\_\{\\star\}\(\{\\mathbf\{y\}\}\\lvert\{\\mathbf\{x\}\}\)=\\frac\{\\pi\_\{\\mathrm\{ref\}\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\\exp\\left\(\{\\beta^\{\-1\}r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\}\\right\)\}\{Z\(\{\\mathbf\{x\}\}\)\},Z\(𝐱\)=∫πref\(𝐲∣𝐱\)exp⁡\(β−1r\(𝐱,𝐲\)\)𝑑𝐲Z\(\{\\mathbf\{x\}\}\)=\\int\\pi\_\{\\mathrm\{ref\}\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\\exp\\left\(\{\\beta^\{\-1\}r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\}\\right\)d\{\\mathbf\{y\}\}is the partition function\. Therefore maximisation of equation refeq:J\_theta corresponds to minimising𝔼x∼P\(𝐱\)\[−𝕂𝕃\(πθ∥π⋆\)\(𝐱\)\]\{\\mathbb\{E\}\}\_\{x\\sim P\(\{\\mathbf\{x\}\}\)\}\\left\[\{\-\\mathbb\{KL\}\\left\(\{\\pi\_\{\\theta\}\\\|\\pi\_\{\\star\}\}\\right\)\(\{\\mathbf\{x\}\}\)\}\\right\]\. If we assume for now that we know the partition function,Z\(𝐱\)Z\(\{\\mathbf\{x\}\}\)then we can make use of the𝕂𝕃sq\\mathbb\{KL\}\_\{sq\}property to define the lossℒ𝕂𝕃sq\(𝐱,𝐲,θ\)\\mathcal\{L\}\_\{\\mathbb\{KL\}\_\{sq\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\},\\theta\)as,

ℒ𝕂𝕃sq\(𝐱,𝐲,θ\)≔12\(log⁡πθ\(𝐲\|𝐱\)π⋆\(𝐲\|𝐱\)\)2\\displaystyle\\mathcal\{L\}\_\{\\mathbb\{KL\}\_\{sq\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\},\\theta\)\\coloneqq\\frac\{1\}\{2\}\\left\(\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\star\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\\right\)^\{2\}where this loss has the property that𝔼𝐲∼πθ\(𝐲∣𝐱\)\[∇ℒ𝕂𝕃\(𝐱,𝐲,θ\)\]=∇θ𝕂𝕃\(πθ∥πref\)\(𝐱\)\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\}\\left\[\{\\nabla\\mathcal\{L\}\_\{\\mathbb\{KL\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\},\\theta\)\}\\right\]\\\!\\\!=\\\!\\\!\\nabla\_\{\\theta\}\\mathbb\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\text\{ref\}\}\)\(\{\\mathbf\{x\}\}\), so that the on\-policy gradients are equal in expectation to the gradients of the𝕂𝕃\\mathbb\{KL\}\. This allows us to define an objective:

𝒥~\(θ\)=𝔼𝐱∼P\(𝐱\)\[𝔼𝐲∼πθ\(𝐲∣𝐱\)\[ℒ𝕂𝕃sq\(𝐱,𝐲,θ\)\]\],\\displaystyle\\tilde\{\\mathcal\{J\}\}\(\\theta\)=\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim P\(\{\\mathbf\{x\}\}\)\}\\left\[\{\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\}\\left\[\{\\mathcal\{L\}\_\{\\mathbb\{KL\}\_\{sq\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\},\\theta\)\}\\right\]\}\\right\],where the expected auto\-differentiated gradients of𝒥~\(θ\)\\tilde\{\\mathcal\{J\}\}\(\\theta\)are equal to the REINFORCE gradients of𝒥\(θ\)\{\\mathcal\{J\}\}\(\\theta\)\. Seeing asℒ𝕂𝕃sq\\mathcal\{L\}\_\{\\mathbb\{KL\}\_\{sq\}\}is simply the squared loss between the logprobs, this has the additional benefit in that if we instead sampled the completions off\-policy from some distributionμ\\muand set the objective as:

𝒥~μ\(θ\)=𝔼𝐱∼P\(𝐱\)\[𝔼𝐲∼μ\(𝐲∣𝐱\)\[ℒ𝕂𝕃sq\(𝐱,𝐲,θ\)\]\],\\displaystyle\\tilde\{\\mathcal\{J\}\}\_\{\\mu\}\(\\theta\)=\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim P\(\{\\mathbf\{x\}\}\)\}\\left\[\{\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim\\mu\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\}\\left\[\{\\mathcal\{L\}\_\{\\mathbb\{KL\}\_\{sq\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\},\\theta\)\}\\right\]\}\\right\],then if the support onπθ\(⋅\|𝐱\)\\pi\_\{\\theta\}\(\\cdot\|\{\\mathbf\{x\}\}\)is contained inμ\(⋅\|𝐱\)\\mu\(\\cdot\|\{\\mathbf\{x\}\}\)for each𝐱\{\\mathbf\{x\}\}these share the same minimizer\. This follows from the fact that𝒥~μ\(θ\)\\tilde\{\\mathcal\{J\}\}\_\{\\mu\}\(\\theta\)is minimised by minimising𝔼𝐲∼μ\(𝐲∣𝐱\)\[ℒ𝕂𝕃sq\(𝐱,𝐲,θ\)\]\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim\\mu\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\}\\left\[\{\{\\mathcal\{L\}\_\{\\mathbb\{KL\}\_\{sq\}\}\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\},\\theta\)\}\}\\right\]for each𝐱\{\\mathbf\{x\}\}which occurs exactly whenπθ\(𝐲\|𝐱\)=π⋆\(𝐲\|𝐱\)\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\\\!=\\\!\\pi\_\{\\star\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)for all𝐲\{\\mathbf\{y\}\}in the support ofμ\(⋅\|𝐱\)\\mu\(\\cdot\|\{\\mathbf\{x\}\}\)\.

##### Unnormalised Target Distribution

IfZ\(𝐱\)Z\(\{\\mathbf\{x\}\}\)is not known, which is commonly the case, we can use a batch based estimate for the normalisation constant, which recovers the VarGrad estimator\(Richter et al\.,[2020](https://arxiv.org/html/2605.15417#bib.bib28)\)of the𝕂𝕃\\mathbb\{KL\}gradient\. That is, for a sampled batch of completionsℬ=\{𝐲1,…,𝐲B\}\\mathcal\{B\}=\\\{\{\\mathbf\{y\}\}\_\{1\},\\dots,\{\\mathbf\{y\}\}\_\{B\}\\\}for each𝐱\{\\mathbf\{x\}\}we can estimatelog⁡Z\(𝐱\)\\log Z\(\{\\mathbf\{x\}\}\)as:

log⁡Z\(𝐱\)^=1B∑i=1B\(r\(𝐲i,𝐱\)β\+log⁡πref\(𝐲i\|𝐱\)πθ\(𝐲i\|𝐱\)\)\\displaystyle\\widehat\{\\log Z\(\{\\mathbf\{x\}\}\)\}=\\frac\{1\}\{B\}\\sum^\{B\}\_\{i=1\}\\left\(\{\\frac\{r\(\{\\mathbf\{y\}\}\_\{i\},\{\\mathbf\{x\}\}\)\}\{\\beta\}\+\\log\\frac\{\\pi\_\{\\mathrm\{ref\}\}\\left\(\{\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\}\\right\)\}\{\\pi\_\{\\theta\}\\left\(\{\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\}\\right\)\}\}\\right\)\(3\)to form the batch wise Vargrad loss as:

ℒKLVG\(ℬ,θ\)=1B∑i=1B\[log⁡SG\[Z^\(𝐱\)\]πθ\(𝐲i∣𝐱\)πref\(𝐲i∣𝐱\)er\(𝐱,𝐲i\)/β\]2\\mathcal\{L\}^\{\\mathrm\{VG\}\}\_\{\\text\{KL\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum^\{B\}\_\{i=1\}\\left\[\\log\\frac\{\\mathrm\{SG\}\\left\[\\widehat\{Z\}\(\\mathbf\{x\}\)\\right\]\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\\mid\\mathbf\{x\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(\\mathbf\{y\}\_\{i\}\\mid\\mathbf\{x\}\)e^\{r\(\\mathbf\{x\},\\mathbf\{y\}\_\{i\}\)/\\beta\}\}\\right\]^\{2\}whereSG\[⋅\]\\mathrm\{SG\}\\left\[\{\\cdot\}\\right\]is the stop gradient function\.

A version of this loss was applied in Kimi K2\(Team et al\.,[2025a](https://arxiv.org/html/2605.15417#bib.bib38)\)and K1\.5\(Team et al\.,[2025b](https://arxiv.org/html/2605.15417#bib.bib39)\), where the generating policy,θold\\theta\_\{\\mathrm\{old\}\}, is used as the reference distribution and everything is scaled byβ\\beta\. This gives the following loss:

ℒK2\(ℬ,θ\)=1B∑i=1B\[βlog⁡Zπθ\(𝐲i∣𝐱\)πθold\(𝐲i∣𝐱\)−r\(𝐲i,𝐱\)\]2\.\\mathcal\{L\}^\{\\mathrm\{K2\}\}\(\\mathcal\{B\},\\theta\)\\\!=\\\!\\frac\{1\}\{B\}\\\!\\sum\_\{i=1\}^\{B\}\\\!\\left\[\\beta\\log\\frac\{Z\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\\\!\\mid\\\!\\mathbf\{x\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(\\mathbf\{y\}\_\{i\}\\\!\\mid\\\!\\mathbf\{x\}\)\}\-r\(\{\\mathbf\{y\}\}\_\{i\},\{\\mathbf\{x\}\}\)\\right\]^\{2\}\\\!\.These papers use the mean rewards,r¯\(𝐱\)=1B∑i=1Br\(𝐱,𝐲i\)\\bar\{r\}\(\{\\mathbf\{x\}\}\)=\\frac\{1\}\{B\}\\sum^\{B\}\_\{i=1\}r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\), as an estimate forlog⁡Z\\log Zinstead of equation refeq:vargrad\_normalisation\. This can be viewed as assuming that0≈β\(log⁡πθ\(𝐲i\|𝐱\)−log⁡πθold\(𝐲i\|𝐱\)\)0\\approx\\beta\\left\(\{\\log\{\\pi\_\{\\theta\}\\left\(\{\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\}\\right\)\}\-\\log\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\left\(\{\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\}\\right\)\}\}\\right\), which will hold exactly if the baseline is on\-policy, i\.eθ=θold\\theta=\\theta\_\{\\mathrm\{old\}\}, and approximately if the baseline is slightly off\-policy orβ\\betais small\. This has the advantage of not needing forward pass for the whole batch to calculatelog⁡Z^\\widehat\{\\log Z\}\.

## 4Generalising Vargrad to Arbitrary Deviations andff\- divergences

In this section we show that the approaches of the previous section can be generalised to arbitraryff\-divergences\. Specifically we can start from anyff\-divergence and construct a surrogate lossℒf\\mathcal\{L\}\_\{f\}whose gradients match theff\-divergence on\-policy and remains a valid loss with the same minimizer off\-policy\. Moreover, we show the reverse implication, that any using any translation invariant loss on the logprobs of the current and target distribution corresponds to minimising a correspondingff\-divergence, and that these maps are the inverse of each other\. First we give the definition of anff\-divergence\. Throughout we drop the conditioning on𝐱\{\\mathbf\{x\}\}for clarity\.

###### Definition 4\.1\.

Letffbe a convex function withf\(1\)=0f\(1\)=0\. The associatedff\- divergence is then defined as:

𝒟f\(p∥q\)=𝔼𝐲∼q\(𝐲\)\[f\(p\(𝐲\)q\(𝐲\)\)\]\.\\displaystyle\\mathcal\{D\}\_\{f\}\(p\\\|q\)=\{\\mathbb\{E\}\}\_\{\{\\mathbf\{y\}\}\\sim q\(\{\\mathbf\{y\}\}\)\}\\left\[\{f\\left\(\{\\frac\{p\(\{\\mathbf\{y\}\}\)\}\{q\(\{\\mathbf\{y\}\}\)\}\}\\right\)\}\\right\]\.We add the additional requirements thatf′\(1\)=1f^\{\\prime\}\(1\)=1andf′′\(1\)=1f^\{\\prime\\prime\}\(1\)=1so that the scale of allffdivergences is standardised\. This can always be achieved asf~\(x\)=λ1f\(x\)\+λ2\(x−1\)\\tilde\{f\}\(x\)=\\lambda\_\{1\}f\(x\)\+\\lambda\_\{2\}\(x\-1\)is convex withf\(1\)=0f\(1\)=0forλi≥0\\lambda\_\{i\}\\geq 0and that𝒟f~\(p∥q\)=λ1𝒟f\(p∥q\)\\mathcal\{D\}\_\{\\tilde\{f\}\}\(p\\\|q\)=\\lambda\_\{1\}\\mathcal\{D\}\_\{f\}\(p\\\|q\)so that the scaling does not change the shape of the divergence\.

ff\-divergences are not symmetric and throughout we will write the target first as𝒟f\(pθ∥p⋆\)\\mathcal\{D\}\_\{f\}\(p\_\{\\theta\}\\\|p\_\{\\star\}\)\. As for any convex function,f~\(t\)=1tf~\(1t\)\\tilde\{f\}\(t\)=\\frac\{1\}\{t\}\\tilde\{f\}\\left\(\{\\frac\{1\}\{t\}\}\\right\)is convex, the reverse divergence,𝒟f\(pθ∥p∗\)\\mathcal\{D\}\_\{f\}\(p\_\{\\theta\}\\\|p^\{\*\}\)is equal to𝒟f~\(p∗∥pθ\)\\mathcal\{D\}\_\{\\tilde\{f\}\}\(p^\{\*\}\\\|p\_\{\\theta\}\)and so can always be written in this form\. Given this, the𝕂𝕃\\mathbb\{KL\}discussed in Section refsec:motivation is the from now on the reverse𝕂𝕃\\mathbb\{KL\}and corresponds to choosingf\(t\)=tlog⁡tf\(t\)=t\\log t\. The gradient of anff\-divergence is given by the log derivative trick as:

∇θDf\(pθ∥p⋆\)=𝔼𝐲∼pθ\[f′\(pθ\(𝐲\)p⋆\(𝐲\)\)∇θlog⁡pθ\(𝐲\)\]\.\\displaystyle\\nabla\_\{\\theta\}D\_\{f\}\(p\_\{\\theta\}\\\!\\\|p\_\{\\star\}\)\\\!=\\\!\\mathbb\{E\}\_\{\{\\mathbf\{y\}\}\\sim p\_\{\\theta\}\}\\left\[f^\{\\prime\}\\\!\\left\(\\frac\{p\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\}\{p\_\{\\star\}\(\{\\mathbf\{y\}\}\)\}\\right\)\\\!\\nabla\_\{\\theta\}\\\!\\log p\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\\right\]\\\!\.
Now given this we define the following loss function, which generalises the𝕂𝕃sq\\mathbb\{KL\}\_\{\\mathrm\{sq\}\}loss to arbitraryff\-divergence’s in the sense that the gradients are equivalent on\-policy but it is also a valid off\-policy loss:

###### Proposition 4\.2\.

LetΔθ\(𝐲\)=log⁡pθ\(𝐲\)−log⁡p⋆\(𝐲\)\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)=\\log p\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\-\\log p\_\{\\star\}\(\{\\mathbf\{y\}\}\)\. Then the functionℒf:ℝ→ℝ\\mathcal\{L\}\_\{f\}:\\mathbb\{R\}\\to\\mathbb\{R\}given by:

ℒf\(Δθ\(𝐲\)\)=∫0Δθ\(𝐲\)f′\(exp⁡\(t\)\)−f′\(1\)dt\\boxed\{\\mathcal\{L\}\_\{f\}\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\)=\\int\_\{0\}^\{\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\}\{f^\{\\prime\}\(\\exp\(t\)\)\}\-f^\{\\prime\}\(1\)dt\}is a translation invariant loss function that generalises the construction in Sectionrefsec:KL\_loss in that:

1. 1\.The population loss between the log probabilities of the target and current distribution: 𝒥μ,f\(θ\)=𝔼𝐲∼μ\[ℒf\(Δθ\(𝐲\)\)\],\\mathcal\{J\}\_\{\\mu,f\}\(\\theta\)=\\mathbb\{E\}\_\{\{\\mathbf\{y\}\}\\sim\\mu\}\\Big\[\\mathcal\{L\}\_\{f\}\\big\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\\big\)\\Big\],is minimised whenpθ=p⋆p\_\{\\theta\}=p\_\{\\star\}so long as the support ofp⋆p\_\{\\star\}is contained in the support ofμ\\mu\.
2. 2\.If the samples are generated on\-policy \(μ=pθ\\mu=p\_\{\\theta\}\), the expected auto\-differentiated gradients of𝒥μ,f\(θ\)\\mathcal\{J\}\_\{\\mu,f\}\(\\theta\)correspond to the gradients of the associatedff\-divergence: ∇θ𝒟f\(pθ∥p⋆\)=𝔼𝐲∼pθ\[∇θℒf\(Δθ\(𝐲\)\)\]\\nabla\_\{\\\!\\theta\}\\mathcal\{D\}\_\{\\\!f\}\(p\_\{\\theta\}\\\|p\_\{\\star\}\)\\\!=\\\!\\mathbb\{E\}\_\{\{\\mathbf\{y\}\}\\sim p\_\{\\theta\}\}\\\!\\big\[\\nabla\_\{\\\!\\theta\}\\mathcal\{L\}\_\{f\}\\big\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\\big\)\\big\]

Applying this construction to the reverse𝕂𝕃\\mathbb\{KL\}divergence directly recovers the loss in Section refsec:KL\_loss:

###### Example 4\.3\.

For the reverse𝕂𝕃\\mathbb\{KL\}divergence we havef\(u\)=ulog⁡\(u\)f\(u\)=u\\log\(u\)\. This givesf′\(u\)=log⁡\(u\)\+1f^\{\\prime\}\(u\)=\\log\(u\)\+1,f′\(1\)=1f^\{\\prime\}\(1\)=1, and:

ℒulog⁡\(u\)\(Δθ\(𝐲\)\)\\displaystyle\\mathcal\{L\}\_\{u\\log\(u\)\}\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\)=∫0Δθ\(𝐲\)log⁡\(exp⁡\(t\)\)𝑑t\\displaystyle=\\int\_\{0\}^\{\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\}\\log\(\\exp\(t\)\)dt=12\(Δθ\(𝐲\)\)2\\displaystyle=\\frac\{1\}\{2\}\\left\(\{\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\}\\right\)^\{2\}

In fact, this result is not specific to the square error loss and works for any*translation invariant loss function*\. Specifically letℓ:ℝ→ℝ\\ell:\\mathbb\{R\}\\to\\mathbb\{R\}be some strictly convex, differentiable function that is minimized atℓ\(0\)\\ell\(0\), withℓ\(y−y^\)\\ell\(y\-\\hat\{y\}\)giving the loss between the prediction𝐲^\\hat\{\{\\mathbf\{y\}\}\}and the target𝐲\{\\mathbf\{y\}\}\. For all such loss functions we can show the reverse result that usingℓ\(⋅\)\\ell\(\\cdot\)on the difference between target and model logprobs can be associated with a particularffdivergence using its on\-policy gradients\. Moreover, these maps are inverses of each other:

###### Proposition 4\.4\.

Letℓ:ℝ→ℝ\\ell:\\mathbb\{R\}\\to\\mathbb\{R\}be a translation invariant loss function\. Then the objective:

𝒥~μ,ℓ\(θ\)=𝔼𝐲∼μ\[ℓ\(Δθ\(𝐲\)\)\],\\tilde\{\\mathcal\{J\}\}\_\{\\mu,\\ell\}\(\\theta\)=\\mathbb\{E\}\_\{\{\\mathbf\{y\}\}\\sim\\mu\}\\Big\[\\ell\\big\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\\big\)\\Big\],is minimised whenpθ=p⋆p\_\{\\theta\}=p\_\{\\star\}so long as the support ofp⋆p\_\{\\star\}is contained in the support ofμ\\mu\. Further it has the property that the its auto\-differentiated gradients correspond to the gradients of a correspondingff\-divergence:

𝔼𝐱∼pθ\(𝐲\)\[∇θℓ\(Δθ\(𝐲\)\)\]=∇θ𝒟fℓ\(pθ∥p⋆\),\\displaystyle\{\\mathbb\{E\}\}\_\{\{\\mathbf\{x\}\}\\sim p\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\}\\left\[\{\\nabla\_\{\\theta\}\\ell\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\)\}\\right\]=\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{f\_\{\\ell\}\}\\left\(\{p\_\{\\theta\}\\\|p\_\{\\star\}\}\\right\),wherefℓf\_\{\\ell\}is defined as:

fℓ\(u\)=λ1∫1uℓ′\(log⁡t\)𝑑t\+λ2\(u−1\)\+c,f\_\{\\ell\}\(u\)=\\lambda\_\{1\}\\int^\{u\}\_\{1\}\\ell^\{\\prime\}\(\\log t\)dt\+\\lambda\_\{2\}\(u\-1\)\+c,forλi,c∈ℝ\\lambda\_\{i\},c\\in\\mathbbm\{R\}chosen to satisfyfℓ\(1\)=0,fℓ′\(1\)=1f\_\{\\ell\}\(1\)=0,f^\{\\prime\}\_\{\\ell\}\(1\)=1, andfℓ′′\(1\)=1f^\{\\prime\\prime\}\_\{\\ell\}\(1\)=1\. Moreover this mapping is the inverse of that in Propositionrefprop:loss\_form in the sense thatfℒf=ff\_\{\\mathcal\{L\}\_\{f\}\}=fandℒfℓ=ℓ\\mathcal\{L\}\_\{f\_\{\\ell\}\}=\\ell\.

The boundary conditions here ensure that the maps are the inverse of each other, but the gradient equivalence for allλi,c\\lambda\_\{i\},c\. However, this choice of parameter values does ensure that the gradient is equivalently scaled across divergences as the model approaches convergence\. Appendix refap:KL\_from\_sq, shows that starting from the squared loss we return to the𝕂𝕃\\mathbb\{KL\}\.

### 4\.1DevGrad Loss for Batch\-Wise Normalisation

We now introduce the generalisation of the Vargrad loss, which can be applied to unnormalised target distributions\. We refer to this as the*DevGrad*loss, as it corresponds to replacing the batch\-wise variance with the batch\-wise*generalised deviation*\(Rockafellar et al\.,[2006](https://arxiv.org/html/2605.15417#bib.bib30); Rockafellar and Uryasev,[2013](https://arxiv.org/html/2605.15417#bib.bib29)\)of the difference in logprobs\. This also has the property of centring the score function coefficients in the batch, which leads to reduced variance even if we are working with a normalised target distribution\.

Let the target density be of the formp⋆\(𝐲\)=1Zexp⁡\(ℛ\(𝐲\)\)p\_\{\\star\}\(\{\\mathbf\{y\}\}\)=\\frac\{1\}\{Z\}\\exp\(\\mathcal\{R\}\(\{\\mathbf\{y\}\}\)\)whereℛ\(𝐲\)\\mathcal\{R\}\(\{\\mathbf\{y\}\}\)is the reward function or \(negative\) energy function andZ=∫exp⁡\(ℛ\(𝐲\)\)𝑑𝐲Z=\\int\\exp\(\\mathcal\{R\}\(\{\\mathbf\{y\}\}\)\)d\{\\mathbf\{y\}\}is the partition function\. The partition function remains intractable, but we can use the same approach as Vargrad and estimate it as:

log⁡Z^=minC⁡1B∑iℒf\(Δ\(𝐲i\)\+C\)\.\\displaystyle\\widehat\{\\log Z\}\\\!=\\\!\\min\_\{C\}\\\!\\tfrac\{1\}\{B\}\\\!\\sum\_\{i\}\\mathcal\{L\}\_\{f\}\\Big\(\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+C\\Big\)\.\(4\)whereΔ\(𝐲i\)=log⁡pθ\(𝐲i\)−ℛ\(𝐲i\)\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)=\\log p\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\)\-\\mathcal\{R\}\(\{\\mathbf\{y\}\}\_\{i\}\)are the unnormalised difference in logprobs/energy functions\. Given this we define the*Devgrad*loss as follows:

###### Definition 4\.5\.

Given a loss function,ℒf\\mathcal\{L\}\_\{f\}, defined as in Section refsec:f\_div\_loss, we define the batch\-wiseDevgradloss for a batchℬ=\{𝐲1,…,𝐲B\}\\mathcal\{B\}=\\\{\{\\mathbf\{y\}\}\_\{1\},\\dots,\{\\mathbf\{y\}\}\_\{B\}\\\}as:

ℒfDG\(ℬ,θ\)=1B∑iℒf\(Δ\(𝐲i\)\+SG\[log⁡Z^\]\)\.\\displaystyle\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{f\}\(\\mathcal\{B\},\\theta\)\\\!=\\\!\\tfrac\{1\}\{B\}\\sum\_\{i\}\\mathcal\{L\}\_\{f\}\\Big\(\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\mathrm\{SG\}\\left\[\{\\widehat\{\\log Z\}\}\\right\]\\Big\)\.wherelog⁡Z^\\widehat\{\\log Z\}satisfies Equation refeq:log\_z\_devgrad\.

Again, this corresponds to taking the batch wise*generalised deviation*\(Rockafellar et al\.,[2006](https://arxiv.org/html/2605.15417#bib.bib30); Rockafellar and Uryasev,[2013](https://arxiv.org/html/2605.15417#bib.bib29)\)of the batch wise difference in logprobs\. Whenℒf\(y\)=y2\{\\mathcal\{L\}\}\_\{f\}\(y\)=y^\{2\}this recovers the variance, but for otherℒf\{\\mathcal\{L\}\}\_\{f\}we recover different deviations, such as the mean absolute deviation around the median whenϕf\(y\)=\|y\|\\phi\_\{f\}\(y\)=\\lvert y\\rvert, which in terms offf\-divergences corresponds to the total variation\. As we show in Appendix refap:grad\_var, usinglog⁡Z^\\widehat\{\\log Z\}also centres the batch wise score function coefficients, which leads to variance reduction in the same way as Vargrad\.

### 4\.2Tempered Loss

If the posterior probabilities are defined by the reward tilted distribution as:

π⋆\(𝐲\|𝐱\)=1Z\(𝐱\)πref\(𝐲∣𝐱\)exp\(β−1r\(𝐱,𝐲\)\),\\displaystyle\\pi\_\{\\star\}\(\{\\mathbf\{y\}\}\\lvert\{\\mathbf\{x\}\}\)=\\frac\{1\}\{Z\(\{\\mathbf\{x\}\}\)\}\\pi\_\{\\mathrm\{ref\}\}\(\{\\mathbf\{y\}\}\\mid\{\\mathbf\{x\}\}\)\\exp\\left\(\{\\beta^\{\-1\}r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\}\\right\),\(5\)for smallβ\\betawe can run into problems of exploding loss and therefore gradients due to theexp⁡\(β−1r\(𝐱,𝐲\)\)\\exp\\left\(\{\\beta^\{\-1\}r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\}\\right\)term\. For example, if the reward is bounded between\[0,1\]\[0,1\]andβ=0\.005\\beta=0\.005the unnormalised probabilities can be of magnitudee200e^\{200\}, causing significant numerical overflow\. Alternatively, if we were to use clipping we would end up losing a significant amount of signal\. E\.g\. clipping the exponential at values greater thane20e^\{20\}would leave us unable to differentiate between samples with rewards greater than0\.10\.1\.

To resolve this, we first define the tempered distribution:

###### Definition 4\.6\.

For a distributionp\(𝐲\|𝐱\)p\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\), we define the tempered distributionp~β\(𝐲\|𝐱\)\\tilde\{p\}\_\{\\beta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)as the distribution that is proportional topβ\(𝐲\|𝐱\)\{p\}^\{\\beta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\. In terms of energy functions we have that ifp\(𝐲\|𝐱\)=1Z\(𝐱\)exp⁡\(ℛ\(𝐲\|𝐱\)\)p\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)=\\frac\{1\}\{Z\(\{\\mathbf\{x\}\}\)\}\\exp\(\\mathcal\{R\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\)then:

pβ\(𝐲\|𝐱\)=1Z~\(𝐱\)exp⁡\(βℛ\(𝐲\|𝐱\)\)\.\\displaystyle\{p\}^\{\\beta\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)=\\frac\{1\}\{\\tilde\{Z\}\(\{\\mathbf\{x\}\}\)\}\\exp\(\\beta\\mathcal\{R\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\)\.\(6\)

Now, as mentioned above, for the tilted distributionπ⋆\\pi\_\{\\star\}given in Equation refeq:boltzman, we have the problem that the magnitude of the energy functionℛ⋆\(𝐲\|𝐱\)\\mathcal\{R\}\_\{\\star\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)scales with1β\\frac\{1\}\{\\beta\}making training with smallβ\\betaimpractical\. However, the energy function of the tempered distribution isβℛ⋆\(𝐲\|𝐱\)=βlog⁡πref\+r\(𝐱,𝐲\)\\beta\\mathcal\{R\}\_\{\\star\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)=\\beta\\log\\pi\_\{\\mathrm\{ref\}\}\+r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)which is bounded in magnitude asβ→0\\beta\\to 0\. We also have that if the tempered target and model logprobs are equal, the untempered logprobs must also be equal\. Given this, we can now define the tempered loss as follows:

###### Definition 4\.7\.

We define the tempered loss relative to anff\-divergenceffas:

ℒ~f,β\(Δθ\(𝐲\)\)=1βℒf\(βΔθ\(𝐲\)\)\\displaystyle\\tilde\{\\mathcal\{L\}\}\_\{f,\\beta\}\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\)=\\frac\{1\}\{\\beta\}\{\\mathcal\{L\}\}\_\{f\}\(\\beta\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\)\(7\)whereΔθ\(𝐲\)=log⁡πθ\(𝐲\)−ℛ⋆\(𝐲\|𝐱\)\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\\\!=\\\!\\log\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\-\\mathcal\{R\}\_\{\\star\}\(\{\\mathbf\{y\}\}\|\{\\mathbf\{x\}\}\)\.

The scaling by1β\\frac\{1\}\{\\beta\}is to ensure the gradient size doesn’t vary withβ\\beta\. The tempered loss formulation also provides a new viewpoint on the KIMI loss\(Team et al\.,[2025a](https://arxiv.org/html/2605.15417#bib.bib38),[b](https://arxiv.org/html/2605.15417#bib.bib39)\)as it is the tempered loss of the reverse𝕂𝕃\\mathbb\{KL\}loss:

###### Example 4\.8\.

Letf\(u\)=ulog⁡uf\(u\)=u\\log u, then we have the tempered loss is equal to:

ℒ~f,β\(Δθ\(𝐲\)\)=12β𝔼𝐲∼πθ\[\(βlog⁡πθ\(𝐲\|𝐱\)ℛ\(𝐲\|𝐱\)−log⁡Z~\)2\]\\displaystyle\\tilde\{\\mathcal\{L\}\}\_\{f,\\beta\}\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\)=\\frac\{1\}\{2\\beta\}\\mathbb\{E\}\_\{\\mathbf\{y\}\\sim\\pi\_\{\\theta\}\}\\\!\\left\[\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\mathcal\{R\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\-\\log\\tilde\{Z\}\\right\)^\{2\}\\right\]Which leads to the batch wise normalisationlog⁡Z^=r¯\\widehat\{\\log Z\}=\\bar\{r\}used by KIMI under the assumption that0≈β\(log⁡πθ\(𝐲i\|𝐱\)−log⁡πθref\(𝐲i\|𝐱\)\)0\\approx\\beta\\left\(\{\\log\{\\pi\_\{\\theta\}\\left\(\{\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\}\\right\)\}\-\\log\{\\pi\_\{\\theta\_\{\\mathrm\{ref\}\}\}\\left\(\{\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\}\\right\)\}\}\\right\), as discussed in refsec:KL\_loss\.

## 5Generalisation to Generative Flow Networks

We now apply the novel loss family detailed in Section refsec:f\_div\_loss to Generative Flow Networks \(GFlowNets\)\(Bengio et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib3),[2023](https://arxiv.org/html/2605.15417#bib.bib4)\)\. GFlowNets are a family of probabilistic methods that amortise sampling from spaces with compositional structure\. One of the key benefits of GFlowNets is that they allow for both on and off\-policy training, with exploratory benefits from off\-policy samples\.

In the on\-policy case there exist partial equivalences between GFlowNets and hierarchical variational inference \(HVI\)\(Malkin et al\.,[2023](https://arxiv.org/html/2605.15417#bib.bib21)\), in the sense that the expected gradients with certain losses\(Malkin et al\.,[2022a](https://arxiv.org/html/2605.15417#bib.bib19)\)match the expected gradients from performing HVI with the KL\. We demonstrate that our losses naturally extend these equivalences to all otherff\-divergences and provide a new family of losses for training GFlowNets on\- and off\-policy\. First, we provide a brief background on GFlowNets, sticking to discrete GFlowNets for simplicity, though GFlowNets have been extended to the continuous case\(Lahlou et al\.,[2023](https://arxiv.org/html/2605.15417#bib.bib17)\)\.

### 5\.1Background: GFlowNets, Trajectory Balance, and Variational Inference

Following\(Bengio et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib3)\), we define a DAG𝒢=\(𝒮,𝒜\)\{\\mathcal\{G\}\}=\(\{\\mathcal\{S\}\},\{\\mathcal\{A\}\}\)with states𝒮\{\\mathcal\{S\}\}, actions𝒜\{\\mathcal\{A\}\}, initial state𝐬0\{\\mathbf\{s\}\}\_\{0\}, and terminal states𝒳\{\\mathcal\{X\}\}\. A complete trajectoryτ=\(𝐬0→⋯→𝐬n\)\\tau=\(\{\\mathbf\{s\}\}\_\{0\}\\to\\cdots\\to\{\\mathbf\{s\}\}\_\{n\}\)ends at𝐱τ∈𝒳\{\\mathbf\{x\}\}\_\{\\tau\}\\in\{\\mathcal\{X\}\}\. GFlowNets aim to sample𝐱∈𝒳\{\\mathbf\{x\}\}\\in\{\\mathcal\{X\}\}proportional to a rewardℛ\(𝐱\)\{\\mathcal\{R\}\}\(\{\\mathbf\{x\}\}\)by learning a forward policyπF\(⋅\|𝐬,θ\)\\pi\_\{F\}\(\\cdot\|\{\\mathbf\{s\}\},\\theta\)from one state to the next\. This induces a trajectory distributionπF\(τ\|θ\)=∏\(𝐬i,𝐬i−1\)∈τπF\(𝐬i\|𝐬i−1\)\\pi\_\{F\}\(\\tau\|\\theta\)=\\prod\_\{\(\\mathbf\{s\}\_\{i\},\\mathbf\{s\}\_\{i\-1\}\)\\in\\tau\}\\pi\_\{F\}\(\{\\mathbf\{s\}\}\_\{i\}\|\{\\mathbf\{s\}\}\_\{i\-1\}\)and a marginal over terminal statesπF\(𝐱τ∣θ\)=∑τ:𝐱τ=𝐱πF\(τ\|θ\)\\pi\_\{F\}\(\{\\mathbf\{x\}\}\_\{\\tau\}\\mid\\theta\)=\\sum\_\{\\tau:\{\\mathbf\{x\}\}\_\{\\tau\}=\{\\mathbf\{x\}\}\}\\pi\_\{F\}\(\\tau\|\\theta\)\.

Direct likelihood maximization is intractable due to the marginal sum\.Malkin et al\. \([2022a](https://arxiv.org/html/2605.15417#bib.bib19)\)address this by introducing a learnableZZand backward policyπB\(τ\|𝐱τ,ϕ\)=∏πB\(𝐬i−1∣𝐬i,ϕ\)\\pi\_\{B\}\(\\tau\|\{\\mathbf\{x\}\}\_\{\\tau\},\\phi\)=\\prod\\pi\_\{B\}\(\\mathbf\{s\}\_\{i\-1\}\\mid\\mathbf\{s\}\_\{i\},\\phi\)and enforcing the*trajectory balance*constraint:

ZπF\(τ\|θ\)=ℛ\(𝐱τ\)πB\(τ\|𝐱τ,ϕ\)\.\\displaystyle Z\\pi\_\{F\}\(\\tau\|\\theta\)=\{\\mathcal\{R\}\}\(\{\\mathbf\{x\}\}\_\{\\tau\}\)\\pi\_\{B\}\(\\tau\|\{\\mathbf\{x\}\}\_\{\\tau\},\\phi\)\.
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/jsd_and_modes_combined.png)\(a\)Comparison of different losses on the grid task ofBengio et al\. \([2021](https://arxiv.org/html/2605.15417#bib.bib3)\)\. The forward KL and Hellinger are more mode covering than standard trajectory balance, meaning they find all 4 modes and fit the distribution quicker\. Pearson is more mode seeking than trajectory balance and attaches to the first mode it finds\.
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/paper_reward_diversity_unique_beta16.png)\(b\)Training a SynFlowNet with a variety of losses based onα\\alpha\-divergences\. Inα\\alphadivergences,α\\alphais a controllable parameter with lower values incentivising mode covering and higher values incentivizing mode seeking, standard trajectory balance corresponds toα=1\\alpha=1\. Annealingα\\alphaduring training allows these to be traded off, lead to the sampling of more diverse high reward molecules\.

HereπB\(τ∣𝐱τ,ϕ\)=∏\(𝐬i−1,𝐬i\)∈τπB\(𝐬i−1∣𝐬i,ϕ\)\\pi\_\{B\}\(\\tau\\mid\\mathbf\{x\}\_\{\\tau\},\\phi\)=\\prod\_\{\(\\mathbf\{s\}\_\{i\-1\},\\mathbf\{s\}\_\{i\}\)\\in\\tau\}\\pi\_\{B\}\(\\mathbf\{s\}\_\{i\-1\}\\mid\\mathbf\{s\}\_\{i\},\\phi\)andZ=∑𝐱∈𝒳ℛ\(𝐱\)Z=\\sum\_\{\{\\mathbf\{x\}\}\\in\{\\mathcal\{X\}\}\}\{\\mathcal\{R\}\}\(\{\\mathbf\{x\}\}\)\.Malkin et al\. \([2022a](https://arxiv.org/html/2605.15417#bib.bib19)\)show that ifθ,ϕ\\theta,\\phiare such that the trajectory balance constraint is satisfied for allτ∈𝒯\\tau\\in\{\\mathcal\{T\}\}, thenπF\(𝐱\|θ\)∝ℛ\(𝐱\)\\pi\_\{F\}\(\{\\mathbf\{x\}\}\|\\theta\)\\propto\{\\mathcal\{R\}\}\(\{\\mathbf\{x\}\}\)\. Defining the log trajectory discrepancy as:

Δ\(τ,θ,ϕ\)\\displaystyle\\Delta\(\\tau,\\theta,\\phi\)=log⁡ZϕπF\(τ\|θ\)ℛ\(𝐱τ\)πB\(τ\|𝐱τ,ϕ\),\\displaystyle=\\log\\frac\{Z\_\{\\phi\}\\pi\_\{F\}\(\\tau\|\\theta\)\}\{\{\\mathcal\{R\}\}\(\{\\mathbf\{x\}\}\_\{\\tau\}\)\\pi\_\{B\}\(\\tau\|\{\\mathbf\{x\}\}\_\{\\tau\},\\phi\)\},whereZϕZ\_\{\\phi\}is now a learnable parameter, the trajectory balance constraint is satisfied ifΔ\(τ,θ,ϕ\)=0\\Delta\(\\tau,\\theta,\\phi\)=0for allτ∈𝒯\\tau\\in\{\\mathcal\{T\}\}, which leadsMalkin et al\. \([2022a](https://arxiv.org/html/2605.15417#bib.bib19)\)to define the trajectory balance loss as:

ℒTB\(τ,ϕ,θ\)≔\(Δ\(τ,θ,ϕ\)\)2\.\\displaystyle\{\\mathcal\{L\}\}\_\{\\mathrm\{TB\}\}\(\\tau,\\phi,\\theta\)\\coloneqq\\left\(\{\\Delta\(\\tau,\\theta,\\phi\)\}\\right\)^\{2\}\.\(8\)The parameters,θ,ϕ\\theta,\\phi, are then updated using the expected gradients of the loss as𝔼μ\(τ\)\[∇θℒTB\(τ,ϕ,θ\)\]\{\\mathbb\{E\}\}\_\{\\mu\(\\tau\)\}\\left\[\{\\nabla\{\\theta\}\{\\mathcal\{L\}\}\_\{\\mathrm\{TB\}\}\(\\tau,\\phi,\\theta\)\}\\right\]and𝔼μ\(τ\)\[∇ϕℒTB\(τ,ϕ,θ\)\]\{\\mathbb\{E\}\}\_\{\\mu\(\\tau\)\}\\left\[\{\\nabla\{\\phi\}\{\\mathcal\{L\}\}\_\{\\mathrm\{TB\}\}\(\\tau,\\phi,\\theta\)\}\\right\], using expected gradients under a behavior policyμ\(τ\)\\mu\(\\tau\), typically a temperedπF\\pi\_\{F\}\.

Malkin et al\. \([2023](https://arxiv.org/html/2605.15417#bib.bib21)\)relate this to hierarchical variational inference \(HVI\)\(Ranganath et al\.,[2016](https://arxiv.org/html/2605.15417#bib.bib27)\), viewingπB\\pi\_\{B\}as a posterior andπF\\pi\_\{F\}as a generative model minimizing theff\-divergence :

ℒHVI,f\(πF,πB\)\\displaystyle\{\\mathcal\{L\}\}\_\{\\mathrm\{HVI,f\}\}\(\\pi\_\{F\},\\pi\_\{B\}\)=𝒟f\(πF∥πB\)\\displaystyle=\{\\mathcal\{D\}\}\_\{f\}\\left\(\{\\pi\_\{F\}\\\|\\pi\_\{B\}\}\\right\)\(9\)=𝔼τ∼πB\[f\(πF\(τ\)πB\(τ\)\)\]\.\\displaystyle=\{\\mathbb\{E\}\}\_\{\\tau\\sim\\pi\_\{B\}\}\\left\[\{f\\left\(\{\\frac\{\\pi\_\{F\}\(\\tau\)\}\{\\pi\_\{B\}\(\\tau\)\}\}\\right\)\}\\right\]\.\(10\)For the reverse KL \(R−𝕂𝕃\\mathrm\{R\-\}\\mathbb\{KL\}\) and KL,Malkin et al\. \([2022b](https://arxiv.org/html/2605.15417#bib.bib20)\)show that their are gradient equivalences between HVI and on\-policy GflowNet training as:

∇θ𝒟R−𝕂𝕃\(πB,ϕ∥πF,θ\)\\displaystyle\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{\\mathrm\{R\-\}\\mathbb\{KL\}\}\(\\pi\_\{B,\\phi\}\\\|\\pi\_\{F,\\theta\}\)=𝔼τ∼πF,θ\[∇θℒTB\(τ,ϕ,θ\)\],\\displaystyle=\{\\mathbb\{E\}\}\_\{\\tau\\sim\\pi\_\{F,\\theta\}\}\\left\[\{\\nabla\_\{\\theta\}\{\\mathcal\{L\}\}\_\{\{\\mathrm\{TB\}\}\}\(\\tau,\\phi,\\theta\)\}\\right\],∇ϕ𝒟𝕂𝕃\(πB,ϕ∥πF,θ\)\\displaystyle\\nabla\_\{\\phi\}\{\\mathcal\{D\}\}\_\{\\mathbb\{KL\}\}\(\\pi\_\{B,\\phi\}\\\|\\pi\_\{F,\\theta\}\)=𝔼τ∼πB,ϕ\[∇ϕℒTB\(τ,ϕ,θ\)\]\.\\displaystyle=\{\\mathbb\{E\}\}\_\{\\tau\\sim\\pi\_\{B,\\phi\}\}\\left\[\{\\nabla\_\{\\phi\}\{\\mathcal\{L\}\}\_\{\{\\mathrm\{TB\}\}\}\(\\tau,\\phi,\\theta\)\}\\right\]\.If𝒢\{\\mathcal\{G\}\}is a tree,πB=1\\pi\_\{B\}=1, meaning we do not have to learn a backward policy and the gradient equivalence is the same as that discussed in Section refsec:KL\_loss\.

### 5\.2ff\-Trajectory Balance

We now demonstrate that the loss introduced in Section refsec:f\_div\_loss can be applied to enforce the trajectory balance constraints in GFlowNets and that doing so generalises the result ofMalkin et al\. \([2022b](https://arxiv.org/html/2605.15417#bib.bib20)\)to arbitraryff\-divergences\.

###### Proposition 5\.1\.

For a convex function,ff, we have thatℒf\\mathcal\{L\}\_\{f\}applied to log trajectory generalises trajectory balance in the sense that we have:

∇θ𝒟f\(πF∥πB\)=𝔼τ∼πF,θ\[∇θℒf\(τ,ϕ,θ\)\]\\displaystyle\\nabla\_\{\\theta\}\{\\mathcal\{D\}\}\_\{f\}\(\\pi\_\{F\}\\\|\\pi\_\{B\}\)=\{\\mathbb\{E\}\}\_\{\\tau\\sim\\pi\_\{F,\\theta\}\}\\left\[\{\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{f\}\(\\tau,\\phi,\\theta\)\}\\right\]∇ϕ𝒟h\(πB∥πF\)=𝔼τ∼πB,ϕ\[∇ϕℒf\(τ,ϕ,θ\)\]\\displaystyle\\nabla\_\{\\phi\}\{\\mathcal\{D\}\}\_\{h\}\(\\pi\_\{B\}\\\|\\pi\_\{F\}\)=\{\\mathbb\{E\}\}\_\{\\tau\\sim\\pi\_\{B,\\phi\}\}\\left\[\{\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{f\}\(\\tau,\\phi,\\theta\)\}\\right\]wherehhis defined as:

h\(u\)=∫1u\(2−f′\(1t\)\)𝑑t\.h\(u\)=\\int\_\{1\}^\{u\}\\left\(2\-f^\{\\prime\}\\left\(\\frac\{1\}\{t\}\\right\)\\right\)dt\.

We demonstrate in Appendix refap:no\_backwards\_grad that the on\-policy gradients of the backwards policy do not corresponds toff\-divergence minimisation, as with trajectory balance\. In this case their validity comes from the factℒf\{\\mathcal\{L\}\}\_\{f\}is a valid loss function\.Malkin et al\. \([2022a](https://arxiv.org/html/2605.15417#bib.bib19)\)also demonstrate that the trajectory balance loss has lower variance than a REINFORCE estimator as the model approaches the optimum, which we show holds forff\-trajectory balance in Appendix refap:f\_traj\_Balance\_variance\.

## 6Experiments

We demonstrate our loss family on four tasks: the synthetic grid task for GFlowNets\(Bengio et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib3)\), molecule sampling with SynFlowNet\(Cretu et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib8)\), diffusion model tuning\(Venkatraman et al\.,[2024](https://arxiv.org/html/2605.15417#bib.bib40)\), and asynchronous LLM fine\-tuning on GSM8k\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib7)\)and Hendrycks MATH\([Hendrycks et al\.,](https://arxiv.org/html/2605.15417#bib.bib12)\)\.

We provide a complete list offf\-divergences used in Appendix refap:f\_traj\_Balance\_variance, but most important for understanding our results is the family ofα\\alpha\-divergences, generated byf\(u\)=uα−uα\(α−1\)\+α−1α\(u−1\)f\(u\)=\\frac\{u^\{\\alpha\}\-u\}\{\\alpha\(\\alpha\-1\)\}\+\\frac\{\\alpha\-1\}\{\\alpha\}\(u\-1\)\. The key point about this family is not just that it contains many of the most commonff\-divergences, including the forward and reverse KL whenα\\alphais 0 and 1 respectively111The value at these points is defined via the limit\., but that theα\\alphaparameter controls the the mode covering vs mode seeking behaviour, with lower values ofα\\alphaleading to more mode covering losses\. Theα\\alphadivergence loss,ℒα\{\\mathcal\{L\}\}\_\{\\alpha\}, is given by:

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/diffusion_tuning/reward_vs_tv.png)\(a\)Tradeoff in reward vs divergence from the prior when tuning a pretrained diffusion model for MNIST digits to generate odd and even digits respectively\.
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/entropy_reward_tradeoff.png)\(b\)Entropy\-Reward Tradeoff when tuning LLMs 50 steps asynchronously with verifiable rewards on GSM8k\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib7)\)and Hendrycks MATH\([Hendrycks et al\.,](https://arxiv.org/html/2605.15417#bib.bib12)\)\.

ℒα\(Δ\)=1\(α−1\)2e\(α−1\)Δ−Δα−1−1\(α−1\)2\.\\mathcal\{L\}\_\{\\alpha\}\(\\Delta\)=\\frac\{1\}\{\(\\alpha\-1\)^\{2\}\}e^\{\(\\alpha\-1\)\\Delta\}\-\\frac\{\\Delta\}\{\\alpha\-1\}\-\\frac\{1\}\{\(\\alpha\-1\)^\{2\}\}\.where trajectory balance emerges in the limit asα→1\\alpha\\to 1

### 6\.1GFlowNet Experiments

#### 6\.1\.1Synthetic Grid

We demonstrated this trade\-off using the 2D grid experiment fromBengio et al\. \([2021](https://arxiv.org/html/2605.15417#bib.bib3)\); Malkin et al\. \([2022a](https://arxiv.org/html/2605.15417#bib.bib19)\), where an agent navigates a hypergrid to discover four modes in the grid corners\. Actions include moving to adjacent coordinates or terminating the trajectory and low\-reward regions between modes simulate the mode\-discovery challenges typical of MCMC and sampling algorithms\. We evaluated fourα\\alpha\-divergence losses,ℒα\{\\mathcal\{L\}\}\_\{\\alpha\}, on this task, the forward KL \(α=0\\alpha=0\), Hellinger \(α=0\.5\\alpha=0\.5\), reverse KL \(α=1\\alpha=1\), and Pearsonχ2\\chi^\{2\}\(α=2\\alpha=2\)\. Reverse KL corresponds to training with standard trajectory balance\.

Figure reffig:grid\_results demonstrates the mode covering vs mode seeking property, where lower values ofα\\alphain the forward KL and Hellinger losses lead the agent to find more modes than standard trajectory balance and to find modes quicker\. Given one of the key motivations of GFlowNets is its ability to find modes, this clearly demonstrates the benefits of our loss family\. On the other hand, the Pearsonχ2\\chi^\{2\}is more mode seeking than trajectory balance, which caused it to get stuck in the first mode it found\. This behaviour will in\-fact be helpful later for getting reward maximising losses for RL\.

#### 6\.1\.2SynFlowNet

We now apply our losses to tuning SynFlowNets\(Cretu et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib8)\), a class of GFlowNets constrained to sample from the space of synthetically accessible molecules\. Again the goal is to sample this space proportional to a given reward model, such as predicted binding energy or bioactivity\. For these tasks we find that theα\\alpha\-divergence losses,ℒα\{\\mathcal\{L\}\}\_\{\\alpha\}, required much less extremeα\\alphavalues for stable training, withα\\alphamuch closer to the trajectory balance value ofα=1\\alpha=1\.

We evaluateα∈\{0\.75,1\.2\}\\alpha\\in\\\{0\.75,1\.2\\\}against TB across 3 SynFlowNet tasks, sweeping the inverse temperature\.α=0\.75\\alpha=0\.75yields more diverse molecules than TB whileα=1\.2\\alpha=1\.2leads to mode collapse\. Annealingα\\alphafrom0\.750\.75to1\.21\.2during training achieves the best of both worlds, as shown for DRD2 in Figure reffig:synflownet\.

### 6\.2Generative Model Experiments

#### 6\.2\.1Conditionally Sampling Diffusion Models

Whilst we have presented GFlowNets for discrete spaces, they have been extended to continuous ones\(Lahlou et al\.,[2023](https://arxiv.org/html/2605.15417#bib.bib17)\), and from this applied to tuning of diffusion models\([Berner et al\.,](https://arxiv.org/html/2605.15417#bib.bib5); Venkatraman et al\.,[2024](https://arxiv.org/html/2605.15417#bib.bib40)\)\. Specifically,Venkatraman et al\. \([2024](https://arxiv.org/html/2605.15417#bib.bib40)\)view generations as trajectories sampled from a GFlowNet, allowing for a pretrained diffusion model,pprior\(𝐱\)p\_\{\\mathrm\{prior\}\}\(\{\\mathbf\{x\}\}\), to be tuned to an unnormalised posterior distributionppost\(𝐱\)∝r\(𝐱\)pprior\(𝐱\)p\_\{\\mathrm\{post\}\}\(\{\\mathbf\{x\}\}\)\\propto r\(\{\\mathbf\{x\}\}\)p\_\{\\mathrm\{prior\}\}\(\{\\mathbf\{x\}\}\)without evaluation of the final likelihoods, only the intermediary steps\.

To demonstrate thatff\-trajectory balance can be applied in in this case, we repeated the experiments fromVenkatraman et al\. \([2024](https://arxiv.org/html/2605.15417#bib.bib40)\)by tuning a pre\-trained diffusion model for MNIST digits to conditionally sample either odd or even numbers\. In this case, the target posterior has multiple modes, one for each target digit, and a correctly fitted posterior would sample them evenly\. Figure reffig:tradeoff\_diffusion demonstrates that standard trajectory balance fails to do this, especially for even digits where Appendix refap:f\_traj\_Balance\_variance shows it oversamples0s and66s \. Alternativeff\-trajectory balance losses are able to sample the modes more evenly, with an annealed alpha divergence again getting the best trade\-off\.

#### 6\.2\.2Asynchronously Tuning LLMs with RL

Finally, as a classic example of the entropy\-reward trade\-off in reinforcement learning with verifiable rewards of large large models, we fine\-tune a number of models on a mix of questions on GSM8k\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib7)\)and Hendrycks MATH\([Hendrycks et al\.,](https://arxiv.org/html/2605.15417#bib.bib12)\)\. To demonstrate the off policy validity, we train asynchronously with a 50 step delay\. For models, we use the Qwen 2\.5 family of sizes 3b\-14b and OLMo\-2\-1124\-7B, aiming to demonstrate our losses in a variety of model sizes and family’s\. As we find that standard GRPO cannot be applied 50 steps asynchronously, we compare against the optimised PPO loss implementation fromIntellect \([2025](https://arxiv.org/html/2605.15417#bib.bib14)\), which includes changes such as CISPO clipping\(Chen et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib6)\)and DAPO\(Yu et al\.,[2025](https://arxiv.org/html/2605.15417#bib.bib43)\)\. For our losses, we use the tempered DevGrad losses \(Appendix refapp:loss\_derivations\) without clipping, importance weighting, or masking\.

We train for 300 steps with 3 seeds per configuration; The Reverse KL, Pearson, and Forward KL again demonstrate the same trade\-off \(Figure reffig:async\_llm\)\. We also include the Jensen\-Shannon divergence \(requiring numerical normalisation\), confirming applicability to the full family offf\-divergences\. The optimised PPO produces inconsistent results across model classes owing to off\-policy instability\.

## 7Conclusion

In this work, we have derived a loss family that naturally extends the Vargrad loss for generative models and the trajectory balance loss for GFlowNets by extending their property of on\-policy gradients matching the𝕂𝕃\\mathbb\{KL\}gradients to the whole family offf\-divergences\. We have demonstrated that doing so allows us to control mode\-seeking vs mode\-covering behaviour in a variety of generative models, whilst training on\- and off\-policy\.

## Acknowledgments

The authors would like to thank Julien Roy and Emmanuel Bengio for feedback on this work\.

## References

- Ali and Silvey \(1966\)S\. M\. Ali and S\. D\. Silvey\.A general class of coefficients of divergence of one distribution from another\.*Journal of the Royal Statistical Society: Series B \(Methodological\)*, 28\(1\):131–142, 1966\.
- Bartoldson et al\. \(2025\)B\. R\. Bartoldson, S\. Venkatraman, J\. Diffenderfer, M\. Jain, T\. Ben\-Nun, S\. Lee, M\. Kim, J\. Obando\-Ceron, Y\. Bengio, and B\. Kailkhura\.Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post\-training\.*CoRR*, 2025\.
- Bengio et al\. \(2021\)E\. Bengio, M\. Jain, M\. Korablyov, D\. Precup, and Y\. Bengio\.Flow network based generative models for non\-iterative diverse candidate generation\.*Advances in neural information processing systems*, 34:27381–27394, 2021\.
- Bengio et al\. \(2023\)Y\. Bengio, S\. Lahlou, T\. Deleu, E\. J\. Hu, M\. Tiwari, and E\. Bengio\.Gflownet foundations\.*Journal of Machine Learning Research*, 24\(210\):1–55, 2023\.
- \(5\)J\. Berner, L\. Richter, M\. Sendera, J\. Rector\-Brooks, and N\. Malkin\.From discrete\-time policies to continuous\-time diffusion samplers: Asymptotic equivalences and faster training\.
- Chen et al\. \(2025\)A\. Chen, A\. Li, B\. Gong, B\. Jiang, B\. Fei, B\. Yang, B\. Shan, C\. Yu, C\. Wang, C\. Zhu, et al\.Minimax\-m1: Scaling test\-time compute efficiently with lightning attention\.*arXiv preprint arXiv:2506\.13585*, 2025\.
- Cobbe et al\. \(2021\)K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, et al\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Cretu et al\. \(2025\)M\. Cretu, C\. Harris, I\. Igashov, A\. Schneuing, M\. Segler, B\. Correia, J\. Roy, E\. Bengio, and P\. Lio\.Synflownet: Design of diverse and novel molecules with synthesis constraints\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=uvHmnahyp1](https://openreview.net/forum?id=uvHmnahyp1)\.
- Ghasemipour et al\. \(2020\)S\. K\. S\. Ghasemipour, R\. Zemel, and S\. Gu\.A divergence minimization perspective on imitation learning methods\.In*Conference on robot learning*, pages 1259–1277\. PMLR, 2020\.
- Go et al\. \(2023\)D\. Go, T\. Korbak, G\. Kruszewski, J\. Rozen, N\. Ryu, and M\. Dymetman\.Aligning language models with preferences through f\-divergence minimization\.*arXiv preprint arXiv:2302\.08215*, 2023\.
- Han et al\. \(2024\)J\. Han, M\. Jiang, Y\. Song, S\. Ermon, and M\. Xu\.ff\-po: Generalizing preference optimization withff\-divergence minimization\.*arXiv preprint arXiv:2410\.21662*, 2024\.
- \(12\)D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\.Measuring mathematical problem solving with the math dataset\.In*Thirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track \(Round 2\)*\.
- \(13\)A\. Huang, W\. Zhan, T\. Xie, J\. D\. Lee, W\. Sun, A\. Krishnamurthy, and D\. J\. Foster\.Correcting the mythos of kl\-regularization: Direct alignment without overoptimization via chi\-squared preference optimization\.In*The Thirteenth International Conference on Learning Representations*\.
- Intellect \(2025\)P\. Intellect\.Prime\-rl, 2025\.URL[https://github\.com/PrimeIntellect\-ai/prime\-rl](https://github.com/PrimeIntellect-ai/prime-rl)\.
- Jin et al\. \(2019\)Q\. Jin, B\. Dhingra, Z\. Liu, W\. Cohen, and X\. Lu\.Pubmedqa: A dataset for biomedical research question answering\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 2567–2577, 2019\.
- Ke et al\. \(2020\)L\. Ke, S\. Choudhury, M\. Barnes, W\. Sun, G\. Lee, and S\. Srinivasa\.Imitation learning as f\-divergence minimization\.In*International workshop on the algorithmic foundations of robotics*, pages 313–329\. Springer, 2020\.
- Lahlou et al\. \(2023\)S\. Lahlou, T\. Deleu, P\. Lemos, D\. Zhang, A\. Volokhova, A\. Hernández\-Garcıa, L\. N\. Ezzine, Y\. Bengio, and N\. Malkin\.A theory of continuous generative flow networks\.In*International Conference on Machine Learning*, pages 18269–18300\. PMLR, 2023\.
- Lu and Lab \(2025\)K\. Lu and T\. M\. Lab\.On\-policy distillation\.*Thinking Machines Lab: Connectionism*, 2025\.doi:10\.64434/tml\.20251026\.https://thinkingmachines\.ai/blog/on\-policy\-distillation\.
- Malkin et al\. \(2022a\)N\. Malkin, M\. Jain, E\. Bengio, C\. Sun, and Y\. Bengio\.Trajectory balance: Improved credit assignment in gflownets\.*Advances in Neural Information Processing Systems*, 35:5955–5967, 2022a\.
- Malkin et al\. \(2022b\)N\. Malkin, S\. Lahlou, T\. Deleu, X\. Ji, E\. Hu, K\. Everett, D\. Zhang, and Y\. Bengio\.Gflownets and variational inference\.*arXiv preprint arXiv:2210\.00580*, 2022b\.
- Malkin et al\. \(2023\)N\. Malkin, S\. Lahlou, T\. Deleu, X\. Ji, E\. J\. Hu, K\. E\. Everett, D\. Zhang, and Y\. Bengio\.GFlownets and variational inference\.In*The Eleventh International Conference on Learning Representations*, 2023\.URL[https://openreview\.net/forum?id=uKiE0VIluA\-](https://openreview.net/forum?id=uKiE0VIluA-)\.
- Minka et al\. \(2005\)T\. Minka et al\.Divergence measures and message passing\.2005\.
- Morimoto \(1963\)T\. Morimoto\.Markov processes and the h\-theorem\.*Journal of the Physical Society of Japan*, 18\(3\):328–331, 1963\.
- Novello et al\. \(2025\)N\. Novello, F\. Fontana, L\. Cinque, D\. Gunduz, and A\. M\. Tonello\.A unified framework for diffusion model unlearning with f\-divergence\.*arXiv preprint arXiv:2509\.21167*, 2025\.
- Nowozin et al\. \(2016\)S\. Nowozin, B\. Cseke, and R\. Tomioka\.f\-gan: Training generative neural samplers using variational divergence minimization\.*Advances in neural information processing systems*, 29, 2016\.
- Nüsken and Richter \(2021\)N\. Nüsken and L\. Richter\.Solving high\-dimensional hamilton–jacobi–bellman pdes using neural networks: perspectives from the theory of controlled diffusions and measures on path space\.*Partial differential equations and applications*, 2\(4\):48, 2021\.
- Ranganath et al\. \(2016\)R\. Ranganath, D\. Tran, and D\. Blei\.Hierarchical variational models\.In*International conference on machine learning*, pages 324–333\. PMLR, 2016\.
- Richter et al\. \(2020\)L\. Richter, A\. Boustati, N\. Nüsken, F\. Ruiz, and O\. D\. Akyildiz\.Vargrad: a low\-variance gradient estimator for variational inference\.*Advances in Neural Information Processing Systems*, 33:13481–13492, 2020\.
- Rockafellar and Uryasev \(2013\)R\. T\. Rockafellar and S\. Uryasev\.The fundamental risk quadrangle in risk management, optimization and statistical estimation\.*Surveys in Operations Research and Management Science*, 18\(1\-2\):33–53, 2013\.
- Rockafellar et al\. \(2006\)R\. T\. Rockafellar, S\. Uryasev, and M\. Zabarankin\.Generalized deviations in risk analysis\.*Finance and Stochastics*, 10\(1\):51–74, 2006\.
- Schulman \(2020\)J\. Schulman\.Approximating kl divergence, 2020\.*URL http://joschu\. net/blog/kl\-approx\. html*, 2020\.
- Schulman and Lab \(2025\)J\. Schulman and T\. M\. Lab\.Lora without regret\.*Thinking Machines Lab: Connectionism*, 2025\.doi:10\.64434/tml\.20250929\.https://thinkingmachines\.ai/blog/lora/\.
- Schulman et al\. \(2017\)J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Silva et al\. \(2024\)T\. Silva, E\. de Souza da Silva, and D\. Mesquita\.On divergence measures for training gflownets\.*Advances in Neural Information Processing Systems*, 37:75883–75913, 2024\.
- Tang \(2024\)W\. Tang\.Fine\-tuning of diffusion models via stochastic control: entropy regularization and beyond\.*CoRR*, 2024\.
- Tang and Munos \(2025\)Y\. Tang and R\. Munos\.On a few pitfalls in kl divergence gradient estimation for rl\.*arXiv preprint arXiv:2506\.09477*, 2025\.
- Tang et al\. \(2025\)Y\. Tang, T\. Cohen, D\. W\. Zhang, M\. Valko, and R\. Munos\.Rl\-finetuning llms from on\-and off\-policy data with a single algorithm\.*CoRR*, 2025\.
- Team et al\. \(2025a\)K\. Team, Y\. Bai, Y\. Bao, G\. Chen, J\. Chen, N\. Chen, R\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, et al\.Kimi k2: Open agentic intelligence\.*arXiv preprint arXiv:2507\.20534*, 2025a\.
- Team et al\. \(2025b\)K\. Team, A\. Du, B\. Gao, B\. Xing, C\. Jiang, C\. Chen, C\. Li, C\. Xiao, C\. Du, C\. Liao, et al\.Kimi k1\. 5: Scaling reinforcement learning with llms\.*arXiv preprint arXiv:2501\.12599*, 2025b\.
- Venkatraman et al\. \(2024\)S\. Venkatraman, M\. Jain, L\. Scimeca, M\. Kim, M\. Sendera, M\. Hasan, L\. Rowe, S\. Mittal, P\. Lemos, E\. Bengio, et al\.Amortizing intractable inference in diffusion models for vision, language, and control\.*Advances in neural information processing systems*, 37:76080–76114, 2024\.
- Wang et al\. \(2018\)D\. Wang, H\. Liu, and Q\. Liu\.Variational inference with tail\-adaptive f\-divergence\.*Advances in Neural Information Processing Systems*, 31, 2018\.
- Williams \(1992\)R\. J\. Williams\.Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.*Machine learning*, 8\(3\):229–256, 1992\.
- Yu et al\. \(2025\)Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, et al\.Dapo: An open\-source llm reinforcement learning system at scale\.*arXiv preprint arXiv:2503\.14476*, 2025\.

## Appendix AAppendix A\. Proofs and Derivations

In this appendix, we provide the full derivations for the loss family introduced in[Section4](https://arxiv.org/html/2605.15417#S4)\. We prove that the translation invariant loss functionℒf\\mathcal\{L\}\_\{f\}defined in[Proposition4\.2](https://arxiv.org/html/2605.15417#S4.Thmtheorem2)satisfies the two key properties: minimizing the population loss recovers the target distribution \(off\-policy consistency\), and the on\-policy auto\-differentiated gradients recover the gradients of the associatedff\-divergence\. We also establish the inverse mapping described in[Proposition4\.4](https://arxiv.org/html/2605.15417#S4.Thmtheorem4)\.

### A\.1Proof of Proposition 4\.3

Proposition 4\.3\.LetΔθ\(y\)=log⁡pθ\(y\)−log⁡p⋆\(y\)\\Delta\_\{\\theta\}\(y\)=\\log p\_\{\\theta\}\(y\)\-\\log p\_\{\\star\}\(y\)\. The function:

ℒf\(Δθ\(y\)\)=∫0Δθ\(y\)\(f′\(exp⁡\(t\)\)−f′\(1\)\)𝑑t\\mathcal\{L\}\_\{f\}\(\\Delta\_\{\\theta\}\(y\)\)=\\int\_\{0\}^\{\\Delta\_\{\\theta\}\(y\)\}\\left\(f^\{\\prime\}\(\\exp\(t\)\)\-f^\{\\prime\}\(1\)\\right\)dtis a translation invariant loss function that generalizes the construction in Section 3\.2 in that:

1. 1\.The population loss𝒥μ,f\(θ\)=𝔼y∼μ\[ℒf\(Δθ\(y\)\)\]\\mathcal\{J\}\_\{\\mu,f\}\(\\theta\)=\\mathbb\{E\}\_\{y\\sim\\mu\}\[\\mathcal\{L\}\_\{f\}\(\\Delta\_\{\\theta\}\(y\)\)\]is minimized whenpθ=p⋆p\_\{\\theta\}=p\_\{\\star\}\(assuming support coverage\)\.
2. 2\.Ifμ=pθ\\mu=p\_\{\\theta\}, the expected auto\-differentiated gradients match theff\-divergence gradient:∇θDf\(pθ∥p⋆\)=𝔼pθ\[∇θℒf\(Δθ\(y\)\)\]\\nabla\_\{\\theta\}D\_\{f\}\(p\_\{\\theta\}\\\|p\_\{\\star\}\)=\\mathbb\{E\}\_\{p\_\{\\theta\}\}\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{f\}\(\\Delta\_\{\\theta\}\(y\)\)\]\.

#### A\.1\.1Proof of Part 1: Convexity and Global Minimizer

Let the scalar loss function with respect to the log\-probability difference be denoted byL\(Δ\)=∫0Δ\(f′\(exp⁡\(t\)\)−f′\(1\)\)𝑑tL\(\\Delta\)=\\int\_\{0\}^\{\\Delta\}\(f^\{\\prime\}\(\\exp\(t\)\)\-f^\{\\prime\}\(1\)\)dt\. We determine the properties ofL\(Δ\)L\(\\Delta\)by analyzing its derivatives\.

Using the Fundamental Theorem of Calculus, the first derivative is:

L′\(Δ\)=f′\(exp⁡\(Δ\)\)−f′\(1\)\.L^\{\\prime\}\(\\Delta\)=f^\{\\prime\}\(\\exp\(\\Delta\)\)\-f^\{\\prime\}\(1\)\.Differentiating again with respect toΔ\\Delta, we obtain the second derivative:

L′′\(Δ\)=f′′\(exp⁡\(Δ\)\)⋅exp⁡\(Δ\)\.L^\{\\prime\\prime\}\(\\Delta\)=f^\{\\prime\\prime\}\(\\exp\(\\Delta\)\)\\cdot\\exp\(\\Delta\)\.Sinceffis a convex function defining anff\-divergence,f′′\(u\)≥0f^\{\\prime\\prime\}\(u\)\\geq 0for allu\>0u\>0\. Additionally, the exponential functionexp⁡\(Δ\)\\exp\(\\Delta\)is strictly positive for all realΔ\\Delta\. Therefore:

L′′\(Δ\)≥0∀Δ∈ℝ\.L^\{\\prime\\prime\}\(\\Delta\)\\geq 0\\quad\\forall\\Delta\\in\\mathbb\{R\}\.This confirms thatL\(Δ\)L\(\\Delta\)is a convex function\. To find the global minimizer, we solve for the stationary pointL′\(Δ\)=0L^\{\\prime\}\(\\Delta\)=0:

f′\(exp⁡\(Δ\)\)−f′\(1\)\\displaystyle f^\{\\prime\}\(\\exp\(\\Delta\)\)\-f^\{\\prime\}\(1\)=0\\displaystyle=0f′\(exp⁡\(Δ\)\)\\displaystyle f^\{\\prime\}\(\\exp\(\\Delta\)\)=f′\(1\)\.\\displaystyle=f^\{\\prime\}\(1\)\.Assuming strict convexity offf\(implyingf′f^\{\\prime\}is strictly monotonic\), this equality holds if and only if:

exp⁡\(Δ\)=1⟹Δ=0\.\\exp\(\\Delta\)=1\\implies\\Delta=0\.SinceL\(Δ\)L\(\\Delta\)is convex and its derivative vanishes atΔ=0\\Delta=0,Δ=0\\Delta=0is the global minimum\. Consequently, the expected population loss𝔼y∼μ\[L\(Δθ\(y\)\)\]\\mathbb\{E\}\_\{y\\sim\\mu\}\[L\(\\Delta\_\{\\theta\}\(y\)\)\]is minimized whenΔθ\(y\)=0\\Delta\_\{\\theta\}\(y\)=0almost everywhere, which implieslog⁡pθ\(y\)=log⁡p⋆\(y\)\\log p\_\{\\theta\}\(y\)=\\log p\_\{\\star\}\(y\), orpθ=p⋆p\_\{\\theta\}=p\_\{\\star\}\.

#### A\.1\.2Proof of Part 2: Gradient Matching Condition

We show that the expected gradient of theff\-divergence on policy equals the expected auto\-differentiated gradient of the loss\.

Gradient of theff\-divergence:Using the log\-derivative trick \(∇θpθ=pθ∇θlog⁡pθ\\nabla\_\{\\theta\}p\_\{\\theta\}=p\_\{\\theta\}\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\) and defining the likelihood ratiou\(y\)=pθ\(y\)/p⋆\(y\)u\(y\)=p\_\{\\theta\}\(y\)/p\_\{\\star\}\(y\), the gradient is:

∇θDf\(pθ∥p⋆\)\\displaystyle\\nabla\_\{\\theta\}D\_\{f\}\(p\_\{\\theta\}\\\|p\_\{\\star\}\)=∇θ𝔼p⋆\[f\(u\(y\)\)\]\\displaystyle=\\nabla\_\{\\theta\}\\mathbb\{E\}\_\{p\_\{\\star\}\}\\left\[f\(u\(y\)\)\\right\]=𝔼p⋆\[f′\(u\(y\)\)∇θpθ\(y\)p⋆\(y\)\]=𝔼pθ\[f′\(u\(y\)\)∇θlog⁡pθ\(y\)\]\.\\displaystyle=\\mathbb\{E\}\_\{p\_\{\\star\}\}\\left\[f^\{\\prime\}\(u\(y\)\)\\frac\{\\nabla\_\{\\theta\}p\_\{\\theta\}\(y\)\}\{p\_\{\\star\}\(y\)\}\\right\]=\\mathbb\{E\}\_\{p\_\{\\theta\}\}\\left\[f^\{\\prime\}\(u\(y\)\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(y\)\\right\]\.Since𝔼pθ\[∇θlog⁡pθ\(y\)\]=0\\mathbb\{E\}\_\{p\_\{\\theta\}\}\[\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(y\)\]=0, we can subtract a constant baselinef′\(1\)f^\{\\prime\}\(1\)without changing the value:

∇θDf\(pθ∥p⋆\)=𝔼pθ\[\(f′\(u\(y\)\)−f′\(1\)\)∇θlog⁡pθ\(y\)\]\.\\nabla\_\{\\theta\}D\_\{f\}\(p\_\{\\theta\}\\\|p\_\{\\star\}\)=\\mathbb\{E\}\_\{p\_\{\\theta\}\}\\left\[\(f^\{\\prime\}\(u\(y\)\)\-f^\{\\prime\}\(1\)\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(y\)\\right\]\.
Gradient of the Surrogate Loss:The auto\-differentiated gradient of the loss sampleℒf\\mathcal\{L\}\_\{f\}with respect toθ\\thetaapplies the chain rule to the term inside the integral:

∇θℒf\(Δθ\(y\)\)\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{f\}\(\\Delta\_\{\\theta\}\(y\)\)=∂ℒf∂Δ∇θ\(log⁡pθ\(y\)−log⁡p⋆\(y\)\)\\displaystyle=\\frac\{\\partial\\mathcal\{L\}\_\{f\}\}\{\\partial\\Delta\}\\nabla\_\{\\theta\}\(\\log p\_\{\\theta\}\(y\)\-\\log p\_\{\\star\}\(y\)\)=\(f′\(exp⁡\(Δθ\(y\)\)\)−f′\(1\)\)∇θlog⁡pθ\(y\)\.\\displaystyle=\\left\(f^\{\\prime\}\(\\exp\(\\Delta\_\{\\theta\}\(y\)\)\)\-f^\{\\prime\}\(1\)\\right\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(y\)\.Substitutingexp⁡\(Δθ\(y\)\)=u\(y\)\\exp\(\\Delta\_\{\\theta\}\(y\)\)=u\(y\), we take the expectation over the on\-policy samplesy∼pθy\\sim p\_\{\\theta\}:

𝔼pθ\[∇θℒf\]=𝔼pθ\[\(f′\(u\(y\)\)−f′\(1\)\)∇θlog⁡pθ\(y\)\]\.\\mathbb\{E\}\_\{p\_\{\\theta\}\}\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{f\}\]=\\mathbb\{E\}\_\{p\_\{\\theta\}\}\\left\[\(f^\{\\prime\}\(u\(y\)\)\-f^\{\\prime\}\(1\)\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\(y\)\\right\]\.This is identical to the gradient of theff\-divergence derived above\.

### A\.2Proof of Proposition 4\.5: Inverse Mapping

Proposition 4\.5\.Letℓ:ℝ→ℝ\\ell:\\mathbb\{R\}\\to\\mathbb\{R\}be a strictly convex translation invariant loss function\. Then the objective𝒥~μ,ℓ\(θ\)\\tilde\{\\mathcal\{J\}\}\_\{\\mu,\\ell\}\(\\theta\)is minimized whenpθ=p⋆p\_\{\\theta\}=p\_\{\\star\}and corresponds to minimizing anff\-divergence defined by the generator:

fℓ\(u\)=λ1∫1uℓ′\(log⁡t\)𝑑t\+λ2\(u−1\)\+cf\_\{\\ell\}\(u\)=\\lambda\_\{1\}\\int\_\{1\}^\{u\}\\ell^\{\\prime\}\(\\log t\)dt\+\\lambda\_\{2\}\(u\-1\)\+cwhere constants are chosen to satisfy the standard boundary conditionsf\(1\)=0,f′\(1\)=1,f′′\(1\)=1f\(1\)=0,f^\{\\prime\}\(1\)=1,f^\{\\prime\\prime\}\(1\)=1\.

Proof\.We seek to find a convex generatorffsuch that the expected on\-policy gradients of the divergenceDfD\_\{f\}are proportional to the expected on\-policy gradients of the lossℓ\\ell\.

From the proof of Proposition 4\.3, we established that the gradient of anff\-divergence can be written as an expectation over on\-policy samples:

∇θDf\(pθ∥p⋆\)=𝔼pθ\[\(f′\(u\)−f′\(1\)\)∇θlog⁡pθ\],\\nabla\_\{\\theta\}D\_\{f\}\(p\_\{\\theta\}\\\|p\_\{\\star\}\)=\\mathbb\{E\}\_\{p\_\{\\theta\}\}\\left\[\(f^\{\\prime\}\(u\)\-f^\{\\prime\}\(1\)\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\\right\],whereu=pθ\(y\)/p⋆\(y\)=exp⁡\(y−y^\)u=p\_\{\\theta\}\(y\)/p\_\{\\star\}\(y\)=\\exp\(y\-\\hat\{y\}\)\.

Now consider the lossℓ\\ellacting on the difference in log\-probabilitiesΔ=y−y^=log⁡u\\Delta=y\-\\hat\{y\}=\\log u\. The gradient of the loss objective with respect toθ\\thetais:

∇θℓ\(Δ\)\\displaystyle\\nabla\_\{\\theta\}\\ell\(\\Delta\)=∂ℓ∂Δ∇θΔ\\displaystyle=\\frac\{\\partial\\ell\}\{\\partial\\Delta\}\\nabla\_\{\\theta\}\\Delta=ℓ′\(log⁡u\)∇θlog⁡pθ\.\\displaystyle=\\ell^\{\\prime\}\(\\log u\)\\nabla\_\{\\theta\}\\log p\_\{\\theta\}\.For the gradients to match \(up to a scaling factorλ1\\lambda\_\{1\}and an additive constant in the score function which vanishes in expectation\), we equate the coefficients of∇θlog⁡pθ\\nabla\_\{\\theta\}\\log p\_\{\\theta\}:

λ1ℓ′\(log⁡u\)=f′\(u\)−C,\\lambda\_\{1\}\\ell^\{\\prime\}\(\\log u\)=f^\{\\prime\}\(u\)\-C,whereC=f′\(1\)C=f^\{\\prime\}\(1\)\. This gives the relationship between the first derivatives\. To check convexity, we differentiate with respect touu:

λ1ℓ′′\(log⁡u\)⋅ddu\(log⁡u\)\\displaystyle\\lambda\_\{1\}\\ell^\{\\prime\\prime\}\(\\log u\)\\cdot\\frac\{d\}\{du\}\(\\log u\)=f′′\(u\)\\displaystyle=f^\{\\prime\\prime\}\(u\)λ1ℓ′′\(log⁡u\)u\\displaystyle\\lambda\_\{1\}\\frac\{\\ell^\{\\prime\\prime\}\(\\log u\)\}\{u\}=f′′\(u\)\.\\displaystyle=f^\{\\prime\\prime\}\(u\)\.Sinceu\>0u\>0\(probability ratio\) andℓ\\ellis strictly convex \(ℓ′′\>0\\ell^\{\\prime\\prime\}\>0\), it follows thatf′′\(u\)\>0f^\{\\prime\\prime\}\(u\)\>0\(assumingλ1\>0\\lambda\_\{1\}\>0\)\. Thus, the induced generatorffis strictly convex\.

To recover the functional form off\(u\)f\(u\), we integrate the first derivative relationshipf′\(t\)=λ1ℓ′\(log⁡t\)\+Cf^\{\\prime\}\(t\)=\\lambda\_\{1\}\\ell^\{\\prime\}\(\\log t\)\+Cfrom11touu:

f\(u\)−f\(1\)=∫1u\(λ1ℓ′\(log⁡t\)\+C\)𝑑t\.f\(u\)\-f\(1\)=\\int\_\{1\}^\{u\}\\left\(\\lambda\_\{1\}\\ell^\{\\prime\}\(\\log t\)\+C\\right\)dt\.Imposing the standardff\-divergence constraintf\(1\)=0f\(1\)=0, and lettingλ2=C\\lambda\_\{2\}=C, we obtain:

fℓ\(u\)=λ1∫1uℓ′\(log⁡t\)𝑑t\+λ2\(u−1\)\.f\_\{\\ell\}\(u\)=\\lambda\_\{1\}\\int\_\{1\}^\{u\}\\ell^\{\\prime\}\(\\log t\)dt\+\\lambda\_\{2\}\(u\-1\)\.The constantsλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}can be solved for using the standardization constraintsf′\(1\)=1f^\{\\prime\}\(1\)=1andf′′\(1\)=1f^\{\\prime\\prime\}\(1\)=1:

f′′\(1\)=1\\displaystyle f^\{\\prime\\prime\}\(1\)=1⟹λ1ℓ′′\(0\)=1⟹λ1=1ℓ′′\(0\)\\displaystyle\\implies\\lambda\_\{1\}\\ell^\{\\prime\\prime\}\(0\)=1\\implies\\lambda\_\{1\}=\\frac\{1\}\{\\ell^\{\\prime\\prime\}\(0\)\}f′\(1\)=1\\displaystyle f^\{\\prime\}\(1\)=1⟹λ1ℓ′\(0\)\+λ2=1⟹λ2=1−ℓ′\(0\)ℓ′′\(0\)\.\\displaystyle\\implies\\lambda\_\{1\}\\ell^\{\\prime\}\(0\)\+\\lambda\_\{2\}=1\\implies\\lambda\_\{2\}=1\-\\frac\{\\ell^\{\\prime\}\(0\)\}\{\\ell^\{\\prime\\prime\}\(0\)\}\.Thus, any strictly convex translation invariant loss on log\-probabilities corresponds to a valid, strictly convexff\-divergence\.

#### A\.2\.1𝕂𝕃\\mathbb\{KL\}from square loss

###### Example A\.1\.

Letℓ:ℝ→ℝ\\ell:\{\\mathbb\{R\}\}\\to\\mathbb\{R\}be the squared loss function, soℓ\(x\)=x2\\ell\(x\)=x^\{2\}forx∈ℝx\\in\\mathbb\{R\}\. Thenℓ′\(x\)=2x\\ell^\{\\prime\}\(x\)=2xand:

fℓ\(u\)\\displaystyle f\_\{\\ell\}\(u\)=λ1∫1u2log⁡tdt\+λ2\(u−1\)\+c\\displaystyle=\\lambda\_\{1\}\\int^\{u\}\_\{1\}2\\log tdt\+\\lambda\_\{2\}\(u\-1\)\+c=2λ1ulog⁡\(u\)\+λ2\(u−1\)\+c,\\displaystyle=2\\lambda\_\{1\}u\\log\(u\)\+\\lambda\_\{2\}\(u\-1\)\+c,where the boundary conditions give us thatfℓ\(u\)=ulog⁡uf\_\{\\ell\}\(u\)=u\\log u\.

### A\.3Proof of Proposition 5\.1

Proposition 5\.1\.The following loss:

ℒf\(Δ\(τ,θ,ϕ\)\)=∫0Δ\(τ,θ,ϕ\)\(f′\(exp⁡\(t\)\)−f′\(1\)\)𝑑t\\mathcal\{L\}\_\{f\}\(\\Delta\(\\tau,\\theta,\\phi\)\)=\\int\_\{0\}^\{\\Delta\(\\tau,\\theta,\\phi\)\}\\left\(f^\{\\prime\}\(\\exp\(t\)\)\-f^\{\\prime\}\(1\)\\right\)dtgeneralizes the trajectory balance in the sense that we have:

∇θDf\(πF∥πB\)\\displaystyle\\nabla\_\{\\theta\}D\_\{f\}\(\\pi\_\{F\}\\\|\\pi\_\{B\}\)=𝔼τ∼πF,θ\[∇θℒf\(τ,ϕ,θ\)\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F,\\theta\}\}\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{f\}\(\\tau,\\phi,\\theta\)\]∇ϕDh\(πB∥πF\)\\displaystyle\\nabla\_\{\\phi\}D\_\{h\}\(\\pi\_\{B\}\\\|\\pi\_\{F\}\)=𝔼τ∼πB,ϕ\[∇ϕℒf\(τ,ϕ,θ\)\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{B,\\phi\}\}\[\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{f\}\(\\tau,\\phi,\\theta\)\]wherehhis defined as:

h\(u\)=∫1u\(2−f′\(1t\)\)𝑑t\.h\(u\)=\\int\_\{1\}^\{u\}\\left\(2\-f^\{\\prime\}\\left\(\\frac\{1\}\{t\}\\right\)\\right\)dt\.
Proof\.Let the trajectory log\-probability difference be denoted byΔ\(τ\)=log⁡πF\(τ\|θ\)−log⁡πB\(τ\|ϕ\)\\Delta\(\\tau\)=\\log\\pi\_\{F\}\(\\tau\|\\theta\)\-\\log\\pi\_\{B\}\(\\tau\|\\phi\)\(absorbingZZandRRinto the definition for clarity\)\. Letu\(τ\)=exp⁡\(Δ\(τ\)\)=πF\(τ\)πB\(τ\)u\(\\tau\)=\\exp\(\\Delta\(\\tau\)\)=\\frac\{\\pi\_\{F\}\(\\tau\)\}\{\\pi\_\{B\}\(\\tau\)\}\. The gradient of the scalar loss with respect toΔ\\Deltaisℒ′\(Δ\)=f′\(u\)−f′\(1\)\\mathcal\{L\}^\{\\prime\}\(\\Delta\)=f^\{\\prime\}\(u\)\-f^\{\\prime\}\(1\)\.

#### A\.3\.1Forward Policy Gradient \(∇θ\\nabla\_\{\\theta\}\)

We examine the expected auto\-differentiated gradient of the loss with respect toθ\\thetaunder the forward policyπF\\pi\_\{F\}:

𝔼τ∼πF\[∇θℒf\]=𝔼τ∼πF\[∂ℒf∂Δ∇θΔ\(τ\)\]\.\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{f\}\]=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\\frac\{\\partial\\mathcal\{L\}\_\{f\}\}\{\\partial\\Delta\}\\nabla\_\{\\theta\}\\Delta\(\\tau\)\\right\]\.SinceπB\\pi\_\{B\}does not depend onθ\\theta,∇θΔ\(τ\)=∇θlog⁡πF\(τ\)\\nabla\_\{\\theta\}\\Delta\(\\tau\)=\\nabla\_\{\\theta\}\\log\\pi\_\{F\}\(\\tau\)\. Thus:

𝔼τ∼πF\[∇θℒf\]=𝔼τ∼πF\[\(f′\(u\)−f′\(1\)\)∇θlog⁡πF\(τ\)\]\.\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\[\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{f\}\]=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\(f^\{\\prime\}\(u\)\-f^\{\\prime\}\(1\)\)\\nabla\_\{\\theta\}\\log\\pi\_\{F\}\(\\tau\)\\right\]\.From Proposition 4\.3, we know that the gradient of theff\-divergenceDf\(πF∥πB\)D\_\{f\}\(\\pi\_\{F\}\\\|\\pi\_\{B\}\)is given by:

∇θDf\(πF∥πB\)=𝔼τ∼πF\[\(f′\(u\)−c\)∇θlog⁡πF\(τ\)\],\\nabla\_\{\\theta\}D\_\{f\}\(\\pi\_\{F\}\\\|\\pi\_\{B\}\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\(f^\{\\prime\}\(u\)\-c\)\\nabla\_\{\\theta\}\\log\\pi\_\{F\}\(\\tau\)\\right\],whereccis any constant\. Choosingc=f′\(1\)c=f^\{\\prime\}\(1\)recovers the loss gradient exactly\.

#### A\.3\.2Backward Policy Gradient \(∇ϕ\\nabla\_\{\\phi\}\)

We examine the expected auto\-differentiated gradient of the loss with respect toϕ\\phiunder the backward policyπB\\pi\_\{B\}:

𝔼τ∼πB\[∇ϕℒf\]=𝔼τ∼πB\[∂ℒf∂Δ∇ϕΔ\(τ\)\]\.\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{B\}\}\[\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{f\}\]=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{B\}\}\\left\[\\frac\{\\partial\\mathcal\{L\}\_\{f\}\}\{\\partial\\Delta\}\\nabla\_\{\\phi\}\\Delta\(\\tau\)\\right\]\.SinceπF\\pi\_\{F\}does not depend onϕ\\phi,∇ϕΔ\(τ\)=−∇ϕlog⁡πB\(τ\)\\nabla\_\{\\phi\}\\Delta\(\\tau\)=\-\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\(\\tau\)\. Thus:

𝔼τ∼πB\[∇ϕℒf\]=𝔼τ∼πB\[−\(f′\(u\)−f′\(1\)\)∇ϕlog⁡πB\(τ\)\]\.\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{B\}\}\[\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{f\}\]=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{B\}\}\\left\[\-\(f^\{\\prime\}\(u\)\-f^\{\\prime\}\(1\)\)\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\(\\tau\)\\right\]\.\(11\)Now consider the divergenceDh\(πB∥πF\)D\_\{h\}\(\\pi\_\{B\}\\\|\\pi\_\{F\}\)defined by the convex generatorhh\. Its gradient with respect toϕ\\phi\(whereπB\\pi\_\{B\}is the model andπF\\pi\_\{F\}is the target\) is:

∇ϕDh\(πB∥πF\)\\displaystyle\\nabla\_\{\\phi\}D\_\{h\}\(\\pi\_\{B\}\\\|\\pi\_\{F\}\)=∇ϕ∫πF\(τ\)h\(πB\(τ\)πF\(τ\)\)𝑑τ\.\\displaystyle=\\nabla\_\{\\phi\}\\int\\pi\_\{F\}\(\\tau\)h\\left\(\\frac\{\\pi\_\{B\}\(\\tau\)\}\{\\pi\_\{F\}\(\\tau\)\}\\right\)d\\tau\.Letv\(τ\)=πB\(τ\)πF\(τ\)=1u\(τ\)v\(\\tau\)=\\frac\{\\pi\_\{B\}\(\\tau\)\}\{\\pi\_\{F\}\(\\tau\)\}=\\frac\{1\}\{u\(\\tau\)\}\. Using the derivative rule forff\-divergences \(equivalent to Eq\. 54 in the main text but forhhandπB\\pi\_\{B\}\):

∇ϕDh\(πB∥πF\)=𝔼τ∼πB\[\(h′\(v\)−h′\(1\)\)∇ϕlog⁡πB\(τ\)\]\.\\nabla\_\{\\phi\}D\_\{h\}\(\\pi\_\{B\}\\\|\\pi\_\{F\}\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{B\}\}\\left\[\(h^\{\\prime\}\(v\)\-h^\{\\prime\}\(1\)\)\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\(\\tau\)\\right\]\.\(12\)To match Eq\. \([11](https://arxiv.org/html/2605.15417#A1.E11)\) and Eq\. \([12](https://arxiv.org/html/2605.15417#A1.E12)\), we require the terms scaling the score function to be equivalent up to a constant shift \(which vanishes in expectation\)\. We equate:

h′\(v\)=−f′\(u\)\+C=−f′\(1/v\)\+C\.h^\{\\prime\}\(v\)=\-f^\{\\prime\}\(u\)\+C=\-f^\{\\prime\}\(1/v\)\+C\.Using the definition ofhhprovided in the proposition:

h\(v\)=∫1v\(2−f′\(1t\)\)𝑑t⟹h′\(v\)=2−f′\(1v\)\.h\(v\)=\\int\_\{1\}^\{v\}\\left\(2\-f^\{\\prime\}\\left\(\\frac\{1\}\{t\}\\right\)\\right\)dt\\implies h^\{\\prime\}\(v\)=2\-f^\{\\prime\}\\left\(\\frac\{1\}\{v\}\\right\)\.Substituting this into the gradient expectation:

∇ϕDh\(πB∥πF\)=𝔼τ∼πB\[\(2−f′\(u\)−h′\(1\)\)∇ϕlog⁡πB\(τ\)\]\.\\nabla\_\{\\phi\}D\_\{h\}\(\\pi\_\{B\}\\\|\\pi\_\{F\}\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{B\}\}\\left\[\(2\-f^\{\\prime\}\(u\)\-h^\{\\prime\}\(1\)\)\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\(\\tau\)\\right\]\.Absorbing constants22andh′\(1\)h^\{\\prime\}\(1\)into the baseline, this matches the loss gradient term−\(f′\(u\)−f′\(1\)\)\-\(f^\{\\prime\}\(u\)\-f^\{\\prime\}\(1\)\)\. Thus, the gradients are equivalent\.

#### A\.3\.3Non\-Existence offf\-Divergence for On\-Policy Backward Gradients

In[Section5](https://arxiv.org/html/2605.15417#S5), we established that the gradients for the backward policyπB\\pi\_\{B\}, when sampled from the backward policy correspond to minimizing a validff\-divergenceDh\(πB∥πF\)D\_\{h\}\(\\pi\_\{B\}\\\|\\pi\_\{F\}\)\. In this section, we investigate theon\-policysetting, where samples are drawn from the forward policyπF\\pi\_\{F\}, and we optimizeπB\\pi\_\{B\}to minimize the surrogate lossℒf\\mathcal\{L\}\_\{f\}\.

We demonstrate that, unlike the off\-policy case, the on\-policy gradient update forπB\\pi\_\{B\}generallydoes notcorrespond to the descent direction of any validff\-divergence\. While a generator functionggcan be derived to match the gradients locally, it fails to satisfy the global convexity requirement \(g′′≥0g^\{\\prime\\prime\}\\geq 0\) for standard choices offf, such as the KL divergence\.

### A\.4Gradient Matching Setup

We seek a convex functiong:ℝ\+→ℝg:\\mathbb\{R\}\_\{\+\}\\to\\mathbb\{R\}such that the gradient of the divergenceDg\(πF∥πB\)D\_\{g\}\(\\pi\_\{F\}\\\|\\pi\_\{B\}\)matches the expected gradient of the surrogate lossℒf\\mathcal\{L\}\_\{f\}when sampling fromπF\\pi\_\{F\}\. Note that we define the divergence asDg\(πF∥πB\)D\_\{g\}\(\\pi\_\{F\}\\\|\\pi\_\{B\}\)because the expectation is taken overπF\\pi\_\{F\}\.

Letu\(τ\)=πF\(τ\)πB\(τ\)u\(\\tau\)=\\frac\{\\pi\_\{F\}\(\\tau\)\}\{\\pi\_\{B\}\(\\tau\)\}\.

1\. Gradient of the Surrogate Loss:The gradient of the lossℒf\\mathcal\{L\}\_\{f\}with respect to the backward parametersϕ\\phi, estimated on\-policy, is:

∇ϕ𝒥on\\displaystyle\\nabla\_\{\\phi\}\\mathcal\{J\}\_\{\\text\{on\}\}=𝔼τ∼πF\[∇ϕℒf\(log⁡u\)\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\\nabla\_\{\\phi\}\\mathcal\{L\}\_\{f\}\(\\log u\)\\right\]=𝔼τ∼πF\[\(f′\(u\)−f′\(1\)\)∇ϕ\(−log⁡πB\)\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\(f^\{\\prime\}\(u\)\-f^\{\\prime\}\(1\)\)\\nabla\_\{\\phi\}\(\-\\log\\pi\_\{B\}\)\\right\]=𝔼τ∼πF\[−f′\(u\)∇ϕlog⁡πB\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\-f^\{\\prime\}\(u\)\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\\right\]\(The constant termf′\(1\)f^\{\\prime\}\(1\)vanishes because𝔼πF\[∇ϕlog⁡πB\]=0\\mathbb\{E\}\_\{\\pi\_\{F\}\}\[\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\]=0\)\.

2\. Gradient of the Candidate Divergence:Consider the generic divergenceDg\(πF∥πB\)=∫πB\(τ\)g\(πF\(τ\)πB\(τ\)\)𝑑τD\_\{g\}\(\\pi\_\{F\}\\\|\\pi\_\{B\}\)=\\int\\pi\_\{B\}\(\\tau\)g\\left\(\\frac\{\\pi\_\{F\}\(\\tau\)\}\{\\pi\_\{B\}\(\\tau\)\}\\right\)d\\tau\. Differentiating with respect toϕ\\phi:

∇ϕDg\\displaystyle\\nabla\_\{\\phi\}D\_\{g\}=∫\(∇ϕπB⋅g\(u\)\+πBg′\(u\)∇ϕu\)𝑑τ\\displaystyle=\\int\\left\(\\nabla\_\{\\phi\}\\pi\_\{B\}\\cdot g\(u\)\+\\pi\_\{B\}g^\{\\prime\}\(u\)\\nabla\_\{\\phi\}u\\right\)d\\tauUsing the identity∇ϕu=πF∇ϕ\(πB−1\)=−u∇ϕlog⁡πB\\nabla\_\{\\phi\}u=\\pi\_\{F\}\\nabla\_\{\\phi\}\(\\pi\_\{B\}^\{\-1\}\)=\-u\\nabla\_\{\\phi\}\\log\\pi\_\{B\}and∇ϕπB=πB∇ϕlog⁡πB\\nabla\_\{\\phi\}\\pi\_\{B\}=\\pi\_\{B\}\\nabla\_\{\\phi\}\\log\\pi\_\{B\}:

∇ϕDg\\displaystyle\\nabla\_\{\\phi\}D\_\{g\}=∫πB\(g\(u\)−ug′\(u\)\)∇ϕlog⁡πBdτ\\displaystyle=\\int\\pi\_\{B\}\\left\(g\(u\)\-ug^\{\\prime\}\(u\)\\right\)\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\\,d\\tau=∫πF1u\(g\(u\)−ug′\(u\)\)∇ϕlog⁡πBdτ\\displaystyle=\\int\\pi\_\{F\}\\frac\{1\}\{u\}\\left\(g\(u\)\-ug^\{\\prime\}\(u\)\\right\)\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\\,d\\tau=𝔼τ∼πF\[\(g\(u\)u−g′\(u\)\)∇ϕlog⁡πB\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\\left\(\\frac\{g\(u\)\}\{u\}\-g^\{\\prime\}\(u\)\\right\)\\nabla\_\{\\phi\}\\log\\pi\_\{B\}\\right\]

### A\.5Derivation of the Generatorgg

Equating the terms inside the expectations gives the condition for the gradients to match:

−f′\(u\)=g\(u\)u−g′\(u\)\-f^\{\\prime\}\(u\)=\\frac\{g\(u\)\}\{u\}\-g^\{\\prime\}\(u\)Rearranging into a linear ordinary differential equation forg\(u\)g\(u\):

g′\(u\)−1ug\(u\)=f′\(u\)g^\{\\prime\}\(u\)\-\\frac\{1\}\{u\}g\(u\)=f^\{\\prime\}\(u\)Using the integrating factorI\(u\)=exp⁡\(∫−1udu\)=1uI\(u\)=\\exp\(\\int\-\\frac\{1\}\{u\}du\)=\\frac\{1\}\{u\}, we solve:

ddu\[g\(u\)u\]\\displaystyle\\frac\{d\}\{du\}\\left\[\\frac\{g\(u\)\}\{u\}\\right\]=f′\(u\)u\\displaystyle=\\frac\{f^\{\\prime\}\(u\)\}\{u\}\(13\)g\(u\)u\\displaystyle\\frac\{g\(u\)\}\{u\}=∫1uf′\(t\)t𝑑t\+C\\displaystyle=\\int\_\{1\}^\{u\}\\frac\{f^\{\\prime\}\(t\)\}\{t\}dt\+C\(14\)g\(u\)\\displaystyle g\(u\)=u∫1uf′\(t\)t𝑑t\+Cu\\displaystyle=u\\int\_\{1\}^\{u\}\\frac\{f^\{\\prime\}\(t\)\}\{t\}dt\+Cu\(15\)The linear termCuCucorresponds to the gradient of a constant expectation and does not affect the convexity analysis\.

### A\.6Proof of Non\-Convexity

ForDgD\_\{g\}to be a valid divergence,ggmust be convex, i\.e\.,g′′\(u\)≥0g^\{\\prime\\prime\}\(u\)\\geq 0for allu∈\(0,∞\)u\\in\(0,\\infty\)\. We compute the second derivative of the solution in Eq\. refeq:g\_solution:

First derivative:

g′\(u\)=∫1uf′\(t\)t𝑑t\+u\(f′\(u\)u\)=∫1uf′\(t\)t𝑑t\+f′\(u\)g^\{\\prime\}\(u\)=\\int\_\{1\}^\{u\}\\frac\{f^\{\\prime\}\(t\)\}\{t\}dt\+u\\left\(\\frac\{f^\{\\prime\}\(u\)\}\{u\}\\right\)=\\int\_\{1\}^\{u\}\\frac\{f^\{\\prime\}\(t\)\}\{t\}dt\+f^\{\\prime\}\(u\)
Second derivative:

g′′\(u\)=f′\(u\)u\+f′′\(u\)g^\{\\prime\\prime\}\(u\)=\\frac\{f^\{\\prime\}\(u\)\}\{u\}\+f^\{\\prime\\prime\}\(u\)\(16\)
We now test this condition for the most common case: the Trajectory Balance loss, which corresponds to the KL divergence\.

Counterexample: KL Divergence \(Trajectory Balance\)Letf\(u\)=ulog⁡uf\(u\)=u\\log u\. Thenf′\(u\)=1\+log⁡uf^\{\\prime\}\(u\)=1\+\\log uandf′′\(u\)=1uf^\{\\prime\\prime\}\(u\)=\\frac\{1\}\{u\}\. Substituting into Eq\. refeq:convexity\_condition:

g′′\(u\)\\displaystyle g^\{\\prime\\prime\}\(u\)=1\+log⁡uu\+1u\\displaystyle=\\frac\{1\+\\log u\}\{u\}\+\\frac\{1\}\{u\}g′′\(u\)\\displaystyle g^\{\\prime\\prime\}\(u\)=2\+log⁡uu\\displaystyle=\\frac\{2\+\\log u\}\{u\}Forggto be convex, we requireg′′\(u\)≥0g^\{\\prime\\prime\}\(u\)\\geq 0for allu\>0u\>0\. However:

2\+log⁡uu<0⇔log⁡u<−2⇔u<e−2≈0\.135\\frac\{2\+\\log u\}\{u\}<0\\iff\\log u<\-2\\iff u<e^\{\-2\}\\approx 0\.135Sinceu=πF\(τ\)/πB\(τ\)u=\\pi\_\{F\}\(\\tau\)/\\pi\_\{B\}\(\\tau\), the ratiouucan easily fall belowe−2e^\{\-2\}in regions where the backward policy assigns significantly higher probability than the forward policy\. In this region,ggis locally concave\.

Conclusion:Becauseg′′\(u\)g^\{\\prime\\prime\}\(u\)is not non\-negative everywhere,ggdoes not define a validff\-divergence\. Consequently, the on\-policy optimization of the backward policy cannot be interpreted as minimizing a distance measure between distributions in the divergence sense\. It is more accurately described as minimizing the variance of the log\-ratio or satisfying a moment\-matching condition specific to the chosen surrogate loss\.

## Appendix BGradient Variance Analysis

### B\.1ff\-DevGrad Variance Analysis

In this section, we demonstrate that the gradient estimator derived from the batch\-wise DevGrad loss,ℒfDG\\mathcal\{L\}\_\{f\}^\{DG\}, is equivalent to a REINFORCE estimator equipped with a generalized deviation baseline\. We prove that the batch normalization step implicitly enforces a zero\-sum constraint on the gradient weights, thereby acting as a variance\-reducing control variate\.

Gradient Form\.Consider the batch\-wise loss over a batchℬ=\{𝐲1,…,𝐲B\}\\mathcal\{B\}=\\\{\\mathbf\{y\}\_\{1\},\\dots,\\mathbf\{y\}\_\{B\}\\\}with deviationΔ\(𝐲i\)=log⁡πθ\(𝐲i\)−ℛ\(𝐲i\)\\Delta\(\\mathbf\{y\}\_\{i\}\)=\\log\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\)\-\\mathcal\{R\}\(\\mathbf\{y\}\_\{i\}\):

ℒfDG\(ℬ,θ\)=minC⁡1B∑i=1BLf\(Δ\(𝐲i\)\+C\)\.\\mathcal\{L\}\_\{f\}^\{DG\}\(\\mathcal\{B\},\\theta\)=\\min\_\{C\}\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}L\_\{f\}\\left\(\\Delta\(\\mathbf\{y\}\_\{i\}\)\+C\\right\)\.LetC^\\hat\{C\}be the minimizer of this objective \(the batch\-wise estimate of−log⁡Z\-\\log Z\)\. The gradient of the loss with respect toθ\\theta, treatingC^\\hat\{C\}as fixed \(via the stop\-gradient operator in Eq\. 30\), is:

𝐠^DG=1B∑i=1BLf′\(Δ\(𝐲i\)\+C^\)∇θlog⁡πθ\(𝐲i\)\.\\hat\{\\mathbf\{g\}\}^\{DG\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}L\_\{f\}^\{\\prime\}\\left\(\\Delta\(\\mathbf\{y\}\_\{i\}\)\+\\hat\{C\}\\right\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\)\.
Zero\-Sum Weight Property\.The scalarC^\\hat\{C\}is defined by the first\-order optimality condition of the inner minimization problem:

∂∂C\(1B∑i=1BLf\(Δ\(𝐲i\)\+C\)\)\|C=C^=1B∑i=1BLf′\(Δ\(𝐲i\)\+C^\)=0\.\\frac\{\\partial\}\{\\partial C\}\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}L\_\{f\}\(\\Delta\(\\mathbf\{y\}\_\{i\}\)\+C\)\\right\)\\bigg\|\_\{C=\\hat\{C\}\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}L\_\{f\}^\{\\prime\}\(\\Delta\(\\mathbf\{y\}\_\{i\}\)\+\\hat\{C\}\)=0\.Letwi=Lf′\(Δ\(𝐲i\)\+C^\)w\_\{i\}=L\_\{f\}^\{\\prime\}\(\\Delta\(\\mathbf\{y\}\_\{i\}\)\+\\hat\{C\}\)be the effective weight for theii\-th sample\. The optimality condition implies∑i=1Bwi=0\\sum\_\{i=1\}^\{B\}w\_\{i\}=0\. Consequently, the estimator can be rewritten as a REINFORCE estimator with a sample\-dependent baseline:

𝐠^DG=1B∑i=1B\(w~i−b^\)∇θlog⁡πθ\(𝐲i\),\\hat\{\\mathbf\{g\}\}^\{DG\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\(\\tilde\{w\}\_\{i\}\-\\hat\{b\}\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\),wherew~i\\tilde\{w\}\_\{i\}represents the uncentered gradient coefficients derived fromf′f^\{\\prime\}, andb^\\hat\{b\}is the implicit baseline resulting fromC^\\hat\{C\}such that the coefficients sum to zero\.

### B\.2ff\-Trajectory Balance Variance Analysis

We compare the variance of the gradient estimator for the forward policy parametersθ\\thetaderived from theff\-Trajectory Balance loss,g^f\-TB\\hat\{g\}\_\{f\\text\{\-TB\}\}, against the standard score function estimator,g^SF\\hat\{g\}\_\{\\text\{SF\}\}\. Letτ\\taube a complete trajectory sampled from the forward policyπF\(⋅\|θ\)\\pi\_\{F\}\(\\cdot\|\\theta\)\. We define the trajectory likelihood ratio asu\(τ\)=ZθπF\(τ\|θ\)R\(xτ\)πB\(τ\|xτ,ϕ\)u\(\\tau\)=\\frac\{Z\_\{\\theta\}\\pi\_\{F\}\(\\tau\|\\theta\)\}\{R\(x\_\{\\tau\}\)\\pi\_\{B\}\(\\tau\|x\_\{\\tau\},\\phi\)\}\.

With the score functionS\(τ\)=∇θlog⁡πF\(τ\|θ\)S\(\\tau\)=\\nabla\_\{\\theta\}\\log\\pi\_\{F\}\(\\tau\|\\theta\), the estimators are given by:

g^SF\\displaystyle\\hat\{g\}\_\{\\text\{SF\}\}=f′\(u\(τ\)\)S\(τ\),\\displaystyle=f^\{\\prime\}\(u\(\\tau\)\)S\(\\tau\),g^f\-TB\\displaystyle\\hat\{g\}\_\{f\\text\{\-TB\}\}=\(f′\(u\(τ\)\)−f′\(1\)\)S\(τ\)=\(f′\(u\(τ\)\)−1\)S\(τ\)\.\\displaystyle=\(f^\{\\prime\}\(u\(\\tau\)\)\-f^\{\\prime\}\(1\)\)S\(\\tau\)=\(f^\{\\prime\}\(u\(\\tau\)\)\-1\)S\(\\tau\)\.Since𝔼τ∼πF\[S\(τ\)\]=0\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\[S\(\\tau\)\]=0, both estimators are unbiased\. The difference in their variances is determined by the difference in their expected squared norms:

Var\(g^f\-TB\)−Var\(g^SF\)\\displaystyle\\text\{Var\}\(\\hat\{g\}\_\{f\\text\{\-TB\}\}\)\-\\text\{Var\}\(\\hat\{g\}\_\{\\text\{SF\}\}\)=𝔼τ∼πF\[\(\(f′\(u\(τ\)\)−1\)2−f′\(u\(τ\)\)2\)‖S\(τ\)‖2\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\\left\(\(f^\{\\prime\}\(u\(\\tau\)\)\-1\)^\{2\}\-f^\{\\prime\}\(u\(\\tau\)\)^\{2\}\\right\)\\\|S\(\\tau\)\\\|^\{2\}\\right\]=𝔼τ∼πF\[\(1−2f′\(u\(τ\)\)\)‖S\(τ\)‖2\]\.\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\\left\[\(1\-2f^\{\\prime\}\(u\(\\tau\)\)\)\\\|S\(\\tau\)\\\|^\{2\}\\right\]\.At the global optimum where the Trajectory Balance constraint is satisfied, we haveZθπF\(τ\)=R\(xτ\)πB\(τ\)Z\_\{\\theta\}\\pi\_\{F\}\(\\tau\)=R\(x\_\{\\tau\}\)\\pi\_\{B\}\(\\tau\), implyingu\(τ\)=1u\(\\tau\)=1for all sampledτ\\tau\. Given the normalizationf′\(1\)=1f^\{\\prime\}\(1\)=1, the term inside the expectation becomes:

1−2f′\(1\)=−1\.1\-2f^\{\\prime\}\(1\)=\-1\.Thus, at the optimum, the variance difference is−𝔼\[‖S\(τ\)‖2\]\-\\mathbb\{E\}\[\\\|S\(\\tau\)\\\|^\{2\}\], which is strictly negative for stochastic policies\.

Sinceffis strictly convex and differentiable,f′f^\{\\prime\}is continuous\. Consequently, by the continuity of expected values, there exists a neighborhood around the optimum \(whereu\(τ\)≈1u\(\\tau\)\\approx 1\) such that𝔼τ∼πF\[\(1−2f′\(u\(τ\)\)\)‖S\(τ\)‖2\]<0\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{F\}\}\[\(1\-2f^\{\\prime\}\(u\(\\tau\)\)\)\\\|S\(\\tau\)\\\|^\{2\}\]<0\. Therefore, theff\-Trajectory Balance estimator yields strictly lower variance than the standard score function estimator in the vicinity of the optimum\.

## Appendix CLoss Derivations and Properties

In this Appendix, we give the definitions of a variety of standardff\-divergences and derive the closed form of theff\-trajectory loss\. We also derive the batch wise DevGrad normalisation, the LLM DevGrad formulation, and the tempered LLM DevGrad formulation\. We assume the standard normalizationf′\(1\)=1f^\{\\prime\}\(1\)=1andf′′\(1\)=1f^\{\\prime\\prime\}\(1\)=1\. Reviewing notation we have:

- •Generic:Δθ\(𝐲i\)=log⁡pθ\(𝐲i\)−ℛ\(𝐲i\)\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\)=\\log p\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\)\-\\mathcal\{R\}\(\{\\mathbf\{y\}\}\_\{i\}\)\.
- •LLM Fine\-tuning:The target is the Boltzmann distribution defined by rewardr\(𝐱,𝐲\)r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)and referenceπref\\pi\_\{\\text\{ref\}\}\. ℛ\(𝐲i\)=log⁡πref\(𝐲i\)\+r\(𝐱,𝐲i\)β\\mathcal\{R\}\(\{\\mathbf\{y\}\}\_\{i\}\)=\\log\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\frac\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)\}\{\\beta\}The deviation expands to: Δθ\(𝐲i\)=log⁡πθ\(𝐲i\|𝐱\)πref\(𝐲i\|𝐱\)−r\(𝐱,𝐲i\)β\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\)=\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\-\\frac\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)\}\{\\beta\}

The loss function,ℒf\\mathcal\{L\}\_\{f\}, is given byℒf\(Δθ\(𝐲\)\)=∫0Δθ\(𝐲\)f′\(exp⁡\(t\)\)−f′\(1\)dt\\mathcal\{L\}\_\{f\}\(\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\)=\\int\_\{0\}^\{\\Delta\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\}\{f^\{\\prime\}\(\\exp\(t\)\)\-f^\{\\prime\}\(1\)\}dt\. The general Batch\-wise DevGrad loss is defined as:

ℒfDG\(ℬ,θ\)\\displaystyle\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{f\}\(\\mathcal\{B\},\\theta\)=1B∑iℒf\(Δ\(𝐲i\)\+SG\[log⁡Z^\]\)\\displaystyle=\\frac\{1\}\{B\}\\sum\_\{i\}\\mathcal\{L\}\_\{f\}\\Big\(\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\mathrm\{SG\}\\left\[\\widehat\{\\log Z\}\\right\]\\Big\)where:log⁡Z^\\displaystyle\\text\{where: \}\\widehat\{\\log Z\}=argminC⁡1B∑iℒf\(Δ\(𝐲i\)\+C\)\\displaystyle=\\operatorname\*\{arg\\,min\}\_\{C\}\\frac\{1\}\{B\}\\\!\\sum\_\{i\}\\mathcal\{L\}\_\{f\}\\Big\(\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+C\\Big\)Which must satisfy:1B∑iℒf′\(Δ\(𝐲i\)\+log⁡Z^\)=0\\displaystyle\\frac\{1\}\{B\}\\\!\\sum\_\{i\}\\mathcal\{L\}^\{\\prime\}\_\{f\}\\Big\(\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\widehat\{\\log Z\}\\Big\)=0Finally, we define theStop\-Gradient Softmax with Temperatureτ\\tauover the rewards𝐫\\mathbf\{r\}as the normalized importance weights:

σ~τ\(𝐫\)i=exp⁡\(r\(𝐱,𝐲i\)/τ\)SG\[∑j=1Bexp⁡\(r\(𝐱,𝐲j\)/τ\)\]\\tilde\{\\sigma\}\_\{\\tau\}\(\\mathbf\{r\}\)\_\{i\}=\\frac\{\\exp\\left\(r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)/\\tau\\right\)\}\{\\mathrm\{SG\}\\left\[\\sum\_\{j=1\}^\{B\}\\exp\\left\(r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{j\}\)/\\tau\\right\)\\right\]\}whereSG\[⋅\]\\mathrm\{SG\}\[\\cdot\]denotes the stop\-gradient operator\.

### C\.1Reverse KL Divergence

Generator:f\(u\)=ulog⁡uf\(u\)=u\\log u\. Derivation:Withf′\(u\)=1\+log⁡uf^\{\\prime\}\(u\)=1\+\\log u, the integrand istt\.

ℒRKL\(Δ\)=∫0Δt𝑑t=12Δ2\.\\mathcal\{L\}\_\{\\text\{RKL\}\}\(\\Delta\)=\\int\_\{0\}^\{\\Delta\}t\\,dt=\\frac\{1\}\{2\}\\Delta^\{2\}\.Batch\-wise Normalization:The optimality condition∑\(Δi\+log⁡Z^\)=0\\sum\(\\Delta\_\{i\}\+\\widehat\{\\log Z\}\)=0yields the mean difference:

log⁡Z^=−1B∑i=1BΔ\(𝐲i\)→LLMπθ≈πref1B∑i=1Br\(𝐱,𝐲i\)β\.\\widehat\{\\log Z\}=\-\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\\quad\\xrightarrow\[\\text\{LLM\}\]\{\\pi\_\{\\theta\}\\approx\\pi\_\{\\text\{ref\}\}\}\\quad\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\frac\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)\}\{\\beta\}\.Batch\-wise DevGrad Loss:We substitute the mean difference for the normalization constant\.

ℒRKLDG\(ℬ,θ\)=12B∑i=1B\(Δ\(𝐲i\)−SG\[1B∑j=1BΔ\(𝐲j\)\]\)2\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{RKL\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{2B\}\\sum\_\{i=1\}^\{B\}\\left\(\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\-\\mathrm\{SG\}\\left\[\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\\right\]\\right\)^\{2\}\.Batch\-wise DevGrad Loss \(LLM\):The Vargrad loss on the reward\-adjusted log\-ratios:

ℒRKLDG=12𝕍ar\[log⁡πθ\(𝐲\)πref\(𝐲\)−r\(𝐱,𝐲\)β\]\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{RKL\}\}=\\frac\{1\}\{2\}\\mathbb\{V\}\\text\{ar\}\\left\[\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\)\}\-\\frac\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\)\}\{\\beta\}\\right\]\.
##### Tempered Reverse KL Divergence \(KIMI Loss\)

From Example 4\.10, the tempered loss for the Reverse KL corresponds to the variance of the reward\-adjusted log\-probabilities scaled byβ\\beta\. This recovers the standard Kimi setup:

ℒ~RKLDG=12βVar\[βlog⁡πθ\(𝐲\|𝐱\)πref\(𝐲\|𝐱\)−r\(𝐱,𝐲\)\]\.\\tilde\{\\mathcal\{L\}\}^\{\\text\{DG\}\}\_\{\\text\{RKL\}\}=\\frac\{1\}\{2\\beta\}\\text\{Var\}\\left\[\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\|\\mathbf\{x\}\)\}\-r\(\\mathbf\{x\},\\mathbf\{y\}\)\\right\]\.

### C\.2Forward KL Divergence

Generator:f\(u\)=2\(u−1\)−log⁡uf\(u\)=2\(u\-1\)\-\\log u\. Derivation:Withf′\(u\)=2−u−1f^\{\\prime\}\(u\)=2\-u^\{\-1\}, the integrand is1−e−t1\-e^\{\-t\}\.

ℒFKL\(Δ\)=∫0Δ\(1−e−t\)𝑑t=Δ\+e−Δ−1\.\\mathcal\{L\}\_\{\\text\{FKL\}\}\(\\Delta\)=\\int\_\{0\}^\{\\Delta\}\(1\-e^\{\-t\}\)\\,dt=\\Delta\+e^\{\-\\Delta\}\-1\.Batch\-wise Normalization:The condition∑\(1−e−\(Δi\+log⁡Z^\)\)=0\\sum\(1\-e^\{\-\(\\Delta\_\{i\}\+\\widehat\{\\log Z\}\)\}\)=0implies∑e−Δie−log⁡Z^=B\\sum e^\{\-\\Delta\_\{i\}\}e^\{\-\\widehat\{\\log Z\}\}=B:

log⁡Z^=log⁡\(1B∑i=1Be−Δ\(𝐲i\)\)→LLMπθ≈πreflog⁡\(1B∑i=1Ber\(𝐱,𝐲i\)/β\)\.\\widehat\{\\log Z\}=\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\-\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\\right\)\\quad\\xrightarrow\[\\text\{LLM\}\]\{\\pi\_\{\\theta\}\\approx\\pi\_\{\\text\{ref\}\}\}\\quad\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)/\\beta\}\\right\)\.Batch\-wise DevGrad Loss:Substitutinglog⁡Z^\\widehat\{\\log Z\}into the robust formΔ\+log⁡Z^\+e−Δe−log⁡Z^−1\\Delta\+\\widehat\{\\log Z\}\+e^\{\-\\Delta\}e^\{\-\\widehat\{\\log Z\}\}\-1:

ℒFKLDG\(ℬ,θ\)=1B∑i=1B\[Δ\(𝐲i\)\+SG\[log⁡\(1B∑j=1Be−Δ\(𝐲j\)\)\]\+e−Δ\(𝐲i\)SG\[1B∑j=1Be−Δ\(𝐲j\)\]−1\]\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{FKL\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\left\[\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\mathrm\{SG\}\\left\[\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\-\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\)\\right\]\+\\frac\{e^\{\-\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\}\{\\mathrm\{SG\}\\left\[\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\-\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\]\}\-1\\right\]\.
Batch\-wise DevGrad Loss \(LLM\):Under the Kimi assumption \(Δ≈−r/β\\Delta\\approx\-r/\\beta\), the normalization depends on the rewards with temperatureβ\\beta\.

ℒFKLDG=1B∑i=1B\[log⁡πθ\(𝐲i\|𝐱\)πref\(𝐲i\|𝐱\)\(B⋅σ~β\(𝐫\)i\)\+\(B⋅σ~β\(𝐫\)i\)πref\(𝐲i\|𝐱\)πθ\(𝐲i\|𝐱\)−1\]\.\\displaystyle\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{FKL\}\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\Bigg\[\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\\left\(B\\cdot\\tilde\{\\sigma\}\_\{\\beta\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\}\+\\frac\{\\left\(B\\cdot\\tilde\{\\sigma\}\_\{\\beta\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\-1\\Bigg\]\.
##### Tempered Forward KL Divergence

Applying the tempered transformation to Eq\. 110\. The normalization temperature shifts fromβ\\betato11\.

ℒ~FKLDG=1Bβ∑i=1B\[log⁡πθ\(𝐲i\|𝐱\)βπref\(𝐲i\|𝐱\)β\(B⋅σ~1\(𝐫\)i\)\+\(B⋅σ~1\(𝐫\)i\)\(πref\(𝐲i\|𝐱\)πθ\(𝐲i\|𝐱\)\)β−1\]\.\\tilde\{\\mathcal\{L\}\}^\{\\text\{DG\}\}\_\{\\text\{FKL\}\}=\\frac\{1\}\{B\\beta\}\\sum\_\{i=1\}^\{B\}\\left\[\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\(B\\cdot\\tilde\{\\sigma\}\_\{1\}\(\\mathbf\{r\}\)\_\{i\}\)\}\+\(B\\cdot\\tilde\{\\sigma\}\_\{1\}\(\\mathbf\{r\}\)\_\{i\}\)\\left\(\\frac\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\\right\)^\{\\beta\}\-1\\right\]\.

### C\.3Pearsonχ2\\chi^\{2\}Divergence

Generator:f\(u\)=12\(u−1\)2\+\(u−1\)f\(u\)=\\frac\{1\}\{2\}\(u\-1\)^\{2\}\+\(u\-1\)\. Derivation:Withf′\(u\)=uf^\{\\prime\}\(u\)=u, the integrand iset−1e^\{t\}\-1\.

ℒχ2\(Δ\)=∫0Δ\(et−1\)𝑑t=eΔ−Δ−1\.\\mathcal\{L\}\_\{\\chi^\{2\}\}\(\\Delta\)=\\int\_\{0\}^\{\\Delta\}\(e^\{t\}\-1\)\\,dt=e^\{\\Delta\}\-\\Delta\-1\.Batch\-wise Normalization:The condition∑\(eΔi\+log⁡Z^−1\)=0\\sum\(e^\{\\Delta\_\{i\}\+\\widehat\{\\log Z\}\}\-1\)=0implieselog⁡Z^∑eΔi=Be^\{\\widehat\{\\log Z\}\}\\sum e^\{\\Delta\_\{i\}\}=B:

log⁡Z^=−log⁡\(1B∑i=1BeΔ\(𝐲i\)\)→LLMπθ≈πref−log⁡\(1B∑i=1Be−r\(𝐱,𝐲i\)/β\)\.\\widehat\{\\log Z\}=\-\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\\right\)\\quad\\xrightarrow\[\\text\{LLM\}\]\{\\pi\_\{\\theta\}\\approx\\pi\_\{\\text\{ref\}\}\}\\quad\-\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\-r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)/\\beta\}\\right\)\.Batch\-wise DevGrad Loss:Substitutinglog⁡Z^\\widehat\{\\log Z\}into the robust formeΔelog⁡Z^−\(Δ\+log⁡Z^\)−1e^\{\\Delta\}e^\{\\widehat\{\\log Z\}\}\-\(\\Delta\+\\widehat\{\\log Z\}\)\-1:

ℒχ2DG\(ℬ,θ\)=1B∑i=1B\[eΔ\(𝐲i\)SG\[1B∑j=1BeΔ\(𝐲j\)\]−Δ\(𝐲i\)−SG\[−log⁡\(1B∑j=1BeΔ\(𝐲j\)\)\]−1\]\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\chi^\{2\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\left\[\\frac\{e^\{\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\}\{\\mathrm\{SG\}\\left\[\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\]\}\-\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\-\\mathrm\{SG\}\\left\[\-\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\)\\right\]\-1\\right\]\.Batch\-wise DevGrad Loss \(LLM\):Using the assumptionΔ≈−r/β\\Delta\\approx\-r/\\beta, the normalization effectively uses a negative temperature−β\-\\beta\.

ℒχ2DG\\displaystyle\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\chi^\{2\}\}=1B∑i=1B\[πθ\(𝐲i\|𝐱\)\(B⋅σ~−β\(𝐫\)i\)πref\(𝐲i\|𝐱\)−logπθ\(𝐲i\|𝐱\)\(B⋅σ~−β\(𝐫\)i\)πref\(𝐲i\|𝐱\)\+−1\]\.\\displaystyle=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\Bigg\[\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\left\(B\\cdot\\tilde\{\\sigma\}\_\{\-\\beta\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\-\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\left\(B\\cdot\\tilde\{\\sigma\}\_\{\-\\beta\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\+\-1\\Bigg\]\.

### C\.4Neymanχ2\\chi^\{2\}Divergence

Generator:f\(u\)=\(u−1\)22u\+\(u−1\)f\(u\)=\\frac\{\(u\-1\)^\{2\}\}\{2u\}\+\(u\-1\)\. Properties:Mass\-covering\.Heavily penalizes under\-estimation of the target probability\. Derivation:Withf′\(u\)=32−12u−2f^\{\\prime\}\(u\)=\\frac\{3\}\{2\}\-\\frac\{1\}\{2\}u^\{\-2\}, the integrand is12−12e−2t\\frac\{1\}\{2\}\-\\frac\{1\}\{2\}e^\{\-2t\}\.

ℒNey\(Δ\)=12Δ\+14e−2Δ−14\.\\mathcal\{L\}\_\{\\text\{Ney\}\}\(\\Delta\)=\\frac\{1\}\{2\}\\Delta\+\\frac\{1\}\{4\}e^\{\-2\\Delta\}\-\\frac\{1\}\{4\}\.Batch\-wise Normalization:The condition∑\(12−12e−2\(Δi\+log⁡Z^\)\)=0\\sum\(\\frac\{1\}\{2\}\-\\frac\{1\}\{2\}e^\{\-2\(\\Delta\_\{i\}\+\\widehat\{\\log Z\}\)\}\)=0impliese−2log⁡Z^∑e−2Δi=Be^\{\-2\\widehat\{\\log Z\}\}\\sum e^\{\-2\\Delta\_\{i\}\}=B:

log⁡Z^=12log⁡\(1B∑i=1Be−2Δ\(𝐲i\)\)→LLMπθ≈πref12log⁡\(1B∑i=1Be2r\(𝐱,𝐲i\)/β\)\.\\widehat\{\\log Z\}=\\frac\{1\}\{2\}\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\-2\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\\right\)\\quad\\xrightarrow\[\\text\{LLM\}\]\{\\pi\_\{\\theta\}\\approx\\pi\_\{\\text\{ref\}\}\}\\quad\\frac\{1\}\{2\}\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{2r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)/\\beta\}\\right\)\.Batch\-wise DevGrad Loss:Substitutinglog⁡Z^\\widehat\{\\log Z\}into the robust form12\(Δ\+log⁡Z^\)\+14e−2Δe−2log⁡Z^−14\\frac\{1\}\{2\}\(\\Delta\+\\widehat\{\\log Z\}\)\+\\frac\{1\}\{4\}e^\{\-2\\Delta\}e^\{\-2\\widehat\{\\log Z\}\}\-\\frac\{1\}\{4\}:

ℒNeyDG\(ℬ,θ\)=1B∑i=1B\[12Δ\(𝐲i\)\+SG\[14log⁡\(1B∑j=1Be−2Δ\(𝐲j\)\)\]\+14e−2Δ\(𝐲i\)SG\[1B∑j=1Be−2Δ\(𝐲j\)\]−14\]\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{Ney\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\left\[\\frac\{1\}\{2\}\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\mathrm\{SG\}\\left\[\\frac\{1\}\{4\}\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\-2\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\)\\right\]\+\\frac\{\\frac\{1\}\{4\}e^\{\-2\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\}\{\\mathrm\{SG\}\\left\[\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\-2\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\]\}\-\\frac\{1\}\{4\}\\right\]\.
##### Tempered Neymanχ2\\chi^\{2\}Divergence

Applying the tempered transformation to Eq\. 120\. The normalization temperature shifts fromβ/2\\beta/2to1/21/2\.

ℒ~NeyDG=1Bβ∑i=1B\[12log⁡πθ\(𝐲i\|𝐱\)βπref\(𝐲i\|𝐱\)β\(B⋅σ~1/2\(𝐫\)i\)\+14\(\(B⋅σ~1/2\(𝐫\)i\)πref\(𝐲i\|𝐱\)βπθ\(𝐲i\|𝐱\)β\)2−14\]\.\\tilde\{\\mathcal\{L\}\}^\{\\text\{DG\}\}\_\{\\text\{Ney\}\}=\\frac\{1\}\{B\\beta\}\\sum\_\{i=1\}^\{B\}\\left\[\\frac\{1\}\{2\}\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\\left\(B\\cdot\\tilde\{\\sigma\}\_\{1/2\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\}\+\\frac\{1\}\{4\}\\left\(\\frac\{\\left\(B\\cdot\\tilde\{\\sigma\}\_\{1/2\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\\right\)^\{2\}\-\\frac\{1\}\{4\}\\right\]\.

### C\.5Squared Hellinger Distance

Generator:f\(u\)=2\(u−1\)2\+\(u−1\)f\(u\)=2\(\\sqrt\{u\}\-1\)^\{2\}\+\(u\-1\)\. Properties:Mode\-covering / Robust\.Balances mode coverage with stability\. The gradient is bounded asΔ→−∞\\Delta\\to\-\\infty, unlike Forward KL which grows linearly\. Derivation:Withf′\(u\)=3−2u−1/2f^\{\\prime\}\(u\)=3\-2u^\{\-1/2\}, the integrand is2−2e−t/22\-2e^\{\-t/2\}\.

ℒH2\(Δ\)=∫0Δ\(2−2e−t/2\)𝑑t=2Δ\+4e−Δ/2−4\.\\mathcal\{L\}\_\{\\text\{H\}^\{2\}\}\(\\Delta\)=\\int\_\{0\}^\{\\Delta\}\(2\-2e^\{\-t/2\}\)\\,dt=2\\Delta\+4e^\{\-\\Delta/2\}\-4\.Batch\-wise Normalization:The condition∑\(2−2e−\(Δi\+log⁡Z^\)/2\)=0\\sum\(2\-2e^\{\-\(\\Delta\_\{i\}\+\\widehat\{\\log Z\}\)/2\}\)=0impliese−log⁡Z^/2∑e−Δi/2=Be^\{\-\\widehat\{\\log Z\}/2\}\\sum e^\{\-\\Delta\_\{i\}/2\}=B:

log⁡Z^=2log⁡\(1B∑i=1Be−12Δ\(𝐲i\)\)→LLMπθ≈πref2log⁡\(1B∑i=1Be12βr\(𝐱,𝐲i\)\)\.\\widehat\{\\log Z\}=2\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\-\\frac\{1\}\{2\}\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\\right\)\\quad\\xrightarrow\[\\text\{LLM\}\]\{\\pi\_\{\\theta\}\\approx\\pi\_\{\\text\{ref\}\}\}\\quad 2\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\\frac\{1\}\{2\\beta\}r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)\}\\right\)\.Batch\-wise DevGrad Loss:Substitutinglog⁡Z^\\widehat\{\\log Z\}into the robust form2\(Δ\+log⁡Z^\)\+4e−Δ/2e−log⁡Z^/2−42\(\\Delta\+\\widehat\{\\log Z\}\)\+4e^\{\-\\Delta/2\}e^\{\-\\widehat\{\\log Z\}/2\}\-4:

ℒH2DG\(ℬ,θ\)=1B∑i=1B\[2Δ\(𝐲i\)\+SG\[4log⁡\(1B∑j=1Be−12Δ\(𝐲j\)\)\]\+4e−12Δ\(𝐲i\)SG\[1B∑j=1Be−12Δ\(𝐲j\)\]−4\]\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{H\}^\{2\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\left\[2\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\mathrm\{SG\}\\left\[4\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\-\\frac\{1\}\{2\}\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\)\\right\]\+\\frac\{4e^\{\-\\frac\{1\}\{2\}\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\}\{\\mathrm\{SG\}\\left\[\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\-\\frac\{1\}\{2\}\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\]\}\-4\\right\]\.Batch\-wise DevGrad Loss \(LLM\):The normalization term corresponds to a softmax with temperature2β2\\beta\.

ℒH2DG\\displaystyle\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{H\}^\{2\}\}=1B∑i=1B\[2log⁡πθ\(𝐲i\|𝐱\)\(B⋅σ~2β\(𝐫\)i\)πref\(𝐲i\|𝐱\)\+4\(B⋅σ~2β\(𝐫\)i\)πref\(𝐲i\|𝐱\)πθ\(𝐲i\|𝐱\)−4\]\.\\displaystyle=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\Bigg\[2\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\left\(B\\cdot\\tilde\{\\sigma\}\_\{2\\beta\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\+4\\sqrt\{\\left\(B\\cdot\\tilde\{\\sigma\}\_\{2\\beta\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\frac\{\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\}\-4\\Bigg\]\.
##### Tempered Squared Hellinger Distance

Applying the tempered transformation to Eq\. 117\. The normalization temperature shifts from2β2\\betato22\.

ℒ~H2DG=1Bβ∑i=1B\[2log⁡πθ\(𝐲i\|𝐱\)β\(B⋅σ~2\(𝐫\)i\)πref\(𝐲i\|𝐱\)β\+4\(B⋅σ~2\(𝐫\)i\)\(πref\(𝐲i\|𝐱\)πθ\(𝐲i\|𝐱\)\)β−4\]\.\\tilde\{\\mathcal\{L\}\}^\{\\text\{DG\}\}\_\{\\text\{H\}^\{2\}\}=\\frac\{1\}\{B\\beta\}\\sum\_\{i=1\}^\{B\}\\left\[2\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\{\(B\\cdot\\tilde\{\\sigma\}\_\{2\}\(\\mathbf\{r\}\)\_\{i\}\)\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\+4\\sqrt\{\(B\\cdot\\tilde\{\\sigma\}\_\{2\}\(\\mathbf\{r\}\)\_\{i\}\)\\left\(\\frac\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\\right\)^\{\\beta\}\}\-4\\right\]\.

### C\.6Jensen\-Shannon Divergence

Generator:f\(u\)=ulog⁡u−\(u\+1\)log⁡u\+12\+linear termsf\(u\)=u\\log u\-\(u\+1\)\\log\\frac\{u\+1\}\{2\}\+\\text\{linear terms\}\. Derivation:Usingf′\(u\)=2log⁡\(2uu\+1\)\+1f^\{\\prime\}\(u\)=2\\log\(\\frac\{2u\}\{u\+1\}\)\+1, we integrate term\-by\-term involving the dilogarithm functionLi2\\text\{Li\}\_\{2\}\.

ℒJSD\(Δ\)=Δ2\+2Δlog⁡2\+2Li2\(−eΔ\)\+π26\.\\mathcal\{L\}\_\{\\text\{JSD\}\}\(\\Delta\)=\\Delta^\{2\}\+2\\Delta\\log 2\+2\\text\{Li\}\_\{2\}\(\-e^\{\\Delta\}\)\+\\frac\{\\pi^\{2\}\}\{6\}\.Batch\-wise Normalization:The condition∑log⁡\(2exp⁡\(Δi\+log⁡Z^\)exp⁡\(Δi\+log⁡Z^\)\+1\)=0\\sum\\log\(\\frac\{2\\exp\(\\Delta\_\{i\}\+\\widehat\{\\log Z\}\)\}\{\\exp\(\\Delta\_\{i\}\+\\widehat\{\\log Z\}\)\+1\}\)=0requires finding the rootZZof:

∏i=1B2eΔiZeΔiZ\+1=1\.\\prod\_\{i=1\}^\{B\}\\frac\{2e^\{\\Delta\_\{i\}\}Z\}\{e^\{\\Delta\_\{i\}\}Z\+1\}=1\.This does not have a closed\-form solution and requires numerical root\-finding\. Batch\-wise DevGrad Loss:No closed form simplification exists\. The loss is computed by numerically solving forlog⁡Z^\\widehat\{\\log Z\}using the root\-finding equation above and substituting it back into the loss expression:

ℒJSDDG\(ℬ,θ\)=1B∑i=1BℒJSD\(Δ\(𝐲i\)\+SG\[log⁡Z^\]\)\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{JSD\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\mathcal\{L\}\_\{\\text\{JSD\}\}\(\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\mathrm\{SG\}\\left\[\\widehat\{\\log Z\}\\right\]\)\.

### C\.7α\\alpha\-Divergence

Generator:f\(u\)=uα−uα\(α−1\)\+α−1α\(u−1\)f\(u\)=\\frac\{u^\{\\alpha\}\-u\}\{\\alpha\(\\alpha\-1\)\}\+\\frac\{\\alpha\-1\}\{\\alpha\}\(u\-1\)\. Derivation:Withf′\(u\)=uα−1α−1\+α−2α−1f^\{\\prime\}\(u\)=\\frac\{u^\{\\alpha\-1\}\}\{\\alpha\-1\}\+\\frac\{\\alpha\-2\}\{\\alpha\-1\}, the integrand is1α−1\(e\(α−1\)t−1\)\\frac\{1\}\{\\alpha\-1\}\(e^\{\(\\alpha\-1\)t\}\-1\)\.

ℒα\(Δ\)=1\(α−1\)2e\(α−1\)Δ−Δα−1−1\(α−1\)2\.\\mathcal\{L\}\_\{\\alpha\}\(\\Delta\)=\\frac\{1\}\{\(\\alpha\-1\)^\{2\}\}e^\{\(\\alpha\-1\)\\Delta\}\-\\frac\{\\Delta\}\{\\alpha\-1\}\-\\frac\{1\}\{\(\\alpha\-1\)^\{2\}\}\.Batch\-wise Normalization:The condition∑\(e\(α−1\)\(Δi\+log⁡Z^\)−1α−1\)=0\\sum\\left\(\\frac\{e^\{\(\\alpha\-1\)\(\\Delta\_\{i\}\+\\widehat\{\\log Z\}\)\}\-1\}\{\\alpha\-1\}\\right\)=0impliese\(α−1\)log⁡Z^∑e\(α−1\)Δi=Be^\{\(\\alpha\-1\)\\widehat\{\\log Z\}\}\\sum e^\{\(\\alpha\-1\)\\Delta\_\{i\}\}=B\. Solving forlog⁡Z^\\widehat\{\\log Z\}:

log⁡Z^=11−αlog⁡\(1B∑i=1Be\(α−1\)Δ\(𝐲i\)\)→LLMπθ≈πref11−αlog⁡\(1B∑i=1Be−\(α−1\)r\(𝐱,𝐲i\)/β\)\.\\widehat\{\\log Z\}=\\frac\{1\}\{1\-\\alpha\}\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\(\\alpha\-1\)\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\\right\)\\quad\\xrightarrow\[\\text\{LLM\}\]\{\\pi\_\{\\theta\}\\approx\\pi\_\{\\text\{ref\}\}\}\\quad\\frac\{1\}\{1\-\\alpha\}\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}e^\{\-\(\\alpha\-1\)r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)/\\beta\}\\right\)\.Batch\-wise DevGrad Loss:Substitutinglog⁡Z^\\widehat\{\\log Z\}into the robust form with𝒞α=1\(α−1\)2\\mathcal\{C\}\_\{\\alpha\}=\\frac\{1\}\{\(\\alpha\-1\)^\{2\}\}:

ℒαDG\(ℬ,θ\)=1B∑i=1B\[𝒞αe\(α−1\)Δ\(𝐲i\)SG\[1B∑j=1Be\(α−1\)Δ\(𝐲j\)\]−Δ\(𝐲i\)\+SG\[11−αlog⁡\(1B∑j=1Be\(α−1\)Δ\(𝐲j\)\)\]α−1−𝒞α\]\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\alpha\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\left\[\\frac\{\\mathcal\{C\}\_\{\\alpha\}e^\{\(\\alpha\-1\)\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\}\}\{\\mathrm\{SG\}\\left\[\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\(\\alpha\-1\)\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\]\}\-\\frac\{\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+\\mathrm\{SG\}\\left\[\\frac\{1\}\{1\-\\alpha\}\\log\\left\(\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}e^\{\(\\alpha\-1\)\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\}\\right\)\\right\]\}\{\\alpha\-1\}\-\\mathcal\{C\}\_\{\\alpha\}\\right\]\.Batch\-wise DevGrad Loss \(LLM\):This effectively normalizes with temperatureβ1−α\\frac\{\\beta\}\{1\-\\alpha\}\.

ℒαDG=1B\(α−1\)2∑i=1B\[\\displaystyle\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\alpha\}=\\frac\{1\}\{B\(\\alpha\-1\)^\{2\}\}\\sum\_\{i=1\}^\{B\}\\Bigg\[\(B⋅σ~β1−α\(𝐫\)i\)\(πθ\(𝐲i\|𝐱\)πref\(𝐲i\|𝐱\)\)α−1−\(α−1\)logπθ\(𝐲i\|𝐱\)\(B⋅σ~β1−α\(𝐫\)i\)πref\(𝐲i\|𝐱\)−1\]\.\\displaystyle\\left\(B\\cdot\\tilde\{\\sigma\}\_\{\\frac\{\\beta\}\{1\-\\alpha\}\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\left\(\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\\right\)^\{\\alpha\-1\}\-\(\\alpha\-1\)\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\left\(B\\cdot\\tilde\{\\sigma\}\_\{\\frac\{\\beta\}\{1\-\\alpha\}\}\(\\mathbf\{r\}\)\_\{i\}\\right\)\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\-1\\Bigg\]\.This unified form shows thatα\\alphacontrols the temperature of the reward softmax used for normalization\. For Forward KL \(α→0\\alpha\\to 0\), the temperature isβ\\beta; for Pearson \(α=2\\alpha=2\), it is−β\-\\beta; and for Hellinger \(α=0\.5\\alpha=0\.5\), it is2β2\\beta\.

##### General Temperedα\\alpha\-Divergence

The general form for anyα\\alpha, where the normalization temperature becomes11−α\\frac\{1\}\{1\-\\alpha\}:

ℒ~αDG=1Bβ\(α−1\)2∑i=1B\[Cα,temp\(B⋅σ~11−α\(𝐫\)i\)\(πθ\(𝐲i\|𝐱\)πref\(𝐲i\|𝐱\)\)β\(α−1\)−\(α−1\)log⁡πθ\(𝐲i\|𝐱\)β\(B⋅σ~11−α\(𝐫\)i\)πref\(𝐲i\|𝐱\)β−1\]\.\\tilde\{\\mathcal\{L\}\}^\{\\text\{DG\}\}\_\{\\alpha\}=\\frac\{1\}\{B\\beta\(\\alpha\-1\)^\{2\}\}\\sum\_\{i=1\}^\{B\}\\left\[\\frac\{C\_\{\\alpha,\\text\{temp\}\}\}\{\(B\\cdot\\tilde\{\\sigma\}\_\{\\frac\{1\}\{1\-\\alpha\}\}\(\\mathbf\{r\}\)\_\{i\}\)\}\\left\(\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\\right\)^\{\\beta\(\\alpha\-1\)\}\-\(\\alpha\-1\)\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\{\(B\\cdot\\tilde\{\\sigma\}\_\{\\frac\{1\}\{1\-\\alpha\}\}\(\\mathbf\{r\}\)\_\{i\}\)\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)^\{\\beta\}\}\-1\\right\]\.

### C\.8Total Variation

Generator:f\(u\)=\|u−1\|f\(u\)=\|u\-1\|\. Derivation:Assuming the subgradient convention to satisfyf′\(1\)=1f^\{\\prime\}\(1\)=1and symmetry, we usef′\(u\)=sgn\(u−1\)f^\{\\prime\}\(u\)=\\text\{sgn\}\(u\-1\)foru≠1u\\neq 1\. The integrand becomessgn\(et−1\)=sgn\(t\)\\text\{sgn\}\(e^\{t\}\-1\)=\\text\{sgn\}\(t\)\.

ℒTV\(Δ\)=∫0Δsgn\(t\)𝑑t=\|Δ\|\.\\mathcal\{L\}\_\{\\text\{TV\}\}\(\\Delta\)=\\int\_\{0\}^\{\\Delta\}\\text\{sgn\}\(t\)\\,dt=\|\\Delta\|\.Batch\-wise Normalization:Minimizing the batch loss corresponds to minimizing the sum of absolute deviations∑i\|Δ\(𝐲i\)\+C\|\\sum\_\{i\}\|\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\+C\|\. The minimizer is the median of the inputs with reversed sign:

log⁡Z^=−Median\(\{Δ\(𝐲i\)\}i=1B\)→LLMπθ≈πrefMedian\(\{r\(𝐱,𝐲i\)β\}i=1B\)\.\\widehat\{\\log Z\}=\-\\text\{Median\}\\left\(\\\{\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\\\}\_\{i=1\}^\{B\}\\right\)\\quad\\xrightarrow\[\\text\{LLM\}\]\{\\pi\_\{\\theta\}\\approx\\pi\_\{\\text\{ref\}\}\}\\quad\\text\{Median\}\\left\(\\left\\\{\\frac\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)\}\{\\beta\}\\right\\\}\_\{i=1\}^\{B\}\\right\)\.Batch\-wise DevGrad Loss:This results in the Mean Absolute Deviation \(MAD\) of the log\-probability differences:

ℒTVDG\(ℬ,θ\)=1B∑i=1B\|Δ\(𝐲i\)−SG\[Median\(\{Δ\(𝐲j\)\}j=1B\)\]\|\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{TV\}\}\(\\mathcal\{B\},\\theta\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\Big\|\\Delta\(\{\\mathbf\{y\}\}\_\{i\}\)\-\\mathrm\{SG\}\\left\[\\text\{Median\}\(\\\{\\Delta\(\{\\mathbf\{y\}\}\_\{j\}\)\\\}\_\{j=1\}^\{B\}\)\\right\]\\Big\|\.Batch\-wise DevGrad Loss \(LLM\):This results in the Mean Absolute Deviation of the reward\-adjusted log\-ratios\.

ℒTVDG=1B∑i=1B\|log⁡πθ\(𝐲i\|𝐱\)πref\(𝐲i\|𝐱\)−r\(𝐱,𝐲i\)β−SG\[Median\(\{log⁡πθ\(𝐲j\|𝐱\)πref\(𝐲j\|𝐱\)−r\(𝐱,𝐲j\)β\}j=1B\)\]\|\.\\mathcal\{L\}^\{\\mathrm\{DG\}\}\_\{\\text\{TV\}\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}\\Bigg\|\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{i\}\|\{\\mathbf\{x\}\}\)\}\-\\frac\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{i\}\)\}\{\\beta\}\-\\mathrm\{SG\}\\left\[\\text\{Median\}\\left\(\\left\\\{\\log\\frac\{\\pi\_\{\\theta\}\(\{\\mathbf\{y\}\}\_\{j\}\|\{\\mathbf\{x\}\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\{\\mathbf\{y\}\}\_\{j\}\|\{\\mathbf\{x\}\}\)\}\-\\frac\{r\(\{\\mathbf\{x\}\},\{\\mathbf\{y\}\}\_\{j\}\)\}\{\\beta\}\\right\\\}\_\{j=1\}^\{B\}\\right\)\\right\]\\Bigg\|\.
##### Tempered Total Variation

Applying the tempered transformation to Eq\. 130\. The normalization relies on the Median of the tempered deviationsδ≈−r\\delta\\approx\-r, removing the division byβ\\betainside the absolute difference\.

ℒ~TVDG=1Bβ∑i=1B\|βlog⁡πθ\(𝐲i\|𝐱\)πref\(𝐲i\|𝐱\)−r\(𝐱,𝐲i\)−SG\[Median\(\{βlog⁡πθ\(𝐲j\|𝐱\)πref\(𝐲j\|𝐱\)−r\(𝐱,𝐲j\)\}j=1B\)\]\|\.\\tilde\{\\mathcal\{L\}\}^\{\\text\{DG\}\}\_\{\\text\{TV\}\}=\\frac\{1\}\{B\\beta\}\\sum\_\{i=1\}^\{B\}\\left\|\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)\}\-r\(\\mathbf\{x\},\\mathbf\{y\}\_\{i\}\)\-\\text\{SG\}\\left\[\\text\{Median\}\\left\(\\left\\\{\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{ref\}\}\(\\mathbf\{y\}\_\{j\}\|\\mathbf\{x\}\)\}\-r\(\\mathbf\{x\},\\mathbf\{y\}\_\{j\}\)\\right\\\}\_\{j=1\}^\{B\}\\right\)\\right\]\\right\|\.

## Appendix DGradient Analysis of DevGrad Losses

In this section, we derive the closed\-form gradients for the batch\-wise DevGrad losses \(ℒfDG\\mathcal\{L\}^\{DG\}\_\{f\}\)\. Recall from Eq\. 91 that the gradient of the DevGrad loss with respect to the parametersθ\\thetais given by:

∇θℒfDG\(ℬ\)=1B∑i=1BwiDG∇θlog⁡πθ\(𝐲i\|𝐱\)\\nabla\_\{\\theta\}\\mathcal\{L\}^\{DG\}\_\{f\}\(\\mathcal\{B\}\)=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}w\_\{i\}^\{DG\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(\\mathbf\{y\}\_\{i\}\|\\mathbf\{x\}\)where the effective gradient weight for theii\-th sample iswiDG=Lf′\(Δi\+log⁡Z\)w\_\{i\}^\{DG\}=L^\{\\prime\}\_\{f\}\(\\Delta\_\{i\}\+\\log Z\)\. Due to the optimality condition oflog⁡Z\\log Z\(Eq\. 92\), these weights always satisfy the zero\-sum property∑wiDG=0\\sum w\_\{i\}^\{DG\}=0, acting as a control variate\.

Below, we defineδi=Δ\(𝐲i\)\+log⁡Z\\delta\_\{i\}=\\Delta\(\\mathbf\{y\}\_\{i\}\)\+\\log Zas the normalized deviation\. We denote the batch mean operation as𝔼ℬ\[⋅\]=1B∑j=1B\(⋅\)\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\cdot\]=\\frac\{1\}\{B\}\\sum\_\{j=1\}^\{B\}\(\\cdot\)\.

1\. Reverse KL Divergence \(Standard Vargrad\)

- •Normalization:log⁡Z=−𝔼ℬ\[Δ\(𝐲\)\]\\log Z=\-\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\Delta\(\\mathbf\{y\}\)\]\.
- •Gradient Weight: wiDG=Δi−𝔼ℬ\[Δ\(𝐲\)\]w\_\{i\}^\{DG\}=\\Delta\_\{i\}\-\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[\\Delta\(\\mathbf\{y\}\)\]

2\. Forward KL Divergence

- •Normalization:log⁡Z\\log Zsatisfies𝔼ℬ\[e−\(Δ\+log⁡Z\)\]=1⟹elog⁡Z=𝔼ℬ\[e−Δ\]\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\-\(\\Delta\+\\log Z\)\}\]=1\\implies e^\{\\log Z\}=\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\-\\Delta\}\]\.
- •Gradient Weight: wiDG=1−e−Δi𝔼ℬ\[e−Δ\(𝐲\)\]w\_\{i\}^\{DG\}=1\-\\frac\{e^\{\-\\Delta\_\{i\}\}\}\{\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\-\\Delta\(\\mathbf\{y\}\)\}\]\}

3\. Pearsonχ2\\chi^\{2\}Divergence

- •Normalization:log⁡Z\\log Zsatisfies𝔼ℬ\[eΔ\+log⁡Z\]=1⟹e−log⁡Z=𝔼ℬ\[eΔ\]\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\\Delta\+\\log Z\}\]=1\\implies e^\{\-\\log Z\}=\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\\Delta\}\]\.
- •Gradient Weight: wiDG=eΔi𝔼ℬ\[eΔ\(𝐲\)\]−1w\_\{i\}^\{DG\}=\\frac\{e^\{\\Delta\_\{i\}\}\}\{\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\\Delta\(\\mathbf\{y\}\)\}\]\}\-1

4\. Neymanχ2\\chi^\{2\}Divergence

- •Normalization:e2log⁡Z=𝔼ℬ\[e−2Δ\]e^\{2\\log Z\}=\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\-2\\Delta\}\]\.
- •Gradient Weight: wiDG=12\(1−e−2Δi𝔼ℬ\[e−2Δ\(𝐲\)\]\)w\_\{i\}^\{DG\}=\\frac\{1\}\{2\}\\left\(1\-\\frac\{e^\{\-2\\Delta\_\{i\}\}\}\{\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\-2\\Delta\(\\mathbf\{y\}\)\}\]\}\\right\)

5\. Squared Hellinger Distance

- •Normalization:e12log⁡Z=𝔼ℬ\[e−Δ/2\]e^\{\\frac\{1\}\{2\}\\log Z\}=\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\-\\Delta/2\}\]\.
- •Gradient Weight: wiDG=2\(1−e−Δi/2𝔼ℬ\[e−Δ\(𝐲\)/2\]\)w\_\{i\}^\{DG\}=2\\left\(1\-\\frac\{e^\{\-\\Delta\_\{i\}/2\}\}\{\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\-\\Delta\(\\mathbf\{y\}\)/2\}\]\}\\right\)

6\. Total Variation

- •Normalization:log⁡Z=−Median\(\{Δj\}\)\\log Z=\-\\text\{Median\}\(\\\{\\Delta\_\{j\}\\\}\)\.
- •Gradient Weight: wiDG=sgn\(Δi−Median\(\{Δ\(𝐲\)\}\)\)w\_\{i\}^\{DG\}=\\text\{sgn\}\\left\(\\Delta\_\{i\}\-\\text\{Median\}\(\\\{\\Delta\(\\mathbf\{y\}\)\\\}\)\\right\)

7\. Generalα\\alpha\-Divergence

- •Normalization:e\(α−1\)log⁡Z=\(𝔼ℬ\[e\(α−1\)Δ\]\)−1e^\{\(\\alpha\-1\)\\log Z\}=\(\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\(\\alpha\-1\)\\Delta\}\]\)^\{\-1\}\.
- •Gradient Weight: wiDG=1α−1\(e\(α−1\)Δi𝔼ℬ\[e\(α−1\)Δ\(𝐲\)\]−1\)w\_\{i\}^\{DG\}=\\frac\{1\}\{\\alpha\-1\}\\left\(\\frac\{e^\{\(\\alpha\-1\)\\Delta\_\{i\}\}\}\{\\mathbb\{E\}\_\{\\mathcal\{B\}\}\[e^\{\(\\alpha\-1\)\\Delta\(\\mathbf\{y\}\)\}\]\}\-1\\right\)

## Appendix EMinimal Implementation

In this Section\. we provide minimal formulations of our loss for both standard cases

1importtorch,math

2

3deflog\_z\_estimate\(delta,name=’ReverseKL’,alpha=1\):

4B=delta\.size\(0\)

5logmeanexp=lambdak:torch\.logsumexp\(k\*delta,0\)\-math\.log\(B\)

6return\{

7’ReverseKL’:lambda:\-delta\.mean\(\),

8’ForwardKL’:lambda:logmeanexp\(\-1\.0\),

9’Pearson’:lambda:\-logmeanexp\(1\.0\),

10’Hellinger’:lambda:2\.0\*logmeanexp\(\-0\.5\),

11’NeymanChi2’:lambda:0\.5\*logmeanexp\(\-2\.0\),

12’TotalVariation’:lambda:\-delta\.median\(\),

13’Alpha’:lambda:\(1\.0/\(1\.0\-alpha\)\)\*logmeanexp\(alpha\-1\.0\)

14\}\[name\]\(\)

15

16defdevgrad\_loss\(delta,name=’ReverseKL’,alpha=1\):

17name=’ReverseKL’if\(name==’Alpha’andabs\(alpha\-1\)<1e\-4\)elsename

18d=delta\+log\_z\_estimate\(delta,name,alpha\)\.detach\(\)

19return\{

20’ReverseKL’:lambda:0\.5\*d\*\*2,

21’ForwardKL’:lambda:d\+\(\-d\)\.exp\(\)\-1,

22’Pearson’:lambda:d\.exp\(\)\-d\-1,

23’Hellinger’:lambda:2\*d\+4\*\(\-0\.5\*d\)\.exp\(\)\-4,

24’NeymanChi2’:lambda:0\.5\*d\+0\.25\*\(\-2\*d\)\.exp\(\)\-0\.25,

25’TotalVariation’:lambda:d\.abs\(\),

26’Alpha’:lambda:\(1/\(alpha\-1\)\*\*2\)\*\(\(alpha\-1\)\*d\)\.exp\(\)\-d/\(alpha\-1\)\-\(1/\(alpha\-1\)\*\*2\)

27\}\[name\]\(\)\.mean\(\)

28

29deftempered\_devgrad\_loss\(log\_pi,log\_ref,reward,beta=1\.0,name=’ReverseKL’,alpha=1\):

30delta\_beta=\(beta\*\(log\_pi\-log\_ref\)\)\-reward

31raw\_loss=devgrad\_loss\(delta\_beta,name,alpha\)

32returnraw\_loss/beta

Listing 1:Implementation of Tempered DevGrad Loss
## Appendix FExperiments

### F\.1Synthetic Grid Experiment

For the synthetic grid experiment, we use same setup asMalkin et al\. \[[2022b](https://arxiv.org/html/2605.15417#bib.bib20)\], with the following parameters:

- •Dimension \(DD\): The dimensionality of the grid\.
- •Side Length \(HH\): The resolution of the grid per dimension\.
- •State Space \(𝒮\\mathcal\{S\}\): The set of non\-terminating states, representing coordinates in the grid\. 𝒮o=\{0,1,…,H−1\}D\\mathcal\{S\}^\{o\}=\\\{0,1,\\dots,H\-1\\\}^\{D\}The process begins at the initial states0=𝟎=\(0,…,0\)s\_\{0\}=\\mathbf\{0\}=\(0,\\dots,0\)\.
- •Exploration Parameter \(R0R\_\{0\}\): A background reward constant that determines the scarcity of the reward signal\.
- •Reward Function \(RR\): The unnormalized density defined over terminating statess⊤=\(s1,…,sD\)s^\{\\top\}=\(s^\{1\},\\dots,s^\{D\}\)\. It consists of the background termR0R\_\{0\}plus two region\-based terms that create modes near the corners\. R\(s⊤\)=R0\+0\.5∏d=1D𝕀\[\|sdH−1−0\.5\|∈\(0\.25,0\.5\]\]\+2∏d=1D𝕀\[\|sdH−1−0\.5\|∈\(0\.3,0\.4\)\]R\(s^\{\\top\}\)=R\_\{0\}\+0\.5\\prod\_\{d=1\}^\{D\}\\mathbb\{I\}\\left\[\\left\|\\frac\{s^\{d\}\}\{H\-1\}\-0\.5\\right\|\\in\(0\.25,0\.5\]\\right\]\+2\\prod\_\{d=1\}^\{D\}\\mathbb\{I\}\\left\[\\left\|\\frac\{s^\{d\}\}\{H\-1\}\-0\.5\\right\|\\in\(0\.3,0\.4\)\\right\]where𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function which is 1 if the condition holds and 0 otherwise\.
- •Action Space: At any states=\(s1,…,sD\)s=\(s^\{1\},\\dots,s^\{D\}\), the allowed actions are: 1. 1\.Increment a coordinateddby 1:s→s\+eds\\to s\+e\_\{d\}\(allowed only ifsd<H−1s^\{d\}<H\-1\)\. 2. 2\.Terminate:s→s⊤s\\to s^\{\\top\}\(transition to the corresponding terminating state in𝒳\\mathcal\{X\}\)\.

We also provide the following plots for on policy training with the same losses, showing a similar trend:

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/grid_exp/jsd_vs_trajectories.png)\(a\)JSD vs\. Trajectories
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/grid_exp/modes_found_vs_trajectories.png)\(b\)Modes Found vs\. Trajectories

Figure 4:An on policy recreation of the synthetic grid experiment in the main text\.
### F\.2SynFlowNet Experiments

Here we present our results from the SynFlowNet experiments across three of the tasks fromCretu et al\. \[[2025](https://arxiv.org/html/2605.15417#bib.bib8)\]\. These are all using rewards based on bioactivity and binding affinity as:

- •sEH \(Soluble Epoxide Hydrolase\): Uses a pretrained proxy model \(a Message Passing Neural Network\) trained to predict the AutoDock Vina binding energy\.
- •GSK3β\\beta\(Glycogen Synthase Kinase\-3 Beta\): Used as an oracle function from the PMO benchmark to predict bioactivity\.
- •DRD2 \(Dopamine Receptor D2\): Used as an oracle function from the PMO benchmark to predict bioactivity\.

We present our results across a range of different values of the inverse temperature,β\\beta, in order to understand how ourα\\alphaparameter interacts with the natural way to tune mode\-seeking vs covering behaviour in GFlowNets\. We fix all the hyperparameter settings used in the original SynFlowNet experiments\. Firstly, we present the reward distributions for unique molecules generated and molecule diversity, which indicate the trend that annealing alpha during training ends up with higher reward molecules\. This can also be seen from the additional plots of the CDF of the reward function which we add afterward\. Finally, we add diversity plots across experimental settings, demonstrating that

For eachα,β\\alpha,\\betasetting we train 5 seeds where each seed takes 3 H100 hours to train\.

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/violin_reward_unqiue/combined_final_unique_drd2.png)\(a\)DRD2 reward distributions
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/violin_reward_unqiue/combined_final_unique_gsk.png)\(b\)GSK3 reward distributions
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/violin_reward_unqiue/combined_final_unique_seh.png)\(c\)sEH reward distributions

Figure 5:Swept reward distributions over different alpha beta values in Synflownet training![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/cdf_unique/cdf_scaffold_beta8_drd2.png)\(a\)DRD2,β=8\\beta=8
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/cdf_unique/cdf_scaffold_beta8_gsk.png)\(b\)GSK3,β=8\\beta=8
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/cdf_unique/cdf_scaffold_beta16_seh.png)\(c\)sEH,β=16\\beta=16
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/cdf_unique/cdf_scaffold_beta16_drd2.png)\(d\)DRD2,β=16\\beta=16
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/cdf_unique/cdf_scaffold_beta16_gsk.png)\(e\)GSK3,β=16\\beta=16
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/cdf_unique/cdf_scaffold_beta32_seh.png)\(f\)sEH,β=32\\beta=32

Figure 6:Comparison of scaffold CDFs across different targets \(columns\) andβ\\betavalues \(rows\) during SynFlowNet training\. In 5/6 of the plots the SynFlownet with annealedα\\alphahas its reward CDF to the right of that trained via trajectory balance, indicating it is generating a higher proportion of high reward molecules\. On the other handα=1\.2\\alpha=1\.2leads to clear mode collapse, choosing a few high value molecules\.![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/diversity_plots/diversity_tanimoto_unique_combined_gsk.png)\(a\)GSK3 diversity
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/diversity_plots/diversity_tanimoto_unique_combined_seh.png)\(b\)sEH diversity
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/synflownet_experiments/diversity_plots/diversity_tanimoto_unique_combined_drd2.png)\(c\)DRD2 diversity

Figure 7:Tanimoto diversity of unique samples across varyingα\\alphaandβ\\betavalues for GSK3, sEH, and DRD2 targets during SynFlowNet training\.This demonstrates that lowerα\\alphaleads to more diverse molecules and that annealing can also lead to more diverse molecules across a range of settings\.
### F\.3Conditional Sampling in Diffusion Models

Here we present additional details and further analysis for the conditional sampling in diffusion models experiment\.

#### F\.3\.1Experimental Details

Our experimental setup is taken directly fromVenkatraman et al\. \[[2024](https://arxiv.org/html/2605.15417#bib.bib40)\], where the goal is to tune a pre\-trained diffusion model for sampling MNIST digits to only sample odd or even digits\. This is a helpful task for illustrating the mode seeking vs mode covering properties as to effectively sample the posterior the model must sample from multiple different digits based on their parity\. A pre\-trained classifier was used to define the reward for generated samples, where the reward is given by:

r\(𝐱\)=maxc∈target\_class⁡P\(c∣𝐱\)\.\\displaystyle r\(\{\\mathbf\{x\}\}\)=\\max\_\{c\\in\\mathrm\{target\\\_class\}\}P\(c\\mid\{\\mathbf\{x\}\}\)\.withP\(c∣𝐱\)P\(c\\mid\{\\mathbf\{x\}\}\)coming from the pretrained classifier\.

##### Hyperparameters

Table 1:Model Training and Optimizer HyperparametersCategoryParameterValueTrainingEpochs300Global Batch Size128Gradient Accumulation1Sampling Steps200Workers8Optimizer \(Adam\)Learning Rate \(η\\eta\)6×10−46\\times 10^\{\-4\}β1,β2\\beta\_\{1\},\\beta\_\{2\}0\.9, 0\.999ϵ\\epsilon1×10−81\\times 10^\{\-8\}

#### F\.3\.2Results on Posterior Fit

Here we include a variety of results and plots aiming to show how well the different losses capture the posterior\. We demonstrate that whilst the relative trajectory balance loss ofVenkatraman et al\. \[[2024](https://arxiv.org/html/2605.15417#bib.bib40)\]leads to the generation of high reward samples \(i\.e digits with the correct parity\) this comes at the cost of sampling unevenly from the posterior digits\. This effect can be seen most extremely in the even digits, where relative trajectory balance over samples0s as they are more distinct from odd digits than other even digits\.

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/diffusion_tuning/reward_vs_fid.png)\(a\)Target Accuracy \(i\.e proportion of target classes sampled\) vs FID to posterior by divergence\.
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/diffusion_tuning/tradeoff_aggregated.png)\(b\)Reward \(i\.e Target accuracy\) vs𝕂𝕃\\mathbb\{KL\}divergence between the fitted posterior and prior over trajectories\.

Figure 8:Training curves for reward and entropy across all four models\. We can see that the large asynchronous delay causes instability in PPO training whereas allff\-trajectory balance losses lead to stable training \.
#### F\.3\.3Generated Samples

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/MNIST_digits/train_posterior_sweep_forward_kl_classeven_5_1769213383_prior_posterior_comparison_300.png)\(a\)Forward KL
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/MNIST_digits/train_posterior_sweep_hellinger_classeven_5_1769213383_prior_posterior_comparison_300.png)\(b\)Hellinger
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/MNIST_digits/train_posterior_sweep_rtb_classeven_5_1769213383_prior_posterior_comparison_300.png)\(c\)Reverse KL
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/MNIST_digits/train_posterior_sweep_pearson_classeven_5_1769213383_prior_posterior_comparison_300.png)\(d\)Pearsonχ2\\chi^\{2\}

Figure 9:Comparison offf\-divergence losses with more mode covering on the top and mode seeking on the bottom\.Figure reffig:gen\_samples shows samples generated by the diffusion model tuned using each loss\. We can see that as the reverse KL and Pearsonχ2\\chi^\{2\}are more mode seeking than the Hellinger and Forward KL, we get more mode collapse, leading the model to overly sample0’s or66’s\.

### F\.4Asynchronous RLVR

In this section, we provide details and additional plots for the asynchronous RLVR task\.

#### F\.4\.1Experimental Details

To demonstrate our losses, we train on themath\-groupenvironment fromIntellect \[[2025](https://arxiv.org/html/2605.15417#bib.bib14)\], which consists of a mix of samples from\[Cobbe et al\.,[2021](https://arxiv.org/html/2605.15417#bib.bib7)\]and Hendrycks MATH\[[Hendrycks et al\.,](https://arxiv.org/html/2605.15417#bib.bib12)\]\. For hyperparameters, we use all the defaults fromIntellect \[[2025](https://arxiv.org/html/2605.15417#bib.bib14)\]with LORA and a1010x multiplier on the learning rate as advocated for inSchulman and Lab \[[2025](https://arxiv.org/html/2605.15417#bib.bib32)\]\. Full hyperparameter details can be found in Table reftab:hyperparameters\_llm\.

CategoryHyperparameterValueGeneralMax Sequence Length4096Training Steps300LoRARank \(rr\)32Alpha \(α\\alpha\)64OptimizerOptimizerAdamWLearning Rate1×10−51\\times 10^\{\-5\}OrchestrationGlobal Batch Size512Rollouts per Example16Async Level50EvaluationEval Examples600Temperature1\.0Lossβ\\beta0\.001TemperedTrueKimi ApproximationTrueTable 2:Model Configuration and Training Hyperparameters\. For all hyperparameters not selected we use prime\-rl\[Intellect,[2025](https://arxiv.org/html/2605.15417#bib.bib14)\]defaults\.We repeat this task for 4 LLMs, Qwen2\.5\-3b, \-7B, \-14B, and OLMo\-2\-1124\-7B with 3 seeds per model\. Each model is run on 4 H100s with two for inference and two for training, using prime\-rl\[Intellect,[2025](https://arxiv.org/html/2605.15417#bib.bib14)\]\. Training time varies from 3\-5 hours depending on model size\. For a comparison loss we use the PPO implementation from prime\-rl with all standard hyperparameters, specifically no KL regularisation\.

#### F\.4\.2Training Curves

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/async_LLM/reward_curves.png)\(a\)Reward training curves averaged over 3 runs, using 10 steps exponential moving average\.
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/async_LLM/entropy_curves.png)\(b\)Entropy training curves averaged over 3 runs, using 10 steps exponential moving average\.

Figure 10:Training curves for reward and entropy across all four models\. We can see that the large asynchronous delay causes instability in PPO training whereas allff\-trajectory balance losses lead to stable training \.
#### F\.4\.3Entropy on Downstream Tasks

We now show that the entropy tradeoff transfers to additional tasks that were not trained on\. Specifically on American Invitational Mathematics Examination \(AIME\) 2024 and OpenPubMedQAJin et al\. \[[2019](https://arxiv.org/html/2605.15417#bib.bib15)\]\.

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/async_LLM/final_entropy_bar_aime.png)\(a\)Entropy by loss and model on AIME 2024\.
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/async_LLM/final_entropy_bar_openmed.png)\(b\)Entropy by loss and model on PubMedQA\[Jin et al\.,[2019](https://arxiv.org/html/2605.15417#bib.bib15)\]\.

For completeness we include the reward for the fully trained models, however all trained models score very poorly on these tasks\.

![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/async_LLM/final_reward_bar_aime.png)\(a\)Final reward by loss and model on AIME 2024\.
![Refer to caption](https://arxiv.org/html/2605.15417v1/paper_figures/async_LLM/final_reward_bar_pubmed.png)\(b\)Final reward by loss and model on PubMedQA\[Jin et al\.,[2019](https://arxiv.org/html/2605.15417#bib.bib15)\]\.
$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

Similar Articles

@HuggingPapers: Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance Naver AI eliminates unsta…

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

DanceOPD: On-Policy Generative Field Distillation

Submit Feedback

Similar Articles

@HuggingPapers: Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance Naver AI eliminates unsta…
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
DanceOPD: On-Policy Generative Field Distillation