What Type of Inference is Active Inference?

arXiv cs.AI Papers

Summary

This paper analyzes Active Inference by proving that the Variational Free Energy of an augmented generative model can be decomposed into the predictive model's VFE plus explicit entropy-correction terms, yielding a full variational characterization of Expected Free Energy-based planning. The authors derive a message-passing scheme for EFE-based planning and validate it on grid-world environments.

arXiv:2606.04935v1 Announce Type: new Abstract: Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that the planning correction already helps when observations are decisive, whereas the additional observation-side epistemic corrections matter most when observations are merely suggestive.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:09 AM

# What Type of Inference is Active Inference?
Source: [https://arxiv.org/html/2606.04935](https://arxiv.org/html/2606.04935)
[Wouter W\. L\. Nuijten](https://arxiv.org/html/2606.04935v1/mailto:[email protected]?Subject=Your%20UAI%202026%20paper)Department of Electrical Engineering Eindhoven University of Technology Eindhoven, the NetherlandsLazy Dynamics Utrecht, the NetherlandsMykola LukashchukThijs van de LaarDepartment of Electrical Engineering Eindhoven University of Technology Eindhoven, the NetherlandsBert de VriesDepartment of Electrical Engineering Eindhoven University of Technology Eindhoven, the NetherlandsLazy Dynamics Utrecht, the Netherlands

###### Abstract

Active inference casts decision\-making as inference, with the Expected Free Energy \(EFE\) unifying goal\-directed and information\-seeking behavior\. Recent work showed that EFE minimization can be written as Variational Free Energy \(VFE\) minimization on a generative model augmented with epistemic priors\. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy\-correction terms, making the EFE contribution transparent\. We then show that proper EFE\-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE\-based planning\. This clarifies which corrections are needed for cross\-entropy planning and for full EFE\-based planning\. The same entropy\-corrected formulation leads to a detailed message\-passing scheme for EFE\-based planning together with simpler ablations\. Experiments on three grid\-world environments show that the planning correction already helps when observations are decisive, whereas the additional observation\-side epistemic corrections matter most when observations are merely suggestive\.

## 1Introduction

Sequential decision\-making under uncertainty requires balancing exploitation of current knowledge against exploration to reduce uncertainty\. Classical reinforcement learning and optimal control address this through value functions or policy optimization\[sutton\_reinforcement\_2018, Bertsekas,[2012](https://arxiv.org/html/2606.04935#bib.bib2)\], but typically treat reward maximization and uncertainty reduction as separate objectives\.

Planning\-as\-Inference \(PAI\) offers an alternative by casting control as probabilistic inference\[Attias,[2003](https://arxiv.org/html/2606.04935#bib.bib1),toussaint\_robot\_2009\], connecting control to variational inference and message passing\[Levine,[2018](https://arxiv.org/html/2606.04935#bib.bib17)\]\. Standard PAI methods optimize objectives such as expected utility or cross\-entropy to preferences, but do not include an explicit epistemic drive to reduce environmental uncertainty\.

Active Inference addresses this by minimizing the Expected Free Energy \(EFE\), unifying instrumental and epistemic objectives\[Friston et al\.,[2015](https://arxiv.org/html/2606.04935#bib.bib10), Da Costa et al\.,[2020](https://arxiv.org/html/2606.04935#bib.bib7)\]\.De Vries et al\. \[[2025](https://arxiv.org/html/2606.04935#bib.bib8)\]showed that EFE minimization can be reformulated as Variational Free Energy \(VFE\) minimization on a model augmented with*epistemic priors*\. This brings Active Inference into the variational framework, but leaves open a key distinction: obtaining EFE inside a marginal variational objective is not yet the same as planning over policies\. Proper planning additionally requires the planning correction ofLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]\. This paper makes that separation explicit and derives a message\-passing scheme for the combined objective\.

This paper combines these two lines of work\. Our contributions are:

- •We show that proper EFE\-based planning requires combining two entropy corrections: the planning correction ofLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\], which turns the expected\-utility variational objective into policy optimization, and the epistemic corrections ofNuijten et al\. \[[2026](https://arxiv.org/html/2606.04935#bib.bib21)\], which turn marginal VFE minimization into EFE minimization\. Together they yield a full variational characterization of EFE\-based planning and clarify the difference between EFE as a marginal objective and EFE\-based planning\.
- •We derive a principled message\-passing family for these entropy\-corrected objectives\. Each added entropy term induces a corresponding channel reparameterization that restores Bethe coordinates, resolves the circularity of posterior\-dependent epistemic priors, and recovers both variational belief propagation and full active\-inference planning within the same derivation\.
- •We validate the framework on three grid\-world environments that span a hierarchy of epistemic demands along two axes, observation scope \(global vs\. local\) and resolution \(decisive vs\. suggestive\)\. The experiments show where progressively adding corrections matters: the planning correction already helps under decisive observations, while the additional observation\-side epistemic corrections matter most under suggestive observations\.

[Section˜2](https://arxiv.org/html/2606.04935#S2)reviews the generative model and epistemic priors\.[Section˜3](https://arxiv.org/html/2606.04935#S3)discusses related work\.[Section˜4](https://arxiv.org/html/2606.04935#S4)presents the entropy corrections and their cumulative taxonomy\.[Section˜5](https://arxiv.org/html/2606.04935#S5)derives the resulting message\-passing family\.[Section˜6](https://arxiv.org/html/2606.04935#S6)provides empirical validation, and[Section˜7](https://arxiv.org/html/2606.04935#S7)concludes\.

## 2Background

### 2\.1Generative Model for Sequential Decision\-Making

We consider an agent that maintains a generative model predicting future observations, states, and the consequences of actions\. Following standard conventions\[Levine,[2018](https://arxiv.org/html/2606.04935#bib.bib17), Lázaro\-Gredilla et al\.,[2024](https://arxiv.org/html/2606.04935#bib.bib16)\], we write this as a rollout model:

p​\(𝒚,𝒙,𝒖,θ\)=\\displaystyle p\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)=\{\}p​\(θ\)​p​\(x0\)​∏t=1Tp​\(yt\|xt,θ\)\\displaystyle p\(\\theta\)p\(x\_\{0\}\)\\prod\_\{t=1\}^\{T\}p\(y\_\{t\}\|x\_\{t\},\\theta\)⋅p​\(xt\|xt−1,ut,θ\)​p​\(ut\),\\displaystyle\\cdot p\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\},\\theta\)\\,p\(u\_\{t\}\)\\,,\(1\)
where𝒙=\(x0,…,xT\)\\bm\{x\}=\(x\_\{0\},\\ldots,x\_\{T\}\)are latent states,𝒚=\(y1,…,yT\)\\bm\{y\}=\(y\_\{1\},\\ldots,y\_\{T\}\)are observations,𝒖=\(u1,…,uT\)\\bm\{u\}=\(u\_\{1\},\\ldots,u\_\{T\}\)are actions, andθ\\thetaare unknown model parameters\. Heret=0t=0denotes the current time, and the model predicts a rollout into the future over horizonTT\. The dynamicsp​\(xt\|xt−1,ut,θ\)p\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\},\\theta\)may depend on parametersθ\\theta, capturing model uncertainty\. Throughout this paper we work in the discrete regime, so all integrals over\(yt,xt,θ\)\(y\_\{t\},x\_\{t\},\\theta\)in what follows reduce to finite sums\.

To encode goals, we augment the model with*preference priors*p^​\(xt\)\\hat\{p\}\(x\_\{t\}\)andp^​\(yt\)\\hat\{p\}\(y\_\{t\}\)over desired states and observations\[Levine,[2018](https://arxiv.org/html/2606.04935#bib.bib17)\]:

p^​\(𝒚,𝒙,𝒖,θ\)∝\\displaystyle\\hat\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\propto\{\}p​\(θ\)​p​\(x0\)​∏t=1Tp​\(yt\|xt,θ\)​p​\(xt\|xt−1,ut,θ\)\\displaystyle p\(\\theta\)p\(x\_\{0\}\)\\prod\_\{t=1\}^\{T\}p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,p\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\},\\theta\)⋅p​\(ut\)​p^​\(xt\)​p^​\(yt\)\.\\displaystyle\\cdot p\(u\_\{t\}\)\\,\\hat\{p\}\(x\_\{t\}\)\\,\\hat\{p\}\(y\_\{t\}\)\\,\.\(2\)These preference priors can be understood as proportional to exponentiated rewards:p^​\(x\)∝exp⁡\(R​\(x\)\)\\hat\{p\}\(x\)\\propto\\exp\(R\(x\)\), connecting planning\-as\-inference to reward maximization\[todorov\_general\_2008\]\. Together, the rollout model \([2](https://arxiv.org/html/2606.04935#S2.E2)\) with preferencesp^​\(xt\)\\hat\{p\}\(x\_\{t\}\)andp^​\(yt\)\\hat\{p\}\(y\_\{t\}\)defines our planning problem over horizonTT: find a policyq​\(ut\|xt−1\)q\(u\_\{t\}\|x\_\{t\-1\}\)whose induced predicted trajectory agrees with the preferences\. The policy is the decision variable; the rollout supplies predictions; the preferences encode the goal\.

### 2\.2Variational Free Energy

Given a generative model, variational inference approximates the posterior by minimizing the Variational Free Energy \(VFE\) over a family of tractable distributionsqq\[Blei et al\.,[2017](https://arxiv.org/html/2606.04935#bib.bib3)\]:

Fp^​\[q\]=𝔻KL​\[q​\(𝒚,𝒙,𝒖,θ\)∥p^​\(𝒚,𝒙,𝒖,θ\)\]\.F\_\{\\hat\{p\}\}\[q\]=\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[q\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\\|\\hat\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\right\]\\,\.\(3\)Since all variables are unobserved in the planning setting \(they represent future quantities\), minimizingFp^​\[q\]F\_\{\\hat\{p\}\}\[q\]yields beliefs about future trajectories that are consistent with both the dynamics and the preference priors\.

### 2\.3Factor Graphs and the Bethe Approximation

The generative model \([2](https://arxiv.org/html/2606.04935#S2.E2)\) factorizes into local terms, which can be represented as a Forney\-style factor graph \(FFG\)\[Forney,[2001](https://arxiv.org/html/2606.04935#bib.bib9), Loeliger et al\.,[2007](https://arxiv.org/html/2606.04935#bib.bib19)\]\. In an FFG, nodes represent factors \(probability distributions\) and edges represent variables; an edge connects to a node when the variable appears in that factor’s scope\. We writeℰ​\(a\)\\mathcal\{E\}\(a\)for the set of edges \(variables\) adjacent to factor nodeaa, and𝒱​\(i\)\\mathcal\{V\}\(i\)for the set of factor nodes adjacent to edgeii\. The variables in the scope of factoraaare denoted𝒔a\\bm\{s\}\_\{a\}\.

The*Bethe approximation*\[yedidia\_constructing\_2005\]exploits this structure by constraining the variational distribution to respect the factorization induced by the graph\. Each nodeaamaintains a local beliefqa​\(𝒔a\)q\_\{a\}\(\\bm\{s\}\_\{a\}\)over its adjacent variables𝒔a\\bm\{s\}\_\{a\}, and each edgeiimaintains a singleton beliefqi​\(si\)q\_\{i\}\(s\_\{i\}\)\. These beliefs must satisfy local consistency constraints:

∫qa​\(𝒔a\)​d𝒔a∖i=qi​\(si\)for all​i∈ℰ​\(a\)\.\\int q\_\{a\}\(\\bm\{s\}\_\{a\}\)\\,\\mathrm\{d\}\\bm\{s\}\_\{a\\setminus i\}=q\_\{i\}\(s\_\{i\}\)\\quad\\text\{for all \}i\\in\\mathcal\{E\}\(a\)\\,\.\(4\)Under these constraints, with entropy corrections that prevent double\-counting of shared variables, the VFE reduces to the*Bethe Free Energy*:

FBethe​\[q\]=\\displaystyle F\_\{\\text\{Bethe\}\}\[q\]=\{\}∑a∈𝒱𝔻KL​\[qa​\(𝒔a\)∥fa​\(𝒔a\)\]\\displaystyle\\sum\_\{a\\in\\mathcal\{V\}\}\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[q\_\{a\}\(\\bm\{s\}\_\{a\}\)\\\|f\_\{a\}\(\\bm\{s\}\_\{a\}\)\\right\]\+∑i∈ℰ\(di−1\)​ℍ​\[qi​\(si\)\],\\displaystyle\+\\sum\_\{i\\in\\mathcal\{E\}\}\(d\_\{i\}\-1\)\\,\\mathbb\{H\}\\left\[q\_\{i\}\(s\_\{i\}\)\\right\]\\,,\(5\)where𝒱\\mathcal\{V\}is the set of nodes,ℰ\\mathcal\{E\}is the set of edges,faf\_\{a\}is the factor at nodeaa, anddid\_\{i\}is the degree \(number of connected nodes\) of edgeii\. Minimizing the Bethe Free Energy via message passing yields the belief propagation algorithm; on tree\-structured graphs, this recovers exact marginals\[Pearl,[1982](https://arxiv.org/html/2606.04935#bib.bib27)\]\. Details are provided in[AppendixC](https://arxiv.org/html/2606.04935#A3)\.

### 2\.4Epistemic Priors

Standard variational inference does not distinguish between variable types: actions, states, observations, and parameters all enter the VFE symmetrically\.Nuijten et al\. \[[2026](https://arxiv.org/html/2606.04935#bib.bib21)\]clarified the*epistemic priors*p~​\(ut\)\\tilde\{p\}\(u\_\{t\}\),p~​\(xt\)\\tilde\{p\}\(x\_\{t\}\), andp~​\(yt,xt\)\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)that encode which variables are controlled, inferred, or observed\. These priors augment the generative model:

p~​\(𝒚,𝒙,𝒖,θ\)∝\\displaystyle\\tilde\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\propto\{\}p^​\(𝒚,𝒙,𝒖,θ\)\\displaystyle\\hat\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)∏t=1Tp~​\(ut\)​p~​\(xt\)​p~​\(yt,xt\)\.\\displaystyle\\prod\_\{t=1\}^\{T\}\\tilde\{p\}\(u\_\{t\}\)\\,\\tilde\{p\}\(x\_\{t\}\)\\,\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\\,\.\(6\)
Each prior is defined in terms of entropies of conditionals111We writeh​\[q​\(y\|x\)\]\\mathrm\{h\}\\left\[q\(y\|x\)\\right\]for the*entropy of the conditional*q​\(y\|x\)q\(y\|x\), a function ofxx, andℍ​\[q​\(y\|x\)\]\\mathbb\{H\}\\left\[q\(y\|x\)\\right\]for the*conditional entropy*, a scalar:ℍ​\[q​\(y\|x\)\]=𝔼q​\(x\)​\[h​\[q​\(y\|x\)\]\]\\mathbb\{H\}\\left\[q\(y\|x\)\\right\]=\\mathbb\{E\}\_\{q\(x\)\}\\left\[\\mathrm\{h\}\\left\[q\(y\|x\)\\right\]\\right\]\.of the variational distributionqq:

p~​\(ut\)∝exp⁡\(h​\[q​\(xt,xt−1\|ut\)\]−h​\[q​\(xt−1\|ut\)\]\),\\displaystyle\\tilde\{p\}\(u\_\{t\}\)\\propto\\exp\\bigl\(\\mathrm\{h\}\\left\[q\(x\_\{t\},x\_\{t\-1\}\|u\_\{t\}\)\\right\]\-\\mathrm\{h\}\\left\[q\(x\_\{t\-1\}\|u\_\{t\}\)\\right\]\\bigr\)\\,,\(7a\)p~​\(xt\)∝exp⁡\(𝔼q​\(θ\|xt\)​\[−h​\[q​\(yt\|xt,θ\)\]\]\),\\displaystyle\\tilde\{p\}\(x\_\{t\}\)\\propto\\exp\\bigl\(\\mathbb\{E\}\_\{q\(\\theta\|x\_\{t\}\)\}\\left\[\-\\mathrm\{h\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\\right\]\\bigr\)\\,,\(7b\)p~\(yt,xt\)∝exp\(𝔻KL\[q\(θ\|yt,xt\)∥q\(θ\|xt\)\]\)\.\\displaystyle\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\\propto\\exp\\bigl\(\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[q\(\\theta\|y\_\{t\},x\_\{t\}\)\\\|q\(\\theta\|x\_\{t\}\)\\right\]\\bigr\)\\,\.\(7c\)Nuijten et al\. \[[2026](https://arxiv.org/html/2606.04935#bib.bib21)\]showed that the VFE of the augmented modelFp~​\[q\]F\_\{\\tilde\{p\}\}\[q\]is an upper bound on the expected EFE\. A notable feature is that the epistemic priors depend on the variational distributionqqitself, creating a circular dependency that complicates optimization\. A central contribution of this paper is to make that circularity explicit as entropy corrections in the objective, rather than leaving it implicit in posterior\-dependent priors\.

## 3Related Work

##### Planning\-as\-Inference\.

The PAI framework casts optimal control as inference in graphical models\[Attias,[2003](https://arxiv.org/html/2606.04935#bib.bib1),toussaint\_robot\_2009\], connecting control to variational methods and message passing\[Levine,[2018](https://arxiv.org/html/2606.04935#bib.bib17)\]\. Closely related formulations include linearly\-solvable MDPs\[todorov\_linearlysolvable\_2006\], path\-integral control\[Kappen,[2005](https://arxiv.org/html/2606.04935#bib.bib14)\], KL control\[Kappen et al\.,[2012](https://arxiv.org/html/2606.04935#bib.bib13)\], and stochastic optimal control\[Rawlik et al\.,[2012](https://arxiv.org/html/2606.04935#bib.bib28)\]\. A known challenge is*optimistic inference*: conditioning on goals biases posteriors toward trajectories assuming favorable outcomes\[Levine,[2018](https://arxiv.org/html/2606.04935#bib.bib17)\]\. This issue was addressed byLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]with an entropy correction that turns the expected\-utility variational objective into a proper control objective by penalizing plans that rely on fortuitous state realizations\.

##### Active inference\.

Active inference minimizes the Expected Free Energy \(EFE\), combining instrumental and epistemic value\[Friston et al\.,[2015](https://arxiv.org/html/2606.04935#bib.bib10), Da Costa et al\.,[2020](https://arxiv.org/html/2606.04935#bib.bib7), Parr et al\.,[2022](https://arxiv.org/html/2606.04935#bib.bib25)\]\. Existing methods employ specialized procedures: tree search\[Friston et al\.,[2021](https://arxiv.org/html/2606.04935#bib.bib11)\], branching\[Champion et al\.,[2022](https://arxiv.org/html/2606.04935#bib.bib6)\], or dynamic programming\[Paul et al\.,[2024](https://arxiv.org/html/2606.04935#bib.bib26)\]\. Several works have sought to unify EFE with variational inference\. In a related direction,Palmieri et al\. \[[2022](https://arxiv.org/html/2606.04935#bib.bib23)\]combined estimation and control via belief propagation\. Building on the Generalized Free Energy\[Parr and Friston,[2019](https://arxiv.org/html/2606.04935#bib.bib24)\],Koudahl et al\. \[[2023](https://arxiv.org/html/2606.04935#bib.bib15)\]andvandelaar\_realizing\_2024modified the VFE to include epistemic terms\. Most recently,Nuijten et al\. \[[2026](https://arxiv.org/html/2606.04935#bib.bib21)\]showed that EFE minimization can be formulated as VFE minimization with epistemic priors, andNuijten et al\. \[[2025](https://arxiv.org/html/2606.04935#bib.bib20)\]implemented this via message passing with alternating updates between the posterior and epistemic priors\. A separate line of work\[O’Donoghue et al\.,[2020](https://arxiv.org/html/2606.04935#bib.bib22),tarbouriech\_probabilistic\_2023\]casts exploration as posterior inference over value functions, targeting uncertainty in the value function itself\. This is complementary to the epistemic priors above, which target uncertainty over model parametersθ\\theta\. Our contribution is to connect these lines: the de Vries construction provides the EFE correction to a marginal objective, the Lazaro\-Gredilla construction provides the planning correction, and their combination yields a principled message\-passing formulation of EFE\-based planning\.

## 4Entropy Corrections for EFE\-Based Planning

We now show that the epistemic priors from[Section˜2\.4](https://arxiv.org/html/2606.04935#S2.SS4)and the planning correction ofLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]play different roles\. The epistemic priors identify the corrections that transform marginal VFE minimization into EFE minimization\. The planning correction turns an expected\-utility variational objective into a planning objective over policies\. Proper active\-inference planning requires both\. More broadly, specifying a planning method is a three\-way modeling choice: the generative model, the variable\-role assignment between controlled, states, parameters, and observed quantities, and the entropy\-correction selecting the objective\. The AIF\-specific commitment lives entirely in the last\.

### 4\.1Entropy Corrections in Variational Inference

Recall the generative model with preference priors from[Section˜2\.1](https://arxiv.org/html/2606.04935#S2.SS1)\. Standard variational inference minimizes the VFEFp^​\[q\]F\_\{\\hat\{p\}\}\[q\]of the preference\-augmented model without distinguishing variable roles\. With no entropy corrections, this is simply marginal inference, or in the control setting, KL control\[todorov\_general\_2008\]222Kappen\-style temperingp^​\(x\)∝exp⁡\(R​\(x\)/λ\)\\hat\{p\}\(x\)\\propto\\exp\(R\(x\)/\\lambda\)\[Kappen,[2005](https://arxiv.org/html/2606.04935#bib.bib14), Kappen et al\.,[2012](https://arxiv.org/html/2606.04935#bib.bib13)\]parametrizes the*generative model*\(the preference prior\), whereas[Table1](https://arxiv.org/html/2606.04935#S4.T1)parametrizes the*objective*via entropy corrections; the two axes are orthogonal\.\. Different objectives arise by adding entropy corrections to this same baseline\. The key question is which corrections are needed for proper EFE\-based planning\.

### 4\.2Cross\-Entropy Planning

Marginal variational inference minimizes a cost over the fullqq, which lets the joint commit to favourable state realizations that the policy alone cannot produce\.Lázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]showed that turning this into planning, where the extracted policyq​\(ut\|xt−1\)q\(u\_\{t\}\|x\_\{t\-1\}\)actually attains the cost it appears to minimize, requires an entropy correction that penalizes action uncertainty:

∑t=1Tℍ​\[q​\(xt−1,ut\)\]−ℍ​\[q​\(xt−1\)\]=∑t=1Tℍ​\[q​\(ut\|xt−1\)\]\.\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\-1\}\)\\right\]=\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]\\,\.\(8\)See[AppendixA\.1](https://arxiv.org/html/2606.04935#A1.SS1)for the derivation\.

Following the control\-as\-inference framework\[Levine,[2018](https://arxiv.org/html/2606.04935#bib.bib17)\], rewards can be encoded as preference distributions viap^​\(x\)∝exp⁡\(R​\(x\)\)\\hat\{p\}\(x\)\\propto\\exp\(R\(x\)\)\. As shown in[AppendixA\.3](https://arxiv.org/html/2606.04935#A1.SS3), adding the entropy correction \([8](https://arxiv.org/html/2606.04935#S4.E8)\) to the VFE transforms the objective into minimizing the*cross\-entropy*between the state marginals and the preference distribution:

minq​∑t=1Tℍ​\[q​\(xt\),p^​\(xt\)\]\+const,\\min\_\{q\}\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(x\_\{t\}\),\\hat\{p\}\(x\_\{t\}\)\\right\]\+\\text\{const\}\\,,\(9\)whereℍ​\[q,p^\]=−𝔼q​\[log⁡p^\]\\mathbb\{H\}\\left\[q,\\hat\{p\}\\right\]=\-\\mathbb\{E\}\_\{q\}\\left\[\\log\\hat\{p\}\\right\]is the cross\-entropy\. Sinceℍ​\[q,p^\]=−𝔼q​\[R​\(x\)\]\+const\\mathbb\{H\}\\left\[q,\\hat\{p\}\\right\]=\-\\mathbb\{E\}\_\{q\}\\left\[R\(x\)\\right\]\+\\text\{const\}, minimizing cross\-entropy is equivalent to maximizing expected reward\.

We call thiscross\-entropy planning: the agent maximizes expected reward \(equivalently, minimizes cross\-entropy to preferences\) while committing to a policy\.

### 4\.3EFE as Entropy Corrections

The epistemic priors introduced in[Section˜2](https://arxiv.org/html/2606.04935#S2)augment the generative model with terms that encode variable roles\. The VFE of this augmented model can be expressed as the original VFE plus entropy corrections\. This rewriting is an exact algebraic identity\.

###### Theorem 1\(Entropy\-corrected form of active inference\)\.

The variational objective ofNuijten et al\. \[[2026](https://arxiv.org/html/2606.04935#bib.bib21)\]can be written as:

Fp~​\[q\]=Fp^​\[q\]\+∑t=1T2​ℍ​\[q​\(yt\|xt,θ\)\]−ℍ​\[q​\(xt\|xt−1,ut\)\]−ℍ​\[q​\(yt\|xt\)\]\.F\_\{\\tilde\{p\}\}\[q\]=F\_\{\\hat\{p\}\}\[q\]\+\\sum\_\{t=1\}^\{T\}2\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\\\\ \-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\\,\.\(10\)

###### Proof\.

See[AppendixA\.2](https://arxiv.org/html/2606.04935#A1.SS2)\. ∎

Each prior contributes a specific correction:p~​\(ut\)\\tilde\{p\}\(u\_\{t\}\)produces−ℍ​\[q​\(xt\|xt−1,ut\)\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\],p~​\(xt\)\\tilde\{p\}\(x\_\{t\}\)produces\+ℍ​\[q​\(yt\|xt,θ\)\]\+\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\], andp~​\(yt,xt\)\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)contributes a further\+ℍ​\[q​\(yt\|xt,θ\)\]−ℍ​\[q​\(yt\|xt\)\]\+\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]via the identity𝔼q\[𝔻KL\[q\(θ\|yt,xt\)∥q\(θ\|xt\)\]\]=ℍ\[q\(yt\|xt\)\]−ℍ\[q\(yt\|xt,θ\)\]\\mathbb\{E\}\_\{q\}\\left\[\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[q\(\\theta\|y\_\{t\},x\_\{t\}\)\\\|q\(\\theta\|x\_\{t\}\)\\right\]\\right\]=\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\], producing the factor of two\. The interplay yields EFE minimization within a marginal variational objective: minimizing the\+2​ℍ​\[q​\(yt\|xt,θ\)\]\+2\\,\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]term concentrates beliefs on state\-parameter configurations under which observations are sharply informative, which is the operational meaning of*epistemic*in AIF\. The two channel reparameterizations introduced in[Section˜5](https://arxiv.org/html/2606.04935#S5)play opposing roles consistent with this reading: the dynamics channel divides and spreads belief over reachable states, while the observation channel multiplies and concentrates belief toward informative ones\. By itself, however, \([10](https://arxiv.org/html/2606.04935#S4.E10)\) does not yet yield EFE\-based planning, because it lacks the planning correction \([8](https://arxiv.org/html/2606.04935#S4.E8)\)\.

### 4\.4EFE\-Based Planning

The missing step is to combine the marginal\-EFE corrections of[Theorem˜1](https://arxiv.org/html/2606.04935#Thmtheorem1)with the planning correction of[Section˜4\.2](https://arxiv.org/html/2606.04935#S4.SS2)\. Adding only the dynamics\-side term−ℍ​\[q​\(xt\|xt−1,ut\)\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]yields an incomplete intermediate objective that is useful as an ablation, but it is not yet full EFE\-based planning because it omits the observation\-side epistemic corrections\.

[AppendixB](https://arxiv.org/html/2606.04935#A2)proves that the resulting EFE\-based planning objective is

minq\\displaystyle\\min\_\{q\}\\;Fp^\[q\]\+∑t=1Tℍ\[q\(ut\|xt−1\)\]\+∑t=1T\(2ℍ\[q\(yt\|xt,θ\)\]\\displaystyle F\_\{\\hat\{p\}\}\[q\]\+\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\Bigl\(2\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]−ℍ\[q\(xt\|xt−1,ut\)\]−ℍ\[q\(yt\|xt\)\]\)\.\\displaystyle\\qquad\\quad\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\\Bigr\)\.\(11\)The first sum is the planning correction; the second sum is the EFE correction\. Only their combination yields properEFE\-based planning\.

### 4\.5Comparison of Objectives

Table 1:Entropy corrections needed to move from baseline variational inference to proper EFE\-based planning\.[Table˜1](https://arxiv.org/html/2606.04935#S4.T1)summarizes the progression: the planning correction changes*how*control is posed \(cross\-entropy planning\), the EFE correction changes*what*objective is optimized \(marginal EFE\), and only their combination yields proper EFE\-based planning, the objective implemented in our experiments\. The channel reparameterizations required for message passing follow directly from these correction terms \([Section˜5\.1](https://arxiv.org/html/2606.04935#S5.SS1)\); we turn to that next\.

## 5Message Passing for EFE\-Based Planning

The full EFE\-based planning objective \([11](https://arxiv.org/html/2606.04935#S4.E11)\) contains policy, dynamics, and observation conditional entropy terms: the planning correction\+ℍ​\[q​\(ut\|xt−1\)\]\+\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]together with the three EFE corrections\+2​ℍ​\[q​\(yt\|xt,θ\)\]\+2\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\],−ℍ​\[q​\(xt\|xt−1,ut\)\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\], and−ℍ​\[q​\(yt\|xt\)\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\. In the Bethe framework, these conditionals are ratios of region beliefs, which are not part of the optimization objective\. We resolve this by introducing auxiliary conditional distributions \(*channels*\) that promote the conditional region beliefs to free variational parameters in the optimization objective, yielding a message\-passing family that generalizes standard belief propagation\.

### 5\.1Channel Reparameterization

![Refer to caption](https://arxiv.org/html/2606.04935v1/x1.png)Figure 1:Forney factor graph for the generative model \([1](https://arxiv.org/html/2606.04935#S2.E1)\)\. Square nodes are factors; edges are variables\. The time slice between the dashed lines is repeated forTTtimesteps\.We work throughout with Forney factor graphs \(FFGs\)\[Forney,[2001](https://arxiv.org/html/2606.04935#bib.bib9), Loeliger,[2004](https://arxiv.org/html/2606.04935#bib.bib18)\], in which factors are nodes and variables are edges \(Figure[1](https://arxiv.org/html/2606.04935#S5.F1)\); this makes the locality of channel reparameterization visually explicit, since each correction acts on a single kernel node while the remainder of the graph is unchanged from standard sum\-product\. The key identity is the variational characterization of conditional entropy \(Gibbs’ inequality\):

ℍ​\[q​\(y\|x\)\]=minr⁡𝔼q​\(y,x\)​\[−log⁡r​\(y\|x\)\],\\mathbb\{H\}\\left\[q\(y\|x\)\\right\]=\\min\_\{r\}\\mathbb\{E\}\_\{q\(y,x\)\}\\left\[\-\\log r\(y\|x\)\\right\]\\,,\(12\)with equality whenr​\(y\|x\)=q​\(y\|x\)r\(y\|x\)=q\(y\|x\)\. Since the minimum over normalized distributionsrris attained atr​\(y\|x\)=q​\(y\|x\)r\(y\|x\)=q\(y\|x\), the expression is an equality rather than a bound\. We introduce four normalized conditional distributions as*channels*:ru\|x,t​\(ut\|xt−1\)r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\),rx\|x​u,t​\(xt\|xt−1,ut\)r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\),ry\|x​θ,t​\(yt\|xt,θ\)r\_\{y\|x\\theta,t\}\(y\_\{t\}\|x\_\{t\},\\theta\), andry\|x,t​\(yt\|xt\)r\_\{y\|x,t\}\(y\_\{t\}\|x\_\{t\}\), as free variational parameters for each time steptt\(see[AppendixD](https://arxiv.org/html/2606.04935#A4)for formal definitions\)\. Substituting \([12](https://arxiv.org/html/2606.04935#S5.E12)\) into the corrections of \([11](https://arxiv.org/html/2606.04935#S4.E11)\) yields a well\-posed optimization\. Since\+ℍ​\[q​\(ut\|xt−1\)\]\+\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]carries a positive sign,ru\|x,tr\_\{u\|x,t\}enters in the numerator of the dynamics kernel\. Since\+2​ℍ​\[q​\(yt\|xt,θ\)\]\+2\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]carries a positive sign,ry\|x​θ,tr\_\{y\|x\\theta,t\}enters squared in the numerator of the observation kernel\. Since−ℍ​\[q​\(yt\|xt\)\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]carries a negative sign,ry\|x,tr\_\{y\|x,t\}appears in the denominator\. The dynamics correction−ℍ​\[q​\(xt\|xt−1,ut\)\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]is also negative, sorx\|x​u,tr\_\{x\|xu,t\}divides the dynamics factor\. This yields the*kernels*:

f~obst​\(yt,xt,θ\)=p​\(yt\|xt,θ\)​ry\|x​θ,t2​\(yt\|xt,θ\)ry\|x,t​\(yt\|xt\),\\tilde\{f\}\_\{\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\},x\_\{t\},\\theta\)=\\frac\{p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,r\_\{y\|x\\theta,t\}^\{2\}\(y\_\{t\}\|x\_\{t\},\\theta\)\}\{r\_\{y\|x,t\}\(y\_\{t\}\|x\_\{t\}\)\},\(13a\)f~dynt​\(xt,xt−1,θ,ut\)=p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)rx\|x​u,t​\(xt\|xt−1,ut\)\.\\tilde\{f\}\_\{\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\},x\_\{t\-1\},\\theta,u\_\{t\}\)=\\frac\{p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\}\{r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\}\.\(13b\)With these substitutions, the EFE\-based planning objective becomes a standard Bethe free energy over the modified factor graph, jointly optimized over beliefs and channels\. The kernels \([13](https://arxiv.org/html/2606.04935#S5.E13)\) replace the original factor functions in the message\-passing equations, making the procedure iterative: the channel beliefsrrdepend on variational beliefsqqand vice versa\. The proof forT=1T=1is given in[AppendixD](https://arxiv.org/html/2606.04935#A4)\. The full scheme comes from the additivity of the Lagrangian and the entropic corrections, see[AppendixD\.6](https://arxiv.org/html/2606.04935#A4.SS6)for details\.

### 5\.2Message\-Passing Equations

![Refer to caption](https://arxiv.org/html/2606.04935v1/x2.png)Figure 2:Factor graph for a single time slice of the generative model \([1](https://arxiv.org/html/2606.04935#S2.E1)\)\. The observation factorfobstf\_\{\\mathrm\{obs\}\_\{t\}\}and dynamics factorfdyntf\_\{\\mathrm\{dyn\}\_\{t\}\}receive kernelsf~obst\\tilde\{f\}\_\{\\mathrm\{obs\}\_\{t\}\}andf~dynt\\tilde\{f\}\_\{\\mathrm\{dyn\}\_\{t\}\}from \([13](https://arxiv.org/html/2606.04935#S5.E13)\) under the EFE\-based planning objective\. Message labeling corresponds to the schedule; first numbered messages are computed in a forward pass, then lettered messages are computed in a backward pass\. The backward pass is computed after the forward pass is completed for all time slices\.Since the modified objective has a Bethe form, the stationarity conditions yield sum\-product\-style message updates\. The only difference from standard belief propagation is that each factor uses its kernel \([13](https://arxiv.org/html/2606.04935#S5.E13)\) in place of the original\.

Each factoraasends to a neighboring factorbbthe integral of its kernel over incoming messages on adjacent edges:

μj​b​\(sj\)∝∫f~a​\(𝒔a\)​∏i∈ℰ​\(a\)∖jμi​a​\(si\)​d​𝒔a∖j\.\\mu\_\{jb\}\(s\_\{j\}\)\\propto\\\!\\int\\\!\\tilde\{f\}\_\{a\}\(\\bm\{s\}\_\{a\}\)\\\!\\\!\\prod\_\{i\\in\\mathcal\{E\}\(a\)\\setminus j\}\\\!\\\!\\mu\_\{ia\}\(s\_\{i\}\)\\,\\mathrm\{d\}\\bm\{s\}\_\{a\\setminus j\}\\,\.\(14\)For unmodified factors \(priors, data likelihoods\),f~a=fa\\tilde\{f\}\_\{a\}=f\_\{a\}\.

##### Singleton beliefs\.

Singleton beliefs are computed by normalizing the product of colliding messages on an edge,

q∗​\(si\)∝μi​a​\(si\)​μi​b​\(si\),q^\{\*\}\(s\_\{i\}\)\\propto\\mu\_\{ia\}\(s\_\{i\}\)\\mu\_\{ib\}\(s\_\{i\}\)\\,,\(15\)with\{a,b\}=𝒱​\(i\)\\\{a,b\\\}=\\mathcal\{V\}\(i\)the nodes adjacent to edgeii\. The full forward\-backward schedule is shown in Fig\.[2](https://arxiv.org/html/2606.04935#S5.F2)\.

##### Region beliefs

Region beliefs are computed by multiplying the factor function with all inbound messages and normalizing,

q∗​\(𝒔a\)∝f~a​\(𝒔a\)​∏i∈ℰ​\(a\)μi​a​\(si\)\.q^\{\*\}\(\\bm\{s\}\_\{a\}\)\\propto\\tilde\{f\}\_\{a\}\(\\bm\{s\}\_\{a\}\)\\prod\_\{i\\in\\mathcal\{E\}\(a\)\}\\mu\_\{ia\}\(s\_\{i\}\)\\,\.\(16\)

##### Channel updates\.

At the fixed point, each channel recovers the true conditional under its factor belief:

ru\|x,t∗​\(ut\|xt−1\)\\displaystyle r\_\{u\|x,t\}^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)=qu\|x,t∗​\(ut\|xt−1\),\\displaystyle=q\_\{u\|x,t\}^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)\\,,\(17a\)ry\|x​θ,t∗​\(yt\|xt,θ\)\\displaystyle r\_\{y\|x\\theta,t\}^\{\*\}\(y\_\{t\}\|x\_\{t\},\\theta\)=qy\|x​θ,t∗​\(yt\|xt,θ\),\\displaystyle=q\_\{y\|x\\theta,t\}^\{\*\}\(y\_\{t\}\|x\_\{t\},\\theta\)\\,,\(17b\)ry\|x,t∗​\(yt\|xt\)\\displaystyle r\_\{y\|x,t\}^\{\*\}\(y\_\{t\}\|x\_\{t\}\)=qy\|x,t∗​\(yt\|xt\),\\displaystyle=q\_\{y\|x,t\}^\{\*\}\(y\_\{t\}\|x\_\{t\}\)\\,,\(17c\)rx\|x​u,t∗​\(xt\|xt−1,ut\)\\displaystyle r\_\{x\|xu,t\}^\{\*\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)=qx\|x​u,t∗​\(xt\|xt−1,ut\),\\displaystyle=q\_\{x\|xu,t\}^\{\*\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\,,\(17d\)where the beliefsqqare conditionals derived from the respective region beliefs around factorsfobstf\_\{\\text\{obs\}\_\{t\}\}andfdyntf\_\{\\text\{dyn\}\_\{t\}\}at timett\. The marginal observation channelry\|x,t∗r\_\{y\|x,t\}^\{\*\}is obtained by marginalizingθ\\thetafrom the observation factor belief \(see[AppendixD](https://arxiv.org/html/2606.04935#A4)\)\.

Algorithm 1EFE\-Based Planning Message Passing1:Generative model factors

\{fobst,fdynt\}t=1T\\\{f\_\{\\mathrm\{obs\}\_\{t\}\},f\_\{\\mathrm\{dyn\}\_\{t\}\}\\\}\_\{t=1\}^\{T\}, priors

p​\(θ\),p​\(x0\),p​\(ut\)p\(\\theta\),p\(x\_\{0\}\),p\(u\_\{t\}\), goal priors

p^x​\(xt\),p^y​\(yt\)\\hat\{p\}\_\{x\}\(x\_\{t\}\),\\hat\{p\}\_\{y\}\(y\_\{t\}\)
2:Action beliefs

\{qut∗​\(ut\)\}t=1T\\\{q\_\{u\_\{t\}\}^\{\*\}\(u\_\{t\}\)\\\}\_\{t=1\}^\{T\}
3:Initialize all messages

μ←1\\mu\\leftarrow 1, channels

ru\|x,ry\|x​θ,ry\|x,rx\|x​u←r\_\{u\|x\},r\_\{y\|x\\theta\},r\_\{y\|x\},r\_\{x\|xu\}\\leftarrowuniform

4:repeat

5:for

t=1,…,Tt=1,\\ldots,Tdo

6:Compute sum\-product messages \([14](https://arxiv.org/html/2606.04935#S5.E14)\)

7:Update region beliefs \([16](https://arxiv.org/html/2606.04935#S5.E16)\)

8:Update channels \([17](https://arxiv.org/html/2606.04935#S5.E17)\), \([18](https://arxiv.org/html/2606.04935#S5.E18)\)

9:Update kernels \([13](https://arxiv.org/html/2606.04935#S5.E13)\)

10:endfor

11:Update singleton beliefs \([15](https://arxiv.org/html/2606.04935#S5.E15)\)

12:untilconvergence

13:return

\{qut∗​\(ut\)\}t=1T\\\{q\_\{u\_\{t\}\}^\{\*\}\(u\_\{t\}\)\\\}\_\{t=1\}^\{T\}

The same construction yields a family of algorithms: VBP uses only the policy reparameterization, a dynamics\-only ablation additionally uses the dynamics\-side reparameterization, and full EFE\-based planning further adds the observation\-side reparameterizations\. The corresponding VBP derivation is given in[AppendixE](https://arxiv.org/html/2606.04935#A5)\.

### 5\.3Convergence

The kernels \([13](https://arxiv.org/html/2606.04935#S5.E13)\) contain opposing channel corrections \([Section˜4](https://arxiv.org/html/2606.04935#S4)\):ru\|xr\_\{u\|x\}andry\|x​θr\_\{y\|x\\theta\}appear in numerators, whilerx\|x​ur\_\{x\|xu\}andry\|xr\_\{y\|x\}appear in denominators, inducing a min\-max structure in the joint optimization\. Because each channel reparameterization is a local rewrite of a single kernel, the per\-update cost matches standard loopy belief propagation up to the channel updates\. Standard BP convergence guarantees, however, do not transfer to this min\-max setting, and we apply arithmetic damping to channels for each updatenn:

rcn∝\(1−λ\)​rcn−1\+λ​rc∗,r\_\{c\}^\{n\}\\propto\(1\-\\lambda\)\\,r\_\{c\}^\{n\-1\}\+\\lambda\\,r\_\{c\}^\{\*\}\\,,\(18\)for each channelc∈\{u\|x,x\|xu,y\|xθ,y\|x\}c\\in\\\{u\|x,x\|xu,y\|x\\theta,y\|x\\\}\. Hereλ∈\[0,1\]\\lambda\\in\[0,1\]is the damping parameter andrc∗r\_\{c\}^\{\*\}denotes the newly computed channel from \([17](https://arxiv.org/html/2606.04935#S5.E17)\)\. We selectλ\\lambdaper method and environment from a convergence sweep overλ∈\{0\.25,0\.4,0\.5,0\.6,0\.75,0\.9\}\\lambda\\in\\\{0\.25,0\.4,0\.5,0\.6,0\.75,0\.9\\\}; at the selectedλ\\lambdathe channel\-based methods reach a stationary VFE plateau within1515–150150iterations \(see[AppendixF\.5](https://arxiv.org/html/2606.04935#A6.SS5)\)\.

## 6Experiments

We design experiments to test the behavioral effect of progressively adding the entropy corrections in[Table˜1](https://arxiv.org/html/2606.04935#S4.T1)\. We evaluate on three grid\-world environments with distinct uncertainty profiles: one dominated by observation noise, one requiring spatial planning to gather distance\-dependent observations, and one requiring joint reasoning over dynamics and observations\. Full experimental details are deferred to[AppendixF](https://arxiv.org/html/2606.04935#A6)333Code available at[https://github\.com/biaslab/UAI\-MP\-AIF\-JAX](https://github.com/biaslab/UAI-MP-AIF-JAX)\. All environments use discrete state spaces with exact factor evaluations, isolating the effect of the entropy corrections and channel\-augmented schemes from errors introduced by approximate message computation\.

### 6\.1Setup

##### Environments\.

We adapt three classic grid\-world environments into epistemic planning benchmarks by treating the environment layout as an unknown parameterθ\\thetain the generative model \([1](https://arxiv.org/html/2606.04935#S2.E1)\)\. All environments support cardinal movement but differ in observation structure, which determines which entropy corrections are needed\. We characterize each environment along two axes:*scope*, whether a single observation constrainsθ\\thetaglobally or only locally, and*resolution*, whether a precise observation decisively identifiesθ\\thetaor merely narrows the possibilities \(suggestive\)\.

*Frozen Lake*\[Brockman et al\.,[2016](https://arxiv.org/html/2606.04935#bib.bib5),towers\_gymnasium\_2024a\]\(global, decisive\): The agent observes binary “hole/safe” sensors for every cell on the grid, with noise that increases with distance from the agent\. A low\-noise reading directly constrains which configurations are consistent, so a single precise observation decisively revealsθ\\theta\.

*RockSample \(5,2\)5,2\)*\[smith\_heuristic\_2012\]\(local, decisive\): Rocks are placed at known positions but have unknown quality \(good or bad\), definingθ\\theta\(44configurations\)\. The agent passively observes a binary quality reading for the nearest rock, whose accuracy degrades with distance, and can actively CHECK the closest rock to reveal its quality, wasting a turn\. The agent can SAMPLE a rock for a reward or penalty depending on quality, or EXIT for a fixed reward\. Observations are local \(only the nearest rock is sensed\) but decisive: CHECK fully reveals quality, so the epistemic strategy is to approach before sampling\.

*Wumpus World*\[Russell and Norvig,[1995](https://arxiv.org/html/2606.04935#bib.bib29)\]\(local, suggestive\): Pit, wumpus, and gold positions defineθ\\theta\(2525configurations\)\. The classic dynamics are simplified to isolate the epistemic challenge: the agent has no orientation or inventory and navigates by cardinal movement\. The agent observes noisy breeze, stench, and glitter adjacency signals and has uncertain position\. Observations are local and suggestive: a breeze indicates a nearby pit but not which neighbor, so even precise readings do not decisively identifyθ\\theta, and the agent must triangulate across multiple positions\.

Frozen Lake and Wumpus World include a SCAN action that switches observations to near\-deterministic at the cost of one time step, with lower prior preference to indicate a higher prior cost\. In RockSample, the agent can additionally CHECK the closest rock to reveal its quality at the cost of one time step; passively, observation accuracy degrades with distance\. Full details are in[AppendixF](https://arxiv.org/html/2606.04935#A6)\.

##### Methods\.

We compare five methods\. The first four correspond to message\-passing implementations of the entropy\-corrected objectives in[Section˜4\.5](https://arxiv.org/html/2606.04935#S4.SS5), with channel configurations as specified in[Algorithm˜1](https://arxiv.org/html/2606.04935#alg1):

1. 1\.BP: standard belief propagation, no entropy correction\.
2. 2\.VBP: cross\-entropy planning, implemented as the principled channelized scheme from[AppendixE](https://arxiv.org/html/2606.04935#A5)\.
3. 3\.RM\-MP: a dynamics\-only ablation, using the planning channel together with the dynamics channel; reduces to VBP under deterministic dynamics\.
4. 4\.AIF\-MP: full EFE\-based planning, using the planning, dynamics, and observation channels \([Algorithm˜1](https://arxiv.org/html/2606.04935#alg1)\)\.
5. 5\.Nuijten\-MP\[Nuijten et al\.,[2025](https://arxiv.org/html/2606.04935#bib.bib20)\]: an alternating approximation to the full active\-inference objective that recomputes epistemic priors outside the joint variational optimization over beliefs and channels\.

All methods except BP include the planning correction, so the experiments ablate the EFE\-side corrections on top of a fixed planning baseline rather than the planning correction itself\.

### 6\.2Results and Discussion

[Table˜2](https://arxiv.org/html/2606.04935#S6.T2)reports performance for all methods across three environments\. The results show where the planning correction already matters and where the additional observation\-side epistemic corrections become necessary\.

Table 2:Performance across three environments with 95% confidence intervals, averaged over 1000 episodes\. Best per metric \(non\-overlapping CIs\) in bold\.##### Global, decisive observations \(Frozen Lake\)\.

Both active inference methods dominate \(∼96%\{\\sim\}96\\%success\): AIF\-MP achieves95\.9%95\.9\\%and Nuijten\-MP95\.6%95\.6\\%\(overlapping confidence intervals\), substantially outperforming all baselines\. Both learn to SCAN, which revealsθ\\thetadirectly\. RM\-MP performs comparably to BP and VBP \(49\.8%49\.8\\%vs\.51\.9%51\.9\\%and54\.5%54\.5\\%, overlapping confidence intervals\), indicating that the dynamics\-only ablation is neither beneficial nor harmful when observations are already global and decisive\. Here a single precise reading resolves uncertainty, so the observation channel offers little additional benefit over the alternating heuristic\.

##### Local, decisive observations \(RockSample\)\.

The shift from global to local scope means the agent must spatially navigate to gather information\. Both active inference methods far outperform baselines: AIF\-MP achieves99\.9%99\.9\\%retrieval \(reward3\.063\.06\) and Nuijten\-MP99\.5%99\.5\\%\(reward3\.053\.05\), with overlapping confidence intervals\. BP exits early \(14\.2%14\.2\\%retrieval\), while VBP and RM\-MP produce identical results \(48\.3%48\.3\\%retrieval, reward1\.981\.98\); this is expected because RockSample has deterministic dynamics, so the dynamics correction reduces to the identity and the dynamics\-only ablation collapses to VBP\. Both active inference formulations handle this regime comparably because precise local observations are decisive: CHECK fully reveals rock quality, so the additional variational parameters of the observation channel do not improve the epistemic signal\.

##### Local, suggestive observations \(Wumpus World\)\.

The critical shift is from decisive to suggestive resolution: even precise SCAN readings do not fully disambiguateθ\\theta, since multiple configurations produce the same breeze and stench patterns\. AIF\-MP is the only method that achieves robust performance \(47\.7%47\.7\\%\), clearly outperforming Nuijten\-MP \(29\.2%29\.2\\%\) and all baselines; representative trajectories are shown in[AppendixF\.3](https://arxiv.org/html/2606.04935#A6.SS3.SSS0.Px7)\. VBP \(35\.2%35\.2\\%\) and RM\-MP \(32\.2%32\.2\\%\) perform comparably \(overlapping confidence intervals\), both outperforming BP \(20\.7%20\.7\\%\) and Nuijten\-MP but falling short of AIF\-MP\. This gap is an objective mismatch, not a scheduling artefact:Nuijten et al\. \[[2025](https://arxiv.org/html/2606.04935#bib.bib20)\]recompute epistemic priors*outside*the variational objective between belief\-propagation sweeps, so the observation channels are never variational parameters and suggestive\-observation information cannot enter the prior updates\. AIF\-MP instead treats all four channels as variational parameters of a single joint objective with closed\-form stationary conditions \([17](https://arxiv.org/html/2606.04935#S5.E17)\)\. Under decisive observations the observation channel is near\-deterministic and the mismatch is masked \(Frozen Lake, RockSample\); under suggestive observations the channel carries non\-trivial information that must co\-adapt with beliefs, which explains the Wumpus gap\.

##### Synthesis\.

The three environments form a hierarchy of epistemic demands along the scope and resolution axes, with each regime cell illustrated by a single environment; the load\-bearing claim is the qualitative pattern across regimes, not the per\-environment numbers\. Across all environments, the planning correction already explains the jump from BP to VBP, while the dynamics\-only ablation is not sufficient for robust epistemic behavior\. The full EFE\-based planning objective separates most clearly not at the global\-to\-local transition, but at the decisive\-to\-suggestive transition, where the observation\-side channels matter most\.

## 7Conclusion

This paper clarifies the variational structure of active inference planning\.[Theorem˜1](https://arxiv.org/html/2606.04935#Thmtheorem1)shows that the epistemic\-prior construction ofNuijten et al\. \[[2026](https://arxiv.org/html/2606.04935#bib.bib21)\]admits an explicit entropy\-corrected reformulation: relative to baseline VFE minimization, it adds a specific set of entropy corrections that yields marginal EFE minimization\. This makes explicit which terms contribute the epistemic part of the objective and separates marginal EFE minimization from planning over policies\. Proper EFE\-based planning additionally requires the planning correction ofLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\], and the combined objective leads directly to a message\-passing construction via channel reparameterization\. That construction recovers a family of algorithms, including VBP, a dynamics\-only ablation, and full EFE\-based planning \([Algorithm˜1](https://arxiv.org/html/2606.04935#alg1)\)\.

Empirically, the distinction between these objectives matters in a structured way\. Across our exact discrete benchmarks, the planning correction already explains the improvement from BP to VBP, whereas the full observation\-side epistemic corrections matter most when observations are suggestive rather than decisive\. This is the regime in which the joint channelized scheme separates most clearly from both simpler planning objectives and the alternating approximation\.

##### Limitations and future work\.

The opposing signs of the entropy corrections induce a min\-max structure in the joint optimization over beliefs and channels, which in practice requires heavy damping for stable convergence \([Section˜5\.3](https://arxiv.org/html/2606.04935#S5.SS3)\)\. Standard belief propagation convergence guarantees do not transfer to this setting, and developing convergence theory for the channel\-augmented scheme is an open problem\. The entropy corrections also introduce an additional tuning parameterλ\\lambdathat controls the strength of the corrections and currently requires manual adjustment\. We restrict to discrete state spaces where exact factor evaluations are available, isolating the effect of the channel\-augmented scheme from errors introduced by approximate message computation; understanding how the channel reparameterization interacts with further factorization constraints on the variational posterior \(e\.g\., mean\-field or structured approximations\) is an important direction\.

\{contributions\}

W\.W\.L\. Nuijten and M\. Lukashchuk contributed equally to this work\. W\.W\.L\. Nuijten developed the entropy decomposition framework\. M\. Lukashchuk derived the message\-passing scheme\. Both authors contributed to writing and experiments\. T\.v\.d\.Laar contributed to the conceptualization of the method and supervision\. B\.d\.Vries has a supervisory and editorial role\.

###### Acknowledgements\.

This publication is part of the project ROBUST: Trustworthy AI\-based Systems for Sustainable Growth with project number KICH3\.LTP\.20\.006, which is \(partly\) financed by the Dutch Research Council \(NWO\), GN Hearing, and the Dutch Ministry of Economic Affairs and Climate Policy \(EZK\) under the program LTP KIC 2020\-2023\.

## References

- Attias \[2003\]Hagai Attias\.Planning by probabilistic inference\.In*International Workshop on Artificial Intelligence and Statistics*, pages 9–16\. PMLR, 2003\.
- Bertsekas \[2012\]Dimitri Bertsekas\.*Dynamic Programming and Optimal Control: Volume I*, volume 4\.Athena scientific, 2012\.
- Blei et al\. \[2017\]David M\. Blei, Alp Kucukelbir, and Jon D\. McAuliffe\.Variational Inference: A Review for Statisticians\.*Journal of the American Statistical Association*, 112\(518\):859–877, April 2017\.ISSN 0162\-1459\.[10\.1080/01621459\.2017\.1285773](https://arxiv.org/doi.org/10.1080/01621459.2017.1285773)\.
- Bradbury et al\. \[2018\]James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman\-Milne, and Qiao Zhang\.JAX: Composable transformations of Python\+NumPy programs, 2018\.
- Brockman et al\. \[2016\]Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba\.OpenAI Gym, June 2016\.
- Champion et al\. \[2022\]Théophile Champion, Lancelot Da Costa, Howard Bowman, and Marek Grześ\.Branching Time Active Inference: The theory and its generality\.*Neural Networks*, 151:295–316, July 2022\.ISSN 0893\-6080\.[10\.1016/j\.neunet\.2022\.03\.036](https://arxiv.org/doi.org/10.1016/j.neunet.2022.03.036)\.
- Da Costa et al\. \[2020\]Lancelot Da Costa, Thomas Parr, Noor Sajid, Sebastijan Veselic, Victorita Neacsu, and Karl Friston\.Active inference on discrete state\-spaces: A synthesis\.*Journal of Mathematical Psychology*, 99:102447, December 2020\.ISSN 0022\-2496\.[10\.1016/j\.jmp\.2020\.102447](https://arxiv.org/doi.org/10.1016/j.jmp.2020.102447)\.
- De Vries et al\. \[2025\]Bert De Vries, Wouter Nuijten, Thijs van de Laar, Wouter Kouw, Sepideh Adamiat, Tim Nisslbeck, Mykola Lukashchuk, Hoang Minh Huu Nguyen, Marco Hidalgo Araya, Raphael Tresor, Thijs Jenneskens, Ivana Nikoloska, Raaja Ganapathy Subramanian, Bart van Erp, Dmitry Bagaev, and Albert Podusenko\.Expected Free Energy\-based Planning as Variational Inference, April 2025\.
- Forney \[2001\]G\. David Forney\.Codes on graphs: Normal realizations\.*IEEE Transactions on Information Theory*, 47\(2\):520–548, 2001\.
- Friston et al\. \[2015\]Karl Friston, Francesco Rigoli, Dimitri Ognibene, Christoph Mathys, Thomas Fitzgerald, and Giovanni Pezzulo\.Active inference and epistemic value\.*Cognitive Neuroscience*, 6\(4\):187–214, October 2015\.ISSN 1758\-8928, 1758\-8936\.[10\.1080/17588928\.2015\.1020053](https://arxiv.org/doi.org/10.1080/17588928.2015.1020053)\.
- Friston et al\. \[2021\]Karl Friston, Lancelot Da Costa, Danijar Hafner, Casper Hesp, and Thomas Parr\.Sophisticated Inference\.*Neural Computation*, 33\(3\):713–763, March 2021\.ISSN 0899\-7667\.[10\.1162/neco\_a\_01351](https://arxiv.org/doi.org/10.1162/neco_a_01351)\.
- Heskes \[2006\]T\. Heskes\.Convexity Arguments for Efficient Minimization of the Bethe and Kikuchi Free Energies\.*Journal of Artificial Intelligence Research*, 26:153–190, June 2006\.ISSN 1076\-9757\.[10\.1613/jair\.1933](https://arxiv.org/doi.org/10.1613/jair.1933)\.
- Kappen et al\. \[2012\]B\. Kappen, V\. Gomez, and M\. Opper\.Optimal control as a graphical model inference problem\.*Machine Learning*, 87\(2\):159–182, May 2012\.ISSN 0885\-6125, 1573\-0565\.[10\.1007/s10994\-012\-5278\-7](https://arxiv.org/doi.org/10.1007/s10994-012-5278-7)\.
- Kappen \[2005\]H J Kappen\.Path integrals and symmetry breaking for optimal control theory\.*Journal of Statistical Mechanics: Theory and Experiment*, 2005\(11\):P11011, November 2005\.ISSN 1742\-5468\.[10\.1088/1742\-5468/2005/11/P11011](https://arxiv.org/doi.org/10.1088/1742-5468/2005/11/P11011)\.
- Koudahl et al\. \[2023\]Magnus Koudahl, Thijs van de Laar, and Bert de Vries\.Realising Synthetic Active Inference Agents, Part I: Epistemic Objectives and Graphical Specification Language, June 2023\.
- Lázaro\-Gredilla et al\. \[2024\]Miguel Lázaro\-Gredilla, Li Yang Ku, Kevin P\. Murphy, and Dileep George\.What type of inference is planning?In A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang, editors,*Advances in Neural Information Processing Systems*, volume 37, pages 116705–116742\. Curran Associates, Inc\., 2024\.[10\.52202/079017\-3705](https://arxiv.org/doi.org/10.52202/079017-3705)\.
- Levine \[2018\]Sergey Levine\.Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review, May 2018\.
- Loeliger \[2004\]H\.\-A\. Loeliger\.An introduction to factor graphs\.*IEEE Signal Processing Magazine*, 21\(1\):28–41, January 2004\.ISSN 1558\-0792\.[10\.1109/MSP\.2004\.1267047](https://arxiv.org/doi.org/10.1109/MSP.2004.1267047)\.
- Loeliger et al\. \[2007\]Hans\-Andrea Loeliger, Justin Dauwels, Junli Hu, Sascha Korl, Li Ping, and Frank R\. Kschischang\.The Factor Graph Approach to Model\-Based Signal Processing\.*Proceedings of the IEEE*, 95\(6\):1295–1322, June 2007\.ISSN 0018\-9219\.[10\.1109/JPROC\.2007\.896497](https://arxiv.org/doi.org/10.1109/JPROC.2007.896497)\.
- Nuijten et al\. \[2025\]Wouter W\. L\. Nuijten, Mykola Lukashchuk, Thijs van de Laar, and Bert de Vries\.A message passing realization of expected free energy minimization\.In*International Workshop on Active Inference*, pages 75–98\. Springer, 2025\.[10\.48550/arXiv\.2508\.02197](https://arxiv.org/doi.org/10.48550/arXiv.2508.02197)\.
- Nuijten et al\. \[2026\]Wouter W\. L\. Nuijten, Thijs van de Laar, and Bert de Vries\.Expected free energy\-based planning as variational inference\.*Transactions on Machine Learning Research*, 2026\.ISSN 2835\-8856\.
- O’Donoghue et al\. \[2020\]Brendan O’Donoghue, Ian Osband, and Catalin Ionescu\.Making sense of reinforcement learning and probabilistic inference\.In*International Conference on Learning Representations*, 2020\.
- Palmieri et al\. \[2022\]Francesco A\. N\. Palmieri, Krishna R\. Pattipati, Giovanni Di Gennaro, Giovanni Fioretti, Francesco Verolla, and Amedeo Buonanno\.A Unifying View of Estimation and Control Using Belief Propagation With Application to Path Planning\.*IEEE Access*, 10:15193–15216, 2022\.ISSN 2169\-3536\.[10\.1109/ACCESS\.2022\.3148127](https://arxiv.org/doi.org/10.1109/ACCESS.2022.3148127)\.
- Parr and Friston \[2019\]Thomas Parr and Karl J\. Friston\.Generalised free energy and active inference\.*Biological Cybernetics*, 113\(5\):495–513, December 2019\.ISSN 1432\-0770\.[10\.1007/s00422\-019\-00805\-w](https://arxiv.org/doi.org/10.1007/s00422-019-00805-w)\.
- Parr et al\. \[2022\]Thomas Parr, Giovanni Pezzulo, and Karl J\. Friston\.*Active Inference: The Free Energy Principle in Mind, Brain, and Behavior*\.The MIT Press, March 2022\.ISBN 978\-0\-262\-36997\-8\.[10\.7551/mitpress/12441\.001\.0001](https://arxiv.org/doi.org/10.7551/mitpress/12441.001.0001)\.
- Paul et al\. \[2024\]Aswin Paul, Noor Sajid, Lancelot Da Costa, and Adeel Razi\.On efficient computation in active inference\.*Expert Systems with Applications*, 253:124315, November 2024\.ISSN 0957\-4174\.[10\.1016/j\.eswa\.2024\.124315](https://arxiv.org/doi.org/10.1016/j.eswa.2024.124315)\.
- Pearl \[1982\]Judea Pearl\.Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach\.In*AAAI\-82 Proceedings*, pages 133–136, Carnegie Mellon University, Pittsburgh PA, 1982\. AAAI Press\.
- Rawlik et al\. \[2012\]Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar\.On stochastic optimal control and reinforcement learning by approximate inference\.*Proceedings of Robotics: Science and Systems VIII*, 2012\.
- Russell and Norvig \[1995\]Stuart Russell and Peter Norvig\.*Artificial Intelligence: A Modern Approach*\.Prentice Hall, Englewood Cliffs, NJ, 1995\.
- What Type of Inference is Active Inference? \(Supplementary Material\)

## Appendix AEntropy Corrections

### A\.1Entropy Correction for Planning

We derive the entropy correction that distinguishes planning\-as\-inference from marginal inference\.Lázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]formulate planning\-as\-inference using a “planning entropy” that excludes action variables from the trajectory entropy\. Here we show that this formulation is equivalent to adding an entropy correction to the standard VFE, and derive the form of this correction\.

#### A\.1\.1The Planning Entropy ofLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]

Lázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]define the planning entropy as:

ℍ​\[q​\(x0\)\]\+∑t=1Tℍq​\[xt\|xt−1,ut\],\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\_\{q\}\[x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\]\\,,\(A\.1\)whereℍq​\[xt\|xt−1,ut\]=−∫q​\(xt,xt−1,ut\)​log⁡q​\(xt\|xt−1,ut\)​dxt​dxt−1​dut\\mathbb\{H\}\_\{q\}\[x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\]=\-\\int q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\log q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}denotes the conditional entropy\. This differs from the full trajectory entropyℍ​\[q​\(𝒙,𝒖\)\]\\mathbb\{H\}\\left\[q\(\\bm\{x\},\\bm\{u\}\)\\right\]by excluding the action entropy\.

#### A\.1\.2Derivation of the Entropy Correction

###### Proposition 3\(Planning entropy decomposition\)\.

The planning entropy \([A\.1](https://arxiv.org/html/2606.04935#A1.E1)\) equals the full trajectory entropy plus an entropy correction:

ℍ​\[q​\(x0\)\]\+∑t=1Tℍq​\[xt\|xt−1,ut\]=ℍ​\[q​\(𝒙,𝒖\)\]\+∑t=1Tℍ​\[q​\(xt−1\)\]−ℍ​\[q​\(xt−1,ut\)\]\.\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\_\{q\}\[x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\]=\\mathbb\{H\}\\left\[q\(\\bm\{x\},\\bm\{u\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(x\_\{t\-1\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\-1\},u\_\{t\}\)\\right\]\\,\.\(A\.2\)

###### Proof\.

Starting from the planning entropy and expanding the conditional entropy:

ℍ​\[q​\(x0\)\]\\displaystyle\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+∑t=1Tℍq​\[xt\|xt−1,ut\]\\displaystyle\+\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\_\{q\}\[x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\]=ℍ​\[q​\(x0\)\]\+∑t=1T\(−∭q​\(xt,xt−1,ut\)​log⁡q​\(xt\|xt−1,ut\)​dxt​dxt−1​dut\)\\displaystyle=\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\left\(\-\\iiint q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\log q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\\right\)\(A\.3\)=ℍ​\[q​\(x0\)\]\+∑t=1T\(−∭q​\(xt,xt−1,ut\)​log⁡q​\(xt,xt−1,ut\)q​\(ut\|xt−1\)​q​\(xt−1\)​d​xt​d​xt−1​d​ut\)\\displaystyle=\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\left\(\-\\iiint q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\log\\frac\{q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\}\{q\(u\_\{t\}\|x\_\{t\-1\}\)q\(x\_\{t\-1\}\)\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\\right\)\(A\.4\)Splitting the logarithm:

=ℍ\[q\(x0\)\]\+∑t=1T\(−∭q\(xt,xt−1,ut\)logq​\(xt,xt−1,ut\)q​\(xt−1\)dxtdxt−1dut\\displaystyle=\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\left\(\-\\iiint q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\log\\frac\{q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\}\{q\(x\_\{t\-1\}\)\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\\right\.\+∭q\(xt,xt−1,ut\)logq\(ut\|xt−1\)dxtdxt−1dut\)\\displaystyle\\qquad\\left\.\+\\iiint q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\log q\(u\_\{t\}\|x\_\{t\-1\}\)\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\\right\)\(A\.5\)=ℍ​\[q​\(x0\)\]\+∑t=1T\(−∭q​\(xt,xt−1,ut\)​log⁡q​\(xt,ut\|xt−1\)​dxt​dxt−1​dut\)⏟ℍq​\[xt,ut\|xt−1\]\\displaystyle=\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\underbrace\{\\left\(\-\\iiint q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\log q\(x\_\{t\},u\_\{t\}\|x\_\{t\-1\}\)\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\\right\)\}\_\{\\mathbb\{H\}\_\{q\}\[x\_\{t\},u\_\{t\}\|x\_\{t\-1\}\]\}\+∑t=1T∬q​\(xt−1,ut\)​log⁡q​\(xt−1,ut\)q​\(xt−1\)​d​xt−1​d​ut\\displaystyle\\qquad\+\\sum\_\{t=1\}^\{T\}\\iint q\(x\_\{t\-1\},u\_\{t\}\)\\log\\frac\{q\(x\_\{t\-1\},u\_\{t\}\)\}\{q\(x\_\{t\-1\}\)\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\(A\.6\)The first sum gives the trajectory entropy by the chain rule:

ℍ​\[q​\(x0\)\]\+∑t=1Tℍq​\[xt,ut\|xt−1\]=ℍ​\[q​\(𝒙,𝒖\)\]\.\\mathbb\{H\}\\left\[q\(x\_\{0\}\)\\right\]\+\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\_\{q\}\[x\_\{t\},u\_\{t\}\|x\_\{t\-1\}\]=\\mathbb\{H\}\\left\[q\(\\bm\{x\},\\bm\{u\}\)\\right\]\\,\.\(A\.7\)The second sum expands as:

∑t=1T∬q​\(xt−1,ut\)​log⁡q​\(xt−1,ut\)q​\(xt−1\)​d​xt−1​d​ut\\displaystyle\\sum\_\{t=1\}^\{T\}\\iint q\(x\_\{t\-1\},u\_\{t\}\)\\log\\frac\{q\(x\_\{t\-1\},u\_\{t\}\)\}\{q\(x\_\{t\-1\}\)\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}=∑t=1T∬q​\(xt−1,ut\)​log⁡q​\(xt−1,ut\)​dxt−1​dut⏟−ℍ​\[q​\(xt−1,ut\)\]\\displaystyle=\\sum\_\{t=1\}^\{T\}\\underbrace\{\\iint q\(x\_\{t\-1\},u\_\{t\}\)\\log q\(x\_\{t\-1\},u\_\{t\}\)\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\}\_\{\-\\mathbb\{H\}\\left\[q\(x\_\{t\-1\},u\_\{t\}\)\\right\]\}−∫q​\(xt−1\)​log⁡q​\(xt−1\)​dxt−1⏟−ℍ​\[q​\(xt−1\)\]\\displaystyle\\qquad\-\\underbrace\{\\int q\(x\_\{t\-1\}\)\\log q\(x\_\{t\-1\}\)\\mathrm\{d\}x\_\{t\-1\}\}\_\{\-\\mathbb\{H\}\\left\[q\(x\_\{t\-1\}\)\\right\]\}\(A\.8\)=∑t=1Tℍ​\[q​\(xt−1\)\]−ℍ​\[q​\(xt−1,ut\)\]\.\\displaystyle=\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(x\_\{t\-1\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\-1\},u\_\{t\}\)\\right\]\\,\.\(A\.9\)Combining these results proves the proposition\. ∎

#### A\.1\.3Interpretation

Since we minimize the VFE \(rather than maximize as inLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]\), the planning entropy is*subtracted*from the objective\. This means the entropy correction∑tℍ​\[q​\(xt−1,ut\)\]−ℍ​\[q​\(xt−1\)\]=∑tℍ​\[q​\(ut\|xt−1\)\]\\sum\_\{t\}\\mathbb\{H\}\\left\[q\(x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\-1\}\)\\right\]=\\sum\_\{t\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]is*added*to the VFE\.

Adding this positive correction penalizes action uncertainty: since we minimize the objective, highℍ​\[q​\(ut\|xt−1\)\]\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]increases the cost, pushing the agent toward a deterministic policy\.

### A\.2Proof of Theorem[1](https://arxiv.org/html/2606.04935#Thmtheorem1)

We prove that the VFE of the augmented model \([6](https://arxiv.org/html/2606.04935#S2.E6)\) decomposes into the original VFE plus entropy correction terms\. The proof requires three lemmas, each showing how one epistemic prior contributes to the entropy correction\.

###### Lemma 4\(State epistemic prior contribution\)\.

Letq​\(𝐲,𝐱,θ,𝐮\)q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)be a variational distribution over the generative model \([1](https://arxiv.org/html/2606.04935#S2.E1)\), and let the state epistemic prior be defined as in \([7b](https://arxiv.org/html/2606.04935#S2.E7.2)\):

p~​\(xt\)=exp⁡\(𝔼q​\(θ\|xt\)​\[−h​\[q​\(yt\|xt,θ\)\]\]\)\.\\tilde\{p\}\(x\_\{t\}\)=\\exp\\bigl\(\\mathbb\{E\}\_\{q\(\\theta\|x\_\{t\}\)\}\\left\[\-\\mathrm\{h\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\\right\]\\bigr\)\\,\.\(A\.10\)Then:

−∫q​\(xt\)​log⁡p~​\(xt\)​dxt=ℍ​\[q​\(yt\|xt,θ\)\]\.\-\\int q\(x\_\{t\}\)\\log\\tilde\{p\}\(x\_\{t\}\)\\mathrm\{d\}x\_\{t\}=\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\\,\.\(A\.11\)

###### Proof\.

Substituting the definition ofp~​\(xt\)\\tilde\{p\}\(x\_\{t\}\)and expanding the conditional entropy:

−∫q​\(xt\)​log⁡p~​\(xt\)​dxt\\displaystyle\-\\int q\(x\_\{t\}\)\\log\\tilde\{p\}\(x\_\{t\}\)\\mathrm\{d\}x\_\{t\}=−∫q​\(xt\)​∫q​\(θ\|xt\)​∫q​\(yt\|xt,θ\)​log⁡q​\(yt\|xt,θ\)​dyt​dθ​dxt\\displaystyle=\-\\int q\(x\_\{t\}\)\\int q\(\\theta\|x\_\{t\}\)\\int q\(y\_\{t\}\|x\_\{t\},\\theta\)\\log q\(y\_\{t\}\|x\_\{t\},\\theta\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}\\theta\\mathrm\{d\}x\_\{t\}\(A\.12\)=−∭q​\(yt\|xt,θ\)​q​\(θ\|xt\)​q​\(xt\)​log⁡q​\(yt\|xt,θ\)​dyt​dθ​dxt\\displaystyle=\-\\iiint q\(y\_\{t\}\|x\_\{t\},\\theta\)\\,q\(\\theta\|x\_\{t\}\)\\,q\(x\_\{t\}\)\\log q\(y\_\{t\}\|x\_\{t\},\\theta\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}\\theta\\mathrm\{d\}x\_\{t\}\(A\.13\)=−∭q​\(yt,xt,θ\)​log⁡q​\(yt\|xt,θ\)​dyt​dxt​dθ\\displaystyle=\-\\iiint q\(y\_\{t\},x\_\{t\},\\theta\)\\log q\(y\_\{t\}\|x\_\{t\},\\theta\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}\\theta\(A\.14\)=ℍ​\[q​\(yt\|xt,θ\)\]\.\\displaystyle=\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\\,\.\(A\.15\)∎

###### Lemma 5\(Action epistemic prior contribution\)\.

Letq​\(𝐲,𝐱,θ,𝐮\)q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)be a variational distribution over the generative model \([1](https://arxiv.org/html/2606.04935#S2.E1)\), and let the action epistemic prior be defined as in \([7a](https://arxiv.org/html/2606.04935#S2.E7.1)\):

p~​\(ut\)=exp⁡\(h​\[q​\(xt,xt−1\|ut\)\]−h​\[q​\(xt−1\|ut\)\]\)\.\\tilde\{p\}\(u\_\{t\}\)=\\exp\\bigl\(\\mathrm\{h\}\\left\[q\(x\_\{t\},x\_\{t\-1\}\|u\_\{t\}\)\\right\]\-\\mathrm\{h\}\\left\[q\(x\_\{t\-1\}\|u\_\{t\}\)\\right\]\\bigr\)\\,\.\(A\.16\)Then:

−∫q​\(ut\)​log⁡p~​\(ut\)​dut=−ℍ​\[q​\(xt\|xt−1,ut\)\]\.\-\\int q\(u\_\{t\}\)\\log\\tilde\{p\}\(u\_\{t\}\)\\mathrm\{d\}u\_\{t\}=\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]\\,\.\(A\.17\)

###### Proof\.

Substituting the definition ofp~​\(ut\)\\tilde\{p\}\(u\_\{t\}\):

−∫q​\(ut\)\\displaystyle\-\\int q\(u\_\{t\}\)log⁡p~​\(ut\)​d​ut\\displaystyle\\log\\tilde\{p\}\(u\_\{t\}\)\\mathrm\{d\}u\_\{t\}=∫q\(ut\)\(∬q\(xt,xt−1\|ut\)logq\(xt,xt−1\|ut\)dxtdxt−1\\displaystyle=\\int q\(u\_\{t\}\)\\left\(\\iint q\(x\_\{t\},x\_\{t\-1\}\|u\_\{t\}\)\\log q\(x\_\{t\},x\_\{t\-1\}\|u\_\{t\}\)\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\right\.−∫q\(xt−1\|ut\)logq\(xt−1\|ut\)dxt−1\)dut\\displaystyle\\qquad\\left\.\-\\int q\(x\_\{t\-1\}\|u\_\{t\}\)\\log q\(x\_\{t\-1\}\|u\_\{t\}\)\\mathrm\{d\}x\_\{t\-1\}\\right\)\\mathrm\{d\}u\_\{t\}\(A\.18\)=∫q\(ut\)\(∬q​\(xt,xt−1,ut\)q​\(ut\)logq​\(xt,xt−1,ut\)q​\(ut\)dxtdxt−1\\displaystyle=\\int q\(u\_\{t\}\)\\left\(\\iint\\frac\{q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\}\{q\(u\_\{t\}\)\}\\log\\frac\{q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\}\{q\(u\_\{t\}\)\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\right\.−∫q​\(xt−1,ut\)q​\(ut\)logq​\(xt−1,ut\)q​\(ut\)dxt−1\)dut\\displaystyle\\qquad\\left\.\-\\int\\frac\{q\(x\_\{t\-1\},u\_\{t\}\)\}\{q\(u\_\{t\}\)\}\\log\\frac\{q\(x\_\{t\-1\},u\_\{t\}\)\}\{q\(u\_\{t\}\)\}\\mathrm\{d\}x\_\{t\-1\}\\right\)\\mathrm\{d\}u\_\{t\}\(A\.19\)=∭q​\(xt,xt−1,ut\)​log⁡q​\(xt,xt−1,ut\)q​\(ut\)​d​xt​d​xt−1​d​ut\\displaystyle=\\iiint q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\log\\frac\{q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\}\{q\(u\_\{t\}\)\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}−∬q​\(xt−1,ut\)​log⁡q​\(xt−1,ut\)q​\(ut\)​d​xt−1​d​ut\\displaystyle\\qquad\-\\iint q\(x\_\{t\-1\},u\_\{t\}\)\\log\\frac\{q\(x\_\{t\-1\},u\_\{t\}\)\}\{q\(u\_\{t\}\)\}\\mathrm\{d\}x\_\{t\-1\}\\mathrm\{d\}u\_\{t\}\(A\.20\)=−ℍ​\[q​\(xt,xt−1,ut\)\]\+ℍ​\[q​\(ut\)\]\+ℍ​\[q​\(xt−1,ut\)\]−ℍ​\[q​\(ut\)\]\\displaystyle=\-\\mathbb\{H\}\\left\[q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\right\]\+\\mathbb\{H\}\\left\[q\(u\_\{t\}\)\\right\]\+\\mathbb\{H\}\\left\[q\(x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(u\_\{t\}\)\\right\]\(A\.21\)=ℍ​\[q​\(xt−1,ut\)\]−ℍ​\[q​\(xt,xt−1,ut\)\]=−ℍ​\[q​\(xt\|xt−1,ut\)\]\.\\displaystyle=\\mathbb\{H\}\\left\[q\(x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\},x\_\{t\-1\},u\_\{t\}\)\\right\]=\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]\\,\.\(A\.22\)∎

###### Lemma 6\(Observation epistemic prior contribution\)\.

Letq​\(𝐲,𝐱,θ,𝐮\)q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)be a variational distribution over the generative model \([1](https://arxiv.org/html/2606.04935#S2.E1)\), and let the observation epistemic prior be defined as in \([7c](https://arxiv.org/html/2606.04935#S2.E7.3)\):

p~\(yt,xt\)=exp\(𝔻KL\[q\(θ\|yt,xt\)∥q\(θ\|xt\)\]\)\.\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)=\\exp\\bigl\(\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[q\(\\theta\|y\_\{t\},x\_\{t\}\)\\\|q\(\\theta\|x\_\{t\}\)\\right\]\\bigr\)\\,\.\(A\.23\)Then:

−∬q​\(yt,xt\)​log⁡p~​\(yt,xt\)​dyt​dxt=ℍ​\[q​\(yt\|xt,θ\)\]−ℍ​\[q​\(yt\|xt\)\]\.\-\\iint q\(y\_\{t\},x\_\{t\}\)\\log\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}=\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\\,\.\(A\.24\)

###### Proof\.

Substituting the definition ofp~​\(yt,xt\)\\tilde\{p\}\(y\_\{t\},x\_\{t\}\):

−∬q​\(yt,xt\)\\displaystyle\-\\iint q\(y\_\{t\},x\_\{t\}\)log⁡p~​\(yt,xt\)​d​yt​d​xt\\displaystyle\\log\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}=−∬q​\(yt,xt\)​\(∫q​\(θ\|yt,xt\)​log⁡q​\(θ\|yt,xt\)q​\(θ\|xt\)​d​θ\)​dyt​dxt\\displaystyle=\-\\iint q\(y\_\{t\},x\_\{t\}\)\\left\(\\int q\(\\theta\|y\_\{t\},x\_\{t\}\)\\log\\frac\{q\(\\theta\|y\_\{t\},x\_\{t\}\)\}\{q\(\\theta\|x\_\{t\}\)\}\\mathrm\{d\}\\theta\\right\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\(A\.25\)=−∬q​\(yt,xt\)​\(∫q​\(θ\|yt,xt\)​log⁡q​\(yt,xt,θ\)q​\(yt,xt\)−log⁡q​\(xt,θ\)q​\(xt\)​d​θ\)​dyt​dxt\\displaystyle=\-\\iint q\(y\_\{t\},x\_\{t\}\)\\left\(\\int q\(\\theta\|y\_\{t\},x\_\{t\}\)\\log\\frac\{q\(y\_\{t\},x\_\{t\},\\theta\)\}\{q\(y\_\{t\},x\_\{t\}\)\}\-\\log\\frac\{q\(x\_\{t\},\\theta\)\}\{q\(x\_\{t\}\)\}\\mathrm\{d\}\\theta\\right\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\(A\.26\)=−∭q​\(yt,xt,θ\)​\(log⁡q​\(yt,xt,θ\)−log⁡q​\(yt,xt\)−log⁡q​\(xt,θ\)\+log⁡q​\(xt\)\)​dyt​dxt​dθ\\displaystyle=\-\\iiint q\(y\_\{t\},x\_\{t\},\\theta\)\\left\(\\log q\(y\_\{t\},x\_\{t\},\\theta\)\-\\log q\(y\_\{t\},x\_\{t\}\)\-\\log q\(x\_\{t\},\\theta\)\+\\log q\(x\_\{t\}\)\\right\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}\\theta\(A\.27\)=−∭q​\(yt,xt,θ\)​log⁡q​\(yt,xt,θ\)​dyt​dxt​dθ⏟ℍ​\[q​\(yt,xt,θ\)\]\\displaystyle=\\underbrace\{\-\\iiint q\(y\_\{t\},x\_\{t\},\\theta\)\\log q\(y\_\{t\},x\_\{t\},\\theta\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}\\theta\}\_\{\\mathbb\{H\}\\left\[q\(y\_\{t\},x\_\{t\},\\theta\)\\right\]\}\+∭q​\(yt,xt,θ\)​log⁡q​\(yt,xt\)​dyt​dxt​dθ⏟−ℍ​\[q​\(yt,xt\)\]\\displaystyle\\qquad\+\\underbrace\{\\iiint q\(y\_\{t\},x\_\{t\},\\theta\)\\log q\(y\_\{t\},x\_\{t\}\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}\\theta\}\_\{\-\\mathbb\{H\}\\left\[q\(y\_\{t\},x\_\{t\}\)\\right\]\}\+∭q​\(yt,xt,θ\)​log⁡q​\(xt,θ\)​dyt​dxt​dθ⏟−ℍ​\[q​\(xt,θ\)\]\\displaystyle\\qquad\+\\underbrace\{\\iiint q\(y\_\{t\},x\_\{t\},\\theta\)\\log q\(x\_\{t\},\\theta\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}\\theta\}\_\{\-\\mathbb\{H\}\\left\[q\(x\_\{t\},\\theta\)\\right\]\}−∭q​\(yt,xt,θ\)​log⁡q​\(xt\)​dyt​dxt​dθ⏟ℍ​\[q​\(xt\)\]\\displaystyle\\qquad\-\\underbrace\{\\iiint q\(y\_\{t\},x\_\{t\},\\theta\)\\log q\(x\_\{t\}\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\\mathrm\{d\}\\theta\}\_\{\\mathbb\{H\}\\left\[q\(x\_\{t\}\)\\right\]\}\(A\.28\)=ℍ​\[q​\(yt,xt,θ\)\]−ℍ​\[q​\(yt,xt\)\]−ℍ​\[q​\(xt,θ\)\]\+ℍ​\[q​\(xt\)\]\\displaystyle=\\mathbb\{H\}\\left\[q\(y\_\{t\},x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\},x\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\},\\theta\)\\right\]\+\\mathbb\{H\}\\left\[q\(x\_\{t\}\)\\right\]\(A\.29\)=\(ℍ​\[q​\(yt,xt,θ\)\]−ℍ​\[q​\(xt,θ\)\]\)−\(ℍ​\[q​\(yt,xt\)\]−ℍ​\[q​\(xt\)\]\)\\displaystyle=\\bigl\(\\mathbb\{H\}\\left\[q\(y\_\{t\},x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\},\\theta\)\\right\]\\bigr\)\-\\bigl\(\\mathbb\{H\}\\left\[q\(y\_\{t\},x\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\)\\right\]\\bigr\)\(A\.30\)=ℍ​\[q​\(yt\|xt,θ\)\]−ℍ​\[q​\(yt\|xt\)\]\.\\displaystyle=\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\\,\.\(A\.31\)∎

###### Proof of Theorem[1](https://arxiv.org/html/2606.04935#Thmtheorem1)\.

The VFE of the augmented model \([6](https://arxiv.org/html/2606.04935#S2.E6)\) is:

Fp~​\[q\]\\displaystyle F\_\{\\tilde\{p\}\}\[q\]=∫q​\(𝒚,𝒙,θ,𝒖\)​log⁡q​\(𝒚,𝒙,θ,𝒖\)p~​\(𝒚,𝒙,θ,𝒖\)\\displaystyle=\\int q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\\log\\frac\{q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\}\{\\tilde\{p\}\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\}\(A\.32\)=∫q​\(𝒚,𝒙,θ,𝒖\)​log⁡q​\(𝒚,𝒙,θ,𝒖\)p​\(𝒚,𝒙,θ,𝒖\)​∏t=1Tp~​\(xt\)​p~​\(ut\)​p~​\(yt,xt\)\\displaystyle=\\int q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\\log\\frac\{q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\}\{p\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\\prod\_\{t=1\}^\{T\}\\tilde\{p\}\(x\_\{t\}\)\\tilde\{p\}\(u\_\{t\}\)\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\}\(A\.33\)=∫q​\(𝒚,𝒙,θ,𝒖\)​log⁡q​\(𝒚,𝒙,θ,𝒖\)p​\(𝒚,𝒙,θ,𝒖\)⏟Fp^​\[q\]\+\\displaystyle=\\underbrace\{\\int q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\\log\\frac\{q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\}\{p\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\}\}\_\{F\_\{\\hat\{p\}\}\[q\]\}\+−∑t=1T\(⨌q\(𝒚,𝒙,θ,𝒖\)logp~\(xt\)d𝒚d𝒙dθd𝒖\+⨌q\(𝒚,𝒙,θ,𝒖\)logp~\(ut\)d𝒚d𝒙dθd𝒖\+\\displaystyle\\qquad\-\\sum\_\{t=1\}^\{T\}\\Bigg\(\\iiiint q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\\log\\tilde\{p\}\(x\_\{t\}\)\\mathrm\{d\}\\bm\{y\}\\mathrm\{d\}\\bm\{x\}\\mathrm\{d\}\\theta\\mathrm\{d\}\\bm\{u\}\+\\iiiint q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\\log\\tilde\{p\}\(u\_\{t\}\)\\mathrm\{d\}\\bm\{y\}\\mathrm\{d\}\\bm\{x\}\\mathrm\{d\}\\theta\\mathrm\{d\}\\bm\{u\}\+\+⨌q\(𝒚,𝒙,θ,𝒖\)logp~\(yt,xt\)d𝒚d𝒙dθd𝒖\)\\displaystyle\\qquad\+\\iiiint q\(\\bm\{y\},\\bm\{x\},\\theta,\\bm\{u\}\)\\log\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\\mathrm\{d\}\\bm\{y\}\\mathrm\{d\}\\bm\{x\}\\mathrm\{d\}\\theta\\mathrm\{d\}\\bm\{u\}\\Bigg\)\(A\.34\)=Fp^​\[q\]\+∑t=1T\(−∫q​\(xt\)​log⁡p~​\(xt\)​dxt−∫q​\(ut\)​log⁡p~​\(ut\)​dut−∬q​\(yt,xt\)​log⁡p~​\(yt,xt\)​dyt​dxt\)\.\\displaystyle=F\_\{\\hat\{p\}\}\[q\]\+\\sum\_\{t=1\}^\{T\}\\left\(\-\\int q\(x\_\{t\}\)\\log\\tilde\{p\}\(x\_\{t\}\)\\mathrm\{d\}x\_\{t\}\-\\int q\(u\_\{t\}\)\\log\\tilde\{p\}\(u\_\{t\}\)\\mathrm\{d\}u\_\{t\}\-\\iint q\(y\_\{t\},x\_\{t\}\)\\log\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\\mathrm\{d\}y\_\{t\}\\mathrm\{d\}x\_\{t\}\\right\)\\,\.\(A\.35\)Applying Lemmas[4](https://arxiv.org/html/2606.04935#Thmtheorem4)–[6](https://arxiv.org/html/2606.04935#Thmtheorem6):

Fp~​\[q\]\\displaystyle F\_\{\\tilde\{p\}\}\[q\]=Fp^​\[q\]\+∑t=1T\(ℍ​\[q​\(yt\|xt,θ\)\]⏟Lemma[4](https://arxiv.org/html/2606.04935#Thmtheorem4)​−ℍ​\[q​\(xt\|xt−1,ut\)\]⏟Lemma[5](https://arxiv.org/html/2606.04935#Thmtheorem5)\+ℍ​\[q​\(yt\|xt,θ\)\]−ℍ​\[q​\(yt\|xt\)\]⏟Lemma[6](https://arxiv.org/html/2606.04935#Thmtheorem6)\)\\displaystyle=F\_\{\\hat\{p\}\}\[q\]\+\\sum\_\{t=1\}^\{T\}\\Big\(\\underbrace\{\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\}\_\{\\text\{Lemma\\penalty 10000\\ \\ref\{lem:p\_tilde\_x\}\}\}\\underbrace\{\{\}\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]\}\_\{\\text\{Lemma\\penalty 10000\\ \\ref\{lem:p\_tilde\_u\}\}\}\+\\underbrace\{\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\}\_\{\\text\{Lemma\\penalty 10000\\ \\ref\{lem:p\_tilde\_x\_y\}\}\}\\Big\)\(A\.36\)=Fp^​\[q\]\+∑t=1T2​ℍ​\[q​\(yt\|xt,θ\)\]−ℍ​\[q​\(xt\|xt−1,ut\)\]−ℍ​\[q​\(yt\|xt\)\]\.\\displaystyle=F\_\{\\hat\{p\}\}\[q\]\+\\sum\_\{t=1\}^\{T\}2\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\\,\.\(A\.37\)∎

### A\.3Cross\-Entropy Interpretation

We show that planning\-as\-inference fromLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]is equivalent to minimizing cross\-entropy to preference distributions\.

#### A\.3\.1Reward as Cross\-Entropy

###### Proposition 7\(Reward as cross\-entropy\)\.

For anyλ\>0\\lambda\>0, defineλ\\lambda\-scaled preference distributionsp^λ​\(xt\)∝exp⁡\(λ​Rt​\(xt\)\)\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\propto\\exp\(\\lambda R\_\{t\}\(x\_\{t\}\)\)\. Then for any distributionq​\(𝐱,𝐮\)q\(\\bm\{x\},\\bm\{u\}\):

𝔼q​\(𝒙,𝒖\)​\[∑t=1TRt​\(xt\)\]=−1λ​∑t=1Tℍ​\[q​\(xt\),p^λ​\(xt\)\]\+const\.\\mathbb\{E\}\_\{q\(\\bm\{x\},\\bm\{u\}\)\}\\left\[\\sum\_\{t=1\}^\{T\}R\_\{t\}\(x\_\{t\}\)\\right\]=\-\\frac\{1\}\{\\lambda\}\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(x\_\{t\}\),\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\right\]\+\\text\{const\}\\,\.\(A\.38\)

###### Proof\.

Withlog⁡p^λ​\(xt\)=λ​Rt​\(xt\)−log⁡Zt,λ\\log\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)=\\lambda R\_\{t\}\(x\_\{t\}\)\-\\log Z\_\{t,\\lambda\}whereZt,λ=∫exp⁡\(λ​Rt​\(xt\)\)​dxtZ\_\{t,\\lambda\}=\\int\\exp\(\\lambda R\_\{t\}\(x\_\{t\}\)\)\\mathrm\{d\}x\_\{t\}:

ℍ​\[q​\(xt\),p^λ​\(xt\)\]\\displaystyle\\mathbb\{H\}\\left\[q\(x\_\{t\}\),\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\right\]=−𝔼q​\(xt\)​\[log⁡p^λ​\(xt\)\]\\displaystyle=\-\\mathbb\{E\}\_\{q\(x\_\{t\}\)\}\\left\[\\log\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\right\]\(A\.39\)=−𝔼q​\(xt\)​\[λ​Rt​\(xt\)−log⁡Zt,λ\]\\displaystyle=\-\\mathbb\{E\}\_\{q\(x\_\{t\}\)\}\\left\[\\lambda R\_\{t\}\(x\_\{t\}\)\-\\log Z\_\{t,\\lambda\}\\right\]\(A\.40\)=−λ​𝔼q​\(xt\)​\[Rt​\(xt\)\]\+log⁡Zt,λ\.\\displaystyle=\-\\lambda\\mathbb\{E\}\_\{q\(x\_\{t\}\)\}\\left\[R\_\{t\}\(x\_\{t\}\)\\right\]\+\\log Z\_\{t,\\lambda\}\\,\.\(A\.41\)Rearranging:𝔼q​\(xt\)​\[Rt​\(xt\)\]=−1λ​ℍ​\[q​\(xt\),p^λ​\(xt\)\]\+1λ​log⁡Zt,λ\\mathbb\{E\}\_\{q\(x\_\{t\}\)\}\\left\[R\_\{t\}\(x\_\{t\}\)\\right\]=\-\\frac\{1\}\{\\lambda\}\\mathbb\{H\}\\left\[q\(x\_\{t\}\),\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\right\]\+\\frac\{1\}\{\\lambda\}\\log Z\_\{t,\\lambda\}\. Summing overttgives the result, where1λ​∑tlog⁡Zt,λ\\frac\{1\}\{\\lambda\}\\sum\_\{t\}\\log Z\_\{t,\\lambda\}is constant with respect toqq\. ∎

#### A\.3\.2Connection to Planning\-as\-Inference

Lázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16), Theorem 1\]show that the best exponential utility

Fλplanning=1λ​log⁡𝔼p​\(𝒙,𝒖\)​\[exp⁡\(λ​∑t=1TRt\)\]F\_\{\\lambda\}^\{\\text\{planning\}\}=\\frac\{1\}\{\\lambda\}\\log\\mathbb\{E\}\_\{p\(\\bm\{x\},\\bm\{u\}\)\}\\left\[\\exp\\left\(\\lambda\\sum\_\{t=1\}^\{T\}R\_\{t\}\\right\)\\right\]\(A\.42\)can be expressed as the result of a variational optimization problem whose objective includes the expected sum of rewards𝔼q​\[∑tRt\]\\mathbb\{E\}\_\{q\}\\left\[\\sum\_\{t\}R\_\{t\}\\right\]as one of its terms\.

By Proposition[7](https://arxiv.org/html/2606.04935#Thmtheorem7)withp^λ​\(xt\)∝exp⁡\(λ​Rt​\(xt\)\)\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\propto\\exp\(\\lambda R\_\{t\}\(x\_\{t\}\)\), this reward term equals \(up to a constant\)−1λ​∑tℍ​\[q​\(xt\),p^λ​\(xt\)\]\-\\frac\{1\}\{\\lambda\}\\sum\_\{t\}\\mathbb\{H\}\\left\[q\(x\_\{t\}\),\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\right\]\. Since the variational bound inLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16), Theorem 1\]has an overall1λ\\frac\{1\}\{\\lambda\}scaling, this factor cancels, and maximizingFλplanningF\_\{\\lambda\}^\{\\text\{planning\}\}is equivalent to minimizing∑tℍ​\[q​\(xt\),p^λ​\(xt\)\]\\sum\_\{t\}\\mathbb\{H\}\\left\[q\(x\_\{t\}\),\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\right\]along with the dynamics and entropy terms\.

This establishes that withp^λ​\(xt\)∝exp⁡\(λ​Rt​\(xt\)\)\\hat\{p\}\_\{\\lambda\}\(x\_\{t\}\)\\propto\\exp\(\\lambda R\_\{t\}\(x\_\{t\}\)\), the expected utility becomes cross\-entropy to preference distributions\. Consequently, the inference procedure fromLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]minimizes cross\-entropy by optimizing its variational objective\. Different values ofλ\\lambdayield different preference distributions \(corresponding to different risk attitudes\), but the underlying mechanism remains cross\-entropy minimization\.

## Appendix BEFE\-based Planning Inference

###### Theorem 8\(EFE\-based planning inference\)\.

Consider the augmented modelp~​\(𝐲,𝐱,𝐮,θ\)\\tilde\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)from \([6](https://arxiv.org/html/2606.04935#S2.E6)\) and a set of reactive policies𝛑=\{πt​\(ut\|xt−1\)\}t=1T\\bm\{\\pi\}=\\\{\\pi\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\)\\\}\_\{t=1\}^\{T\}\. Then:

max𝝅log∑𝒚,𝒙,𝒖,θp~\(𝒚,𝒙,𝒖,θ\)∏t=1Tπt\(ut\|xt−1\)=maxq⟨logp~⟩q\+ℍ\[q\(𝒚,𝒙,𝒖,θ\)\]−∑t=1Tℍ\[q\(ut\|xt−1\)\],\\max\_\{\\bm\{\\pi\}\}\\log\\sum\_\{\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\}\\tilde\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\prod\_\{t=1\}^\{T\}\\pi\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\)=\\max\_\{q\}\\;\\bigl\\langle\\log\\tilde\{p\}\\bigr\\rangle\_\{q\}\+\\mathbb\{H\}\\left\[q\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\right\]\-\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]\\,,\(B\.1\)whereq​\(𝐲,𝐱,𝐮,θ\)q\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)is an arbitrary variational distribution and the optimal policy isπt∗​\(ut\|xt−1\)=q∗​\(ut\|xt−1\)\\pi\_\{t\}^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)=q^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)\. Equivalently, up to an additive constant independent ofqq:

max𝝅⁡log​∑𝒚,𝒙,𝒖,θp~​\(𝒚,𝒙,𝒖,θ\)​∏t=1Tπt​\(ut\|xt−1\)=−minq⁡Fp~​\[q\]\+∑t=1Tℍ​\[q​\(ut\|xt−1\)\],\\max\_\{\\bm\{\\pi\}\}\\log\\sum\_\{\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\}\\tilde\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\prod\_\{t=1\}^\{T\}\\pi\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\)=\-\\min\_\{q\}\\;F\_\{\\tilde\{p\}\}\[q\]\+\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]\\,,\(B\.2\)whereFp~​\[q\]=𝔻KL​\[q∥p~\]F\_\{\\tilde\{p\}\}\[q\]=\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[q\\\|\\tilde\{p\}\\right\]\.

###### Proof\.

This is the variational formulation of planning fromLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16), Appendix A\], rewritten in the notation of this paper\. To connect our formulation to theirs, note that, up to a normalization constant independent of𝝅\\bm\{\\pi\},

∑𝒚,𝒙,𝒖,θp~​\(𝒚,𝒙,𝒖,θ\)​∏t=1Tπt​\(ut\|xt−1\)\\displaystyle\\sum\_\{\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\}\\tilde\{p\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\prod\_\{t=1\}^\{T\}\\pi\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\)∝∑𝒚,𝒙,𝒖,θρ𝝅​\(𝒚,𝒙,𝒖,θ\)​∏t=1Tp^​\(xt\)​p^​\(yt\),\\displaystyle\\qquad\\propto\\sum\_\{\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\}\\rho\_\{\\bm\{\\pi\}\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\)\\prod\_\{t=1\}^\{T\}\\hat\{p\}\(x\_\{t\}\)\\hat\{p\}\(y\_\{t\}\)\\,,\(B\.3\)where

ρ𝝅​\(𝒚,𝒙,𝒖,θ\):=p​\(θ\)​p​\(x0\)​∏t=1Tp​\(yt\|xt,θ\)​p​\(xt\|xt−1,ut,θ\)​p​\(ut\)​p~​\(ut\)​p~​\(xt\)​p~​\(yt,xt\)​πt​\(ut\|xt−1\)\.\\rho\_\{\\bm\{\\pi\}\}\(\\bm\{y\},\\bm\{x\},\\bm\{u\},\\theta\):=p\(\\theta\)\\,p\(x\_\{0\}\)\\\!\\prod\_\{t=1\}^\{T\}p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,p\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\},\\theta\)\\,p\(u\_\{t\}\)\\,\\tilde\{p\}\(u\_\{t\}\)\\,\\tilde\{p\}\(x\_\{t\}\)\\,\\tilde\{p\}\(y\_\{t\},x\_\{t\}\)\\,\\pi\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\)\\,\.\(B\.4\)Thus the objective is the log evidence of a preference\-weighted rollout model, withlog⁡p^​\(xt\)\\log\\hat\{p\}\(x\_\{t\}\)andlog⁡p^​\(yt\)\\log\\hat\{p\}\(y\_\{t\}\)playing the role of rewards, exactly as in Appendix[A\.3](https://arxiv.org/html/2606.04935#A1.SS3)\. The remaining changes relative toLázaro\-Gredilla et al\. \[[2024](https://arxiv.org/html/2606.04935#bib.bib16)\]are that: \(i\) actions are indexed by the state they lead to, so policies are written asπt​\(ut\|xt−1\)\\pi\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\); and \(ii\) the planning entropy differs from the full trajectory entropy by the correction∑tℍ​\[q​\(ut\|xt−1\)\]\\sum\_\{t\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]derived in Appendix[A\.1](https://arxiv.org/html/2606.04935#A1.SS1)\. The optimal policy satisfiesπt∗​\(ut\|xt−1\)=q∗​\(ut\|xt−1\)\\pi\_\{t\}^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)=q^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)by the same maximization over policies\. ∎

###### Corollary 9\(Combined entropy correction\)\.

Applying Theorem[1](https://arxiv.org/html/2606.04935#Thmtheorem1)to expandFp~​\[q\]F\_\{\\tilde\{p\}\}\[q\], the objective \([B\.2](https://arxiv.org/html/2606.04935#A2.E2)\) becomes

minq⁡Fp^​\[q\]\+∑t=1T2​ℍ​\[q​\(yt\|xt,θ\)\]−ℍ​\[q​\(xt\|xt−1,ut\)\]−ℍ​\[q​\(yt\|xt\)\]⏟ΔAIF\+∑t=1Tℍ​\[q​\(ut\|xt−1\)\]⏟Δplanning\.\\min\_\{q\}\\;F\_\{\\hat\{p\}\}\[q\]\+\\underbrace\{\\sum\_\{t=1\}^\{T\}2\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{t\}\|x\_\{t\}\)\\right\]\}\_\{\\Delta^\{\\mathrm\{AIF\}\}\}\+\\underbrace\{\\sum\_\{t=1\}^\{T\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]\}\_\{\\Delta^\{\\mathrm\{planning\}\}\}\\,\.\(B\.5\)

###### Proof\.

Direct substitution of Theorem[1](https://arxiv.org/html/2606.04935#Thmtheorem1)into \([B\.2](https://arxiv.org/html/2606.04935#A2.E2)\)\. ∎

## Appendix CBethe Free Energy Details

This appendix reviews the standard derivation of the Bethe Free Energy \(BFE\) from the Variational Free Energy \(VFE\) on factor graphs, followingyedidia\_constructing\_2005andwainwright\_graphical\_2008\. The material is collected here for self\-containedness and to establish the notation used in[Section˜2\.3](https://arxiv.org/html/2606.04935#S2.SS3)and subsequent appendices\.

### C\.1Factorized Generative Model on an FFG

![Refer to caption](https://arxiv.org/html/2606.04935v1/x3.png)Figure 3:A Forney\-style factor graph representing the factorizationp​\(𝒙\)=f1​\(x1\)​f2​\(x1,x2,x3\)​f3​\(x2,x4\)​f4​\(x3\)​f5​\(x4\)p\(\\bm\{x\}\)=f\_\{1\}\(x\_\{1\}\)\\,f\_\{2\}\(x\_\{1\},x\_\{2\},x\_\{3\}\)\\,f\_\{3\}\(x\_\{2\},x\_\{4\}\)\\,f\_\{4\}\(x\_\{3\}\)\\,f\_\{5\}\(x\_\{4\}\)\. Square nodes denote factors; edges denote variables\.Consider a generative model that factorizes as

p​\(𝒔\)=∏a∈𝒱fa​\(𝒔a\),p\(\\bm\{s\}\)=\\prod\_\{a\\in\\mathcal\{V\}\}f\_\{a\}\(\\bm\{s\}\_\{a\}\)\\,,\(C\.1\)where each factorfaf\_\{a\}depends on a subset of variables𝒔a⊆𝒔\\bm\{s\}\_\{a\}\\subseteq\\bm\{s\}\. This factorization is represented as a Forney\-style factor graph \(FFG\)𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)\[Forney,[2001](https://arxiv.org/html/2606.04935#bib.bib9), Loeliger et al\.,[2007](https://arxiv.org/html/2606.04935#bib.bib19)\]\(see[Figure˜3](https://arxiv.org/html/2606.04935#A3.F3)\), where𝒱\\mathcal\{V\}denotes the set of factor nodes andℰ\\mathcal\{E\}the set of edges \(variables\)\. An edgei∈ℰi\\in\\mathcal\{E\}connects to a nodea∈𝒱a\\in\\mathcal\{V\}whenever variablesis\_\{i\}appears in the scope offaf\_\{a\}\. We writeℰ​\(a\)\\mathcal\{E\}\(a\)for the edges adjacent to nodeaaand𝒱​\(i\)\\mathcal\{V\}\(i\)for the nodes adjacent to edgeii\. The*degree*of edgeiiisdi=\|𝒱​\(i\)\|d\_\{i\}=\|\\mathcal\{V\}\(i\)\|, counting the number of factors in whichsis\_\{i\}participates\.

### C\.2The Bethe Variational Family

The Bethe approximation\[yedidia\_constructing\_2005\]parameterizes the variational distribution in terms of*local beliefs*: a factor beliefqa​\(𝒔a\)q\_\{a\}\(\\bm\{s\}\_\{a\}\)for each nodea∈𝒱a\\in\\mathcal\{V\}and a singleton beliefqi​\(si\)q\_\{i\}\(s\_\{i\}\)for each edgei∈ℰi\\in\\mathcal\{E\}\. These beliefs define a pseudo\-distribution via the*Bethe factorization*:

qBethe​\(𝒔\)=∏a∈𝒱qa​\(𝒔a\)∏i∈ℰqi​\(si\)di−1\.q\_\{\\mathrm\{Bethe\}\}\(\\bm\{s\}\)=\\frac\{\\prod\_\{a\\in\\mathcal\{V\}\}q\_\{a\}\(\\bm\{s\}\_\{a\}\)\}\{\\prod\_\{i\\in\\mathcal\{E\}\}q\_\{i\}\(s\_\{i\}\)^\{d\_\{i\}\-1\}\}\\,\.\(C\.2\)The denominator corrects the over\-counting: since each variablesis\_\{i\}appears indid\_\{i\}factor beliefs, naively multiplying allqaq\_\{a\}would countqiq\_\{i\}a total ofdid\_\{i\}times\. Dividing byqidi−1q\_\{i\}^\{d\_\{i\}\-1\}reduces this to a single effective copy\. For a variable that appears in only one factor \(di=1d\_\{i\}=1\), no correction is needed\.

This parameterization is valid only when the beliefs satisfy*local consistency constraints*:

∫qa​\(𝒔a\)​d𝒔a∖i\\displaystyle\\int q\_\{a\}\(\\bm\{s\}\_\{a\}\)\\,\\mathrm\{d\}\\bm\{s\}\_\{a\\setminus i\}=qi​\(si\)for all​a∈𝒱,i∈ℰ​\(a\),\\displaystyle=q\_\{i\}\(s\_\{i\}\)\\quad\\text\{for all \}a\\in\\mathcal\{V\},\\;i\\in\\mathcal\{E\}\(a\)\\,,\(C\.3a\)∫qa​\(𝒔a\)​d𝒔a\\displaystyle\\int q\_\{a\}\(\\bm\{s\}\_\{a\}\)\\,\\mathrm\{d\}\\bm\{s\}\_\{a\}=1for all​a∈𝒱\.\\displaystyle=1\\quad\\text\{for all \}a\\in\\mathcal\{V\}\\,\.\(C\.3b\)The marginalization constraint \([C\.3a](https://arxiv.org/html/2606.04935#A3.E3.1)\) requires that each factor belief, when marginalized over all variables exceptsis\_\{i\}, agrees with the singleton beliefqiq\_\{i\}\. Together with normalization \([C\.3b](https://arxiv.org/html/2606.04935#A3.E3.2)\), these constraints define the*local polytope*ℒ𝒢\\mathcal\{L\}\_\{\\mathcal\{G\}\}\.

### C\.3Derivation: VFE to BFE

Substituting the model factorization \([C\.1](https://arxiv.org/html/2606.04935#A3.E1)\) and the Bethe factorization \([C\.2](https://arxiv.org/html/2606.04935#A3.E2)\) into the VFE yields the BFE\.

##### Step 1: Expand the log terms\.

Taking the logarithm of the Bethe factorization:

log⁡qBethe​\(𝒔\)=∑a∈𝒱log⁡qa​\(𝒔a\)−∑i∈ℰ\(di−1\)​log⁡qi​\(si\)\.\\log q\_\{\\mathrm\{Bethe\}\}\(\\bm\{s\}\)=\\sum\_\{a\\in\\mathcal\{V\}\}\\log q\_\{a\}\(\\bm\{s\}\_\{a\}\)\-\\sum\_\{i\\in\\mathcal\{E\}\}\(d\_\{i\}\-1\)\\log q\_\{i\}\(s\_\{i\}\)\\,\.\(C\.4\)Similarly, the log model decomposes as:

log⁡p​\(𝒔\)=∑a∈𝒱log⁡fa​\(𝒔a\)\.\\log p\(\\bm\{s\}\)=\\sum\_\{a\\in\\mathcal\{V\}\}\\log f\_\{a\}\(\\bm\{s\}\_\{a\}\)\\,\.\(C\.5\)

##### Step 2: Substitute into the VFE\.

The VFE isF​\[q\]=∫q​\(𝒔\)​log⁡q​\(𝒔\)p​\(𝒔\)​d​𝒔F\[q\]=\\int q\(\\bm\{s\}\)\\log\\frac\{q\(\\bm\{s\}\)\}\{p\(\\bm\{s\}\)\}\\,\\mathrm\{d\}\\bm\{s\}\. Substituting \([C\.4](https://arxiv.org/html/2606.04935#A3.E4)\) and \([C\.5](https://arxiv.org/html/2606.04935#A3.E5)\):

F​\[q\]=∫q​\(𝒔\)​\[∑a∈𝒱log⁡qa​\(𝒔a\)−∑i∈ℰ\(di−1\)​log⁡qi​\(si\)−∑a∈𝒱log⁡fa​\(𝒔a\)\]​d𝒔\.F\[q\]=\\int q\(\\bm\{s\}\)\\left\[\\sum\_\{a\\in\\mathcal\{V\}\}\\log q\_\{a\}\(\\bm\{s\}\_\{a\}\)\-\\sum\_\{i\\in\\mathcal\{E\}\}\(d\_\{i\}\-1\)\\log q\_\{i\}\(s\_\{i\}\)\-\\sum\_\{a\\in\\mathcal\{V\}\}\\log f\_\{a\}\(\\bm\{s\}\_\{a\}\)\\right\]\\mathrm\{d\}\\bm\{s\}\\,\.\(C\.6\)

##### Step 3: Localize expectations using consistency\.

The key step exploits the local consistency constraints\. For any functiong​\(𝒔a\)g\(\\bm\{s\}\_\{a\}\)that depends only on the variables in the scope of factoraa:

∫q​\(𝒔\)​g​\(𝒔a\)​d𝒔=∫qa​\(𝒔a\)​g​\(𝒔a\)​d𝒔a,\\int q\(\\bm\{s\}\)\\,g\(\\bm\{s\}\_\{a\}\)\\,\\mathrm\{d\}\\bm\{s\}=\\int q\_\{a\}\(\\bm\{s\}\_\{a\}\)\\,g\(\\bm\{s\}\_\{a\}\)\\,\\mathrm\{d\}\\bm\{s\}\_\{a\}\\,,\(C\.7\)sinceqqmarginalizes toqaq\_\{a\}over𝒔a\\bm\{s\}\_\{a\}by consistency\. Similarly, for any functionh​\(si\)h\(s\_\{i\}\)of a single variable:

∫q​\(𝒔\)​h​\(si\)​d𝒔=∫qi​\(si\)​h​\(si\)​dsi\.\\int q\(\\bm\{s\}\)\\,h\(s\_\{i\}\)\\,\\mathrm\{d\}\\bm\{s\}=\\int q\_\{i\}\(s\_\{i\}\)\\,h\(s\_\{i\}\)\\,\\mathrm\{d\}s\_\{i\}\\,\.\(C\.8\)
Applying these identities to \([C\.6](https://arxiv.org/html/2606.04935#A3.E6)\):

F​\[q\]=∑a∈𝒱∫qa​\(𝒔a\)​log⁡qa​\(𝒔a\)fa​\(𝒔a\)​d​𝒔a−∑i∈ℰ\(di−1\)​∫qi​\(si\)​log⁡qi​\(si\)​dsi\.F\[q\]=\\sum\_\{a\\in\\mathcal\{V\}\}\\int q\_\{a\}\(\\bm\{s\}\_\{a\}\)\\log\\frac\{q\_\{a\}\(\\bm\{s\}\_\{a\}\)\}\{f\_\{a\}\(\\bm\{s\}\_\{a\}\)\}\\,\\mathrm\{d\}\\bm\{s\}\_\{a\}\-\\sum\_\{i\\in\\mathcal\{E\}\}\(d\_\{i\}\-1\)\\int q\_\{i\}\(s\_\{i\}\)\\log q\_\{i\}\(s\_\{i\}\)\\,\\mathrm\{d\}s\_\{i\}\\,\.\(C\.9\)

##### Step 4: Identify the BFE\.

Recognizing the KL divergence and entropy, one obtains the*Bethe Free Energy*:

FBethe​\[q\]=∑a∈𝒱𝔻KL​\[qa​\(𝒔a\)∥fa​\(𝒔a\)\]\+∑i∈ℰ\(di−1\)​ℍ​\[qi​\(si\)\],F\_\{\\text\{Bethe\}\}\[q\]=\\sum\_\{a\\in\\mathcal\{V\}\}\\mathbb\{D\}\_\{\\mathrm\{KL\}\}\\left\[q\_\{a\}\(\\bm\{s\}\_\{a\}\)\\\|f\_\{a\}\(\\bm\{s\}\_\{a\}\)\\right\]\+\\sum\_\{i\\in\\mathcal\{E\}\}\(d\_\{i\}\-1\)\\,\\mathbb\{H\}\\left\[q\_\{i\}\(s\_\{i\}\)\\right\]\\,,\(C\.10\)which is \([5](https://arxiv.org/html/2606.04935#S2.E5)\) in the main text\. The first sum penalizes each factor belief for deviating from its corresponding factor, while the second sum adds back the entropy of shared variables to correct for the over\-counting inherent in the Bethe factorization\.

### C\.4Constrained Optimization and Belief Propagation

The BFE gives rise to the constrained optimization problem

min\{qa,qi\}∈ℒ𝒢⁡FBethe​\[q\],\\min\_\{\\\{q\_\{a\},q\_\{i\}\\\}\\in\\mathcal\{L\}\_\{\\mathcal\{G\}\}\}F\_\{\\text\{Bethe\}\}\[q\]\\,,\(C\.11\)whereℒ𝒢\\mathcal\{L\}\_\{\\mathcal\{G\}\}is the local polytope defined by the constraints \([C\.3](https://arxiv.org/html/2606.04935#A3.E3)\)\.yedidia\_constructing\_2005showed that the stationary points of this constrained problem correspond exactly to the fixed points of the*belief propagation*\(BP\) algorithm\.

On*tree\-structured*graphs, the BFE equals the exact VFE, and BP converges to the exact posterior marginals\[Pearl,[1982](https://arxiv.org/html/2606.04935#bib.bib27)\]\. In this case, the Bethe factorization \([C\.2](https://arxiv.org/html/2606.04935#A3.E2)\) is an exact representation of the global posterior, and the local consistency constraints are sufficient to characterize it\.

On graphs with*loops*, the BFE is an approximation: the Bethe factorization does not generally correspond to a valid probability distribution, and BP may not converge\. Nevertheless, when BP does converge, its fixed points remain stationary points of the BFE\[yedidia\_constructing\_2005, Heskes,[2006](https://arxiv.org/html/2606.04935#bib.bib12)\]\.

In models with shared parameters, such as a temporal chain whereθ\\thetaappears in both the dynamics and observation factors at every time step, the resulting loops make the BFE an approximation rather than an exact decomposition\. Entropic corrections to the BFE yield modified objectives whose stationarity conditions are derived in[AppendixD](https://arxiv.org/html/2606.04935#A4)\.

## Appendix DDetailed Message Passing Derivation for the Combined Objective

This appendix gives a self\-containedT=1T=1derivation of the message\-passing scheme for the combined objective\. The objective includes both the cross\-entropy planning correction and the entropy corrections required for planning\-as\-inference in Active Inference\. Its message\-passing structure is organized around four channels, withru\|x​\(u1\|x0\)r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)appearing in the numerator of the dynamics kernel\.

### D\.1Model, Coordinates, and Constraints

We consider the biased generative model

p​\(y1,x1,x0,θ,u1\)=p​\(θ\)​p​\(x0\)​p​\(u1\)​p​\(x1\|x0,θ,u1\)​p​\(y1\|x1,θ\)​p^x​\(x1\)​p^y​\(y1\)\.p\(y\_\{1\},x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)=p\(\\theta\)\\,p\(x\_\{0\}\)\\,p\(u\_\{1\}\)\\,p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,\\hat\{p\}\_\{x\}\(x\_\{1\}\)\\,\\hat\{p\}\_\{y\}\(y\_\{1\}\)\.\(D\.1\)The corresponding non\-singleton factors are the observation factorfobs​\(y1,x1,θ\)=p​\(y1\|x1,θ\)f\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)=p\(y\_\{1\}\|x\_\{1\},\\theta\)and the dynamics factorfdyn​\(x1,x0,θ,u1\)=p​\(x1\|x0,θ,u1\)f\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)=p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\), together with singleton prior and goal factors onθ\\theta,x0x\_\{0\},u1u\_\{1\},x1x\_\{1\}, andy1y\_\{1\}\.

The Bethe coordinates consist of the factor beliefs

qobs​\(y1,x1,θ\),qdyn​\(x1,x0,θ,u1\),q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\),\\qquad q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\),\(D\.2\)the singleton beliefs are

qθ​\(θ\),qx0​\(x0\),qx1​\(x1\),qy1​\(y1\),qu1​\(u1\),q\_\{\\theta\}\(\\theta\),\\quad q\_\{x\_\{0\}\}\(x\_\{0\}\),\\quad q\_\{x\_\{1\}\}\(x\_\{1\}\),\\quad q\_\{y\_\{1\}\}\(y\_\{1\}\),\\quad q\_\{u\_\{1\}\}\(u\_\{1\}\),\(D\.3\)and the local\-polytope constraints consist of factor normalization,

∫qobs​\(y1,x1,θ\)​dy1​dx1​dθ\\displaystyle\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta=1,\\displaystyle=1,\(D\.4a\)∫qdyn​\(x1,x0,θ,u1\)​dx1​dx0​dθ​du1\\displaystyle\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}=1,\\displaystyle=1,\(D\.4b\)and factor\-to\-singleton consistency,

∫qobs​\(y1,x1,θ\)​dx1​dθ\\displaystyle\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta=qy1​\(y1\),\\displaystyle=q\_\{y\_\{1\}\}\(y\_\{1\}\),\(D\.5a\)∫qobs​\(y1,x1,θ\)​dy1​dθ\\displaystyle\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}\\theta=qx1​\(x1\),\\displaystyle=q\_\{x\_\{1\}\}\(x\_\{1\}\),\(D\.5b\)∫qobs​\(y1,x1,θ\)​dy1​dx1\\displaystyle\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}=qθ​\(θ\),\\displaystyle=q\_\{\\theta\}\(\\theta\),\(D\.5c\)∫qdyn​\(x1,x0,θ,u1\)​dx0​dθ​du1\\displaystyle\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}=qx1​\(x1\),\\displaystyle=q\_\{x\_\{1\}\}\(x\_\{1\}\),\(D\.5d\)∫qdyn​\(x1,x0,θ,u1\)​dx1​dθ​du1\\displaystyle\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}=qx0​\(x0\),\\displaystyle=q\_\{x\_\{0\}\}\(x\_\{0\}\),\(D\.5e\)∫qdyn​\(x1,x0,θ,u1\)​dx1​dx0​du1\\displaystyle\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}u\_\{1\}=qθ​\(θ\),\\displaystyle=q\_\{\\theta\}\(\\theta\),\(D\.5f\)∫qdyn​\(x1,x0,θ,u1\)​dx1​dx0​dθ\\displaystyle\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta=qu1​\(u1\)\.\\displaystyle=q\_\{u\_\{1\}\}\(u\_\{1\}\)\.\(D\.5g\)
The combined objective uses four normalized channels:

ry\|x​θ​\(y1\|x1,θ\),ry\|x​\(y1\|x1\),rx\|x​u​\(x1\|x0,u1\),ru\|x​\(u1\|x0\),\\displaystyle r\_\{y\|x\\theta\}\(y\_\{1\}\|x\_\{1\},\\theta\),\\qquad r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\),\\qquad r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\),\\qquad r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\),\(D\.6a\)with normalization constraints

∫ry\|x​θ​\(y1\|x1,θ\)​dy1\\displaystyle\\int r\_\{y\|x\\theta\}\(y\_\{1\}\|x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}=1∀\(x1,θ\),\\displaystyle=1\\quad\\forall\(x\_\{1\},\\theta\),\(D\.7a\)∫ry\|x​\(y1\|x1\)​dy1\\displaystyle\\int r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\\,\\mathrm\{d\}y\_\{1\}=1∀x1,\\displaystyle=1\\quad\\forall x\_\{1\},\(D\.7b\)∫rx\|x​u​\(x1\|x0,u1\)​dx1\\displaystyle\\int r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}=1∀\(x0,u1\),\\displaystyle=1\\quad\\forall\(x\_\{0\},u\_\{1\}\),\(D\.7c\)∫ru\|x​\(u1\|x0\)​du1\\displaystyle\\int r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}u\_\{1\}=1∀x0\.\\displaystyle=1\\quad\\forall x\_\{0\}\.\(D\.7d\)
We also use the derived marginals

qsep​\(x1,θ\)\\displaystyle q\_\{\\mathrm\{sep\}\}\(x\_\{1\},\\theta\):=∫qobs​\(y1,x1,θ\)​dy1,\\displaystyle:=\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\},\(D\.8a\)qy​x​\(y1,x1\)\\displaystyle q\_\{yx\}\(y\_\{1\},x\_\{1\}\):=∫qobs​\(y1,x1,θ\)​dθ,\\displaystyle:=\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}\\theta,\(D\.8b\)qtrip​\(x1,x0,u1\)\\displaystyle q\_\{\\mathrm\{trip\}\}\(x\_\{1\},x\_\{0\},u\_\{1\}\):=∫qdyn​\(x1,x0,θ,u1\)​dθ,\\displaystyle:=\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}\\theta,\(D\.8c\)qpair​\(x0,u1\)\\displaystyle q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\):=∫qtrip​\(x1,x0,u1\)​dx1,\\displaystyle:=\\int q\_\{\\mathrm\{trip\}\}\(x\_\{1\},x\_\{0\},u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\},\(D\.8d\)qu​x​\(u1,x0\)\\displaystyle q\_\{ux\}\(u\_\{1\},x\_\{0\}\):=∫qdyn​\(x1,x0,θ,u1\)​dx1​dθ\.\\displaystyle:=\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\.\(D\.8e\)The last quantity is justqu​x​\(u1,x0\)=qpair​\(x0,u1\)q\_\{ux\}\(u\_\{1\},x\_\{0\}\)=q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\), but it is convenient to name it separately when deriving the policy channel\.

### D\.2Combined Objective and Modified Kernels

ForT=1T=1, the correction added to the Bethe free energy is

Δ​Fcomb=2​ℍ​\[q​\(y1\|x1,θ\)\]−ℍ​\[q​\(x1\|x0,u1\)\]−ℍ​\[q​\(y1\|x1\)\]⏟ΔAIF\+ℍ​\[q​\(u1\|x0\)\]⏟Δplanning\.\\Delta F\_\{\\mathrm\{comb\}\}=\\underbrace\{2\\mathbb\{H\}\\left\[q\(y\_\{1\}\|x\_\{1\},\\theta\)\\right\]\-\\mathbb\{H\}\\left\[q\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\\right\]\-\\mathbb\{H\}\\left\[q\(y\_\{1\}\|x\_\{1\}\)\\right\]\}\_\{\\Delta^\{\\mathrm\{AIF\}\}\}\+\\underbrace\{\\mathbb\{H\}\\left\[q\(u\_\{1\}\|x\_\{0\}\)\\right\]\}\_\{\\Delta^\{\\mathrm\{planning\}\}\}\.\(D\.9\)
Using the variational characterization of conditional entropy, each conditional entropy is reparameterized by a channel\. The signs matter:

- •\+ℍ​\[q​\(u1\|x0\)\]\+\\mathbb\{H\}\\left\[q\(u\_\{1\}\|x\_\{0\}\)\\right\]contributes−𝔼qdyn​\[log⁡ru\|x​\(u1\|x0\)\]\-\\mathbb\{E\}\_\{q\_\{\\mathrm\{dyn\}\}\}\\left\[\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\right\], soru\|xr\_\{u\|x\}enters in the numerator of the dynamics kernel\.
- •−ℍ​\[q​\(x1\|x0,u1\)\]\-\\mathbb\{H\}\\left\[q\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\\right\]contributes\+𝔼qdyn​\[log⁡rx\|x​u​\(x1\|x0,u1\)\]\+\\mathbb\{E\}\_\{q\_\{\\mathrm\{dyn\}\}\}\\left\[\\log r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\\right\], sorx\|x​ur\_\{x\|xu\}enters in the denominator of the dynamics kernel\.
- •\+2​ℍ​\[q​\(y1\|x1,θ\)\]\+2\\mathbb\{H\}\\left\[q\(y\_\{1\}\|x\_\{1\},\\theta\)\\right\]contributes−2​𝔼qobs​\[log⁡ry\|x​θ​\(y1\|x1,θ\)\]\-2\\mathbb\{E\}\_\{q\_\{\\mathrm\{obs\}\}\}\\left\[\\log r\_\{y\|x\\theta\}\(y\_\{1\}\|x\_\{1\},\\theta\)\\right\], sory\|x​θ2r\_\{y\|x\\theta\}^\{2\}enters in the numerator of the observation kernel\.
- •−ℍ​\[q​\(y1\|x1\)\]\-\\mathbb\{H\}\\left\[q\(y\_\{1\}\|x\_\{1\}\)\\right\]contributes\+𝔼qy​x​\[log⁡ry\|x​\(y1\|x1\)\]\+\\mathbb\{E\}\_\{q\_\{yx\}\}\\left\[\\log r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\\right\], sory\|xr\_\{y\|x\}enters in the denominator of the observation kernel\.

The resulting objective is

Fcomb​\[q,r\]\\displaystyle F\_\{\\mathrm\{comb\}\}\[q,r\]=∫qobs​log⁡qobsp​\(y1\|x1,θ\)​d​y1​d​x1​d​θ−2​∫qobs​log⁡ry\|x​θ​d​y1​d​x1​d​θ\\displaystyle=\\int q\_\{\\mathrm\{obs\}\}\\log\\frac\{q\_\{\\mathrm\{obs\}\}\}\{p\(y\_\{1\}\|x\_\{1\},\\theta\)\}\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\-2\\int q\_\{\\mathrm\{obs\}\}\\log r\_\{y\|x\\theta\}\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\(D\.10\)\+∫qy​x​\(y1,x1\)​log⁡ry\|x​\(y1\|x1\)​dy1​dx1\\displaystyle\\quad\+\\int q\_\{yx\}\(y\_\{1\},x\_\{1\}\)\\log r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}\+∫qdyn​log⁡qdynp​\(x1\|x0,θ,u1\)​d​x1​d​x0​d​θ​d​u1\\displaystyle\\quad\+\\int q\_\{\\mathrm\{dyn\}\}\\log\\frac\{q\_\{\\mathrm\{dyn\}\}\}\{p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}\+∫qdyn​log⁡rx\|x​u​\(x1\|x0,u1\)​dx1​dx0​dθ​du1\\displaystyle\\quad\+\\int q\_\{\\mathrm\{dyn\}\}\\log r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}−∫qdyn​log⁡ru\|x​\(u1\|x0\)​dx1​dx0​dθ​du1\\displaystyle\\quad\-\\int q\_\{\\mathrm\{dyn\}\}\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}−∫qx0​log⁡p​\(x0\)​dx0−∫qu1​log⁡p​\(u1\)​du1−∫qy1​log⁡p^y​\(y1\)​dy1\\displaystyle\\quad\-\\int q\_\{x\_\{0\}\}\\log p\(x\_\{0\}\)\\,\\mathrm\{d\}x\_\{0\}\-\\int q\_\{u\_\{1\}\}\\log p\(u\_\{1\}\)\\,\\mathrm\{d\}u\_\{1\}\-\\int q\_\{y\_\{1\}\}\\log\\hat\{p\}\_\{y\}\(y\_\{1\}\)\\,\\mathrm\{d\}y\_\{1\}\+\(dθ−1\)​ℍ​\[qθ\]−∫qθ​log⁡p​\(θ\)​dθ\\displaystyle\\quad\+\(d\_\{\\theta\}\-1\)\\mathbb\{H\}\\left\[q\_\{\\theta\}\\right\]\-\\int q\_\{\\theta\}\\log p\(\\theta\)\\,\\mathrm\{d\}\\theta\+\(dx1−1\)​ℍ​\[qx1\]−∫qx1​log⁡p^x​\(x1\)​dx1,\\displaystyle\\quad\+\(d\_\{x\_\{1\}\}\-1\)\\mathbb\{H\}\\left\[q\_\{x\_\{1\}\}\\right\]\-\\int q\_\{x\_\{1\}\}\\log\\hat\{p\}\_\{x\}\(x\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\},
The variable degrees are

dθ=3,dx0=2,dx1=3,dy1=2,du1=2\.d\_\{\\theta\}=3,\\qquad d\_\{x\_\{0\}\}=2,\\qquad d\_\{x\_\{1\}\}=3,\\qquad d\_\{y\_\{1\}\}=2,\\qquad d\_\{u\_\{1\}\}=2\.\(D\.11\)

### D\.3Lagrangian

We introduce Lagrange multipliers for all normalization and marginalization constraints\. The factor and consistency multipliers are

λobs,λdyn,λy1​\(y1\),λθ\(obs\)​\(θ\),λx1\(obs\)​\(x1\),λx1\(dyn\)​\(x1\),λx0​\(x0\),λu1​\(u1\),λθ\(dyn\)​\(θ\),\\lambda\_\{\\mathrm\{obs\}\},\\quad\\lambda\_\{\\mathrm\{dyn\}\},\\quad\\lambda\_\{y\_\{1\}\}\(y\_\{1\}\),\\quad\\lambda\_\{\\theta\}^\{\(\\mathrm\{obs\}\)\}\(\\theta\),\\quad\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{obs\}\)\}\(x\_\{1\}\),\\quad\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{dyn\}\)\}\(x\_\{1\}\),\\quad\\lambda\_\{x\_\{0\}\}\(x\_\{0\}\),\\quad\\lambda\_\{u\_\{1\}\}\(u\_\{1\}\),\\quad\\lambda\_\{\\theta\}^\{\(\\mathrm\{dyn\}\)\}\(\\theta\),\(D\.14\)and the channel multipliers are

νobs​\(x1,θ\),νy\|x​\(x1\),νx​\(x0,u1\),νu\|x​\(x0\)\.\\nu\_\{\\mathrm\{obs\}\}\(x\_\{1\},\\theta\),\\qquad\\nu\_\{y\|x\}\(x\_\{1\}\),\\qquad\\nu\_\{x\}\(x\_\{0\},u\_\{1\}\),\\qquad\\nu\_\{u\|x\}\(x\_\{0\}\)\.\(D\.15\)
The full Lagrangian is

ℒcomb\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{comb\}\}=Fcomb​\[q,r\]\\displaystyle=F\_\{\\mathrm\{comb\}\}\[q,r\]\(D\.16\)\+λobs​\(∫qobs​\(y1,x1,θ\)​dy1​dx1​dθ−1\)\\displaystyle\\quad\+\\lambda\_\{\\mathrm\{obs\}\}\\left\(\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\-1\\right\)\+λdyn​\(∫qdyn​\(x1,x0,θ,u1\)​dx1​dx0​dθ​du1−1\)\\displaystyle\\quad\+\\lambda\_\{\\mathrm\{dyn\}\}\\left\(\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}\-1\\right\)\+∫λy1​\(y1\)​\(∫qobs​\(y1,x1,θ\)​dx1​dθ−qy1​\(y1\)\)​dy1\\displaystyle\\quad\+\\int\\lambda\_\{y\_\{1\}\}\(y\_\{1\}\)\\left\(\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\-q\_\{y\_\{1\}\}\(y\_\{1\}\)\\right\)\\mathrm\{d\}y\_\{1\}\+∫λθ\(obs\)​\(θ\)​\(∫qobs​\(y1,x1,θ\)​dy1​dx1−qθ​\(θ\)\)​dθ\\displaystyle\\quad\+\\int\\lambda\_\{\\theta\}^\{\(\\mathrm\{obs\}\)\}\(\\theta\)\\left\(\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}\-q\_\{\\theta\}\(\\theta\)\\right\)\\mathrm\{d\}\\theta\+∫λx1\(obs\)​\(x1\)​\(∫qobs​\(y1,x1,θ\)​dy1​dθ−qx1​\(x1\)\)​dx1\\displaystyle\\quad\+\\int\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{obs\}\)\}\(x\_\{1\}\)\\left\(\\int q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}\\theta\-q\_\{x\_\{1\}\}\(x\_\{1\}\)\\right\)\\mathrm\{d\}x\_\{1\}\+∫λx1\(dyn\)​\(x1\)​\(∫qdyn​\(x1,x0,θ,u1\)​dx0​dθ​du1−qx1​\(x1\)\)​dx1\\displaystyle\\quad\+\\int\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{dyn\}\)\}\(x\_\{1\}\)\\left\(\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}\-q\_\{x\_\{1\}\}\(x\_\{1\}\)\\right\)\\mathrm\{d\}x\_\{1\}\+∫λx0​\(x0\)​\(∫qdyn​\(x1,x0,θ,u1\)​dx1​dθ​du1−qx0​\(x0\)\)​dx0\\displaystyle\\quad\+\\int\\lambda\_\{x\_\{0\}\}\(x\_\{0\}\)\\left\(\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}\-q\_\{x\_\{0\}\}\(x\_\{0\}\)\\right\)\\mathrm\{d\}x\_\{0\}\+∫λu1​\(u1\)​\(∫qdyn​\(x1,x0,θ,u1\)​dx1​dx0​dθ−qu1​\(u1\)\)​du1\\displaystyle\\quad\+\\int\\lambda\_\{u\_\{1\}\}\(u\_\{1\}\)\\left\(\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\-q\_\{u\_\{1\}\}\(u\_\{1\}\)\\right\)\\mathrm\{d\}u\_\{1\}\+∫λθ\(dyn\)​\(θ\)​\(∫qdyn​\(x1,x0,θ,u1\)​dx1​dx0​du1−qθ​\(θ\)\)​dθ\\displaystyle\\quad\+\\int\\lambda\_\{\\theta\}^\{\(\\mathrm\{dyn\}\)\}\(\\theta\)\\left\(\\int q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}u\_\{1\}\-q\_\{\\theta\}\(\\theta\)\\right\)\\mathrm\{d\}\\theta\+∬νobs​\(x1,θ\)​\(∫ry\|x​θ​\(y1\|x1,θ\)​dy1−1\)​dx1​dθ\\displaystyle\\quad\+\\iint\\nu\_\{\\mathrm\{obs\}\}\(x\_\{1\},\\theta\)\\left\(\\int r\_\{y\|x\\theta\}\(y\_\{1\}\|x\_\{1\},\\theta\)\\,\\mathrm\{d\}y\_\{1\}\-1\\right\)\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\+∬νy\|x​\(x1\)​\(∫ry\|x​\(y1\|x1\)​dy1−1\)​dx1\\displaystyle\\quad\+\\iint\\nu\_\{y\|x\}\(x\_\{1\}\)\\left\(\\int r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\\,\\mathrm\{d\}y\_\{1\}\-1\\right\)\\mathrm\{d\}x\_\{1\}\+∬νx​\(x0,u1\)​\(∫rx\|x​u​\(x1\|x0,u1\)​dx1−1\)​dx0​du1\\displaystyle\\quad\+\\iint\\nu\_\{x\}\(x\_\{0\},u\_\{1\}\)\\left\(\\int r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\-1\\right\)\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}u\_\{1\}\+∫νu\|x​\(x0\)​\(∫ru\|x​\(u1\|x0\)​du1−1\)​dx0\.\\displaystyle\\quad\+\\int\\nu\_\{u\|x\}\(x\_\{0\}\)\\left\(\\int r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}u\_\{1\}\-1\\right\)\\mathrm\{d\}x\_\{0\}\.

### D\.4Stationarity Conditions

#### D\.4\.1Variation with respect toqobsq\_\{\\mathrm\{obs\}\}

The stationarity equation for the observation factor is

###### Proposition 12\(Observation factor belief for the combined objective\)\.

At stationarity,

qobs∗​\(y1,x1,θ\)∝p​\(y1\|x1,θ\)​ry\|x​θ2​\(y1\|x1,θ\)ry\|x​\(y1\|x1\)​e−λy1​\(y1\)​e−λθ\(obs\)​\(θ\)​e−λx1\(obs\)​\(x1\)\.q\_\{\\mathrm\{obs\}\}^\{\*\}\(y\_\{1\},x\_\{1\},\\theta\)\\propto\\frac\{p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,r\_\{y\|x\\theta\}^\{2\}\(y\_\{1\}\|x\_\{1\},\\theta\)\}\{r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\}e^\{\-\\lambda\_\{y\_\{1\}\}\(y\_\{1\}\)\}e^\{\-\\lambda\_\{\\theta\}^\{\(\\mathrm\{obs\}\)\}\(\\theta\)\}e^\{\-\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{obs\}\)\}\(x\_\{1\}\)\}\.\(D\.17\)

#### D\.4\.2Variation with respect toqdynq\_\{\\mathrm\{dyn\}\}

The dynamics\-side terms now contain both channel contributions:

∫qdyn​log⁡qdyn−∫qdyn​log⁡p​\(x1\|x0,θ,u1\)\+∫qdyn​log⁡rx\|x​u​\(x1\|x0,u1\)\\displaystyle\\int q\_\{\\mathrm\{dyn\}\}\\log q\_\{\\mathrm\{dyn\}\}\-\\int q\_\{\\mathrm\{dyn\}\}\\log p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\+\\int q\_\{\\mathrm\{dyn\}\}\\log r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\(D\.18\)−∫qdyn​log⁡ru\|x​\(u1\|x0\)\+\(multiplier terms\)\.\\displaystyle\\quad\-\\int q\_\{\\mathrm\{dyn\}\}\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\+\\text\{\(multiplier terms\)\}\.Takingδ​ℒcombδ​qdyn=0\\frac\{\\delta\\mathcal\{L\}\_\{\\mathrm\{comb\}\}\}\{\\delta q\_\{\\mathrm\{dyn\}\}\}=0gives

log⁡qdyn\+1−log⁡p​\(x1\|x0,θ,u1\)\+log⁡rx\|x​u​\(x1\|x0,u1\)−log⁡ru\|x​\(u1\|x0\)\+λdyn\+λx1\(dyn\)\+λx0\+λu1\+λθ\(dyn\)=0\.\\log q\_\{\\mathrm\{dyn\}\}\+1\-\\log p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\+\\log r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\-\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\+\\lambda\_\{\\mathrm\{dyn\}\}\+\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{dyn\}\)\}\+\\lambda\_\{x\_\{0\}\}\+\\lambda\_\{u\_\{1\}\}\+\\lambda\_\{\\theta\}^\{\(\\mathrm\{dyn\}\)\}=0\.\(D\.19\)
###### Proposition 13\(Combined dynamics factor belief\)\.

At stationarity,

qdyn∗​\(x1,x0,θ,u1\)∝p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)rx\|x​u​\(x1\|x0,u1\)​e−λx1\(dyn\)​\(x1\)​e−λx0​\(x0\)​e−λu1​\(u1\)​e−λθ\(dyn\)​\(θ\)\.q\_\{\\mathrm\{dyn\}\}^\{\*\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\propto\\frac\{p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\}\{r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\}e^\{\-\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{dyn\}\)\}\(x\_\{1\}\)\}e^\{\-\\lambda\_\{x\_\{0\}\}\(x\_\{0\}\)\}e^\{\-\\lambda\_\{u\_\{1\}\}\(u\_\{1\}\)\}e^\{\-\\lambda\_\{\\theta\}^\{\(\\mathrm\{dyn\}\)\}\(\\theta\)\}\.\(D\.20\)The dynamics factor is therefore reweighted by a policy\-dependent numerator together with a predictive denominator\.

#### D\.4\.3Variation with respect to the observation channels

The observation\-side channels satisfy

ry\|x​θ∗​\(y1\|x1,θ\)\\displaystyle r\_\{y\|x\\theta\}^\{\*\}\(y\_\{1\}\|x\_\{1\},\\theta\)=qobs​\(y1,x1,θ\)qsep​\(x1,θ\)=q​\(y1\|x1,θ\),\\displaystyle=\\frac\{q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\}\{q\_\{\\mathrm\{sep\}\}\(x\_\{1\},\\theta\)\}=q\(y\_\{1\}\|x\_\{1\},\\theta\),\(D\.21a\)ry\|x∗​\(y1\|x1\)\\displaystyle r\_\{y\|x\}^\{\*\}\(y\_\{1\}\|x\_\{1\}\)=qy​x​\(y1,x1\)qx1​\(x1\)=q​\(y1\|x1\)\.\\displaystyle=\\frac\{q\_\{yx\}\(y\_\{1\},x\_\{1\}\)\}\{q\_\{x\_\{1\}\}\(x\_\{1\}\)\}=q\(y\_\{1\}\|x\_\{1\}\)\.\(D\.21b\)

#### D\.4\.4Variation with respect torx\|x​ur\_\{x\|xu\}

The predictive dynamics channel is obtained from

δ​ℒcombδ​rx\|x​u​\(x1\|x0,u1\)=qtrip​\(x1,x0,u1\)rx\|x​u​\(x1\|x0,u1\)\+νx​\(x0,u1\)=0\.\\frac\{\\delta\\mathcal\{L\}\_\{\\mathrm\{comb\}\}\}\{\\delta r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\}=\\frac\{q\_\{\\mathrm\{trip\}\}\(x\_\{1\},x\_\{0\},u\_\{1\}\)\}\{r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\}\+\\nu\_\{x\}\(x\_\{0\},u\_\{1\}\)=0\.\(D\.22\)
###### Proposition 14\(Predictive dynamics channel\)\.

At stationarity,

rx\|x​u∗​\(x1\|x0,u1\)=qtrip​\(x1,x0,u1\)qpair​\(x0,u1\)=q​\(x1\|x0,u1\)\.r\_\{x\|xu\}^\{\*\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)=\\frac\{q\_\{\\mathrm\{trip\}\}\(x\_\{1\},x\_\{0\},u\_\{1\}\)\}\{q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)\}=q\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\.\(D\.23\)

#### D\.4\.5Variation with respect toru\|xr\_\{u\|x\}

Sinceru\|xr\_\{u\|x\}does not depend onx1x\_\{1\}orθ\\theta, only the marginalqu​x​\(u1,x0\)q\_\{ux\}\(u\_\{1\},x\_\{0\}\)matters:

−∫qu​x​\(u1,x0\)​log⁡ru\|x​\(u1\|x0\)​du1​dx0\+∫νu\|x​\(x0\)​\(∫ru\|x​\(u1\|x0\)​du1−1\)​dx0\.\-\\int q\_\{ux\}\(u\_\{1\},x\_\{0\}\)\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}u\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\+\\int\\nu\_\{u\|x\}\(x\_\{0\}\)\\left\(\\int r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}u\_\{1\}\-1\\right\)\\mathrm\{d\}x\_\{0\}\.\(D\.24\)
Taking a pointwise derivative yields

−qu​x​\(u1,x0\)ru\|x​\(u1\|x0\)\+νu\|x​\(x0\)=0\.\-\\frac\{q\_\{ux\}\(u\_\{1\},x\_\{0\}\)\}\{r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\}\+\\nu\_\{u\|x\}\(x\_\{0\}\)=0\.\(D\.25\)Normalization impliesνu\|x​\(x0\)=qx0​\(x0\)\\nu\_\{u\|x\}\(x\_\{0\}\)=q\_\{x\_\{0\}\}\(x\_\{0\}\)\.

###### Proposition 15\(Policy channel\)\.

At stationarity,

ru\|x∗​\(u1\|x0\)=qu​x​\(u1,x0\)qx0​\(x0\)=q​\(u1\|x0\)\.r\_\{u\|x\}^\{\*\}\(u\_\{1\}\|x\_\{0\}\)=\\frac\{q\_\{ux\}\(u\_\{1\},x\_\{0\}\)\}\{q\_\{x\_\{0\}\}\(x\_\{0\}\)\}=q\(u\_\{1\}\|x\_\{0\}\)\.\(D\.26\)

#### D\.4\.6Variation with respect to singleton beliefs

The singleton calculations are elementary because they depend only on degree counting\. Since

dy1=dx0=du1=2,dθ=dx1=3,d\_\{y\_\{1\}\}=d\_\{x\_\{0\}\}=d\_\{u\_\{1\}\}=2,\\qquad d\_\{\\theta\}=d\_\{x\_\{1\}\}=3,\(D\.27\)the degree\-2 and degree\-3 variables behave differently\. In particular:

- •for degree\-2 variablesy1y\_\{1\},x0x\_\{0\}, andu1u\_\{1\}, the KL term and the singleton entropy cancel, so λy1​\(y1\)=−log⁡p^y​\(y1\),λx0​\(x0\)=−log⁡p​\(x0\),λu1​\(u1\)=−log⁡p​\(u1\);\\lambda\_\{y\_\{1\}\}\(y\_\{1\}\)=\-\\log\\hat\{p\}\_\{y\}\(y\_\{1\}\),\\qquad\\lambda\_\{x\_\{0\}\}\(x\_\{0\}\)=\-\\log p\(x\_\{0\}\),\\qquad\\lambda\_\{u\_\{1\}\}\(u\_\{1\}\)=\-\\log p\(u\_\{1\}\);\(D\.28\)
- •for degree\-3 variablesθ\\thetaandx1x\_\{1\}, one obtains only constraints on sums of multipliers, λθ\(obs\)​\(θ\)\+λθ\(dyn\)​\(θ\)=−log⁡qθ∗​\(θ\)−1−log⁡p​\(θ\),\\lambda\_\{\\theta\}^\{\(\\mathrm\{obs\}\)\}\(\\theta\)\+\\lambda\_\{\\theta\}^\{\(\\mathrm\{dyn\}\)\}\(\\theta\)=\-\\log q\_\{\\theta\}^\{\*\}\(\\theta\)\-1\-\\log p\(\\theta\),\(D\.29\)λx1\(obs\)​\(x1\)\+λx1\(dyn\)​\(x1\)=−log⁡qx1∗​\(x1\)−1−log⁡p^x​\(x1\)\.\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{obs\}\)\}\(x\_\{1\}\)\+\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{dyn\}\)\}\(x\_\{1\}\)=\-\\log q\_\{x\_\{1\}\}^\{\*\}\(x\_\{1\}\)\-1\-\\log\\hat\{p\}\_\{x\}\(x\_\{1\}\)\.\(D\.30\)

The extra policy channel does not alter these facts, because it modifies only the non\-singleton dynamics factor and does not introduce any new local\-consistency constraint involving singletons\.

### D\.5Solving the Stationarity Equations as Message Passing

Once the kernels \([D\.12](https://arxiv.org/html/2606.04935#A4.E12)\) are identified, the rest of the derivation follows the standard Bethe logic\. For a factoraawith kernelf~a\\tilde\{f\}\_\{a\}and neighboring variablesℰ​\(a\)\\mathcal\{E\}\(a\), the factor\-to\-variable update is

μa→j​\(sj\)∝∫f~a​\(𝒔a\)​∏i∈ℰ​\(a\)∖jμi→a​\(si\)​d​𝒔a∖j\.\\mu\_\{a\\to j\}\(s\_\{j\}\)\\propto\\int\\tilde\{f\}\_\{a\}\(\\bm\{s\}\_\{a\}\)\\prod\_\{i\\in\\mathcal\{E\}\(a\)\\setminus j\}\\mu\_\{i\\to a\}\(s\_\{i\}\)\\,\\mathrm\{d\}\\bm\{s\}\_\{a\\setminus j\}\.\(D\.31\)
#### D\.5\.1Observation\-factor messages

Using the observation kernel in \([D\.12a](https://arxiv.org/html/2606.04935#A4.E12.1)\), the observation\-factor messages are

μobs→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\\to\\theta\}\(\\theta\)=∬p​\(y1\|x1,θ\)​ry\|x​θ2​\(y1\|x1,θ\)ry\|x​\(y1\|x1\)​μy1→obs​\(y1\)​μx1→obs​\(x1\)​dy1​dx1,\\displaystyle=\\iint\\frac\{p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,r\_\{y\|x\\theta\}^\{2\}\(y\_\{1\}\|x\_\{1\},\\theta\)\}\{r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\}\\mu\_\{y\_\{1\}\\to\\mathrm\{obs\}\}\(y\_\{1\}\)\\mu\_\{x\_\{1\}\\to\\mathrm\{obs\}\}\(x\_\{1\}\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\},\(D\.32a\)μobs→x1​\(x1\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\\to x\_\{1\}\}\(x\_\{1\}\)=∬p​\(y1\|x1,θ\)​ry\|x​θ2​\(y1\|x1,θ\)ry\|x​\(y1\|x1\)​μy1→obs​\(y1\)​μθ→obs​\(θ\)​dy1​dθ,\\displaystyle=\\iint\\frac\{p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,r\_\{y\|x\\theta\}^\{2\}\(y\_\{1\}\|x\_\{1\},\\theta\)\}\{r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\}\\mu\_\{y\_\{1\}\\to\\mathrm\{obs\}\}\(y\_\{1\}\)\\mu\_\{\\theta\\to\\mathrm\{obs\}\}\(\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}\\theta,\(D\.32b\)μobs→y1​\(y1\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\\to y\_\{1\}\}\(y\_\{1\}\)=∬p​\(y1\|x1,θ\)​ry\|x​θ2​\(y1\|x1,θ\)ry\|x​\(y1\|x1\)​μx1→obs​\(x1\)​μθ→obs​\(θ\)​dx1​dθ\.\\displaystyle=\\iint\\frac\{p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,r\_\{y\|x\\theta\}^\{2\}\(y\_\{1\}\|x\_\{1\},\\theta\)\}\{r\_\{y\|x\}\(y\_\{1\}\|x\_\{1\}\)\}\\mu\_\{x\_\{1\}\\to\\mathrm\{obs\}\}\(x\_\{1\}\)\\mu\_\{\\theta\\to\\mathrm\{obs\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\.\(D\.32c\)

#### D\.5\.2Dynamics\-factor messages

The combined dynamics kernel yields

μdyn→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to\\theta\}\(\\theta\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)rx\|x​u​\(x1\|x0,u1\)​μx1→dyn​\(x1\)​μx0→dyn​\(x0\)​μu1→dyn​\(u1\)​dx1​dx0​du1,\\displaystyle=\\iiint\\frac\{p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\}\{r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\}\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}u\_\{1\},\(D\.33a\)μdyn→x1​\(x1\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to x\_\{1\}\}\(x\_\{1\}\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)rx\|x​u​\(x1\|x0,u1\)​μx0→dyn​\(x0\)​μu1→dyn​\(u1\)​μθ→dyn​\(θ\)​dx0​du1​dθ,\\displaystyle=\\iiint\\frac\{p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\}\{r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\}\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}u\_\{1\}\\,\\mathrm\{d\}\\theta,\(D\.33b\)μdyn→x0​\(x0\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to x\_\{0\}\}\(x\_\{0\}\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)rx\|x​u​\(x1\|x0,u1\)​μx1→dyn​\(x1\)​μu1→dyn​\(u1\)​μθ→dyn​\(θ\)​dx1​du1​dθ,\\displaystyle=\\iiint\\frac\{p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\}\{r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\}\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}u\_\{1\}\\,\\mathrm\{d\}\\theta,\(D\.33c\)μdyn→u1​\(u1\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to u\_\{1\}\}\(u\_\{1\}\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)rx\|x​u​\(x1\|x0,u1\)​μx1→dyn​\(x1\)​μx0→dyn​\(x0\)​μθ→dyn​\(θ\)​dx1​dx0​dθ\.\\displaystyle=\\iiint\\frac\{p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\}\{r\_\{x\|xu\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)\}\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\.\(D\.33d\)

#### D\.5\.3Factor beliefs and singleton beliefs

The factor beliefs are kernel times incoming messages:

qobs∗​\(y1,x1,θ\)\\displaystyle q\_\{\\mathrm\{obs\}\}^\{\*\}\(y\_\{1\},x\_\{1\},\\theta\)∝f~obs​\(y1,x1,θ\)​μy1→obs​\(y1\)​μx1→obs​\(x1\)​μθ→obs​\(θ\),\\displaystyle\\propto\\tilde\{f\}\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)\\,\\mu\_\{y\_\{1\}\\to\\mathrm\{obs\}\}\(y\_\{1\}\)\\mu\_\{x\_\{1\}\\to\\mathrm\{obs\}\}\(x\_\{1\}\)\\mu\_\{\\theta\\to\\mathrm\{obs\}\}\(\\theta\),\(D\.34a\)qdyn∗​\(x1,x0,θ,u1\)\\displaystyle q\_\{\\mathrm\{dyn\}\}^\{\*\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)∝f~dyncomb​\(x1,x0,θ,u1\)​μx1→dyn​\(x1\)​μx0→dyn​\(x0\)​μθ→dyn​\(θ\)​μu1→dyn​\(u1\)\.\\displaystyle\\propto\\tilde\{f\}\_\{\\mathrm\{dyn\}\}^\{\\mathrm\{comb\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\.\(D\.34b\)
Singleton beliefs keep the usual sum\-product form:

qθ∗​\(θ\)\\displaystyle q\_\{\\theta\}^\{\*\}\(\\theta\)∝p​\(θ\)​μobs→θ​\(θ\)​μdyn→θ​\(θ\),\\displaystyle\\propto p\(\\theta\)\\,\\mu\_\{\\mathrm\{obs\}\\to\\theta\}\(\\theta\)\\,\\mu\_\{\\mathrm\{dyn\}\\to\\theta\}\(\\theta\),\(D\.35a\)qx1∗​\(x1\)\\displaystyle q\_\{x\_\{1\}\}^\{\*\}\(x\_\{1\}\)∝p^x​\(x1\)​μobs→x1​\(x1\)​μdyn→x1​\(x1\),\\displaystyle\\propto\\hat\{p\}\_\{x\}\(x\_\{1\}\)\\,\\mu\_\{\\mathrm\{obs\}\\to x\_\{1\}\}\(x\_\{1\}\)\\,\\mu\_\{\\mathrm\{dyn\}\\to x\_\{1\}\}\(x\_\{1\}\),\(D\.35b\)qx0∗​\(x0\)\\displaystyle q\_\{x\_\{0\}\}^\{\*\}\(x\_\{0\}\)∝p​\(x0\)​μdyn→x0​\(x0\),\\displaystyle\\propto p\(x\_\{0\}\)\\,\\mu\_\{\\mathrm\{dyn\}\\to x\_\{0\}\}\(x\_\{0\}\),\(D\.35c\)qu1∗​\(u1\)\\displaystyle q\_\{u\_\{1\}\}^\{\*\}\(u\_\{1\}\)∝p​\(u1\)​μdyn→u1​\(u1\),\\displaystyle\\propto p\(u\_\{1\}\)\\,\\mu\_\{\\mathrm\{dyn\}\\to u\_\{1\}\}\(u\_\{1\}\),\(D\.35d\)qy1∗​\(y1\)\\displaystyle q\_\{y\_\{1\}\}^\{\*\}\(y\_\{1\}\)∝p^y​\(y1\)​μobs→y1​\(y1\)\.\\displaystyle\\propto\\hat\{p\}\_\{y\}\(y\_\{1\}\)\\,\\mu\_\{\\mathrm\{obs\}\\to y\_\{1\}\}\(y\_\{1\}\)\.\(D\.35e\)

#### D\.5\.4Channel updates

At a fixed point, all four channels recover the corresponding conditionals under the current factor beliefs:

ry\|x​θ∗​\(y1\|x1,θ\)\\displaystyle r\_\{y\|x\\theta\}^\{\*\}\(y\_\{1\}\|x\_\{1\},\\theta\)=q​\(y1\|x1,θ\),\\displaystyle=q\(y\_\{1\}\|x\_\{1\},\\theta\),\(D\.36a\)ry\|x∗​\(y1\|x1\)\\displaystyle r\_\{y\|x\}^\{\*\}\(y\_\{1\}\|x\_\{1\}\)=q​\(y1\|x1\),\\displaystyle=q\(y\_\{1\}\|x\_\{1\}\),\(D\.36b\)rx\|x​u∗​\(x1\|x0,u1\)\\displaystyle r\_\{x\|xu\}^\{\*\}\(x\_\{1\}\|x\_\{0\},u\_\{1\}\)=q​\(x1\|x0,u1\),\\displaystyle=q\(x\_\{1\}\|x\_\{0\},u\_\{1\}\),\(D\.36c\)ru\|x∗​\(u1\|x0\)\\displaystyle r\_\{u\|x\}^\{\*\}\(u\_\{1\}\|x\_\{0\}\)=q​\(u1\|x0\)\.\\displaystyle=q\(u\_\{1\}\|x\_\{0\}\)\.\(D\.36d\)

### D\.6Generic scheme for arbitraryTT

The passage fromT=1T=1to arbitrary horizons is immediate because all entropy corrections are additive across time, so each time step receives its own local channels

ry\|x​θ,t​\(yt\|xt,θ\),ry\|x,t​\(yt\|xt\),rx\|x​u,t​\(xt\|xt−1,ut\),ru\|x,t​\(ut\|xt−1\)\.r\_\{y\|x\\theta,t\}\(y\_\{t\}\|x\_\{t\},\\theta\),\\quad r\_\{y\|x,t\}\(y\_\{t\}\|x\_\{t\}\),\\quad r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\),\\quad r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\.\(D\.37\)
The per\-time\-step kernels are

f~obst​\(yt,xt,θ\)\\displaystyle\\tilde\{f\}\_\{\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\},x\_\{t\},\\theta\)=p​\(yt\|xt,θ\)​ry\|x​θ,t2​\(yt\|xt,θ\)ry\|x,t​\(yt\|xt\),\\displaystyle=\\frac\{p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,r\_\{y\|x\\theta,t\}^\{2\}\(y\_\{t\}\|x\_\{t\},\\theta\)\}\{r\_\{y\|x,t\}\(y\_\{t\}\|x\_\{t\}\)\},\(D\.38a\)f~dyntcomb​\(xt,xt−1,θ,ut\)\\displaystyle\\tilde\{f\}\_\{\\mathrm\{dyn\}\_\{t\}\}^\{\\mathrm\{comb\}\}\(x\_\{t\},x\_\{t\-1\},\\theta,u\_\{t\}\)=p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)rx\|x​u,t​\(xt\|xt−1,ut\)\.\\displaystyle=\\frac\{p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\}\{r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\}\.\(D\.38b\)
The full multi\-step scheme is obtained by applying the usual sum\-product updates with the kernels in \([D\.38](https://arxiv.org/html/2606.04935#A4.E38)\)\.

#### D\.6\.1Messages from Observation Factorfobstf\_\{\\mathrm\{obs\}\_\{t\}\}

μobst→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to\\theta\}\(\\theta\)=∬p​\(yt\|xt,θ\)​ry\|x​θ,t2​\(yt\|xt,θ\)ry\|x,t​\(yt\|xt\)​μyt→obst​\(yt\)​μxt→obst​\(xt\)​dyt​dxt,\\displaystyle=\\iint\\frac\{p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,r\_\{y\|x\\theta,t\}^\{2\}\(y\_\{t\}\|x\_\{t\},\\theta\)\}\{r\_\{y\|x,t\}\(y\_\{t\}\|x\_\{t\}\)\}\\mu\_\{y\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\}\)\\mu\_\{x\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(x\_\{t\}\)\\,\\mathrm\{d\}y\_\{t\}\\,\\mathrm\{d\}x\_\{t\},\(D\.39a\)μobst→xt​\(xt\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)=∬p​\(yt\|xt,θ\)​ry\|x​θ,t2​\(yt\|xt,θ\)ry\|x,t​\(yt\|xt\)​μyt→obst​\(yt\)​μθ→obst​\(θ\)​dyt​dθ,\\displaystyle=\\iint\\frac\{p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,r\_\{y\|x\\theta,t\}^\{2\}\(y\_\{t\}\|x\_\{t\},\\theta\)\}\{r\_\{y\|x,t\}\(y\_\{t\}\|x\_\{t\}\)\}\\mu\_\{y\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\}\)\\mu\_\{\\theta\\to\\mathrm\{obs\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}y\_\{t\}\\,\\mathrm\{d\}\\theta,\(D\.39b\)μobst→yt​\(yt\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to y\_\{t\}\}\(y\_\{t\}\)=∬p​\(yt\|xt,θ\)​ry\|x​θ,t2​\(yt\|xt,θ\)ry\|x,t​\(yt\|xt\)​μxt→obst​\(xt\)​μθ→obst​\(θ\)​dxt​dθ\.\\displaystyle=\\iint\\frac\{p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,r\_\{y\|x\\theta,t\}^\{2\}\(y\_\{t\}\|x\_\{t\},\\theta\)\}\{r\_\{y\|x,t\}\(y\_\{t\}\|x\_\{t\}\)\}\\mu\_\{x\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(x\_\{t\}\)\\mu\_\{\\theta\\to\\mathrm\{obs\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}\\theta\.\(D\.39c\)

#### D\.6\.2Messages from Dynamics Factorfdyntf\_\{\\mathrm\{dyn\}\_\{t\}\}

μdynt→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to\\theta\}\(\\theta\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)rx\|x​u,t​\(xt\|xt−1,ut\)​μxt→dynt​\(xt\)​μxt−1→dynt​\(xt−1\)​μut→dynt​\(ut\)​dxt​dxt−1​dut,\\displaystyle=\\iiint\\frac\{p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\}\{r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\}\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}x\_\{t\-1\}\\,\\mathrm\{d\}u\_\{t\},\(D\.40a\)μdynt→xt​\(xt\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)rx\|x​u,t​\(xt\|xt−1,ut\)​μxt−1→dynt​\(xt−1\)​μut→dynt​\(ut\)​μθ→dynt​\(θ\)​dxt−1​dut​dθ,\\displaystyle=\\iiint\\frac\{p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\}\{r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\}\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{t\-1\}\\,\\mathrm\{d\}u\_\{t\}\\,\\mathrm\{d\}\\theta,\(D\.40b\)μdynt→xt−1​\(xt−1\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to x\_\{t\-1\}\}\(x\_\{t\-1\}\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)rx\|x​u,t​\(xt\|xt−1,ut\)​μxt→dynt​\(xt\)​μut→dynt​\(ut\)​μθ→dynt​\(θ\)​dxt​dut​dθ,\\displaystyle=\\iiint\\frac\{p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\}\{r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\}\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}u\_\{t\}\\,\\mathrm\{d\}\\theta,\(D\.40c\)μdynt→ut​\(ut\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to u\_\{t\}\}\(u\_\{t\}\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)rx\|x​u,t​\(xt\|xt−1,ut\)​μxt→dynt​\(xt\)​μxt−1→dynt​\(xt−1\)​μθ→dynt​\(θ\)​dxt​dxt−1​dθ\.\\displaystyle=\\iiint\\frac\{p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\}\{r\_\{x\|xu,t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\}\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}x\_\{t\-1\}\\,\\mathrm\{d\}\\theta\.\(D\.40d\)

#### D\.6\.3Factor Beliefs

qobs,t∗​\(yt,xt,θ\)\\displaystyle q\_\{\\mathrm\{obs\},t\}^\{\*\}\(y\_\{t\},x\_\{t\},\\theta\)∝f~obst​\(yt,xt,θ\)​μyt→obst​\(yt\)​μxt→obst​\(xt\)​μθ→obst​\(θ\),\\displaystyle\\propto\\tilde\{f\}\_\{\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\},x\_\{t\},\\theta\)\\,\\mu\_\{y\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\}\)\\mu\_\{x\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(x\_\{t\}\)\\mu\_\{\\theta\\to\\mathrm\{obs\}\_\{t\}\}\(\\theta\),\(D\.41a\)qdyn,t∗​\(xt,xt−1,θ,ut\)\\displaystyle q\_\{\\mathrm\{dyn\},t\}^\{\*\}\(x\_\{t\},x\_\{t\-1\},\\theta,u\_\{t\}\)∝f~dyntcomb​\(xt,xt−1,θ,ut\)​μxt→dynt​\(xt\)​μxt−1→dynt​\(xt−1\)​μθ→dynt​\(θ\)​μut→dynt​\(ut\)\.\\displaystyle\\propto\\tilde\{f\}\_\{\\mathrm\{dyn\}\_\{t\}\}^\{\\mathrm\{comb\}\}\(x\_\{t\},x\_\{t\-1\},\\theta,u\_\{t\}\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\.\(D\.41b\)

#### D\.6\.4Channel Updates

At a fixed point, each time step has the four local channel updates

ry\|x​θ,t∗​\(yt\|xt,θ\)\\displaystyle r\_\{y\|x\\theta,t\}^\{\*\}\(y\_\{t\}\|x\_\{t\},\\theta\)=qt​\(yt\|xt,θ\),\\displaystyle=q\_\{t\}\(y\_\{t\}\|x\_\{t\},\\theta\),\(D\.42a\)ry\|x,t∗​\(yt\|xt\)\\displaystyle r\_\{y\|x,t\}^\{\*\}\(y\_\{t\}\|x\_\{t\}\)=qt​\(yt\|xt\),\\displaystyle=q\_\{t\}\(y\_\{t\}\|x\_\{t\}\),\(D\.42b\)rx\|x​u,t∗​\(xt\|xt−1,ut\)\\displaystyle r\_\{x\|xu,t\}^\{\*\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)=qt​\(xt\|xt−1,ut\),\\displaystyle=q\_\{t\}\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\),\(D\.42c\)ru\|x,t∗​\(ut\|xt−1\)\\displaystyle r\_\{u\|x,t\}^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)=qt​\(ut\|xt−1\)\.\\displaystyle=q\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\)\.\(D\.42d\)

#### D\.6\.5Singleton Beliefs

qxt∗​\(xt\)\\displaystyle q\_\{x\_\{t\}\}^\{\*\}\(x\_\{t\}\)∝p^x​\(xt\)​μobst→xt​\(xt\)​μdynt→xt​\(xt\)​μdynt\+1→xt​\(xt\),\\displaystyle\\propto\\hat\{p\}\_\{x\}\(x\_\{t\}\)\\,\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{t\+1\}\\to x\_\{t\}\}\(x\_\{t\}\),\(D\.43a\)qθ∗​\(θ\)\\displaystyle q\_\{\\theta\}^\{\*\}\(\\theta\)∝p​\(θ\)​∏τ=1Tμobsτ→θ​\(θ\)​μdynτ→θ​\(θ\),\\displaystyle\\propto p\(\\theta\)\\prod\_\{\\tau=1\}^\{T\}\\mu\_\{\\mathrm\{obs\}\_\{\\tau\}\\to\\theta\}\(\\theta\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{\\tau\}\\to\\theta\}\(\\theta\),\(D\.43b\)qut∗​\(ut\)\\displaystyle q\_\{u\_\{t\}\}^\{\*\}\(u\_\{t\}\)∝p​\(ut\)​μdynt→ut​\(ut\),\\displaystyle\\propto p\(u\_\{t\}\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to u\_\{t\}\}\(u\_\{t\}\),\(D\.43c\)qyt∗​\(yt\)\\displaystyle q\_\{y\_\{t\}\}^\{\*\}\(y\_\{t\}\)∝p^y​\(yt\)​μobst→yt​\(yt\)\.\\displaystyle\\propto\\hat\{p\}\_\{y\}\(y\_\{t\}\)\\,\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to y\_\{t\}\}\(y\_\{t\}\)\.\(D\.43d\)
The boundary conditions follow the usual temporal\-edge conventions: att=1t=1,μx0→dyn1​\(x0\)=p​\(x0\)\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\_\{1\}\}\(x\_\{0\}\)=p\(x\_\{0\}\); att=Tt=T, the messageμdynT\+1→xT\\mu\_\{\\mathrm\{dyn\}\_\{T\+1\}\\to x\_\{T\}\}is absent\.

### D\.7Interpretation

At a fixed point,

f~dyntcomb​\(xt,xt−1,θ,ut\)=p​\(xt\|xt−1,θ,ut\)​q​\(ut\|xt−1\)q​\(xt\|xt−1,ut\)\.\\tilde\{f\}\_\{\\mathrm\{dyn\}\_\{t\}\}^\{\\mathrm\{comb\}\}\(x\_\{t\},x\_\{t\-1\},\\theta,u\_\{t\}\)=\\frac\{p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,q\(u\_\{t\}\|x\_\{t\-1\}\)\}\{q\(x\_\{t\}\|x\_\{t\-1\},u\_\{t\}\)\}\.\(D\.44\)This exposes the objective transparently: the numerator implements policy commitment, while the denominator implements the risk\-sensitive correction over predicted futures\. The observation factor simultaneously balances the two observation\-side entropy terms throughry\|x​θr\_\{y\|x\\theta\}andry\|xr\_\{y\|x\}\.

## Appendix EMessage Passing Derivation for VBP

This appendix derives the message\-passing equations for the Variational Belief Propagation \(VBP\) scheme that implements cross\-entropy planning \([Section˜4\.2](https://arxiv.org/html/2606.04935#S4.SS2)\)\. The factor graph and Bethe approximation are identical to the combined\-objective derivation \([AppendixD](https://arxiv.org/html/2606.04935#A4)\); only the entropy correction differs\. VBP requires a single channel reparameterization, making the derivation substantially simpler\.

### E\.1Coordinate System

We use the same generative model, factor graph, Bethe coordinates, and local\-polytope constraints as in[AppendixD](https://arxiv.org/html/2606.04935#A4): factor beliefsqobs​\(y1,x1,θ\)q\_\{\\mathrm\{obs\}\}\(y\_\{1\},x\_\{1\},\\theta\)andqdyn​\(x1,x0,θ,u1\)q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\), plus singleton beliefsqθ,qx0,qx1,qy1,qu1q\_\{\\theta\},q\_\{x\_\{0\}\},q\_\{x\_\{1\}\},q\_\{y\_\{1\}\},q\_\{u\_\{1\}\}\.

The single difference is the channel set\. VBP introduces one channel:

ru\|x​\(u1\|x0\),subject to∫ru\|x​\(u1\|x0\)​du1=1∀x0\.r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\),\\quad\\text\{subject to\}\\quad\\int r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}u\_\{1\}=1\\quad\\forall\\,x\_\{0\}\.\(E\.1\)This channel parameterizes the conditional entropy of actions given states via:

ℍ​\[q​\(u1\|x0\)\]=minru\|x⁡𝔼qpair​\(x0,u1\)​\[−log⁡ru\|x​\(u1\|x0\)\],\\mathbb\{H\}\\left\[q\(u\_\{1\}\|x\_\{0\}\)\\right\]=\\min\_\{r\_\{u\|x\}\}\\mathbb\{E\}\_\{q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)\}\\left\[\-\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\right\],\(E\.2\)where the minimum is attained atru\|x​\(u1\|x0\)=q​\(u1\|x0\)r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)=q\(u\_\{1\}\|x\_\{0\}\), and

qpair​\(x0,u1\):=∬qdyn​\(x1,x0,θ,u1\)​dx1​dθq\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\):=\\iint q\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\(E\.3\)is the marginal of the dynamics factor belief over\(x0,u1\)\(x\_\{0\},u\_\{1\}\)\.

### E\.2Objective Function

The VBP objective adds the cross\-entropy planning correction \([8](https://arxiv.org/html/2606.04935#S4.E8)\) to the usual Bethe free energy:

Δ​FVBP=\+ℍ​\[q​\(u1\|x0\)\]\.\\Delta F\_\{\\mathrm\{VBP\}\}=\+\\mathbb\{H\}\\left\[q\(u\_\{1\}\|x\_\{0\}\)\\right\]\.\(E\.4\)
After channel reparameterization via \([E\.2](https://arxiv.org/html/2606.04935#A5.E2)\):

FVBP​\[q,ru\|x\]\\displaystyle F\_\{\\mathrm\{VBP\}\}\[q,r\_\{u\|x\}\]=∫qobs​log⁡qobsp​\(y1\|x1,θ\)​d​y1​d​x1​d​θ\\displaystyle=\\int q\_\{\\mathrm\{obs\}\}\\log\\frac\{q\_\{\\mathrm\{obs\}\}\}\{p\(y\_\{1\}\|x\_\{1\},\\theta\)\}\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\(E\.5\)\+∫qdyn​log⁡qdynp​\(x1\|x0,θ,u1\)​d​x1​d​x0​d​θ​d​u1\\displaystyle\\quad\+\\int q\_\{\\mathrm\{dyn\}\}\\log\\frac\{q\_\{\\mathrm\{dyn\}\}\}\{p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}−∫qdyn​log⁡ru\|x​\(u1\|x0\)​dx1​dx0​dθ​du1\\displaystyle\\quad\-\\int q\_\{\\mathrm\{dyn\}\}\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\\,\\mathrm\{d\}u\_\{1\}\+\(the usual singleton Bethe terms\)\.\\displaystyle\\quad\+\\text\{\(the usual singleton Bethe terms\)\}\.

### E\.3Stationarity Conditions

We form the Lagrangian with the same normalization and consistency multipliers as in[AppendixD\.3](https://arxiv.org/html/2606.04935#A4.SS3), except that the three observation and predictive\-dynamics channel multipliers are replaced by a single multiplierνu\|x​\(x0\)\\nu\_\{u\|x\}\(x\_\{0\}\)for the channel normalization constraint \([E\.1](https://arxiv.org/html/2606.04935#A5.E1)\)\.

#### E\.3\.1Variation with respect toqobsq\_\{\\mathrm\{obs\}\}

The observation factor receives no channel correction\. The stationarity condition is identical to standard Bethe:

qobs∗​\(y1,x1,θ\)∝p​\(y1\|x1,θ\)​e−λy1​\(y1\)​e−λθ\(obs\)​\(θ\)​e−λx1\(obs\)​\(x1\)\.q\_\{\\mathrm\{obs\}\}^\{\*\}\(y\_\{1\},x\_\{1\},\\theta\)\\propto p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,e^\{\-\\lambda\_\{y\_\{1\}\}\(y\_\{1\}\)\}\\,e^\{\-\\lambda\_\{\\theta\}^\{\(\\mathrm\{obs\}\)\}\(\\theta\)\}\\,e^\{\-\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{obs\}\)\}\(x\_\{1\}\)\}\.\(E\.6\)The observation kernel is simplyp​\(y1\|x1,θ\)p\(y\_\{1\}\|x\_\{1\},\\theta\), as in standard belief propagation\.

#### E\.3\.2Variation with respect toqdynq\_\{\\mathrm\{dyn\}\}

Theqdynq\_\{\\mathrm\{dyn\}\}\-dependent terms include the channel correction−∫qdyn​log⁡ru\|x​\(u1\|x0\)\-\\int q\_\{\\mathrm\{dyn\}\}\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\. Takingδ​ℒδ​qdyn=0\\frac\{\\delta\\mathcal\{L\}\}\{\\delta q\_\{\\mathrm\{dyn\}\}\}=0:

log⁡qdyn\+1−log⁡p​\(x1\|x0,θ,u1\)−log⁡ru\|x​\(u1\|x0\)\+λdyn\+λx1\(dyn\)\+λx0\+λu1\+λθ\(dyn\)=0\.\\log q\_\{\\mathrm\{dyn\}\}\+1\-\\log p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\-\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\+\\lambda\_\{\\mathrm\{dyn\}\}\+\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{dyn\}\)\}\+\\lambda\_\{x\_\{0\}\}\+\\lambda\_\{u\_\{1\}\}\+\\lambda\_\{\\theta\}^\{\(\\mathrm\{dyn\}\)\}=0\.\(E\.7\)
###### Proposition 18\(VBP dynamics factor belief\)\.

At stationarity:

qdyn∗​\(x1,x0,θ,u1\)∝p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)​e−λx1\(dyn\)​\(x1\)​e−λx0​\(x0\)​e−λu1​\(u1\)​e−λθ\(dyn\)​\(θ\)\.q\_\{\\mathrm\{dyn\}\}^\{\*\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)\\propto p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,e^\{\-\\lambda\_\{x\_\{1\}\}^\{\(\\mathrm\{dyn\}\)\}\(x\_\{1\}\)\}\\,e^\{\-\\lambda\_\{x\_\{0\}\}\(x\_\{0\}\)\}\\,e^\{\-\\lambda\_\{u\_\{1\}\}\(u\_\{1\}\)\}\\,e^\{\-\\lambda\_\{\\theta\}^\{\(\\mathrm\{dyn\}\)\}\(\\theta\)\}\.\(E\.8\)The productf~dyn​\(x1,x0,θ,u1\):=p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)\\tilde\{f\}\_\{\\mathrm\{dyn\}\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\):=p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)is the VBP dynamics kernel\.

#### E\.3\.3Variation with respect toru\|xr\_\{u\|x\}

Theru\|xr\_\{u\|x\}\-dependent terms in the Lagrangian are:

−∫qpair​\(x0,u1\)​log⁡ru\|x​\(u1\|x0\)​du1​dx0\+∫νu\|x​\(x0\)​\(∫ru\|x​\(u1\|x0\)​du1−1\)​dx0\.\-\\int q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)\\log r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}u\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\+\\int\\nu\_\{u\|x\}\(x\_\{0\}\)\\left\(\\int r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mathrm\{d\}u\_\{1\}\-1\\right\)\\mathrm\{d\}x\_\{0\}\.\(E\.9\)
Pointwise derivative:

−qpair​\(x0,u1\)ru\|x​\(u1\|x0\)\+νu\|x​\(x0\)=0\.\-\\frac\{q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)\}\{r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\}\+\\nu\_\{u\|x\}\(x\_\{0\}\)=0\.\(E\.10\)
Imposing normalization∫ru\|x​du1=1\\int r\_\{u\|x\}\\,\\mathrm\{d\}u\_\{1\}=1yieldsνu\|x​\(x0\)=qx0​\(x0\)\\nu\_\{u\|x\}\(x\_\{0\}\)=q\_\{x\_\{0\}\}\(x\_\{0\}\), whereqx0​\(x0\)=∫qpair​\(x0,u1\)​du1q\_\{x\_\{0\}\}\(x\_\{0\}\)=\\int q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)\\,\\mathrm\{d\}u\_\{1\}\.

###### Proposition 19\(Action channel\)\.

At stationarity:

ru\|x∗​\(u1\|x0\)=qpair​\(x0,u1\)qx0​\(x0\)=q​\(u1\|x0\),r\_\{u\|x\}^\{\*\}\(u\_\{1\}\|x\_\{0\}\)=\\frac\{q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)\}\{q\_\{x\_\{0\}\}\(x\_\{0\}\)\}=q\(u\_\{1\}\|x\_\{0\}\),\(E\.11\)the conditional from the dynamics factor belief\.

#### E\.3\.4Singleton stationarity

The singleton stationarity conditions are identical to the corresponding degree\-counting argument in[AppendixD\.4](https://arxiv.org/html/2606.04935#A4.SS4): degree\-2 multipliers are identified algebraically, while degree\-3 multipliers satisfy the same constraint equations on sums of multipliers\.

### E\.4Message\-Passing Equations

Using the observation kernelf~obs=p​\(y1\|x1,θ\)\\tilde\{f\}\_\{\\mathrm\{obs\}\}=p\(y\_\{1\}\|x\_\{1\},\\theta\)and the VBP dynamics kernelf~dyn=p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)\\tilde\{f\}\_\{\\mathrm\{dyn\}\}=p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\), the sum\-product messages follow from the standard factor\-to\-variable update\. Variable\-to\-factor messagesμi→a\\mu\_\{i\\to a\}are the usual products of all incoming factor\-to\-variable messages except the one from the recipient factor\.

#### E\.4\.1Messages from Observation Factor

Standard sum\-product \(no channel modification\):

μobs→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\\to\\theta\}\(\\theta\)=∬p​\(y1\|x1,θ\)​μy1→obs​\(y1\)​μx1→obs​\(x1\)​dy1​dx1,\\displaystyle=\\iint p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,\\mu\_\{y\_\{1\}\\to\\mathrm\{obs\}\}\(y\_\{1\}\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{obs\}\}\(x\_\{1\}\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}x\_\{1\},\(E\.12a\)μobs→x1​\(x1\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\\to x\_\{1\}\}\(x\_\{1\}\)=∬p​\(y1\|x1,θ\)​μy1→obs​\(y1\)​μθ→obs​\(θ\)​dy1​dθ,\\displaystyle=\\iint p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,\\mu\_\{y\_\{1\}\\to\\mathrm\{obs\}\}\(y\_\{1\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{obs\}\}\(\\theta\)\\,\\mathrm\{d\}y\_\{1\}\\,\\mathrm\{d\}\\theta,\(E\.12b\)μobs→y1​\(y1\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\\to y\_\{1\}\}\(y\_\{1\}\)=∬p​\(y1\|x1,θ\)​μθ→obs​\(θ\)​μx1→obs​\(x1\)​dx1​dθ\.\\displaystyle=\\iint p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,\\mu\_\{\\theta\\to\\mathrm\{obs\}\}\(\\theta\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{obs\}\}\(x\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\.\(E\.12c\)

#### E\.4\.2Messages from Dynamics Factor

Using kernelp​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\):

μdyn→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to\\theta\}\(\\theta\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)​μx0→dyn​\(x0\)​μu1→dyn​\(u1\)​μx1→dyn​\(x1\)​dx1​dx0​du1,\\displaystyle=\\iiint p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\,\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}u\_\{1\},\(E\.13a\)μdyn→x1​\(x1\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to x\_\{1\}\}\(x\_\{1\}\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)​μx0→dyn​\(x0\)​μu1→dyn​\(u1\)​μθ→dyn​\(θ\)​dx0​du1​dθ,\\displaystyle=\\iiint p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\,\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}u\_\{1\}\\,\\mathrm\{d\}\\theta,\(E\.13b\)μdyn→x0​\(x0\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to x\_\{0\}\}\(x\_\{0\}\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)​μx1→dyn​\(x1\)​μu1→dyn​\(u1\)​μθ→dyn​\(θ\)​dx1​du1​dθ,\\displaystyle=\\iiint p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\,\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}u\_\{1\}\\,\\mathrm\{d\}\\theta,\(E\.13c\)μdyn→u1​\(u1\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\\to u\_\{1\}\}\(u\_\{1\}\)=∭p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)​μx0→dyn​\(x0\)​μθ→dyn​\(θ\)​μx1→dyn​\(x1\)​dx1​dx0​dθ\.\\displaystyle=\\iiint p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}x\_\{0\}\\,\\mathrm\{d\}\\theta\.\(E\.13d\)

#### E\.4\.3Factor Beliefs and Channel Update

The factor beliefs are the kernel times all incoming variable\-to\-factor messages:

qobs∗​\(y1,x1,θ\)\\displaystyle q\_\{\\mathrm\{obs\}\}^\{\*\}\(y\_\{1\},x\_\{1\},\\theta\)∝p​\(y1\|x1,θ\)​μy1→obs​\(y1\)​μx1→obs​\(x1\)​μθ→obs​\(θ\),\\displaystyle\\propto p\(y\_\{1\}\|x\_\{1\},\\theta\)\\,\\mu\_\{y\_\{1\}\\to\\mathrm\{obs\}\}\(y\_\{1\}\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{obs\}\}\(x\_\{1\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{obs\}\}\(\\theta\),\(E\.14a\)qdyn∗​\(x1,x0,θ,u1\)\\displaystyle q\_\{\\mathrm\{dyn\}\}^\{\*\}\(x\_\{1\},x\_\{0\},\\theta,u\_\{1\}\)∝p​\(x1\|x0,θ,u1\)​ru\|x​\(u1\|x0\)​μx1→dyn​\(x1\)​μx0→dyn​\(x0\)​μθ→dyn​\(θ\)​μu1→dyn​\(u1\)\.\\displaystyle\\propto p\(x\_\{1\}\|x\_\{0\},\\theta,u\_\{1\}\)\\,r\_\{u\|x\}\(u\_\{1\}\|x\_\{0\}\)\\,\\mu\_\{x\_\{1\}\\to\\mathrm\{dyn\}\}\(x\_\{1\}\)\\,\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\}\(x\_\{0\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\}\(\\theta\)\\,\\mu\_\{u\_\{1\}\\to\\mathrm\{dyn\}\}\(u\_\{1\}\)\.\(E\.14b\)
The channel update follows from[Proposition19](https://arxiv.org/html/2606.04935#Thmtheorem19):

ru\|x∗​\(u1\|x0\)=qpair​\(x0,u1\)qx0​\(x0\)=q​\(u1\|x0\),where​qpair​\(x0,u1\)=∬qdyn∗​dx1​dθ\.r\_\{u\|x\}^\{\*\}\(u\_\{1\}\|x\_\{0\}\)=\\frac\{q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)\}\{q\_\{x\_\{0\}\}\(x\_\{0\}\)\}=q\(u\_\{1\}\|x\_\{0\}\),\\quad\\text\{where \}q\_\{\\mathrm\{pair\}\}\(x\_\{0\},u\_\{1\}\)=\\iint q\_\{\\mathrm\{dyn\}\}^\{\*\}\\,\\mathrm\{d\}x\_\{1\}\\,\\mathrm\{d\}\\theta\.\(E\.15\)

#### E\.4\.4Singleton Beliefs

Singleton beliefs follow the standard sum\-product rule — the product of all incoming factor\-to\-variable messages with the prior \(or goal prior\):

qθ∗​\(θ\)\\displaystyle q\_\{\\theta\}^\{\*\}\(\\theta\)∝p​\(θ\)​μobs→θ​\(θ\)​μdyn→θ​\(θ\),\\displaystyle\\propto p\(\\theta\)\\,\\mu\_\{\\mathrm\{obs\}\\to\\theta\}\(\\theta\)\\,\\mu\_\{\\mathrm\{dyn\}\\to\\theta\}\(\\theta\),\(E\.16a\)qx1∗​\(x1\)\\displaystyle q\_\{x\_\{1\}\}^\{\*\}\(x\_\{1\}\)∝p^x​\(x1\)​μobs→x1​\(x1\)​μdyn→x1​\(x1\),\\displaystyle\\propto\\hat\{p\}\_\{x\}\(x\_\{1\}\)\\,\\mu\_\{\\mathrm\{obs\}\\to x\_\{1\}\}\(x\_\{1\}\)\\,\\mu\_\{\\mathrm\{dyn\}\\to x\_\{1\}\}\(x\_\{1\}\),\(E\.16b\)qx0∗​\(x0\)\\displaystyle q\_\{x\_\{0\}\}^\{\*\}\(x\_\{0\}\)∝p​\(x0\)​μdyn→x0​\(x0\),\\displaystyle\\propto p\(x\_\{0\}\)\\,\\mu\_\{\\mathrm\{dyn\}\\to x\_\{0\}\}\(x\_\{0\}\),\(E\.16c\)qu1∗​\(u1\)\\displaystyle q\_\{u\_\{1\}\}^\{\*\}\(u\_\{1\}\)∝p​\(u1\)​μdyn→u1​\(u1\),\\displaystyle\\propto p\(u\_\{1\}\)\\,\\mu\_\{\\mathrm\{dyn\}\\to u\_\{1\}\}\(u\_\{1\}\),\(E\.16d\)qy1∗​\(y1\)\\displaystyle q\_\{y\_\{1\}\}^\{\*\}\(y\_\{1\}\)∝p^y​\(y1\)​μobs→y1​\(y1\)\.\\displaystyle\\propto\\hat\{p\}\_\{y\}\(y\_\{1\}\)\\,\\mu\_\{\\mathrm\{obs\}\\to y\_\{1\}\}\(y\_\{1\}\)\.\(E\.16e\)

### E\.5Generic Scheme for ArbitraryTT

The VBP scheme generalizes to arbitrary time horizons by introducing a time\-local channelru\|x,t​\(ut\|xt−1\)r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)at each time stept=1,…,Tt=1,\\ldots,T\. As in[AppendixD\.6](https://arxiv.org/html/2606.04935#A4.SS6), the per\-timestep additivity of the Lagrangian and the entropy correction\+∑tℍ​\[q​\(ut\|xt−1\)\]\+\\sum\_\{t\}\\mathbb\{H\}\\left\[q\(u\_\{t\}\|x\_\{t\-1\}\)\\right\]yields the multi\-step scheme directly from theT=1T\{=\}1derivation\.

#### E\.5\.1Messages from Observation Factorfobstf\_\{\\mathrm\{obs\}\_\{t\}\}

Standard sum\-product \(no channel modification\):

μobst→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to\\theta\}\(\\theta\)=∬p​\(yt\|xt,θ\)​μyt→obst​\(yt\)​μxt→obst​\(xt\)​dyt​dxt,\\displaystyle=\\iint p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,\\mu\_\{y\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\}\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(x\_\{t\}\)\\,\\mathrm\{d\}y\_\{t\}\\,\\mathrm\{d\}x\_\{t\},\(E\.17a\)μobst→xt​\(xt\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)=∬p​\(yt\|xt,θ\)​μyt→obst​\(yt\)​μθ→obst​\(θ\)​dyt​dθ,\\displaystyle=\\iint p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,\\mu\_\{y\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{obs\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}y\_\{t\}\\,\\mathrm\{d\}\\theta,\(E\.17b\)μobst→yt​\(yt\)\\displaystyle\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to y\_\{t\}\}\(y\_\{t\}\)=∬p​\(yt\|xt,θ\)​μθ→obst​\(θ\)​μxt→obst​\(xt\)​dxt​dθ\.\\displaystyle=\\iint p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,\\mu\_\{\\theta\\to\\mathrm\{obs\}\_\{t\}\}\(\\theta\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(x\_\{t\}\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}\\theta\.\(E\.17c\)

#### E\.5\.2Messages from Dynamics Factorfdyntf\_\{\\mathrm\{dyn\}\_\{t\}\}

Using kernelp​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\):

μdynt→θ​\(θ\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to\\theta\}\(\\theta\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)​μxt−1→dynt​\(xt−1\)​μut→dynt​\(ut\)​μxt→dynt​\(xt\)​dxt​dxt−1​dut,\\displaystyle=\\iiint p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\\,\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\,\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}x\_\{t\-1\}\\,\\mathrm\{d\}u\_\{t\},\(E\.18a\)μdynt→xt​\(xt\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)​μxt−1→dynt​\(xt−1\)​μut→dynt​\(ut\)​μθ→dynt​\(θ\)​dxt−1​dut​dθ,\\displaystyle=\\iiint p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\\,\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\,\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{t\-1\}\\,\\mathrm\{d\}u\_\{t\}\\,\\mathrm\{d\}\\theta,\(E\.18b\)μdynt→xt−1​\(xt−1\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to x\_\{t\-1\}\}\(x\_\{t\-1\}\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)​μxt→dynt​\(xt\)​μut→dynt​\(ut\)​μθ→dynt​\(θ\)​dxt​dut​dθ,\\displaystyle=\\iiint p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\,\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}u\_\{t\}\\,\\mathrm\{d\}\\theta,\(E\.18c\)μdynt→ut​\(ut\)\\displaystyle\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to u\_\{t\}\}\(u\_\{t\}\)=∭p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)​μxt−1→dynt​\(xt−1\)​μθ→dynt​\(θ\)​μxt→dynt​\(xt\)​dxt​dxt−1​dθ\.\\displaystyle=\\iiint p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\\,\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}x\_\{t\-1\}\\,\\mathrm\{d\}\\theta\.\(E\.18d\)

#### E\.5\.3Factor Beliefs

qobs,t∗​\(yt,xt,θ\)\\displaystyle q\_\{\\mathrm\{obs\},t\}^\{\*\}\(y\_\{t\},x\_\{t\},\\theta\)∝p​\(yt\|xt,θ\)​μyt→obst​\(yt\)​μxt→obst​\(xt\)​μθ→obst​\(θ\),\\displaystyle\\propto p\(y\_\{t\}\|x\_\{t\},\\theta\)\\,\\mu\_\{y\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(y\_\{t\}\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{obs\}\_\{t\}\}\(x\_\{t\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{obs\}\_\{t\}\}\(\\theta\),\(E\.19a\)qdyn,t∗​\(xt,xt−1,θ,ut\)\\displaystyle q\_\{\\mathrm\{dyn\},t\}^\{\*\}\(x\_\{t\},x\_\{t\-1\},\\theta,u\_\{t\}\)∝p​\(xt\|xt−1,θ,ut\)​ru\|x,t​\(ut\|xt−1\)​μxt→dynt​\(xt\)​μxt−1→dynt​\(xt−1\)​μθ→dynt​\(θ\)​μut→dynt​\(ut\)\.\\displaystyle\\propto p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,r\_\{u\|x,t\}\(u\_\{t\}\|x\_\{t\-1\}\)\\,\\mu\_\{x\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\}\)\\,\\mu\_\{x\_\{t\-1\}\\to\\mathrm\{dyn\}\_\{t\}\}\(x\_\{t\-1\}\)\\,\\mu\_\{\\theta\\to\\mathrm\{dyn\}\_\{t\}\}\(\\theta\)\\,\\mu\_\{u\_\{t\}\\to\\mathrm\{dyn\}\_\{t\}\}\(u\_\{t\}\)\.\(E\.19b\)

#### E\.5\.4Channel Updates

Each time step has its own channel update:

ru\|x,t∗​\(ut\|xt−1\)=qt​\(ut\|xt−1\),where​qt​\(ut\|xt−1\)=qpair,t​\(xt−1,ut\)qxt−1​\(xt−1\)r\_\{u\|x,t\}^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)=q\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\),\\quad\\text\{where \}q\_\{t\}\(u\_\{t\}\|x\_\{t\-1\}\)=\\frac\{q\_\{\\mathrm\{pair\},t\}\(x\_\{t\-1\},u\_\{t\}\)\}\{q\_\{x\_\{t\-1\}\}\(x\_\{t\-1\}\)\}\(E\.20\)withqpair,t​\(xt−1,ut\)=∬qdyn,t∗​dxt​dθq\_\{\\mathrm\{pair\},t\}\(x\_\{t\-1\},u\_\{t\}\)=\\iint q\_\{\\mathrm\{dyn\},t\}^\{\*\}\\,\\mathrm\{d\}x\_\{t\}\\,\\mathrm\{d\}\\theta\.

#### E\.5\.5Singleton Beliefs

qxt∗​\(xt\)\\displaystyle q\_\{x\_\{t\}\}^\{\*\}\(x\_\{t\}\)∝p^x​\(xt\)​μobst→xt​\(xt\)​μdynt→xt​\(xt\)​μdynt\+1→xt​\(xt\),\\displaystyle\\propto\\hat\{p\}\_\{x\}\(x\_\{t\}\)\\,\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to x\_\{t\}\}\(x\_\{t\}\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{t\+1\}\\to x\_\{t\}\}\(x\_\{t\}\),\(E\.21a\)qθ∗​\(θ\)\\displaystyle q\_\{\\theta\}^\{\*\}\(\\theta\)∝p​\(θ\)​∏τ=1Tμobsτ→θ​\(θ\)​μdynτ→θ​\(θ\),\\displaystyle\\propto p\(\\theta\)\\prod\_\{\\tau=1\}^\{T\}\\mu\_\{\\mathrm\{obs\}\_\{\\tau\}\\to\\theta\}\(\\theta\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{\\tau\}\\to\\theta\}\(\\theta\),\(E\.21b\)qut∗​\(ut\)\\displaystyle q\_\{u\_\{t\}\}^\{\*\}\(u\_\{t\}\)∝p​\(ut\)​μdynt→ut​\(ut\),\\displaystyle\\propto p\(u\_\{t\}\)\\,\\mu\_\{\\mathrm\{dyn\}\_\{t\}\\to u\_\{t\}\}\(u\_\{t\}\),\(E\.21c\)qyt∗​\(yt\)\\displaystyle q\_\{y\_\{t\}\}^\{\*\}\(y\_\{t\}\)∝p^y​\(yt\)​μobst→yt​\(yt\)\.\\displaystyle\\propto\\hat\{p\}\_\{y\}\(y\_\{t\}\)\\,\\mu\_\{\\mathrm\{obs\}\_\{t\}\\to y\_\{t\}\}\(y\_\{t\}\)\.\(E\.21d\)
The boundary conditions follow the usual temporal\-edge conventions: att=1t=1,μx0→dyn1​\(x0\)=p​\(x0\)\\mu\_\{x\_\{0\}\\to\\mathrm\{dyn\}\_\{1\}\}\(x\_\{0\}\)=p\(x\_\{0\}\); att=Tt=T, the messageμdynT\+1→xT\\mu\_\{\\mathrm\{dyn\}\_\{T\+1\}\\to x\_\{T\}\}is absent\.

### E\.6Properties

##### No min\-max structure\.

Unlike AIF, where opposing channels create a min\-max optimization over beliefs, VBP has a single channel that enters only the dynamics kernel numerator\. The joint optimization over\(q,ru\|x\)\(q,r\_\{u\|x\}\)is a pure minimization problem, avoiding the convergence difficulties of AIF’s saddle\-point structure\.

##### Convergence\.

The observation factor uses standard sum\-product messages with no channel modification, so standard BP convergence guarantees apply to the observation side\. The dynamics channel provides a single\-sided correction\. While damping may improve convergence in practice, the lack of opposing channels makes it less critical than for AIF\.

##### Reduction\.

Settingru\|x,tr\_\{u\|x,t\}to uniform for allttremoves the entropy correction and recovers standard belief propagation \(marginal inference\)\. The VBP scheme thus continuously interpolates between marginal inference \(uniform channel\) and full cross\-entropy planning \(converged channel\)\.

##### Fixed\-point interpretation\.

At convergence,ru\|x,t∗​\(ut\|xt−1\)=q​\(ut\|xt−1\)r\_\{u\|x,t\}^\{\*\}\(u\_\{t\}\|x\_\{t\-1\}\)=q\(u\_\{t\}\|x\_\{t\-1\}\), so the VBP dynamics kernel becomesp​\(xt\|xt−1,θ,ut\)​q​\(ut\|xt−1\)p\(x\_\{t\}\|x\_\{t\-1\},\\theta,u\_\{t\}\)\\,q\(u\_\{t\}\|x\_\{t\-1\}\)\. This reweights the state transition by the policy conditioned on the previous state, implementing the action commitment that cross\-entropy planning encodes\.

## Appendix FExperiment Details

This appendix provides full details for the experiments in[Section6](https://arxiv.org/html/2606.04935#S6)\.

### F\.1Frozen Lake Environment

##### Grid layout\.

We adapt the classic Frozen Lake environment\[Brockman et al\.,[2016](https://arxiv.org/html/2606.04935#bib.bib5),towers\_gymnasium\_2024a\]to include epistemic uncertainty by treating the hole layout as unknown\. A4×44\{\\times\}4grid where the agent starts at the top\-left cell and must reach the goal at the bottom\-right cell\. A subset of the remaining cells are holes; stepping into a hole terminates the episode with failure\. The ice surface makes transitions stochastic: with probability1−pslip1\-p\_\{\\text\{slip\}\}the agent moves in the intended direction, and with probabilitypslip/3p\_\{\\text\{slip\}\}/3it slips to each of the three remaining directions, wherepslip=0\.1p\_\{\\text\{slip\}\}=0\.1\.

##### Configurations\.

The hole layout is the unknown parameterθ\\theta\. We sample1515hole configurations uniformly at random \(with fixed seeds for reproducibility\), each placing22holes on the grid \(a fraction0\.20\.2of the1414non\-start/goal cells, truncated to an integer\)\. The agent does not know the hole locations and must infer them from observations\.

##### Observation model\.

The observation model has two modalities\. First,2​npos2n\_\{\\text\{pos\}\}position channels observe the agent’s position and scan mode with near\-deterministic precision \(0\.999/0\.0010\.999/0\.001\), so the agent always knows where it is\. Second,nposn\_\{\\text\{pos\}\}grid\-cell channels each provide a binary “hole/safe” reading for the corresponding cell\. Unscanned grid\-cell observations are corrupted by distance\-dependent noise:noise=αbase\+αrange⋅d/dmax\\text\{noise\}=\\alpha\_\{\\text\{base\}\}\+\\alpha\_\{\\text\{range\}\}\\cdot d/d\_\{\\text\{max\}\}, whereddis the Manhattan distance from the agent to the cell\. A low\-noise reading directly constrains which configurationsθ\\thetaare consistent, making the observation model approximately unambiguous\. A SCAN action switches all grid\-cell observations to near\-deterministic \(0\.999/0\.0010\.999/0\.001\) at the cost of one time step\.

##### State, action, and observation spaces\.

States encode position and scan mode:2×npos2\\times n\_\{\\text\{pos\}\}states total \(e\.g\.,3232for a4×44\{\\times\}4grid\)\. Actions are the four cardinal directions plus SCAN \(\|𝒰\|=5\|\\mathcal\{U\}\|=5\)\. Observations consist of2​npos2n\_\{\\text\{pos\}\}position channels andnposn\_\{\\text\{pos\}\}grid\-cell channels\.

##### Priors\.

The goal priorp^​\(xT\)\\hat\{p\}\(x\_\{T\}\)is a softmax preference peaking at the bottom\-right cell, with penalties for hole positions \(varying perθ\\theta\)\. The parameter prior is uniform over the1515configurations:p​\(θ\)=1/15p\(\\theta\)=1/15\. The initial state priorp​\(x0\)p\(x\_\{0\}\)places all mass on the top\-left cell in unscanned mode\. The action prior assigns weight11to each movement action and weightcscan=0\.1c\_\{\\text\{scan\}\}=0\.1to SCAN, then normalizes:p​\(ut\)=wut/∑uwup\(u\_\{t\}\)=w\_\{u\_\{t\}\}/\\sum\_\{u\}w\_\{u\}, givingp​\(move\)≈0\.244p\(\\text\{move\}\)\\approx 0\.244andp​\(SCAN\)≈0\.024p\(\\text\{SCAN\}\)\\approx 0\.024\.

##### Planning parameters\.

Planning horizonT=15T=15, fixed across all decision steps\. All methods use400400iterations\. We run10001000episodes per method with a maximum of1515steps per episode\. Episodeiiuses seediifor reproducibility\.

### F\.2RockSample Environment

##### Grid layout\.

We adapt the classic RockSample environment\[smith\_heuristic\_2012\]to the epistemic planning framework by treating rock quality as the unknown parameterθ\\theta\. A5×55\{\\times\}5grid where the agent starts at the left edge and can exit via the right edge\. Two rocks are placed at known grid positions; their quality \(good or bad\) is unknown\.

##### Configurations\.

Rock quality defines the unknown parameterθ\\theta\. With22rocks each having binary quality, there arenθ=4n\_\{\\theta\}=4configurations\. The agent does not know rock quality and must infer it from distance\-dependent observations\.

##### Observation model\.

The observation model has two components\. First, position channels observe the agent’s position with noise parameterαpos=0\.3\\alpha\_\{\\text\{pos\}\}=0\.3\. Second, the agent passively receives a binary “good/bad” quality reading for the nearest rock, whose accuracy depends on the Euclidean distanceddfrom the agent to that rock:p​\(correct∣d\)=12​\(1\+2−d/d1/2\)p\(\\text\{correct\}\\mid d\)=\\tfrac\{1\}\{2\}\\bigl\(1\+2^\{\-d/d\_\{1/2\}\}\\bigr\), whered1/2=0\.5d\_\{1/2\}=0\.5is the half\-efficiency distance\. Atd=0d=0the reading is deterministic; atd=d1/2d=d\_\{1/2\}accuracy is75%75\\%; asd→∞d\\to\\inftythe reading approaches chance\. This makes the observation model locally decisive: nearby observations reliably determine rock quality, so the epistemic strategy is spatial, approach a rock before sampling\.

##### Actions and rewards\.

The agent has seven actions: four cardinal movements, CHECK, SAMPLE, and EXIT \(move off the right edge\)\. CHECK reveals the quality of the nearest rock at the cost of one time step, providing the main explicit epistemic action\. SAMPLE collects the rock at the current position, yielding reward\+2\+2\(good rock\) or penalty−3\-3\(bad rock\)\. EXIT gives a fixed reward\+1\+1\. Movement is deterministic \(pslip=0p\_\{\\text\{slip\}\}=0\)\.

##### State, action, and observation spaces\.

States encode position:npos=25n\_\{\\text\{pos\}\}=25states for the5×55\{\\times\}5grid\. Actions: four cardinal directions, CHECK, SAMPLE, and EXIT \(\|𝒰\|=7\|\\mathcal\{U\}\|=7\)\. Observations consist ofnposn\_\{\\text\{pos\}\}position channels and22binary rock\-quality channels\.

##### Priors\.

The goal priorp^​\(xT\)\\hat\{p\}\(x\_\{T\}\)is a softmax preference with temperatureτ=0\.5\\tau=0\.5peaking at EXIT cells, with penalties for remaining on the grid\. The parameter prior is uniform over the44configurations:p​\(θ\)=1/4p\(\\theta\)=1/4\. The action prior assigns weight11to each movement action and weightcexit=0\.5c\_\{\\text\{exit\}\}=0\.5to EXIT, then normalizes\.

##### Planning parameters\.

Planning horizonT=15T=15, fixed across all decision steps\. BP uses5050iterations; all other methods use200200iterations\. We run10001000episodes per method with a maximum of1515steps per episode\. Episodeiiuses seediifor reproducibility\.

##### Full results with confidence intervals\.

[Table˜3](https://arxiv.org/html/2606.04935#A6.T3)reports RockSample results with 95% confidence intervals\.

Table 3:RockSample results with 95% confidence intervals, averaged over 1000 episodes\.

### F\.3Wumpus World Environment

##### Grid layout\.

We adapt the classic Wumpus World environment\[Russell and Norvig,[1995](https://arxiv.org/html/2606.04935#bib.bib29)\]to include epistemic uncertainty by treating the layout as unknown\. We simplify the classic dynamics to isolate the epistemic challenge: the agent has no orientation or inventory and navigates by cardinal movement\. The agent starts at cell0and must reach the gold cell\. The grid contains pits and a wumpus \(both absorbing hazards that terminate the episode\) and a single gold cell\.

##### Configurations\.

The locations of pits, wumpus, and gold define the unknown parameterθ\\theta\. We use2525configurations sampled with fixed seeds\. Each configuration places44pits, one wumpus, and one gold on the grid, excluding the agent’s starting cell\.

##### Observation model\.

The observation model has two components, both noisy when unscanned\. Three binary feature channels detect adjacency to hazards:*breeze*fires if adjacent to a pit,*stench*fires if adjacent to the wumpus, and*glitter*fires if on the gold cell\. Unscanned feature channels have true\-positive probabilityptp=1−αobsp\_\{\\text\{tp\}\}=1\-\\alpha\_\{\\text\{obs\}\}and false\-positive probabilitypfp=0\.1​αobsp\_\{\\text\{fp\}\}=0\.1\\,\\alpha\_\{\\text\{obs\}\}\. Additionally,nposn\_\{\\text\{pos\}\}position channels encode the agent’s position, also noisy when unscanned\. Observations are ambiguous: a breeze indicates a nearby pit but not which neighbor, and position uncertainty compounds this ambiguity\. The agent must integrate evidence across multiple positions to triangulate hazard locations\. A SCAN action switches all channels \(feature and position\) to near\-deterministic \(0\.999/0\.0010\.999/0\.001\) at the cost of one time step\.

##### State, action, and observation spaces\.

States encode position and scan mode:2×npos2\\times n\_\{\\text\{pos\}\}states total \(e\.g\.,5050for a5×55\{\\times\}5grid\)\. Actions are the four cardinal directions plus SCAN \(\|𝒰\|=5\|\\mathcal\{U\}\|=5\)\. Observations consist of33binary feature channels \(breeze, stench, glitter\) andnposn\_\{\\text\{pos\}\}position channels\.

##### Priors\.

The goal priorp^​\(xT\)\\hat\{p\}\(x\_\{T\}\)is a softmax preference peaking at the gold cell for eachθ\\theta, with penalties for pits and the wumpus\. The parameter prior is uniform over the2525configurations\. The initial state priorp​\(x0\)p\(x\_\{0\}\)places all mass on position0in unscanned mode\. The action prior assigns weight11to each movement action and weightcscan=0\.7c\_\{\\text\{scan\}\}=0\.7to SCAN, then normalizes:p​\(ut\)=wut/∑uwup\(u\_\{t\}\)=w\_\{u\_\{t\}\}/\\sum\_\{u\}w\_\{u\}, givingp​\(move\)≈0\.213p\(\\text\{move\}\)\\approx 0\.213andp​\(SCAN\)≈0\.149p\(\\text\{SCAN\}\)\\approx 0\.149\.

##### Planning parameters\.

Planning horizonT=7T=7, fixed across all decision steps\. All methods use400400iterations\. SCAN costs one time step \(same as Frozen Lake\)\. We run10001000episodes per method with a maximum of1010steps per episode\. Episodeiiuses seediifor reproducibility\.

##### Representative trajectories\.

[Figures˜4](https://arxiv.org/html/2606.04935#A6.F4)and[5](https://arxiv.org/html/2606.04935#A6.F5)show two representative episodes that illustrate the qualitative behavioral separation behind the aggregate Wumpus gap reported in[Section6\.2](https://arxiv.org/html/2606.04935#S6.SS2)\. Each panel shows one method on a fixed layout, with the agent’s path coloured by step order \(light early, dark late\)\. The header below each panel reports the terminal reward, the number of steps taken, the outcome \(success or pit\), and the position and reading of any SCAN action\.

[Figure˜4](https://arxiv.org/html/2606.04935#A6.F4)illustrates the suggestive\-observation regime that drives the gap\. BP, RM\-MP, and VBP step into a pit on the first move without scanning\. Nuijten\-MP scans at the start and acquires the breeze, no\-stench, no\-glitter reading, but its non\-variational prior update fails to translate this evidence into a safe move, so it commits to a doomed step and falls in a pit on step 2\. AIF\-MP scans, observes the same reading, and then reaches the gold cell in seven further steps\.

[Figure˜5](https://arxiv.org/html/2606.04935#A6.F5)shows a layout where one column happens to be hazard\-free, so methods that move forward without scanning can succeed by luck\. BP reaches the gold cell in eight steps and VBP in nine steps, both without scanning; RM\-MP and Nuijten\-MP still step into a pit within five steps\. AIF\-MP scans first, then follows the same column to the gold cell in eight steps\.

![Refer to caption](https://arxiv.org/html/2606.04935v1/x4.png)Figure 4:Wumpus World trajectories for all five methods on configuration1818, episode44\. Symbols: P pit, W wumpus, G gold, A agent start, S SCAN action; arrow shade encodes step order\. BP, RM\-MP, and VBP step into a pit without scanning; Nuijten\-MP scans but still steps into a pit on step22; AIF\-MP scans first, then reaches the gold cell in seven further steps\.![Refer to caption](https://arxiv.org/html/2606.04935v1/x5.png)Figure 5:Wumpus World trajectories on configuration1111, episode11\. A hazard\-free column allows BP and VBP to succeed without scanning; RM\-MP and Nuijten\-MP still step into pits early\. AIF\-MP scans first and follows the same column to the gold cell\.

### F\.4Common Implementation Details

##### Software framework\.

All tensor operations and message\-passing routines use JAX\[Bradbury et al\.,[2018](https://arxiv.org/html/2606.04935#bib.bib4)\]with JIT compilation\. All planning and inference functions are decorated withjax\.jit, with the planning horizon and number of iterations as compile\-time constants\.

##### Log\-space computation\.

All internal messages, beliefs, and channels are stored as log\-probabilities\. A sentinel value of−1012\-10^\{12\}replaces−∞\-\\inftyfor numerical stability, and a safe logarithm floors its argument at10−3010^\{\-30\}before taking the log\. Conversion to probability space occurs only at final output via softmax\.

##### Message damping\.

Channel\-based methods \(VBP, RM\-MP, and AIF\-MP\) apply arithmetic damping \([18](https://arxiv.org/html/2606.04935#S5.E18)\) in probability space \(implemented vialogaddexpin log\-space\)\. Damping is applied to all three channels \(dynamics, observation, and marginal observation\) after each BP iteration\. Structural zeros are preserved: a position remains at−1012\-10^\{12\}only if both old and new values are−1012\-10^\{12\}\. For VBP, this reduces to damping the single policy channel; BP and Nuijten\-MP do not use damping \(λ=1\.0\\lambda=1\.0\)\. The damping parameterλ\\lambdais selected per method and environment from the convergence sweep described in[Section˜F\.5](https://arxiv.org/html/2606.04935#A6.SS5)\.[Table˜4](https://arxiv.org/html/2606.04935#A6.T4)reports the selected values\.

Table 4:Damping parameterλ\\lambdaper method and environment, selected from the convergence sweep \([Section˜F\.5](https://arxiv.org/html/2606.04935#A6.SS5)\)\.
##### Message initialization\.

All channels \(dynamics, observation, and marginal observation\) are initialized to uniform distributions over their respective domains\. Parameter cavity beliefs are initialized to the priorp​\(θ\)p\(\\theta\); state beliefs are initialized to uniform over valid states\.

##### Action selection\.

Actions are selected by greedy argmax over the action marginal att=0t=0\(deterministic, no sampling\)\.

### F\.5Convergence Behavior

To select damping parameters and characterize convergence, we run a systematic sweep overλ∈\{0\.25,0\.4,0\.5,0\.6,0\.75,0\.9\}\\lambda\\in\\\{0\.25,0\.4,0\.5,0\.6,0\.75,0\.9\\\}for each channel\-based method \(RM\-MP, VBP, AIF\-MP\) andλ=1\.0\\lambda=1\.0for loopy BP, across all three environments\. Each configuration is run with55random seeds and1,0001\{,\}000BP iterations\. Convergence is declared when the maximum absolute change in any channel falls below10−410^\{\-4\}\.

##### Convergence rate\.

[Figure˜6](https://arxiv.org/html/2606.04935#A6.F6)reports the fraction of seeds that converge \(color\) and the median number of iterations to convergence \(in parentheses\) for each method–damping combination\. The three environments exhibit qualitatively different convergence profiles\.

On RockSample \([Figure˜6\(b\)](https://arxiv.org/html/2606.04935#A6.F6.sf2)\), all methods converge at all damping values\. The deterministic dynamics make the dynamics channel update exact, removing one source of instability\.

On Frozen Lake \([Figure˜6\(a\)](https://arxiv.org/html/2606.04935#A6.F6.sf1)\), the dynamics channel \(RM\-MP\) is the most fragile: convergence drops to0%0\\%forλ≥0\.5\\lambda\\geq 0\.5, with large VFE oscillations\. This is consistent with the min\-max structure identified in[Section5](https://arxiv.org/html/2606.04935#S5): the stochastic dynamics activate the opposing\-sign dynamics channel, which amplifies update steps when damping is insufficient\. VBP is robustly stable across all damping values \(8080–100%100\\%\), since it uses only the planning channel \(no opposing signs\)\. AIF\-MP converges reliably at higher damping \(100%100\\%atλ=0\.9\\lambda=0\.9\) but slowly at conservative settings \(20%20\\%atλ=0\.25\\lambda=0\.25\)\.

On Wumpus World \([Figure˜6\(c\)](https://arxiv.org/html/2606.04935#A6.F6.sf3)\), most methods converge well at moderate damping\. The dynamics channel fails only atλ=0\.9\\lambda=0\.9\(40%40\\%, with VFE values exceeding10310^\{3\}on divergent seeds\)\. AIF\-MP shows non\-monotone behavior:100%100\\%convergence atλ∈\{0\.25,0\.4,0\.6\}\\lambda\\in\\\{0\.25,0\.4,0\.6\\\}but only20%20\\%atλ=0\.5\\lambda=0\.5and60%60\\%atλ=0\.9\\lambda=0\.9\. The suggestive observation structure means the observation\-side channels are more active, producing a more complex optimization landscape\.

![Refer to caption](https://arxiv.org/html/2606.04935v1/x6.png)\(a\)Frozen Lake
![Refer to caption](https://arxiv.org/html/2606.04935v1/x7.png)\(b\)RockSample
![Refer to caption](https://arxiv.org/html/2606.04935v1/x8.png)\(c\)Wumpus World

Figure 6:Convergence rate \(color\) and median iterations to convergence \(in parentheses\) for each method and damping valueλ\\lambda, averaged over55seeds with1,0001\{,\}000iterations each\. Dashes indicate that the method does not use damping at that value\.
##### VFE convergence dynamics\.

[Figure˜7](https://arxiv.org/html/2606.04935#A6.F7)shows VFE traces for all methods at their best damping on Frozen Lake\. All four methods reach a stationary plateau within the iteration budget: loopy BP within∼5\{\\sim\}5iterations, and the channel\-augmented methods within1515–150150iterations, depending on the number of active channels\. The absolute VFE values at the plateau are not directly comparable across methods because each method optimizes a different functional \([Table˜1](https://arxiv.org/html/2606.04935#S4.T1)\)\.

![Refer to caption](https://arxiv.org/html/2606.04935v1/x9.png)Figure 7:VFE traces on Frozen Lake for all methods at their best damping \(seed\-averaged with1​σ1\\sigmabands\)\. All four methods reach a stationary plateau within the iteration budget\.
##### Damping selection\.

The damping parameterλ\\lambdais selected per method and environment as the value with the highest convergence rate; ties are broken by fewest median iterations\. The selected values are reported in[Table˜4](https://arxiv.org/html/2606.04935#A6.T4)\. The pattern is consistent across environments: RM\-MP requires conservative to moderate damping \(λ=0\.25\\lambda=0\.25–0\.50\.5\) due to the opposing dynamics channel, while VBP tolerates higher damping \(λ=0\.75\\lambda=0\.75–0\.90\.9\)\. AIF\-MP is more variable, ranging from conservative \(λ=0\.25\\lambda=0\.25on Wumpus World\) to aggressive \(λ=0\.9\\lambda=0\.9on Frozen Lake and RockSample\), reflecting the environment\-dependent optimization landscape\. The non\-monotone convergence behavior of AIF\-MP \(e\.g\.,20%20\\%atλ=0\.5\\lambda=0\.5but100%100\\%atλ=0\.6\\lambda=0\.6on Wumpus World\) makes per\-environment tuning necessary rather than using a single global value\. Developing convergence theory for the channel\-augmented scheme remains an open problem\.

Similar Articles

Variational Inference for Evidential Deep Learning

arXiv cs.LG

A mathematically principled framework, Variational Inference Evidential Deep Learning (VI-EDL), is proposed to address limitations in conventional Evidential Deep Learning by reformulating it through variational inference, deriving an Evidence Lower Bound, establishing a generalization bound, and achieving state-of-the-art performance on visual and medical datasets.

Implicit generation and generalization methods for energy-based models

OpenAI Blog

OpenAI presents implicit generation and generalization methods for energy-based models (EBMs) that use Langevin dynamics for iterative refinement to generate samples without explicit generator networks. The approach offers advantages including adaptive computation time, flexibility in learning disconnected data modes, and built-in compositionality through product of experts.

A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models

OpenAI Blog

This paper establishes mathematical equivalences between generative adversarial networks (GANs), inverse reinforcement learning (IRL), and energy-based models (EBMs), demonstrating that certain IRL methods are equivalent to GANs with evaluable generator density. The work bridges three research communities to enable knowledge transfer for developing more stable and scalable algorithms.