A Path-Space Formulation of Prediction in World Models: From a Single Action to Prediction, Planning, and Irreversibility

arXiv cs.LG Papers

Summary

This paper proposes a path-space formulation of prediction in AI world models, treating the distribution over future trajectories as the fundamental predictive object. It shows that prediction, planning, and uncertainty emerge as operations on a single action functional, and demonstrates that attention asymmetry in learned models correlates with irreversibility in the data.

arXiv:2606.28751v1 Announce Type: new Abstract: We propose a path-space formulation of prediction in AI world models. Rather than sequences of one-step conditional distributions, we argue that a world model implicitly defines a probability measure over future trajectories. In the local regime where latent dynamics admit an effective Markovian description, this path measure takes the Onsager-Machlup form. Within this framework, prediction (most probable trajectory), planning (constrained optimization), and uncertainty (fluctuations) emerge as operations on a single action functional. We decompose the latent dynamics into reversible and irreversible components and introduce operational measures of entropy production from model rollouts. In controlled small-scale attention-based models, we find that attention asymmetry is acquired during training in proportion to the irreversibility of the data. Symmetrizing the learned attention suppresses entropy production and selectively degrades long-horizon prediction of irreversible dynamics while preserving relaxational prediction. These results suggest that irreversibility may serve as a computational resource for predictive world models. More generally, the fundamental predictive object is a distribution over future paths rather than states.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:29 AM

# A Path-Space Formulation of Prediction in World Models: From a Single Action to Prediction, Planning, and Irreversibility
Source: [https://arxiv.org/html/2606.28751](https://arxiv.org/html/2606.28751)
###### Abstract

We propose a path\-space formulation of prediction in AI world models\. Rather than sequences of one\-step conditional distributions, we argue that a world model implicitly defines a probability measure over future trajectories\. In the local regime where latent dynamics admit an effective Markovian description, this path measure takes the Onsager–Machlup form\. Within this framework, prediction \(most probable trajectory\), planning \(constrained optimization\), and uncertainty \(fluctuations\) emerge as operations on a single action functional\. We decompose the latent dynamics into reversible and irreversible components and introduce operational measures of entropy production from model rollouts\. In controlled small\-scale attention\-based models, we find that attention asymmetry is acquired during training in proportion to the irreversibility of the data\. Symmetrizing the learned attention suppresses entropy production and selectively degrades long\-horizon prediction of irreversible dynamics while preserving relaxational prediction\. These results suggest that irreversibility may serve as a computational resource for predictive world models\. More generally, the fundamental predictive object is a distribution over future paths rather than states\.

## IIntroduction

A world model is a learned internal simulator of an environment: given a context, it generates futures consistent with the data on which it was trained\[[1](https://arxiv.org/html/2606.28751#bib.bib1),[2](https://arxiv.org/html/2606.28751#bib.bib2)\]\. The idea recurs across disciplines\. In neuroscience, the cortex is described as a prediction machine that continually anticipates its sensory stream\[[3](https://arxiv.org/html/2606.28751#bib.bib3)\]; in machine learning, sequence models and latent world models are trained to roll an environment forward\[[4](https://arxiv.org/html/2606.28751#bib.bib4),[2](https://arxiv.org/html/2606.28751#bib.bib2)\]\. In all of these settings, a predictive system is summarized by the same slogan, that it is a “future\-prediction machine,” and that slogan is almost always made precise through the one\-step conditionalp​\(xt\+1∣xt\)p\(x\_\{t\+1\}\\mid x\_\{t\}\), the quantity that autoregressive training directly optimizes\.

We take a different basic object\. Prediction, planning, and uncertainty are all assertions about futures, that is, about whole trajectories rather than single increments, and the natural carrier of all three is the probability measure that the model assigns to future paths\. Writing a future as a latent trajectoryΓ=\{𝒛​\(t\)\}\\Gamma=\\\{\\bm\{z\}\(t\)\\\}, we study the functional measureP​\[Γ\]P\[\\Gamma\]that the model induces on such trajectories; the one\-step conditional, and the next token it produces are then derived marginals ofP​\[Γ\]P\[\\Gamma\]\. The questions of what a world model predicts, how it plans, and how confident it is then become questions about the structure of a single path distribution\. That the fundamental predictive object of a world model is a distribution over future paths rather than future states is the conceptual core of this work\.

This shift is useful because it imports a mature physical machinery\. In the regime where the learned latent dynamics is effectively local in time, the Markovian limit of its Mori–Zwanzig formalism\[[5](https://arxiv.org/html/2606.28751#bib.bib5),[6](https://arxiv.org/html/2606.28751#bib.bib6),[7](https://arxiv.org/html/2606.28751#bib.bib7)\], the path measure takes the Onsager–Machlup formP​\[Γ\]∝e−𝒜​\[Γ\]P\[\\Gamma\]\\propto e^\{\-\\mathcal\{A\}\[\\Gamma\]\}\[[8](https://arxiv.org/html/2606.28751#bib.bib8)\], the action functional of a diffusion\. A single functional then governs three operations that are usually treated as separate modules: prediction is its most\-probable path, planning is its constrained least\-action path, and predictive uncertainty is its curvature\. The same action separates the drift into a reversible, gradient part and an irreversible, circulating part, so that the entropy production of nonequilibrium statistical mechanics\[[9](https://arxiv.org/html/2606.28751#bib.bib9)\]becomes a measurable property of the predicted world rather than an abstract thermodynamic quantity\. The framework thereby provides a common language linking latent dynamics, stochastic thermodynamics, and attention\-based sequence modeling\.

Within this language the architecture acquires a thermodynamic reading\. We show how an attention layer can be read as computing the local action, with the query–key productWQ⊤​WKW\_\{Q\}^\{\\top\}W\_\{K\}playing the role of the metric of the kinetic term and its antisymmetric part that of the irreversible drift\. An explicit calculation of the drift Jacobian then identifies the query\-key asymmetry as an architecturally controllable source of irreversibility\. These statements are operational: from model rollouts we estimate the drift, the circulating current, the entropy production, the non\-normality, and the attention asymmetry, and we state a list of falsifiable predictions\. In a small trained attention model these quantities are measurements rather than assumptions\. The attention asymmetry, the entropy production, and the non\-normality are acquired together during learning in proportion to the irreversibility of the data; a causal intervention that symmetrizes the attention collapses all three\. The same intervention selectively destroys long\-horizon prediction of circulating structure while sparing relaxational prediction\. Irreversibility is in this sense not only a by\-product of learning but a measurable resource for prediction\.

To delimit our contribution precisely: our primary claim does not concern the architectural necessity of attention in world models\. Rather, we establish that predictive dynamics are fundamentally structured as inference on path space, where temporal irreversibility manifests as a measurable computational resource\. The drift\-Jacobian calculation identifies and lets us control one channel of irreversibility\. Whether that channel dominates the entropy production of a large pretrained architecture, in which residual connections, feedforward blocks, normalization, value mixing, and depth all contribute to the learned drift, is a separate empirical question that we leave open\. In the tradition where a minimal model is used to exhibit a universal mechanism rather than to simulate a system at full scale, we develop and test the structure in a controlled, small\-scale setting\.

The paper is organized as follows\. Section[II](https://arxiv.org/html/2606.28751#S2)formalizes the path measure and its local Onsager–Machlup action\. Section[III](https://arxiv.org/html/2606.28751#S3)treats prediction as the most\-probable future and identifies where it departs from the deterministic rollout\. Section[IV](https://arxiv.org/html/2606.28751#S4)develops planning and uncertainty as further operations on the same action\. Section[V](https://arxiv.org/html/2606.28751#S5)establishes the correspondence between attention and the local action and derives the antisymmetric Jacobian from the query–key asymmetry\. Section[VI](https://arxiv.org/html/2606.28751#S6)gives operational definitions of the reversible and irreversible drift, the entropy production, and the attention asymmetry, and states falsifiable predictions\. Section[VII](https://arxiv.org/html/2606.28751#S7)presents the controlled experiments and discusses their scope, and the final section concludes\.

## IIThe predictive object: world models on path space

### II\.1From next\-token conditionals to future trajectories

Let an encoder map observations to a latent state𝒛∈ℝd\\bm\{z\}\\in\\mathbb\{R\}^\{d\}, in which the model’s dynamics are defined\. A future is a latent trajectory

Γ=\{𝒛​\(t\)\}0≤t≤T,\\Gamma=\\\{\\bm\{z\}\(t\)\\\}\_\{0\\leq t\\leq T\},\(1\)and the predictive object we study is the functional probability measureP​\[Γ\]P\[\\Gamma\]that the model induces on such trajectories given a context\. The one\-step conditional, and with it the next token, is recovered as a marginal ofP​\[Γ\]P\[\\Gamma\]and is in this sense a derived rather than a fundamental quantity\. The central questions of this paper, what the model predicts, how it plans, and how confident it is, become questions about the structure ofP​\[Γ\]P\[\\Gamma\]\. Of the three operations onP​\[Γ\]P\[\\Gamma\]that follow, it is prediction, and specifically its irreversible part, that we make operational and test; planning and uncertainty are structural consequences of the same action, developed here for completeness and left to direct measurement elsewhere\.

### II\.2The latent dynamics and its local regime

We model the latent evolution as a stochastic process

d​𝒛=𝒇​\(𝒛\)​d​t\+2​D​d​𝑾t,\\mathrm\{d\}\\bm\{z\}=\\bm\{f\}\(\\bm\{z\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2D\}\\,\\mathrm\{d\}\\bm\{W\}\_\{t\},\(2\)with drift𝒇\\bm\{f\}and, for clarity, isotropic and constant diffusionDD; the anisotropic caseD​\(𝒛\)D\(\\bm\{z\}\)introduces only the standard multiplicative\-noise corrections and is deferred\. Two points of principle attach to Eq\. \([2](https://arxiv.org/html/2606.28751#S2.E2)\), since a trained world model is in general neither Markovian nor continuous in time\.

First, the exact reduced dynamics obtained after integrating out the latent coordinates that the model does not expose is, by the Mori–Zwanzig projection\[[5](https://arxiv.org/html/2606.28751#bib.bib5),[6](https://arxiv.org/html/2606.28751#bib.bib6),[7](https://arxiv.org/html/2606.28751#bib.bib7)\], a*generalized*Langevin equation with a memory kernel and colored noise,

𝒛˙​\(t\)\\displaystyle\\dot\{\\bm\{z\}\}\(t\)=𝒇loc​\(𝒛\)−∫0tK​\(t−t′\)​𝒛˙​\(t′\)​dt′\+𝝃​\(t\),\\displaystyle=\\bm\{f\}\_\{\\mathrm\{loc\}\}\(\\bm\{z\}\)\-\\int\_\{0\}^\{t\}K\(t\-t^\{\\prime\}\)\\,\\dot\{\\bm\{z\}\}\(t^\{\\prime\}\)\\,\\mathrm\{d\}t^\{\\prime\}\+\\bm\{\\xi\}\(t\),\(3\)⟨𝝃​\(t\)​𝝃​\(t′\)⟩∝K​\(t−t′\)\.\\displaystyle\\quad\\langle\\bm\{\\xi\}\(t\)\\bm\{\\xi\}\(t^\{\\prime\}\)\\rangle\\propto K\(t\-t^\{\\prime\}\)\.The local model \([2](https://arxiv.org/html/2606.28751#S2.E2)\) is the leading term of \([3](https://arxiv.org/html/2606.28751#S2.E3)\) in the Markovian limitK​\(t\)→2​γ​δ​\(t\)K\(t\)\\to 2\\gamma\\,\\delta\(t\), valid when the memory timeτmem\\tau\_\{\\mathrm\{mem\}\}\(the width ofKK, set physically by the attention’s effective look\-back\) is short compared with the dynamical timeτdyn\\tau\_\{\\mathrm\{dyn\}\}on which we predict\. We make this restriction explicit and treat the small parameter

ϵ≡τmem/τdyn\\epsilon\\equiv\\tau\_\{\\mathrm\{mem\}\}/\\tau\_\{\\mathrm\{dyn\}\}\(4\)as the control parameter of the expansion: the local theory is itsϵ→0\\epsilon\\to 0limit, and non\-local, higher\-time\-derivative corrections are organized in powers ofϵ\\epsilon\. This is the working analogue of Hooke’s law\. A real spring has an elastic limit, but the linear regime is a controlled and honest description once its range of validity is stated and, ideally, measured\. We work throughout in this regime and reportϵ\\epsilonrather than assuming it away\. The memory time itself is architectural: it is bounded above by the context window,τmem≤N​Δ​t\\tau\_\{\\mathrm\{mem\}\}\\leq N\\Delta tfor context lengthNN, and grows with depthLL, since each layer composes attention over the already\-mixed history of the previous one, so thatϵ\\epsilonis set byNNandLLrelative toτdyn\\tau\_\{\\mathrm\{dyn\}\}\. The local regime is the one in which the learned attention is effectively short\-ranged on the dynamical timescale, a condition that is itself measurable \(P1 below\)\.

Second, going local does not mean going to equilibrium\. We retain the full drift, including its non\-gradient part\. Writing

𝒇=−∇U\+𝒗,\\bm\{f\}=\-\\nabla U\+\\bm\{v\},\(5\)the gradient part−∇U\-\\nabla Uis the reversible, detailed\-balance\-respecting component, while the circulating part𝒗\\bm\{v\}, defined by∇⋅\(ρss​𝒗\)=0\\nabla\\\!\\cdot\(\\rho\_\{\\mathrm\{ss\}\}\\bm\{v\}\)=0withρss\\rho\_\{\\mathrm\{ss\}\}the stationary density, breaks detailed balance and produces entropy\. The local, linear elastic regime that still carries an antisymmetric, non\-reciprocal response is exactly the setting of odd elasticity, in which the symmetric part of the stiffness plays the role of−∇U\-\\nabla Uand the antisymmetric \(odd\) part plays the role of𝒗\\bm\{v\}\. Restricting to the local regime therefore preserves, rather than discards, the irreversible structure that the rest of the paper depends on\.

### II\.3The path measure and its Onsager–Machlup action

For the local dynamics \([2](https://arxiv.org/html/2606.28751#S2.E2)\), the probability that the trajectory lies in an infinitesimal tube around a given pathΓ\\Gammais, up to a path\-independent normalization,P​\[Γ\]∝e−𝒜​\[Γ\]P\[\\Gamma\]\\propto e^\{\-\\mathcal\{A\}\[\\Gamma\]\}, with the Onsager–Machlup action\[[8](https://arxiv.org/html/2606.28751#bib.bib8)\]

𝒜​\[Γ\]=∫0Tdt​ℒ​\(𝒛,𝒛˙\),ℒ=14​D​\|𝒛˙−𝒇​\(𝒛\)\|2\+12​∇⋅𝒇\.\\mathcal\{A\}\[\\Gamma\]=\\int\_\{0\}^\{T\}\\mathrm\{d\}t\\,\\mathcal\{L\}\(\\bm\{z\},\\dot\{\\bm\{z\}\}\),\\qquad\\mathcal\{L\}=\\frac\{1\}\{4D\}\\,\\bigl\|\\dot\{\\bm\{z\}\}\-\\bm\{f\}\(\\bm\{z\}\)\\bigr\|^\{2\}\+\\frac\{1\}\{2\}\\,\\nabla\\\!\\cdot\\bm\{f\}\.\(6\)The first term is the kinetic weight that penalizes deviation of the realized velocity from the drift; the second is the Jacobian term arising from the Stratonovich \(midpoint\) discretization of \([2](https://arxiv.org/html/2606.28751#S2.E2)\), and it is the term that distinguishes the genuine path probability from the naive squared\-residual cost\. The normalization is the path integral

P​\[Γ\]=1Z​e−𝒜​\[Γ\],Z=∫𝒟​Γ​e−𝒜​\[Γ\]\.P\[\\Gamma\]=\\frac\{1\}\{Z\}\\,e^\{\-\\mathcal\{A\}\[\\Gamma\]\},\\qquad Z=\\int\\mathcal\{D\}\\Gamma\\;e^\{\-\\mathcal\{A\}\[\\Gamma\]\}\.\(7\)Equation \([7](https://arxiv.org/html/2606.28751#S2.E7)\) is formally a Gibbs measure on path space, with𝒜\\mathcal\{A\}playing the role of an energy and the diffusionDDsetting the temperature scale\. Two structures we shall use repeatedly follow at once\. The action is additive along time,𝒜=∫dt​ℒ\\mathcal\{A\}=\\int\\mathrm\{d\}t\\,\\mathcal\{L\}, so that the log\-probability of a future factorizes into local\-in\-time increments, which is the precise content of the locality assumption of Sec\.[II\.2](https://arxiv.org/html/2606.28751#S2.SS2)\. And𝒜\\mathcal\{A\}splits into a time\-symmetric and a time\-antisymmetric part under the path reversalΓ=\{𝒛​\(t\)\}↦Γ~=\{𝒛​\(T−t\)\}\\Gamma=\\\{\\bm\{z\}\(t\)\\\}\\mapsto\\widetilde\{\\Gamma\}=\\\{\\bm\{z\}\(T\-t\)\\\},

𝒜=𝒜sym\+𝒜irr,𝒜irr​\[Γ\]=−12​D​∫0T𝒗​\(𝒛\)⋅d𝒛,\\mathcal\{A\}=\\mathcal\{A\}\_\{\\mathrm\{sym\}\}\+\\mathcal\{A\}\_\{\\mathrm\{irr\}\},\\quad\\mathcal\{A\}\_\{\\mathrm\{irr\}\}\[\\Gamma\]=\-\\frac\{1\}\{2D\}\\int\_\{0\}^\{T\}\\bm\{v\}\(\\bm\{z\}\)\\cdot\\mathrm\{d\}\\bm\{z\},\(8\)so that the irreversible part of the action is the line integral of the circulating drift along the path\. The path\-wise entropy production is the log\-ratio of forward and reversed path probabilities,

Σ​\[Γ\]=ln⁡P​\[Γ\]P​\[Γ~\]=1D​∫0T𝒗​\(𝒛\)⋅d𝒛,\\Sigma\[\\Gamma\]=\\ln\\frac\{P\[\\Gamma\]\}\{P\[\\widetilde\{\\Gamma\}\]\}=\\frac\{1\}\{D\}\\int\_\{0\}^\{T\}\\bm\{v\}\(\\bm\{z\}\)\\cdot\\mathrm\{d\}\\bm\{z\},\(9\)a quantity that is defined directly from the path measure and its reversal and, as we stress later, does not itself require the local form \([6](https://arxiv.org/html/2606.28751#S2.E6)\)\.

### II\.4One functional, three operations

The point of Eq\. \([7](https://arxiv.org/html/2606.28751#S2.E7)\) is that a single functional organizes the three operations a world model must perform\. They are developed in Secs\.[III](https://arxiv.org/html/2606.28751#S3)–[V](https://arxiv.org/html/2606.28751#S5); we state them here to fix the logic of the paper\.

- •*Prediction*is the stationary, most\-probable future,Γ∗=arg⁡minΓ⁡𝒜​\[Γ\]\\Gamma^\{\\ast\}=\\arg\\min\_\{\\Gamma\}\\mathcal\{A\}\[\\Gamma\], obtained fromδ​𝒜=0\\delta\\mathcal\{A\}=0\. We show in Sec\.[III](https://arxiv.org/html/2606.28751#S3)thatΓ∗\\Gamma^\{\\ast\}is generically distinct from the deterministic rollout𝒛˙=𝒇\\dot\{\\bm\{z\}\}=\\bm\{f\}, and that the circulating drift𝒗\\bm\{v\}enters it as a Lorentz\-like force\.
- •*Planning*is the least\-action future subject to terminal constraints, with the planning value given by the path free energy−ln​∫𝒛0→goal𝒟​Γ​e−𝒜\-\\ln\\\!\\int\_\{\\,\\bm\{z\}\_\{0\}\\to\\,\\mathrm\{goal\}\}\\mathcal\{D\}\\Gamma\\,e^\{\-\\mathcal\{A\}\}\. Prediction and planning are thus the free\- and fixed\-endpoint versions of one variational problem\.
- •*Uncertainty*is the curvature of the action aboutΓ∗\\Gamma^\{\\ast\}\. Expanding𝒜=𝒜​\[Γ∗\]\+12​⟨η,ℋ​η⟩\+⋯\\mathcal\{A\}=\\mathcal\{A\}\[\\Gamma^\{\\ast\}\]\+\\tfrac\{1\}\{2\}\\langle\\eta,\\mathcal\{H\}\\eta\\rangle\+\\cdotswithℋ=δ2​𝒜/δ​Γ2\\mathcal\{H\}=\\delta^\{2\}\\mathcal\{A\}/\\delta\\Gamma^\{2\}, the predictive covariance isℋ−1\\mathcal\{H\}^\{\-1\}, the Green’s function of a Schrödinger\-type fluctuation operator along the path\.

The reversible/irreversible split \([8](https://arxiv.org/html/2606.28751#S2.E8)\) threads all three, and Sec\.[VI](https://arxiv.org/html/2606.28751#S6)connects its irreversible part to the query–key asymmetry of attention\.

### II\.5Relation to prior work

Each ingredient of the framework has a counterpart in the literature, and we state the correspondences plainly so that the contribution is not mistaken for any one of them\. The static, single\-step reading of \([7](https://arxiv.org/html/2606.28751#S2.E7)\), a Gibbs distribution over candidate continuations with energy equal to the negative attention score, is the modern Hopfield\-network view of attention\[[10](https://arxiv.org/html/2606.28751#bib.bib10)\]\. The fixed\-endpoint, planning reading is control\-as\-inference, in which optimal behavior is a trajectory distribution weighted by return\[[11](https://arxiv.org/html/2606.28751#bib.bib11)\]\. The use of an Onsager–Machlup action for a learned generative process is established for diffusion and flow\-matching models\[[12](https://arxiv.org/html/2606.28751#bib.bib12),[13](https://arxiv.org/html/2606.28751#bib.bib13)\], where the learned score is the drift\[[14](https://arxiv.org/html/2606.28751#bib.bib14)\]\. The locality regime rests on the Mori–Zwanzig projection that produces a generalized Langevin equation\[[6](https://arxiv.org/html/2606.28751#bib.bib6)\], and the odd\-elastic reading of the antisymmetric drift is borrowed from non\-reciprocal continuum mechanics\[[15](https://arxiv.org/html/2606.28751#bib.bib15)\]\.

Two further lines are close enough to require explicit separation\. The dynamical\-systems analysis of trained recurrent networks reverse\-engineers their computation from fixed points and the linearized flow around them\[[16](https://arxiv.org/html/2606.28751#bib.bib16)\], and a parallel line reads the transformer itself as an interacting particle system whose tokens flow and cluster under attention\[[17](https://arxiv.org/html/2606.28751#bib.bib17)\]; these are the nearest precedents to our reading of a world model through its drift field, but they characterize geometry, fixed points and slow manifolds, not thermodynamics, and they have no attention parameter to trace the geometry back to\. Stochastic thermodynamics of neural and brain dynamics measures entropy production and its oscillatory decomposition from recorded activity\[[18](https://arxiv.org/html/2606.28751#bib.bib18),[19](https://arxiv.org/html/2606.28751#bib.bib19)\]; this supplies the irreversibility diagnostic we use, but on data, with no handle on the architecture that produced it\.

What is new is therefore not any single correspondence but their unification and its consequence\. The unification reads prediction, planning, and uncertainty as operations on one action that attention computes\. The consequence is the bridge, absent from all of the above, from an architectural quantity, the query–key asymmetry, to a thermodynamic one, the entropy production of the predicted dynamics, and from there to a functional one: that this entropy production is a resource the model spends to predict irreversible structure over a long horizon \(Sec\.[VII\.4](https://arxiv.org/html/2606.28751#S7.SS4)\)\. The geometric and thermodynamic literatures measure such quantities in recorded dynamics; here they are produced, controlled, and shown to matter for prediction by a single architectural knob\.

## IIIPrediction: the most\-probable future

Prediction is the operationδ​𝒜=0\\delta\\mathcal\{A\}=0on the path action of Sec\.[II\.3](https://arxiv.org/html/2606.28751#S2.SS3)\. Carrying it out gives two facts that are easy to state and easy to forget: the most\-probable future obeys a Newtonian equation in which the irreversible part of the drift acts as a magnetic force, and that future is generically not the deterministic rollout𝒛˙=𝒇\\dot\{\\bm\{z\}\}=\\bm\{f\}\.

### III\.1The Euler–Lagrange equation

Stationarity of𝒜​\[Γ\]=∫0Tℒ​dt\\mathcal\{A\}\[\\Gamma\]=\\int\_\{0\}^\{T\}\\mathcal\{L\}\\,\\mathrm\{d\}twith the Onsager–Machlup Lagrangian \([6](https://arxiv.org/html/2606.28751#S2.E6)\), throughdd​t​∂𝒛˙ℒ=∂𝒛ℒ\\tfrac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\\,\\partial\_\{\\dot\{\\bm\{z\}\}\}\\mathcal\{L\}=\\partial\_\{\\bm\{z\}\}\\mathcal\{L\}, and using the identities∂jfi−∂ifj=2​\(JA\)i​j\\partial\_\{j\}f\_\{i\}\-\\partial\_\{i\}f\_\{j\}=2\(J\_\{A\}\)\_\{ij\}andfj​∂ifj=12​∂i\|𝒇\|2f\_\{j\}\\partial\_\{i\}f\_\{j\}=\\tfrac\{1\}\{2\}\\partial\_\{i\}\|\\bm\{f\}\|^\{2\}, gives for constant isotropicDD

𝒛¨=2​JA​\(𝒛\)​𝒛˙−∇Veff​\(𝒛\),Veff=−12​\|𝒇\|2−D​∇⋅𝒇,\\ddot\{\\bm\{z\}\}=2\\,J\_\{A\}\(\\bm\{z\}\)\\,\\dot\{\\bm\{z\}\}\\;\-\\;\\nabla V\_\{\\mathrm\{eff\}\}\(\\bm\{z\}\),\\quad V\_\{\\mathrm\{eff\}\}=\-\\tfrac\{1\}\{2\}\|\\bm\{f\}\|^\{2\}\-D\\,\\nabla\\\!\\cdot\\bm\{f\},\(10\)withJi​j=∂jfiJ\_\{ij\}=\\partial\_\{j\}f\_\{i\}the drift Jacobian andJA=12​\(J−J⊤\)J\_\{A\}=\\tfrac\{1\}\{2\}\(J\-J^\{\\top\}\)its antisymmetric part\. Equation \([10](https://arxiv.org/html/2606.28751#S3.E10)\) is the instanton equation of the path measure: the most\-probable future is a Newtonian trajectory in the effective potentialVeffV\_\{\\mathrm\{eff\}\}under an additional velocity\-dependent force2​JA​𝒛˙2J\_\{A\}\\dot\{\\bm\{z\}\}\. BecauseJAJ\_\{A\}is antisymmetric this force does no work,𝒛˙⋅\(2​JA​𝒛˙\)=0\\dot\{\\bm\{z\}\}\\cdot\(2J\_\{A\}\\dot\{\\bm\{z\}\}\)=0; it is a magnetic, or gyroscopic, force, with2​JA2J\_\{A\}in the role of the field\-strength tensor\. The Lagrangian has no explicit time, so

E=14​D​\|𝒛˙\|2\+12​D​Veff​\(𝒛\)E=\\frac\{1\}\{4D\}\\,\|\\dot\{\\bm\{z\}\}\|^\{2\}\+\\frac\{1\}\{2D\}\\,V\_\{\\mathrm\{eff\}\}\(\\bm\{z\}\)\(11\)is conserved along the most\-probable path, a first integral of Eq\. \([10](https://arxiv.org/html/2606.28751#S3.E10)\)\.

### III\.2Why the most\-probable future is not the deterministic rollout

It is tempting to identify the predicted future with the deterministic flow𝒛˙=𝒇\\dot\{\\bm\{z\}\}=\\bm\{f\}, the path that makes the kinetic term of𝒜\\mathcal\{A\}vanish\. This is correct only in a special case\. Substituting𝒛˙=𝒇\\dot\{\\bm\{z\}\}=\\bm\{f\}into Eq\. \([10](https://arxiv.org/html/2606.28751#S3.E10)\) and using2​JA​𝒇\+∇\(12​\|𝒇\|2\)=J​𝒇2J\_\{A\}\\bm\{f\}\+\\nabla\(\\tfrac\{1\}\{2\}\|\\bm\{f\}\|^\{2\}\)=J\\bm\{f\}, the deterministic flow solves the Euler–Lagrange equation if and only if

D​∇\(∇⋅𝒇\)=𝟎\.D\\,\\nabla\(\\nabla\\\!\\cdot\\bm\{f\}\)=\\bm\{0\}\.\(12\)Whenever the drift has non\-uniform divergence, the most\-probable future departs from the deterministic rollout, and the leading departure is the noise\-induced termD​∇\(∇⋅𝒇\)D\\,\\nabla\(\\nabla\\\!\\cdot\\bm\{f\}\), of orderDD\. The interpretation is standard but worth stating: integrating the learned drift forward returns the mean field, not the mode of the future distribution, and the two differ wherever probability is being focused or defocused \(∇⋅𝒇≠const\\nabla\\\!\\cdot\\bm\{f\}\\neq\\mathrm\{const\}\)\. A world model that reports its single most\-likely rollout is therefore not reportingarg⁡maxΓ⁡P​\[Γ\]\\arg\\max\_\{\\Gamma\}P\[\\Gamma\]unless its drift is divergence\-harmonic\.

### III\.3The antisymmetric Jacobian threads prediction and irreversibility

The gyroscopic force in Eq\. \([10](https://arxiv.org/html/2606.28751#S3.E10)\) is generated byJAJ\_\{A\}, the same antisymmetric Jacobian that in Sec\.[VI](https://arxiv.org/html/2606.28751#S6)sources the circulating drift𝒗\\bm\{v\}and the entropy production, and that in Sec\.[V](https://arxiv.org/html/2606.28751#S5)descends from the query–key asymmetry of attention\. One object thus controls both the geometry of prediction and its thermodynamics\. On the forward, free\-endpoint path its effect is mild, entering only through the off\-deterministic corrections above\. It becomes decisive in the two problems treated next: the fixed\-endpoint problem of planning \(Sec\.[IV](https://arxiv.org/html/2606.28751#S4)\), where the instanton must reach a prescribed target and the magnetic force bends it away from any gradient descent ofVeffV\_\{\\mathrm\{eff\}\}, and the fluctuation spectrum that sets predictive uncertainty, whereJAJ\_\{A\}enters the Hessian\. A world model with symmetric attention \(JA=0J\_\{A\}=0\) has the curl\-free instanton𝒛¨=−∇Veff\\ddot\{\\bm\{z\}\}=\-\\nabla V\_\{\\mathrm\{eff\}\}and predicts only gradient relaxation; the circulating, history\-carrying futures requireJA≠0J\_\{A\}\\neq 0\.

## IVPlanning and uncertainty from the same action

Prediction fixed the initial condition and left the future free\. Planning fixes a target and asks for the best route to it; uncertainty asks how sharply that route is determined\. Both are read off the same path integral, by imposing a terminal condition and by expanding to second order\.

### IV\.1Planning is fixed\-endpoint least action

Given a target, a terminal state𝒛T\\bm\{z\}\_\{T\}or more generally a terminal cost, the optimal plan is the least\-action path that reaches it,

Γ∗=arg⁡min𝒛​\(0\)=𝒛0,𝒛​\(T\)=𝒛T⁡𝒜​\[Γ\],\\Gamma^\{\\ast\}=\\arg\\min\_\{\\bm\{z\}\(0\)=\\bm\{z\}\_\{0\},\\;\\bm\{z\}\(T\)=\\bm\{z\}\_\{T\}\}\\mathcal\{A\}\[\\Gamma\],\(13\)the instanton \([10](https://arxiv.org/html/2606.28751#S3.E10)\) now solved as a boundary\-value problem\. The gyroscopic force2​JA​𝒛˙2J\_\{A\}\\dot\{\\bm\{z\}\}, mild for free\-endpoint prediction, here bends the plan away from gradient descent ofVeffV\_\{\\mathrm\{eff\}\}: an asymmetric attention plans along curved, circulating routes, a symmetric one only down the potential\. The value of the target is the path free energy,

𝒱​\(𝒛T,T∣𝒛0\)\\displaystyle\\mathcal\{V\}\(\\bm\{z\}\_\{T\},T\\mid\\bm\{z\}\_\{0\}\)=−D​ln​∫𝒛0→𝒛T𝒟​Γ​e−𝒜​\[Γ\]\\displaystyle=\-D\\ln\\\!\\\!\\int\_\{\\bm\{z\}\_\{0\}\\to\\bm\{z\}\_\{T\}\}\\\!\\\!\\mathcal\{D\}\\Gamma\\;e^\{\-\\mathcal\{A\}\[\\Gamma\]\}=−D​ln⁡K​\(𝒛T,T∣𝒛0,0\),\\displaystyle=\-D\\ln K\(\\bm\{z\}\_\{T\},T\\mid\\bm\{z\}\_\{0\},0\),\(14\)withKKthe propagator\. In the low\-noise limit𝒱→D​𝒜​\[Γ∗\]\\mathcal\{V\}\\to D\\,\\mathcal\{A\}\[\\Gamma^\{\\ast\}\], and𝒱\\mathcal\{V\}obeys a Hamilton–Jacobi–Bellman equation whose characteriztics are the instantons\. This is the control\-as\-inference correspondence\[[11](https://arxiv.org/html/2606.28751#bib.bib11)\], here grounded in the Onsager–Machlup action: the optimal cost\-to\-go is the free energy of the future ensemble, and planning is least action under a terminal constraint\.

### IV\.2Value and uncertainty are consecutive orders of one expansion

Expanding the path integral aboutΓ∗\\Gamma^\{\\ast\}by the saddle point gives

−ln​∫𝒟​Γ​e−𝒜=𝒜​\[Γ∗\]\+12​ln​detℋ\+O​\(D\),ℋ=δ2​𝒜δ​Γ2\|Γ∗\.\-\\ln\\\!\\int\\\!\\mathcal\{D\}\\Gamma\\,e^\{\-\\mathcal\{A\}\}=\\mathcal\{A\}\[\\Gamma^\{\\ast\}\]\+\\tfrac\{1\}\{2\}\\ln\\det\\mathcal\{H\}\+O\(D\),\\quad\\mathcal\{H\}=\\frac\{\\delta^\{2\}\\mathcal\{A\}\}\{\\delta\\Gamma^\{2\}\}\\bigg\|\_\{\\Gamma^\{\\ast\}\}\.\(15\)The leading term is the planning value; the next, the fluctuation determinant, is the log\-volume of nearby futures, that is the predictive uncertainty\. Prediction, planning, and uncertainty are thus not three constructions but three terms read off one functional: the stationary path, its action, and its curvature\.

### IV\.3The fluctuation operator and the predictive covariance

The second variation is the quadratic form12​∫0Tη⊤​ℋ​η​dt\\tfrac\{1\}\{2\}\\int\_\{0\}^\{T\}\\eta^\{\\top\}\\mathcal\{H\}\\,\\eta\\,\\mathrm\{d\}twith the Jacobi operator

ℋ=12​D​\[−d2d​t2\+2​JA​\(t\)​dd​t\+Ω2​\(t\)\],\\mathcal\{H\}=\\frac\{1\}\{2D\}\\Big\[\-\\frac\{\\mathrm\{d\}^\{2\}\}\{\\mathrm\{d\}t^\{2\}\}\+2J\_\{A\}\(t\)\\,\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}t\}\+\\mathsf\{\\Omega\}^\{2\}\(t\)\\Big\],\(16\)whereΩ2​\(t\)\\mathsf\{\\Omega\}^\{2\}\(t\)is a symmetric matrix potential assembled fromJ⊤​JJ^\{\\top\}J,J˙S\\dot\{J\}\_\{S\}, and the curvature∇∇⁡Veff\\nabla\\nabla V\_\{\\mathrm\{eff\}\}alongΓ∗\\Gamma^\{\\ast\}, andJS=12​\(J\+J⊤\)J\_\{S\}=\\tfrac\{1\}\{2\}\(J\+J^\{\\top\}\)\. The boundary conditions complete the definition: Dirichletη​\(0\)=η​\(T\)=𝟎\\eta\(0\)=\\eta\(T\)=\\bm\{0\}for the fixed\-endpoint planning covariance, andη​\(0\)=𝟎\\eta\(0\)=\\bm\{0\}with the natural condition∂η˙ℒ=𝟎\\partial\_\{\\dot\{\\eta\}\}\\mathcal\{L\}=\\bm\{0\}att=Tt=Tfor forward prediction\. With these conditionsℋ\\mathcal\{H\}is self\-adjoint and, absent a conjugate point, invertible, so the predictive covariance is its Green’s function,

⟨η​\(t\)​η​\(t′\)⊤⟩=ℋ−1​\(t,t′\)≡G​\(t,t′\),ℋ​G​\(t,t′\)=𝟙​δ​\(t−t′\)\.\\begin\{split\}\\langle\\eta\(t\)\\,\\eta\(t^\{\\prime\}\)^\{\\top\}\\rangle&=\\mathcal\{H\}^\{\-1\}\(t,t^\{\\prime\}\)\\equiv G\(t,t^\{\\prime\}\),\\\\ \\mathcal\{H\}\\,G\(t,t^\{\\prime\}\)&=\\mathbb\{1\}\\,\\delta\(t\-t^\{\\prime\}\)\.\\end\{split\}\(17\)The antisymmetric Jacobian enters the first\-order term2​JA​∂t2J\_\{A\}\\partial\_\{t\}, so a world model with asymmetric attention has a non\-normal fluctuation operator: its predictive uncertainty grows transiently and anisotropically rather than diffusing isotropically, the same non\-normality that drives the entropy production of Sec\.[VI](https://arxiv.org/html/2606.28751#S6)\. Symmetric attention \(JA=0J\_\{A\}=0\) returns a normal operator and isotropic, monotone uncertainty growth\.

## VAttention computes the local action

This section is the architectural bridge\. It reads the autoregressive attention computation as a discretization of the path integral of Sec\.[II\.3](https://arxiv.org/html/2606.28751#S2.SS3), so that the attention logits play the role of local action increments and both terms of the Onsager–Machlup Lagrangian can be read off from the attention operation\. We present this as a structural correspondence rather than a derivation from first principles\. It holds at the level of a single attention layer in the local regime, where the one\-step conditional is Gaussian; a deep stack renormalizes the resulting drift and diffusion, which we return to in Sec\.[VII](https://arxiv.org/html/2606.28751#S7)\.

### V\.1Autoregressive attention is a discretized path measure

An autoregressive world model generates a trajectory by sampling each step from a conditional built by attention\. The joint law of the generated path factorizes,

−ln⁡P​\[Γ\]\\displaystyle\-\\ln P\[\\Gamma\]=∑t\[−ln⁡p​\(𝒛t\+1∣𝒛≤t\)\]\\displaystyle=\\sum\_\{t\}\\big\[\-\\ln p\(\\bm\{z\}\_\{t\+1\}\\mid\\bm\{z\}\_\{\\leq t\}\)\\big\]\(18\)→local∑tℓ​\(𝒛t,𝒛t\+1\)\\displaystyle\\xrightarrow\[\\text\{local\}\]\{\}\\sum\_\{t\}\\ell\(\\bm\{z\}\_\{t\},\\bm\{z\}\_\{t\+1\}\)→Δ​t→0∫0Tℒ​\(𝒛,𝒛˙\)​dt,\\displaystyle\\xrightarrow\[\\Delta t\\to 0\]\{\}\\int\_\{0\}^\{T\}\\mathcal\{L\}\(\\bm\{z\},\\dot\{\\bm\{z\}\}\)\\,\\mathrm\{d\}t,where the middle arrow is the locality \(Markov\) assumption of Sec\.[II\.2](https://arxiv.org/html/2606.28751#S2.SS2)and the last is the continuum limit\. Because the one\-step conditional is Gaussian with mean𝒛t\+Δ​t​𝒇\\bm\{z\}\_\{t\}\+\\Delta t\\,\\bm\{f\}set by the attention readout, its negative logarithm is exactly the discretized Onsager–Machlup incrementℓ​\(𝒛t,𝒛t\+1\)=14​D​Δ​t​\|𝒛t\+1−𝒛t−Δ​t​𝒇\|2\+12​∇⋅𝒇\\ell\(\\bm\{z\}\_\{t\},\\bm\{z\}\_\{t\+1\}\)=\\tfrac\{1\}\{4D\\Delta t\}\\,\|\\bm\{z\}\_\{t\+1\}\-\\bm\{z\}\_\{t\}\-\\Delta t\\,\\bm\{f\}\|^\{2\}\+\\tfrac\{1\}\{2\}\\nabla\\\!\\cdot\\bm\{f\}; the attention enters the action through the drift𝒇\\bm\{f\}, not as the increment itself, and the two terms of the Lagrangian are read off from𝒇\\bm\{f\}in the next subsections\.

### V\.2Drift, temperature, and metric from the attention layer

A single attention layer reads out

Attn​\(𝒛\)\\displaystyle\\mathrm\{Attn\}\(\\bm\{z\}\)=∑sas​\(𝒛\)​𝒗s,as​\(𝒛\)=ess​\(𝒛\)Z​\(𝒛\),\\displaystyle=\\sum\_\{s\}a\_\{s\}\(\\bm\{z\}\)\\,\\bm\{v\}\_\{s\},\\qquad a\_\{s\}\(\\bm\{z\}\)=\\frac\{e^\{\\,s\_\{s\}\(\\bm\{z\}\)\}\}\{Z\(\\bm\{z\}\)\},\(19\)ss​\(𝒛\)\\displaystyle s\_\{s\}\(\\bm\{z\}\)=1d​𝒛⊤​M​𝒎s,M=WQ⊤​WK,\\displaystyle=\\frac\{1\}\{\\sqrt\{d\}\}\\,\\bm\{z\}^\{\\top\}M\\,\\bm\{m\}\_\{s\},\\qquad M=W\_\{Q\}^\{\\top\}W\_\{K\},over context keys𝒎s\\bm\{m\}\_\{s\}and values𝒗s\\bm\{v\}\_\{s\}, with denominatorZ​\(𝒛\)=∑sess​\(𝒛\)Z\(\\bm\{z\}\)=\\sum\_\{s\}e^\{s\_\{s\}\(\\bm\{z\}\)\}\. The one\-step update𝒛t\+1=𝒛t\+Δ​t​𝒇​\(𝒛t\)\+2​D​Δ​𝑾\\bm\{z\}\_\{t\+1\}=\\bm\{z\}\_\{t\}\+\\Delta t\\,\\bm\{f\}\(\\bm\{z\}\_\{t\}\)\+\\sqrt\{2D\}\\,\\Delta\\bm\{W\}, with𝒇​\(𝒛\)=Attn​\(𝒛\)−𝒛\\bm\{f\}\(\\bm\{z\}\)=\\mathrm\{Attn\}\(\\bm\{z\}\)\-\\bm\{z\}, then fixes three objects of the action\. The inverse temperature weighting the path Gibbs measure is the attention scaled\\sqrt\{d\}, the Lagrange multiplier of Sec\.[II\.3](https://arxiv.org/html/2606.28751#S2.SS3)\. The bilinear formM=WQ⊤​WKM=W\_\{Q\}^\{\\top\}W\_\{K\}is the metric of the kinetic term, so a non\-symmetricMMis an anisotropic, non\-reciprocal metric\. Its antisymmetric partMAM\_\{A\}feeds the antisymmetric drift JacobianJAJ\_\{A\}, derived explicitly in Sec\.[V\.4](https://arxiv.org/html/2606.28751#S5.SS4), and through Eqs\. \([10](https://arxiv.org/html/2606.28751#S3.E10)\) and \([27](https://arxiv.org/html/2606.28751#S6.E27)\) the sameMAM\_\{A\}supplies both the gyroscopic force of prediction and the entropy production of Sec\.[VI](https://arxiv.org/html/2606.28751#S6)\. In the symmetric, value\-tied limit \(WQ=WKW\_\{Q\}=W\_\{K\}with𝒗s\\bm\{v\}\_\{s\}tied to𝒎s\\bm\{m\}\_\{s\}\) the readout reduces to a gradient flow, recovering the modern Hopfield energy and a purely relaxational world model\.

### V\.3The Jacobian term from the softmax denominator

The second term of the Onsager–Machlup Lagrangian,12​∇⋅𝒇\\tfrac\{1\}\{2\}\\nabla\\\!\\cdot\\bm\{f\}, is the Jacobian of the midpoint \(Stratonovich\) discretization, and it is what distinguishes the path probability from a bare squared residual\. In a transformer it is supplied by the softmax denominator\. Differentiating the normalized weights gives∇as=1d​as​\(M​𝒎s−⟨M​𝒎⟩a\)\\nabla a\_\{s\}=\\tfrac\{1\}\{\\sqrt\{d\}\}\\,a\_\{s\}\\,\(M\\bm\{m\}\_\{s\}\-\\langle M\\bm\{m\}\\rangle\_\{a\}\)with⟨⋅⟩a=∑sas​\(⋅\)\\langle\\cdot\\rangle\_\{a\}=\\sum\_\{s\}a\_\{s\}\(\\cdot\), so that

12​∇⋅𝒇=12​d​∑sas​\(M​𝒎s−⟨M​𝒎⟩a\)⋅𝒗s=12​d​Cova​\(M​𝒎,𝒗\)\.\\begin\{split\}\\tfrac\{1\}\{2\}\\,\\nabla\\\!\\cdot\\bm\{f\}&=\\frac\{1\}\{2\\sqrt\{d\}\}\\sum\_\{s\}a\_\{s\}\\,\(M\\bm\{m\}\_\{s\}\-\\langle M\\bm\{m\}\\rangle\_\{a\}\)\\cdot\\bm\{v\}\_\{s\}\\\\ &=\\frac\{1\}\{2\\sqrt\{d\}\}\\,\\mathrm\{Cov\}\_\{a\}\\\!\\big\(M\\bm\{m\},\\,\\bm\{v\}\\big\)\.\\end\{split\}\(20\)The Jacobian term is thus the attention\-weighted covariance, under the softmax denominator, between the key projectionM​𝒎sM\\bm\{m\}\_\{s\}and the value𝒗s\\bm\{v\}\_\{s\}\. Both pieces of the local action therefore come from one attention layer: the kinetic term from the score, the Jacobian term from the normalization\. Key\-normalization \(LayerNorm\) enters by making the score a genuine inner product and fixing the temperature scale, the same role it plays in reducing free diffusion to dot\-product attention\. Equation \([20](https://arxiv.org/html/2606.28751#S5.E20)\) also makes the locality check of Sec\.[VI](https://arxiv.org/html/2606.28751#S6)concrete: it holds layer by layer, and the departure from it across a deep stack measures the memory parameterϵ\\epsilon\.

### V\.4The antisymmetric Jacobian descends from the query–key asymmetry

The paper’s causal claim, that the query–key asymmetry is an architecturally controllable source of the irreversible drift, can be made explicit at the level of the drift Jacobian rather than asserted\. Using∇as\\nabla a\_\{s\}from above, the Jacobian of𝒇=Attn​\(𝒛\)−𝒛\\bm\{f\}=\\mathrm\{Attn\}\(\\bm\{z\}\)\-\\bm\{z\}is the attention\-weighted cross\-covariance between the values and the key projections,

J=1d​Cova​\(𝒗,M​𝒎\)−I,Cova​\(𝒗,𝒖\)=⟨𝒗​𝒖⊤⟩a−⟨𝒗⟩a​⟨𝒖⟩a⊤\.\\begin\{split\}J&=\\frac\{1\}\{\\sqrt\{d\}\}\\,\\mathrm\{Cov\}\_\{a\}\\\!\\big\(\\bm\{v\},\\,M\\bm\{m\}\\big\)\-I,\\\\ \\mathrm\{Cov\}\_\{a\}\(\\bm\{v\},\\bm\{u\}\)&=\\langle\\bm\{v\}\\bm\{u\}^\{\\top\}\\rangle\_\{a\}\-\\langle\\bm\{v\}\\rangle\_\{a\}\\langle\\bm\{u\}\\rangle\_\{a\}^\{\\top\}\.\\end\{split\}\(21\)In the value\-tied regime𝒗s=𝒎s\\bm\{v\}\_\{s\}=\\bm\{m\}\_\{s\}of Sec\.[V](https://arxiv.org/html/2606.28751#S5), where the query–key product is the only asymmetry channel, this cross\-covariance isC​M⊤C\\,M^\{\\top\}withC=Cova​\(𝒎,𝒎\)C=\\mathrm\{Cov\}\_\{a\}\(\\bm\{m\},\\bm\{m\}\)the symmetric, positive attention\-weighted key covariance, and the antisymmetric part of the Jacobian splits into exactly two terms,

JA=12​d​\(\[C,MS\]⏟state\-dependent−\{C,MA\}⏟architectural\)\.J\_\{A\}=\\frac\{1\}\{2\\sqrt\{d\}\}\\Big\(\\underbrace\{\[\\,C,\\,M\_\{S\}\\,\]\}\_\{\\text\{state\-dependent\}\}\\;\-\\;\\underbrace\{\\\{\\,C,\\,M\_\{A\}\\,\\\}\}\_\{\\text\{architectural\}\}\\Big\)\.\(22\)The second term is the anticommutator of the key covariance with the query–key asymmetryMAM\_\{A\}\. For a positive\-definiteCC, it vanishes if and only ifMA=0M\_\{A\}=0, soMAM\_\{A\}is exactly the architecturally controlled source of the antisymmetric Jacobian, and hence, through Eq\. \([27](https://arxiv.org/html/2606.28751#S6.E27)\), of the circulating drift𝒗\\bm\{v\}and the entropy production\. This is the link theP4P\_\{4\}intervention severs: symmetrizingMMsetsMA=0M\_\{A\}=0and removes this term at a stroke\. The first term, the commutator of the*state\-dependent*covarianceC​\(𝒛\)C\(\\bm\{z\}\)with the symmetric partMSM\_\{S\}, survives symmetrization and is generically nonzero\. It is the residual, state\-dependent non\-integrability noted in Sec\.[VII](https://arxiv.org/html/2606.28751#S7), and it is why the intervention collapses the entropy production to a small floor rather than exactly to zero, as seen in Figs\.[2](https://arxiv.org/html/2606.28751#S7.F2)and[3](https://arxiv.org/html/2606.28751#S7.F3)\.

### V\.5The odd\-elasticity dictionary

The analogy to odd elasticity invoked in Sec\.[II\.2](https://arxiv.org/html/2606.28751#S2.SS2)can now be made exact\. Linearizing the drift about a point,𝒇​\(𝒛\)≈−K​\(𝒛−𝒛0\)\\bm\{f\}\(\\bm\{z\}\)\\approx\-K\(\\bm\{z\}\-\\bm\{z\}\_\{0\}\)with stiffnessK=−JK=\-J, splits asK=KS\+KAK=K\_\{S\}\+K\_\{A\}into a reciprocal partKS=−JSK\_\{S\}=\-J\_\{S\}, derivable from a potential, and a non\-reciprocal odd partKA=−JAK\_\{A\}=\-J\_\{A\}\. The work extracted by traversing a closed cycleCCin latent space is

WC=∮C𝒇⋅d𝒛=∮C\(J​𝒛\)⋅d𝒛=2​\(JA\)⟂​Area​\(C\),W\_\{C\}=\\oint\_\{C\}\\bm\{f\}\\cdot\\mathrm\{d\}\\bm\{z\}=\\oint\_\{C\}\(J\\bm\{z\}\)\\cdot\\mathrm\{d\}\\bm\{z\}=2\\,\(J\_\{A\}\)\_\{\\perp\}\\,\\mathrm\{Area\}\(C\),\(23\)where the symmetric part integrates to zero and, by Stokes, only the component\(JA\)⟂\(J\_\{A\}\)\_\{\\perp\}ofJAJ\_\{A\}in the plane ofCCsurvives, times the enclosed area\. This is exactly the odd\-elastic work per cycle of a non\-reciprocal linear medium\[[15](https://arxiv.org/html/2606.28751#bib.bib15)\]\. By Sec\.[VI](https://arxiv.org/html/2606.28751#S6)the same cycle integral is the entropy produced,ΣC=1D​∮C𝒗⋅d𝒛\\Sigma\_\{C\}=\\tfrac\{1\}\{D\}\\oint\_\{C\}\\bm\{v\}\\cdot\\mathrm\{d\}\\bm\{z\}, and by the previous subsectionJAJ\_\{A\}descends from the attention asymmetryMAM\_\{A\}\. The dictionary is therefore

KA⏟odd modulus⟷JA⏟antisym\. Jacobian⟷\\displaystyle\\underbrace\{K\_\{A\}\}\_\{\\text\{odd modulus\}\}\\;\\longleftrightarrow\\;\\underbrace\{J\_\{A\}\}\_\{\\text\{antisym\.\\ Jacobian\}\}\\;\\longleftrightarrow\\;\(24\)𝒗⏟circulating drift⟷MA⏟query–key asymmetry,\\displaystyle\\underbrace\{\\bm\{v\}\}\_\{\\text\{circulating drift\}\}\\;\\longleftrightarrow\\;\\underbrace\{M\_\{A\}\}\_\{\\text\{query\-\-key asymmetry\}\},with the odd\-elastic work per cycle equal, up to1/D1/D, to the entropy produced per cycle\. A reciprocal \(symmetric\) attention is an ordinary elastic world model that stores no work around cycles; the odd, non\-reciprocal part is supplied byWQ≠WKW\_\{Q\}\\neq W\_\{K\}\.

## VIIrreversibility and the query–key asymmetry

This section turns the framing of the preceding sections into operational definitions, so that every quantity is something extracted from a trained world model rather than assumed\. This is what makes the central claim falsifiable, and it is the part on which the contribution stands or falls\.

### VI\.1Reversible and irreversible drift from rollouts

Given an ensemble of latent rollouts, estimate the first two Kramers–Moyal coefficients locally in𝒛\\bm\{z\}, a standard inference problem for stochastic trajectories\[[20](https://arxiv.org/html/2606.28751#bib.bib20)\],

𝒇^​\(𝒛\)\\displaystyle\\hat\{\\bm\{f\}\}\(\\bm\{z\}\)=1Δ​t𝔼\[𝒛t\+Δ​t−𝒛t\|𝒛t=𝒛\],\\displaystyle=\\frac\{1\}\{\\Delta t\}\\,\\mathbb\{E\}\\\!\\left\[\\bm\{z\}\_\{t\+\\Delta t\}\-\\bm\{z\}\_\{t\}\\,\\middle\|\\,\\bm\{z\}\_\{t\}=\\bm\{z\}\\right\],\(25\)2​D^​\(𝒛\)\\displaystyle 2\\hat\{D\}\(\\bm\{z\}\)=1Δ​tCov\[𝒛t\+Δ​t−𝒛t\|𝒛t=𝒛\]\.\\displaystyle=\\frac\{1\}\{\\Delta t\}\\,\\mathrm\{Cov\}\\\!\\left\[\\bm\{z\}\_\{t\+\\Delta t\}\-\\bm\{z\}\_\{t\}\\,\\middle\|\\,\\bm\{z\}\_\{t\}=\\bm\{z\}\\right\]\.\(26\)The locality \(Hooke\) assumption of Sec\.[II\.2](https://arxiv.org/html/2606.28751#S2.SS2)is itself testable here: conditioning additionally on the history𝒛<t\\bm\{z\}\_\{<t\}should change𝒇^\\hat\{\\bm\{f\}\}by onlyO​\(ϵ\)O\(\\epsilon\), and the size of that change is a direct estimate ofϵ\\epsilon\.

Estimate the stationary densityρss​\(𝒛\)\\rho\_\{\\mathrm\{ss\}\}\(\\bm\{z\}\)from the occupation of long rollouts\. The steady\-state probability current and the irreversible \(circulating\) drift are, for constant isotropicDD,

𝑱​\(𝒛\)=𝒇^​ρss−D^​∇ρss,𝒗​\(𝒛\)=𝑱​\(𝒛\)ρss​\(𝒛\)=𝒇^​\(𝒛\)−D^​∇ln⁡ρss​\(𝒛\),\\begin\{split\}\\bm\{J\}\(\\bm\{z\}\)&=\\hat\{\\bm\{f\}\}\\,\\rho\_\{\\mathrm\{ss\}\}\-\\hat\{D\}\\,\\nabla\\rho\_\{\\mathrm\{ss\}\},\\\\ \\bm\{v\}\(\\bm\{z\}\)&=\\frac\{\\bm\{J\}\(\\bm\{z\}\)\}\{\\rho\_\{\\mathrm\{ss\}\}\(\\bm\{z\}\)\}=\\hat\{\\bm\{f\}\}\(\\bm\{z\}\)\-\\hat\{D\}\\,\\nabla\\ln\\rho\_\{\\mathrm\{ss\}\}\(\\bm\{z\}\),\\end\{split\}\(27\)while the reversible part is the gradient𝒇^−𝒗=D^​∇ln⁡ρss=−∇U\\hat\{\\bm\{f\}\}\-\\bm\{v\}=\\hat\{D\}\\,\\nabla\\ln\\rho\_\{\\mathrm\{ss\}\}=\-\\nabla UwithU=−D^​ln⁡ρssU=\-\\hat\{D\}\\ln\\rho\_\{\\mathrm\{ss\}\}\. Detailed balance holds if and only if𝒗≡𝟎\\bm\{v\}\\equiv\\bm\{0\}\. Note that𝒗\\bm\{v\}and hence the irreversibility are obtained directly from the current; they do not require having first established the local Onsager–Machlup form, only the drift and the stationary density\.

### VI\.2Entropy production

The path\-wise entropy production was defined in Eq\. \([9](https://arxiv.org/html/2606.28751#S2.E9)\) asΣ​\[Γ\]=ln⁡P​\[Γ\]/P​\[Γ~\]\\Sigma\[\\Gamma\]=\\ln P\[\\Gamma\]/P\[\\widetilde\{\\Gamma\}\], the central object of stochastic thermodynamics and its fluctuation theorems\[[9](https://arxiv.org/html/2606.28751#bib.bib9),[21](https://arxiv.org/html/2606.28751#bib.bib21),[22](https://arxiv.org/html/2606.28751#bib.bib22)\], which needs only the path measure and its reversal\. Its steady\-state rate is the measurable functional of𝒗\\bm\{v\}andρss\\rho\_\{\\mathrm\{ss\}\},

Σ˙=∫d𝒛​\|𝑱​\(𝒛\)\|2D^​ρss​\(𝒛\)=1D^​⟨\|𝒗\|2⟩ss≥0,\\dot\{\\Sigma\}=\\int\\mathrm\{d\}\\bm\{z\}\\,\\frac\{\|\\bm\{J\}\(\\bm\{z\}\)\|^\{2\}\}\{\\hat\{D\}\\,\\rho\_\{\\mathrm\{ss\}\}\(\\bm\{z\}\)\}=\\frac\{1\}\{\\hat\{D\}\}\\,\\big\\langle\|\\bm\{v\}\|^\{2\}\\big\\rangle\_\{\\mathrm\{ss\}\}\\geq 0,\(28\)which vanishes exactly when the predicted dynamics is reversible\. We proposeΣ˙\\dot\{\\Sigma\}as the scalar signature of how much genuine temporal structure a world model predicts\.

### VI\.3Local circulation and non\-normality

The local generator of circulation is the antisymmetric partJA=12​\(J−J⊤\)J\_\{A\}=\\tfrac\{1\}\{2\}\(J\-J^\{\\top\}\)of the drift JacobianJJ, whose strength we measure by the frame\-invariant non\-normality

𝒩​\(𝒛\)=‖\[J,J⊤\]‖F=2​‖\[JA,JS\]‖F,\\mathcal\{N\}\(\\bm\{z\}\)=\\big\\\|\[J,J^\{\\top\}\]\\big\\\|\_\{F\}=2\\,\\big\\\|\[J\_\{A\},J\_\{S\}\]\\big\\\|\_\{F\},\(29\)withJS=12​\(J\+J⊤\)J\_\{S\}=\\tfrac\{1\}\{2\}\(J\+J^\{\\top\}\)\. A drift Jacobian that is symmetric everywhere \(JA≡0J\_\{A\}\\equiv 0\) forces𝒗=𝟎\\bm\{v\}=\\bm\{0\}and henceΣ˙=0\\dot\{\\Sigma\}=0: such a world model can only relax to attractors and cannot sustain directed or cyclic prediction\.

### VI\.4The attention\-asymmetry index

For a transformer world model the attention logit is the bilinear formq⊤​k=𝒛t⊤​M​𝒛sq^\{\\top\}k=\\bm\{z\}\_\{t\}^\{\\top\}M\\,\\bm\{z\}\_\{s\}withM=WQ⊤​WKM=W\_\{Q\}^\{\\top\}W\_\{K\}\. The asymmetry under exchange of the two positions is carried by

MA=12​\(WQ⊤​WK−WK⊤​WQ\),𝒬=‖MA‖F‖M‖F∈\[0,1\],M\_\{A\}=\\tfrac\{1\}\{2\}\\\!\\left\(W\_\{Q\}^\{\\top\}W\_\{K\}\-W\_\{K\}^\{\\top\}W\_\{Q\}\\right\),\\qquad\\mathcal\{Q\}=\\frac\{\\\|M\_\{A\}\\\|\_\{F\}\}\{\\\|M\\\|\_\{F\}\}\\in\[0,1\],\(30\)the dimensionless query–key asymmetry\. Symmetric attention \(WQ=WKW\_\{Q\}=W\_\{K\}\) gives𝒬=0\\mathcal\{Q\}=0; standard attention has𝒬\>0\\mathcal\{Q\}\>0\. The bridge proposed here is that𝒬\\mathcal\{Q\}feedsJAJ\_\{A\}through Eq\. \([27](https://arxiv.org/html/2606.28751#S6.E27)\) and hence drivesΣ˙\\dot\{\\Sigma\}\.

### VI\.5Falsifiable predictions

The definitions above make the thesis testable rather than asserted\.

- P1*Locality\.*Conditioning the drift on history beyond𝒛t\\bm\{z\}\_\{t\}changes𝒇^\\hat\{\\bm\{f\}\}byO​\(ϵ\)O\(\\epsilon\); this validates the local regime and measuresϵ\\epsilon\.
- P2*Broken detailed balance\.*Σ˙\>0\\dot\{\\Sigma\}\>0on tasks that require predicting sustained dynamics \(motion, rhythm, sequence\), andΣ˙→0\\dot\{\\Sigma\}\\to 0on relaxational tasks \(denoising to a fixed point\)\.
- P3*Correlation\.*Across heads, layers, or models,Σ˙\\dot\{\\Sigma\}increases monotonically with the attention asymmetry𝒬\\mathcal\{Q\}\.
- P4*Intervention\.*Symmetrizing the attention maps \(𝒬→0\\mathcal\{Q\}\\\!\\to\\\!0\) suppressesΣ˙\\dot\{\\Sigma\}and selectively degrades prediction of sustained dynamics while leaving relaxational prediction intact\.
- P5*Instanton versus rollout\.*The most\-probable path of Sec\.[III](https://arxiv.org/html/2606.28751#S3)departs from the deterministic rollout𝒛˙=𝒇^\\dot\{\\bm\{z\}\}=\\hat\{\\bm\{f\}\}where𝒗\\bm\{v\}and∇⋅𝒇^\\nabla\\\!\\cdot\\hat\{\\bm\{f\}\}are large\.

P4 is the causal core: it converts a correlation between an architectural quantity and a thermodynamic one into an intervention with a predicted, falsifiable consequence\.

## VIIResults and discussion

### VII\.1A controlled demonstration

Before turning to large pretrained world models, we verify both the measurement pipeline and the mechanism in a controlled minimal model where the architectural asymmetry can be swept and intervened upon\. We take a faithful two\-dimensional attention\-driven latent dynamics in which the query–key productM=WQ⊤​WKM=W\_\{Q\}^\{\\top\}W\_\{K\}is parameterized asM=S0\+θ​A0M=S\_\{0\}\+\\theta A\_\{0\}, withS0=12​IS\_\{0\}=\\tfrac\{1\}\{2\}Isymmetric andA0A\_\{0\}the planar rotation generator, so that the asymmetry𝒬​\(θ\)\\mathcal\{Q\}\(\\theta\)runs from0atθ=0\\theta=0to near unity\. We use both a linear\-attention form, for which the latent drift is𝒇​\(𝒛\)=W​𝒛\\bm\{f\}\(\\bm\{z\}\)=W\\bm\{z\}withW=−γ​I\+M⊤W=\-\\gamma I\+M^\{\\top\}and the steady\-state entropy production is known exactly from the Lyapunov equation, and a softmax form𝒇​\(𝒛\)=−γ​𝒛\+β​∑jsoftmaxj​\(𝒛⊤​M​𝒎j\)​𝒎j\\bm\{f\}\(\\bm\{z\}\)=\-\\gamma\\bm\{z\}\+\\beta\\sum\_\{j\}\\mathrm\{softmax\}\_\{j\}\(\\bm\{z\}^\{\\top\}M\\bm\{m\}\_\{j\}\)\\,\\bm\{m\}\_\{j\}over fixed memory slots𝒎j\\bm\{m\}\_\{j\}\. We fixγ=1\\gamma=1andD=0\.25D=0\.25; the linear drift is stable for allθ\\theta\(eigenvalues−12±i​θ\-\\tfrac\{1\}\{2\}\\pm i\\theta\)\. The full pipeline of Eqs\. \([27](https://arxiv.org/html/2606.28751#S6.E27)\)–\([28](https://arxiv.org/html/2606.28751#S6.E28)\) is then applied blindly to sampled rollouts, with the drift and stationary density estimated on a grid and never supplied analytically\.

![Refer to caption](https://arxiv.org/html/2606.28751v1/x1.png)Figure 1:Query–key asymmetry sets the entropy production of the predicted dynamics\.\(a\) In a linear\-attention latent dynamics, the blindly measured entropy productionΣ˙\\dot\{\\Sigma\}rises with the attention asymmetry𝒬=‖MA‖/‖M‖\\mathcal\{Q\}=\\\|M\_\{A\}\\\|/\\\|M\\\|and matches the exact Lyapunov value\. \(b\) The same trend holds for softmax attention; note the smaller vertical scale, as the bounded softmax weights saturate the attainable circulation\. \(c\) Intervention: forcingWQ=WKW\_\{Q\}=W\_\{K\}\(𝒬→0\\mathcal\{Q\}\\\!\\to\\\!0\) collapsesΣ˙\\dot\{\\Sigma\}in both models\. \(d,e\) The measured probability current \(dark red streamlines\) over the stationary densityρss\\rho\_\{\\mathrm\{ss\}\}\(gray shading\): a coherent global circulation for the asymmetric model \(d\) and no net rotation after symmetrization \(e\), where the residual short vectors are numerical fluctuation at the noise floorΣ˙≈0\.2\\dot\{\\Sigma\}\\\!\\approx\\\!0\.2rather than genuine current\. \(f\) The swept asymmetry𝒬​\(θ\)\\mathcal\{Q\}\(\\theta\); the linear model is stable for allθ\\theta\.
### VII\.2Results

The four predictions of Sec\.[VI](https://arxiv.org/html/2606.28751#S6)that the toy model can address are borne out \(Fig\.[1](https://arxiv.org/html/2606.28751#S7.F1)\)\.

*Pipeline validity and the𝒬\\mathcal\{Q\}–Σ˙\\dot\{\\Sigma\}relation \(P3\)\.*For the linear model, the blindly measuredΣ˙\\dot\{\\Sigma\}reproduces the exact Lyapunov value across the whole sweep \(panel a\): atθ=1\\theta=1the exact and measured rates are4\.004\.00and4\.214\.21, and atθ=2\\theta=2they are16\.016\.0and15\.915\.9\. The entropy production rises steeply and monotonically with the attention asymmetry, spanning roughly two orders of magnitude as𝒬\\mathcal\{Q\}goes from0to0\.970\.97\. The agreement establishes that the grid\-based Kramers–Moyal and current estimators recover the true entropy production rather than an artefact of the measurement\.

*Persistence under nonlinearity\.*In the softmax model \(panel b\) the same monotone rise is observed, fromΣ˙≈0\.3\\dot\{\\Sigma\}\\\!\\approx\\\!0\.3at𝒬=0\\mathcal\{Q\}=0toΣ˙≈4\.8\\dot\{\\Sigma\}\\\!\\approx\\\!4\.8at𝒬=0\.97\\mathcal\{Q\}=0\.97, now saturating at large𝒬\\mathcal\{Q\}because the bounded softmax limits the attainable circulation\. The effect is therefore not an artefact of the linear construction\.

*Intervention \(P4\)\.*Symmetrizing the attention map, that is forcingWQ=WKW\_\{Q\}=W\_\{K\}so that𝒬→0\\mathcal\{Q\}\\\!\\to\\\!0, collapses the entropy production by more than an order of magnitude in both models \(panel c\): from7\.717\.71to0\.180\.18in the linear case and from3\.823\.82to0\.190\.19in the softmax case\. Correspondingly, the measured probability current changes from a coherent global circulation \(panel d\) to a field with no net rotation \(panel e\)\. Because the intervention acts on the architecture and the consequence is read out thermodynamically, this is a causal test of the bridge, not a correlation\.

*Noise floor\.*The residualΣ˙≈0\.2\\dot\{\\Sigma\}\\\!\\approx\\\!0\.2that remains at𝒬=0\\mathcal\{Q\}=0is a positive finite\-binning bias of the estimator, the standard small\-sample bias of entropy\-production estimators, and it sets the floor of the measurement\. The signal at moderate and large𝒬\\mathcal\{Q\}stands well above it\.

### VII\.3Spontaneous acquisition in a trained model

The toy of Fig\.[1](https://arxiv.org/html/2606.28751#S7.F1)set the attention asymmetry by hand\. To test whether the same asymmetry is*acquired*by learning, and in proportion to the irreversibility of the data, we train the attention world model of Sec\.[V](https://arxiv.org/html/2606.28751#S5)on a controllable two\-dimensional processd​𝒛=\(−a​𝒛\+ω​R​𝒛\)​d​t\+2​D​d​𝑾\\mathrm\{d\}\\bm\{z\}=\(\-a\\bm\{z\}\+\\omega R\\bm\{z\}\)\\,\\mathrm\{d\}t\+\\sqrt\{2D\}\\,\\mathrm\{d\}\\bm\{W\}, withRRthe rotation generator, whose true entropy production isΣ˙true=2​ω2/a\\dot\{\\Sigma\}\_\{\\mathrm\{true\}\}=2\\omega^\{2\}/a\. The model queries the state against learnable memory slots with values tied to keys, so its only channel for an antisymmetric drift is the score matrixM=WQ⊤​WKM=W\_\{Q\}^\{\\top\}W\_\{K\}; we initialize it symmetric \(𝒬=0\\mathcal\{Q\}=0\) and train by next\-step prediction\.

Figure[2](https://arxiv.org/html/2606.28751#S7.F2)reports the outcome\. Starting from a symmetric initialization, training on circulating data drives the query–key asymmetry‖MA‖\\\|M\_\{A\}\\\|and the measured entropy production up together within the first hundred steps \(panel a\): the asymmetry is acquired, not imposed\. Across the sweep the trained model’s blind entropy production reproduces the true2​ω2/a2\\omega^\{2\}/a\(panel b\), and the acquired asymmetry, together with the dynamical non\-normality𝒩\\mathcal\{N\}it induces, is smallest for gradient \(ω=0\\omega=0\) data, where the drift Jacobian is exactly normal, and grows with the circulating part of the data \(panel c\)\. The causal test is decisive: symmetrizing the learnedMMcollapses the entropy production from5\.05\.0to0\.30\.3\(panel d\), and with it the non\-normality𝒩\\mathcal\{N\}from0\.80\.8to0\.060\.06, and selectively raises the prediction loss on circulating data while leaving relaxational data untouched \(panel e\), and the trained model’s measured current is a coherent circulation \(panel f\)\. A small attention model thus learns to be irreversible exactly when its data is, and carries that irreversibility in its query–key asymmetry\.

![Refer to caption](https://arxiv.org/html/2606.28751v1/x2.png)Figure 2:Query–key asymmetry is acquired by training in proportion to the data’s irreversibility\.\(a\) From a symmetric initialization, training on circulating data drives‖MA‖\\\|M\_\{A\}\\\|\(dashed\) and the measuredΣ˙\\dot\{\\Sigma\}\(solid\) up together\. \(b\) The trained model’s blindΣ˙\\dot\{\\Sigma\}reproduces the true2​ω2/a2\\omega^\{2\}/aacross the data irreversibilityω\\omega\. \(c\) The acquired asymmetry‖MA‖\\\|M\_\{A\}\\\|\(left axis\) and the dynamical non\-normality𝒩=‖\[J,J⊤\]‖F\\mathcal\{N\}=\\\|\[J,J^\{\\top\}\]\\\|\_\{F\}it induces \(right axis\) both vanish for gradient data \(ω=0\\omega=0\) and grow with circulating data; the ratio𝒬\\mathcal\{Q\}saturates and‖MA‖\\\|M\_\{A\}\\\|turns over at largeω\\omega, where the model uses additional capacity\. \(d\) Symmetrizing the learnedMMcollapsesΣ˙\\dot\{\\Sigma\}\(and, in the text,𝒩\\mathcal\{N\}\)\. \(e\) Excess one\-step prediction loss after symmetrization: circulating prediction degrades, relaxational prediction is spared\. \(f\) The trained circulating model’s measured probability current \(dark red\) over its stationary density \(gray shading\)\.
### VII\.4Irreversibility is a resource for prediction

The measurements so far show that a trained model is irreversible and that its irreversibility is carried by the query–key asymmetry\. The sharper question is whether that irreversibility is*useful*: does removing it cost prediction? We compare multi\-step deterministic rollouts of the trained model and of its symmetrized counterpart against the true conditional meanμ​\(𝒛,h\)=eWω​h​𝒛\\mu\(\\bm\{z\},h\)=e^\{W\_\{\\omega\}h\}\\bm\{z\}, on circulating and relaxational data\.

Figure[3](https://arxiv.org/html/2606.28751#S7.F3)answers it\. On circulating data the trained model tracks the true rollout over the full horizon, while the symmetrized model’s error grows sharply: unable to rotate, it predicts only radial relaxation and loses the phase \(panels a, d\)\. On relaxational data the two are indistinguishable \(panel b\), so symmetrization costs nothing when there is no circulation to predict\. Across the sweep, the prediction lost on symmetrization grows monotonically with the entropy production \(panel c\)\. The query–key asymmetry is therefore not a passive by\-product but a resource: the entropy it produces is what lets the model predict irreversible structure over a long horizon, and removing it degrades exactly that prediction and nothing else\. A world model that must anticipate genuine temporal structure cannot be at equilibrium\.

![Refer to caption](https://arxiv.org/html/2606.28751v1/x3.png)Figure 3:Irreversibility is a resource for long\-horizon prediction\.Multi\-step rollout error against the true conditional mean\. \(a\) On circulating data the trained model \(solid\) tracks the truth while the symmetrized model \(dashed,𝒬→0\\mathcal\{Q\}\\\!\\to\\\!0\) fails at long horizon\. \(b\) On relaxational data the two coincide, so symmetrization is harmless\. \(c\) The peak prediction error caused by symmetrization grows monotonically with the entropy productionΣ˙\\dot\{\\Sigma\}, while the trained model stays near zero\. \(d\) An example circulating rollout: the trained model follows the true spiral, the symmetrized model decays radially and loses the rotation\.
### VII\.5Discussion

The demonstration confirms, in a controlled setting, the paper’s central claim: the irreversibility of a world model’s predicted dynamics is set by the asymmetry of its attention maps\. The reading is physical\. A symmetric attention \(WQ=WKW\_\{Q\}=W\_\{K\}\) yields a gradient drift, the equilibrium, modern\-Hopfield limit, in which the latent state merely relaxes to an attractor and the predicted dynamics is time\-reversible\. Only an asymmetric attention produces the circulating drift𝒗\\bm\{v\}that sustains directed or cyclic prediction, and that circulation is exactly the entropy production measured here\. A predictive model that must anticipate ongoing temporal structure, motion, rhythm, or sequence, therefore cannot operate at equilibrium: it must break detailed balance, and the resource that lets it do so is supplied architecturally by the query–key asymmetry\. This gives a mechanistic counterpart to, and is consistent with, the empirical finding that neural and active biological dynamics break detailed balance, more strongly so under demanding cognitive tasks\[[18](https://arxiv.org/html/2606.28751#bib.bib18),[23](https://arxiv.org/html/2606.28751#bib.bib23),[24](https://arxiv.org/html/2606.28751#bib.bib24)\], and with the view of the brain as a prediction machine that minimizes a free\-energy functional\[[3](https://arxiv.org/html/2606.28751#bib.bib3)\]; the difference is that here the irreversibility is traced to an identifiable architectural source rather than only measured in the data\. Figure[3](https://arxiv.org/html/2606.28751#S7.F3)turns this from an argument into a measurement: removing the asymmetry destroys precisely the long\-horizon, circulating prediction it was claimed to support and leaves relaxational prediction intact, so the entropy production is a resource the predictor uses rather than waste heat it merely emits\[[25](https://arxiv.org/html/2606.28751#bib.bib25)\]\.

Several limitations bound the claim\. The model is a two\-dimensional toy rather than a trained world model, chosen so that𝒬\\mathcal\{Q\}can be swept and intervened upon and so that an exact cross\-check exists\. What keeps the demonstration from being a two\-dimensional coincidence is that the mechanism behind it is established analytically and independently of dimension: Eq\. \([22](https://arxiv.org/html/2606.28751#S5.E22)\) expresses the antisymmetric drift Jacobian through the query–key asymmetryMAM\_\{A\}for any value\-tied attention layer in any latent dimension, so the experiment confirms a general algebraic relation rather than a property of the particular system\. What remains genuinely scale\- and architecture\-dependent, and is therefore the proper subject of a large\-model study, is whether the query–key channel keeps dominating the value path once values are untied, and whether a deep stack preserves the local\-action picture\. We are careful to claim that Eq\. \([22](https://arxiv.org/html/2606.28751#S5.E22)\)*identifies and controls*a channel of irreversibility, not that this channel*dominates*it in a full architecture: residual connections, the feedforward block, layer normalization, value mixing, and depth composition all contribute to the learned drift, and establishing which channel dominates the entropy production of a large pretrained model is the open empirical question, separate from the analytic identification made here\. The floor ofΣ˙≈0\.2\\dot\{\\Sigma\}\\\!\\approx\\\!0\.2limits the dynamic range at small𝒬\\mathcal\{Q\}\. This floor receives contributions both from finite\-binning bias and from the state\-dependent residual non\-integrability of Sec\.[VII](https://arxiv.org/html/2606.28751#S7); separating the two would require comparing floors across binning resolutions, which we do not pursue here\. The linear panel makes the𝒬→Σ˙\\mathcal\{Q\}\\\!\\to\\\!\\dot\{\\Sigma\}link nearly analytic, which is why the softmax panel is included as a guard against a purely linear artefact\. Finally, the grid\-based current estimator used here does not scale to the high\-dimensional latents of real models\. Three routes avoid the grid\. The drift𝒇\\bm\{f\}, and hence𝒗\\bm\{v\}through Eq\. \([27](https://arxiv.org/html/2606.28751#S6.E27)\), can be estimated by neural score matching of the one\-step transition\[[26](https://arxiv.org/html/2606.28751#bib.bib26),[12](https://arxiv.org/html/2606.28751#bib.bib12),[27](https://arxiv.org/html/2606.28751#bib.bib27)\], since∇ln⁡ρss\\nabla\\ln\\rho\_\{\\mathrm\{ss\}\}is a score of the kind such networks already learn\. The rateΣ˙\\dot\{\\Sigma\}can be estimated directly with a variational lower bound on the divergence between the forward and reversed path distributions, which needs only samples and no explicit density, in the spirit of recent machine\-learning estimators of entropy production\[[28](https://arxiv.org/html/2606.28751#bib.bib28),[29](https://arxiv.org/html/2606.28751#bib.bib29),[30](https://arxiv.org/html/2606.28751#bib.bib30)\]\. OrΣ​\[Γ\]\\Sigma\[\\Gamma\]can be accumulated pathwise from the model’s own trajectory log\-likelihoods\. All three reduce to the quantities of Sec\.[VI](https://arxiv.org/html/2606.28751#S6)and are what we would use on a trained model\.

A further subtlety concerns the intervention\. SymmetrizingM=WQ⊤​WKM=W\_\{Q\}^\{\\top\}W\_\{K\}removes the asymmetry at the level of the weights, but the softmax makes the drift state\-dependent, and a state\-dependent map can be non\-integrable even under a symmetricMM: the antisymmetric part of the drift Jacobian contains a term∝\[⟨a​𝒎​𝒎⊤⟩​\(𝒛\),M\]\\propto\[\\langle a\\,\\bm\{m\}\\bm\{m\}^\{\\top\}\\rangle\(\\bm\{z\}\),\\,M\]that need not vanish whenM=M⊤M=M^\{\\top\}\. Our claim is therefore about the weight\-level channel, the resource a designer controls throughWQ,WKW\_\{Q\},W\_\{K\}, and the empirical collapse ofΣ˙\\dot\{\\Sigma\}to the floor under symmetrization \(Figs\.[2](https://arxiv.org/html/2606.28751#S7.F2),[3](https://arxiv.org/html/2606.28751#S7.F3)\) shows that this channel dominates here\. A complete account would also measure the residual state\-dependent non\-integrability under symmetric weights, which we leave to the high\-dimensional study above\.

Figure[2](https://arxiv.org/html/2606.28751#S7.F2)takes the first of these steps on a small trained model\. The remaining rung is a large pretrained latent world model, an RSSM or Dreamer\-type model or a transformer trained on rich data, where the latent is high\-dimensional and the grid estimator gives way to the score\-matching and variational estimators above\. There the value path is no longer tied, so the analysis must also ask how the irreversibility is shared between the query–key and value paths, and the selective degradation should be read over long\-horizon and planning rollouts, where one\-step effects compound, rather than the single step measured here\. The learned drift is then also shaped by the non\-attention channels: the residual connections, feedforward blocks, and normalization\. Their contributions can be isolated by comparing the measured drift against one in which those blocks are bypassed or ablated, so that the query–key channel is read against, rather than confounded with, the rest of the architecture\. The operational definitions of Sec\.[VI](https://arxiv.org/html/2606.28751#S6)carry over unchanged; only the estimators and the scale grow\.

Concretely, the protocol on a trained model is the following\. Extract𝒬=‖MA‖/‖M‖\\mathcal\{Q\}=\\\|M\_\{A\}\\\|/\\\|M\\\|for each head and layer from itsWQ,WKW\_\{Q\},W\_\{K\}; measureΣ˙\\dot\{\\Sigma\}of its latent rollouts with the estimators above; and test the correlation \(P3\) across heads, layers, and training checkpoints\. The causal test \(P4\) then symmetrizesWQ⊤​WK→12​\(WQ⊤​WK\+WK⊤​WQ\)W\_\{Q\}^\{\\top\}W\_\{K\}\\to\\tfrac\{1\}\{2\}\(W\_\{Q\}^\{\\top\}W\_\{K\}\+W\_\{K\}^\{\\top\}W\_\{Q\}\)and predicts three coupled consequences:Σ˙\\dot\{\\Sigma\}falls; long\-horizon and periodic\-motion prediction, for example physics\-engine rollouts, degrades; and planning quality drops; while purely relaxational tasks, such as denoising or completion to a static structure, are spared\. This selective degradation, irreversible prediction lost but relaxation preserved, is the signature that would turn the central claim from a hypothesis into a measured result\.

## VIIIConclusion

In this work, we proposed a path\-space formulation of prediction in AI world models\. Rather than viewing prediction as the estimation of a future state distribution, we argued that a world model implicitly represents a probability distribution over future trajectories\. Within this framework, prediction, planning, and uncertainty emerge as different operations on the same path measure: the most probable trajectory corresponds to prediction, constrained trajectory selection corresponds to planning, and fluctuations around dominant trajectories correspond to uncertainty\. Assuming an effective local path description, the trajectory distribution takes the form of an Onsager–Machlup action\. This provides a common physical language linking latent dynamics, stochastic thermodynamics, and attention\-based sequence modeling\. The resulting framework allows reversible and irreversible components of the learned dynamics to be separated, making entropy production a measurable property of the predicted world rather than an abstract thermodynamic quantity\. Building on this formulation, we introduced operational measures connecting architectural asymmetry in attention mechanisms to the irreversibility of learned dynamics\. In controlled attention\-model experiments, asymmetry in the attention structure was associated with circulating probability flow and positive entropy production, while symmetrization interventions selectively degraded long\-horizon predictive performance\. These observations support the hypothesis that irreversibility is not only a by\-product of learning but may constitute a computational resource for maintaining predictive representations of temporally evolving environments\.

Future prediction is naturally formulated as inference on path space, within which irreversibility emerges as a measurable property of the predictive dynamics and, potentially, as a computational resource for representing sustained temporal structure\. A world model, in this view, does not predict the next state; it weights whole futures, and its irreversibility is what makes the long\-horizon structure of those futures predictable\. The present work should be viewed as a first step toward a statistical physics of world models\. The central object is not a state distribution but a path distribution, and many questions remain open\. In particular, it remains to be established to what extent large\-scale world models admit an effective local path description, how nonlocal memory effects modify the trajectory measure, and whether the thermodynamic structures identified here persist across architectures and scales\. We hope that the path\-space perspective developed in this work provides a foundation for addressing these questions and for building a unified physical theory of prediction, planning, and uncertainty in intelligent systems\.

## References

- \[1\]D\. Ha and J\. Schmidhuber, “World Models,” arXiv:1803\.10122 \(2018\)\.
- \[2\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap, “Mastering Diverse Domains through World Models,” arXiv:2301\.04104 \(2023\)\.
- \[3\]K\. Friston, “The free\-energy principle: a unified brain theory?” Nat\. Rev\. Neurosci\.11, 127 \(2010\)\.
- \[4\]A\. Vaswani*et al\.*, “Attention Is All You Need,” in*Advances in Neural Information Processing Systems*30\(2017\)\.
- \[5\]H\. Mori, “Transport, Collective Motion, and Brownian Motion,” Prog\. Theor\. Phys\.33, 423 \(1965\)\.
- \[6\]R\. Zwanzig,*Nonequilibrium Statistical Mechanics*\(Oxford Univ\. Press, 2001\)\.
- \[7\]A\. J\. Chorin, O\. H\. Hald, and R\. Kupferman, “Optimal prediction and the Mori–Zwanzig representation of irreversible processes,” Proc\. Natl\. Acad\. Sci\. USA97, 2968 \(2000\)\.
- \[8\]L\. Onsager and S\. Machlup, “Fluctuations and Irreversible Processes,” Phys\. Rev\.91, 1505 \(1953\)\.
- \[9\]U\. Seifert, “Stochastic thermodynamics, fluctuation theorems and molecular machines,” Rep\. Prog\. Phys\.75, 126001 \(2012\)\.
- \[10\]H\. Ramsauer*et al\.*, “Hopfield Networks is All You Need,” in*Int\. Conf\. on Learning Representations \(ICLR\)*\(2021\)\.
- \[11\]S\. Levine, “Reinforcement Learning and Control as Probabilistic Inference,” arXiv:1805\.00909 \(2018\)\.
- \[12\]Y\. Song*et al\.*, “Score\-Based Generative Modeling through Stochastic Differential Equations,” in*Int\. Conf\. on Learning Representations \(ICLR\)*\(2021\)\.
- \[13\]Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le, “Flow Matching for Generative Modeling,” in*Int\. Conf\. on Learning Representations \(ICLR\)*\(2023\)\.
- \[14\]S\. Raja*et al\.*, “Action\-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager–Machlup Functional,” in*Proc\. 42nd Int\. Conf\. on Machine Learning \(ICML\)*, PMLR267, 50972 \(2025\)\.
- \[15\]C\. Scheibner*et al\.*, “Odd elasticity,” Nat\. Phys\.16, 475 \(2020\)\.
- \[16\]D\. Sussillo and O\. Barak, “Opening the Black Box: Low\-Dimensional Dynamics in High\-Dimensional Recurrent Neural Networks,” Neural Comput\.25, 626 \(2013\)\.
- \[17\]B\. Geshkovski, C\. Letrouit, Y\. Polyanskiy, and P\. Rigollet, “A mathematical perspective on Transformers,” arXiv:2312\.10794 \(2023\)\.
- \[18\]C\. W\. Lynn*et al\.*, “Broken detailed balance and entropy production in the human brain,” Proc\. Natl\. Acad\. Sci\. USA118, e2109889118 \(2021\)\.
- \[19\]D\. Sekizawa, S\. Ito, and M\. Oizumi, “Decomposing thermodynamic dissipation of linear Langevin systems via oscillatory modes and its application to neural dynamics,” Phys\. Rev\. X14, 041003 \(2024\)\.
- \[20\]A\. Frishman and P\. Ronceray, “Learning Force Fields from Stochastic Trajectories,” Phys\. Rev\. X10, 021009 \(2020\)\.
- \[21\]C\. Jarzynski, “Nonequilibrium Equality for Free Energy Differences,” Phys\. Rev\. Lett\.78, 2690 \(1997\)\.
- \[22\]G\. E\. Crooks, “Entropy production fluctuation theorem and the nonequilibrium work relation for free energy differences,” Phys\. Rev\. E60, 2721 \(1999\)\.
- \[23\]C\. Battle*et al\.*, “Broken detailed balance at mesoscopic scales in active biological systems,” Science352, 604 \(2016\)\.
- \[24\]F\. S\. Gnesotto, F\. Mura, J\. Gladrow, and C\. P\. Broedersz, “Broken detailed balance and non\-equilibrium dynamics in living systems: a review,” Rep\. Prog\. Phys\.81, 066601 \(2018\)\.
- \[25\]J\. M\. R\. Parrondo, J\. M\. Horowitz, and T\. Sagawa, “Thermodynamics of information,” Nat\. Phys\.11, 131 \(2015\)\.
- \[26\]A\. Hyvärinen, “Estimation of Non\-Normalized Statistical Models by Score Matching,” J\. Mach\. Learn\. Res\.6, 695 \(2005\)\.
- \[27\]J\. Ho, A\. Jain, and P\. Abbeel, “Denoising Diffusion Probabilistic Models,” in*Advances in Neural Information Processing Systems*33\(2020\)\.
- \[28\]A\. C\. Barato and U\. Seifert, “Thermodynamic Uncertainty Relation for Biomolecular Processes,” Phys\. Rev\. Lett\.114, 158101 \(2015\)\.
- \[29\]S\. Otsubo, S\. Ito, A\. Dechant, and T\. Sagawa, “Estimating entropy production by machine learning of short\-time fluctuating currents,” Phys\. Rev\. E101, 062106 \(2020\)\.
- \[30\]D\.\-K\. Kim, Y\. Bae, S\. Lee, and H\. Jeong, “Learning Entropy Production via Neural Networks,” Phys\. Rev\. Lett\.125, 140604 \(2020\)\.

Similar Articles

World Action Models: The Next Frontier in Embodied AI

Hugging Face Daily Papers

This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.

Learning Transferable Dynamics Priors from Action to World Modeling

Hugging Face Daily Papers

This paper introduces A2World, a diffusion-based world model pretrained on large-scale robot manipulation data to learn transferable dynamics priors. The model can be adapted into a real-world simulator (A2World-sim) for policy evaluation or a video-action prediction model (A2World-policy) for action prediction, demonstrating benefits for both simulator-centric and policy-centric robot learning.

A Tutorial on World Models and Physical AI

arXiv cs.AI

This tutorial presents a coherent framework unifying diverse world modeling approaches for physical AI, covering explicit and implicit world models and their role in prediction, reasoning, and planning.