Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding

arXiv cs.AI 07/01/26, 04:00 AM Papers
world-models jepa action-sensitive latent-difference reconstruction-free planning continuous-control
Summary
Delta-JEPA introduces a reconstruction-free world model that augments latent forward prediction with a Latent Difference Action Decoder to prevent collapse and improve action-sensitivity, achieving better planning performance on visual continuous-control tasks.
arXiv:2606.31232v1 Announce Type: new Abstract: Learning visual world models for planning requires compact latent dynamics that remain sensitive to actions, yet reconstruction-free joint-embedding objectives can collapse to action-insensitive representations. We propose Delta-JEPA, an end-to-end reconstruction-free world model that augments latent forward prediction with a Latent Difference Action Decoder (LDAD). Unlike inverse decoders that infer actions from concatenated endpoint embeddings, LDAD reconstructs the executed action from the latent displacement between consecutive observations. This displacement-level supervision directly regularizes transition geometry: adjacent embeddings cannot collapse without losing action information, and different actions are encouraged to induce distinguishable latent changes for rollout-based planning. Delta-JEPA uses only latent prediction and action reconstruction, avoiding pixel reconstruction and distribution-matching regularizers. Across four visual continuous-control tasks, Delta-JEPA improves planning over JEPA-based and representation-learning world model baselines. Ablations show that displacement-based action decoding is consistently more effective than endpoint concatenation, and action-sensitivity analyses show clearer action-conditioned latent responses. These results indicate that supervising latent differences is a simple and effective mechanism for collapse-resistant and action-sensitive world model learning.
Original Article
View Cached Full Text
Cached at: 07/01/26, 05:37 AM
# Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding
Source: [https://arxiv.org/html/2606.31232](https://arxiv.org/html/2606.31232)
Zhenghao Zhang1, Yuanxiang Wang1, Zhenyu Guan111footnotemark:1, Yujia Yang111footnotemark:1, Bingkang Shi211footnotemark:1, Tianyu Zong1, Hongzhu Yi1, Guoqing Chao3, Xingchen Chen4, Tiankun Yang1, Chenxi Bao1, Tao Yu5, Jingjing Zhou1, Jungang Xu1

###### Abstract

Learning visual world models for planning requires compact latent dynamics that remain sensitive to actions, yet reconstruction\-free joint\-embedding objectives can collapse to action\-insensitive representations\. We propose Delta\-JEPA, an end\-to\-end reconstruction\-free world model that augments latent forward prediction with a Latent Difference Action Decoder \(LDAD\)\. Unlike inverse decoders that infer actions from concatenated endpoint embeddings, LDAD reconstructs the executed action from the latent displacement between consecutive observations\. This displacement\-level supervision directly regularizes transition geometry: adjacent embeddings cannot collapse without losing action information, and different actions are encouraged to induce distinguishable latent changes for rollout\-based planning\. Delta\-JEPA uses only latent prediction and action reconstruction, avoiding pixel reconstruction and distribution\-matching regularizers\. Across four visual continuous\-control tasks, Delta\-JEPA improves planning over JEPA\-based and representation\-learning world model baselines\. Ablations show that displacement\-based action decoding is consistently more effective than endpoint concatenation, and action\-sensitivity analyses show clearer action\-conditioned latent responses\. These results indicate that supervising latent differences is a simple and effective mechanism for collapse\-resistant and action\-sensitive world model learning\.

## Introduction

Building agents that can infer environment dynamics and predict future states directly from raw sensory observations remains a central goal in artificial intelligence\(Ha and Schmidhuber[2018a](https://arxiv.org/html/2606.31232#bib.bib4),[b](https://arxiv.org/html/2606.31232#bib.bib5)\)\. World models address this goal by learning an internal “imagination space” in which future outcomes can be forecast under candidate actions, thereby supporting planning and control\(Hafneret al\.[2019a](https://arxiv.org/html/2606.31232#bib.bib6); Wuet al\.[2023](https://arxiv.org/html/2606.31232#bib.bib7)\)\. Early world models often relied on pixel\-space reconstruction\(Hafneret al\.[2019b](https://arxiv.org/html/2606.31232#bib.bib8)\), but reconstructing high\-dimensional observations is computationally expensive and can waste model capacity on visually detailed but dynamics\-irrelevant information\(Assranet al\.[2023](https://arxiv.org/html/2606.31232#bib.bib9),[2025](https://arxiv.org/html/2606.31232#bib.bib10); Hauri and Zenke[2026](https://arxiv.org/html/2606.31232#bib.bib11)\)\. This makes reconstruction\-free latent prediction an attractive alternative\.

Joint Embedding Predictive Architectures \(JEPA\)\(Assranet al\.[2023](https://arxiv.org/html/2606.31232#bib.bib9)\)offer a particularly appealing foundation for latent world modeling because they directly predict compact future representations rather than future pixels\. However, this efficiency introduces a major challenge: when trained end\-to\-end with only latent prediction objectives, JEPA\-based world models can easily collapse to trivial constant representations\(Maeset al\.[2026](https://arxiv.org/html/2606.31232#bib.bib2)\)\. In that case, the model achieves deceptively low prediction loss while destroying the representation structure needed for planning\.

Existing approaches typically address collapse through additional training heuristics, though these designs involve different tradeoffs\. LeWorldModel\(Maeset al\.[2026](https://arxiv.org/html/2606.31232#bib.bib2)\), for example, uses SigReg\(Balestriero and LeCun[2025](https://arxiv.org/html/2606.31232#bib.bib13)\)to stabilize end\-to\-end latent prediction, but it does not explicitly constrain the latent space to be sensitive to executed actions, allowing different actions to induce weakly distinguishable latent transitions\. PLDM\(Sobalet al\.[2026](https://arxiv.org/html/2606.31232#bib.bib14)\)instead combines VICReg\-style regularization with inverse dynamics, yielding a more complex multi\-loss objective that is sensitive to hyperparameter tuning\. Moreover, its inverse dynamics module decodes actions from concatenated adjacent latent states\[zt,zt\+1\]\[z\_\{t\},z\_\{t\+1\}\]\. Because the forward predictor is itself conditioned on the executed action, end\-to\-end optimization can make the next\-state representationzt\+1z\_\{t\+1\}absorb action\-correlated cues that are easy for the inverse decoder to exploit, without requiring the model to represent the actual transition between the two states\.

To address these issues, we proposeDelta\-JEPA, an end\-to\-end latent world model built around theLatent Difference Action Decoder \(LDAD\)\. Instead of reconstructing actions from concatenated latent states, LDAD predicts the executed action from the latent differenceΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. This displacement\-level inverse objective encourages action\-sensitive latent dynamics that are crucial for planning: if different actions from the same latent state lead to indistinguishable next embeddings, the world model cannot represent action\-controllable next\-state transitions, making latent rollouts uninformative for planning\. Conversely, a latent representation is more controllable when different actions from the same state induce distinguishable next\-state embeddings\. By requiring the action to be recovered fromΔzt\\Delta z\_\{t\}, LDAD encourages different actions to induce distinguishable latent displacements and next\-state embeddings, while discouraging action prediction from relying on state\-specific cues rather than the transition itself\.

Delta\-JEPA trains this mechanism with a simple two\-objective scheme: latent forward prediction models future representations under actions, while LDAD makes action\-induced latent displacements predictive of their actions\. This design is particularly important for planning, where candidate action sequences are evaluated through latent rollouts and the model must distinguish how alternative actions drive the environment forward\. Empirically, we show that Delta\-JEPA improves planning performance and learns more action\-sensitive latent transition structure across diverse continuous\-control tasks\.

The main contributions of this work are summarized as follows:

- •Action\-Sensitive Latent Dynamics: We introduce LDAD, a displacement\-based inverse objective that mitigates collapse by enforcing action\-distinguishable latent transitions\.
- •Two\-Objective Training Framework: We develop Delta\-JEPA, an end\-to\-end latent world model trained only with latent forward prediction and LDAD\-based action reconstruction\.
- •Empirical Validation: We evaluate Delta\-JEPA on diverse continuous\-control tasks and show improved planning performance together with stronger action\-sensitive latent dynamics\.

## Related Work

### Latent World Models

World models learn compact predictive models of environment dynamics that support planning and control from high\-dimensional observations\(Ha and Schmidhuber[2018b](https://arxiv.org/html/2606.31232#bib.bib5),[a](https://arxiv.org/html/2606.31232#bib.bib4)\)\. A prominent line of work builds latent dynamics models for visual control, including PlaNet\(Hafneret al\.[2019b](https://arxiv.org/html/2606.31232#bib.bib8)\), Dreamer\(Hafneret al\.[2019a](https://arxiv.org/html/2606.31232#bib.bib6)\), and DreamerV3\(Hafneret al\.[2023](https://arxiv.org/html/2606.31232#bib.bib15)\), which encode pixels into latent states and use imagined rollouts for planning or policy learning\. These methods demonstrate the effectiveness of latent imagination, but they commonly rely on reconstruction or reward\-driven objectives\. This motivates reconstruction\-free latent world models that directly predict compact representations and focus model capacity on control\-relevant state changes\.

### Joint Embedding Predictive Architectures

Joint Embedding Predictive Architectures \(JEPA\) were proposed as non\-generative predictive models that compare predictions in representation space rather than input space\(LeCun and others[2022](https://arxiv.org/html/2606.31232#bib.bib16)\)\. I\-JEPA instantiates this idea for images by predicting masked target embeddings from context embeddings\(Assranet al\.[2023](https://arxiv.org/html/2606.31232#bib.bib9)\), while V\-JEPA extends feature prediction to videos and learns spatiotemporal representations without labels, text supervision, or pixel reconstruction\(Bardeset al\.[2024](https://arxiv.org/html/2606.31232#bib.bib17)\)\. For world model learning, JEPA is attractive because planning requires accurate predictions of how different actions lead to different future states, rather than photorealistic observation synthesis\. However, end\-to\-end JEPA training with only latent prediction losses can admit trivial constant representations, making collapse prevention a central design issue\.

### Collapse Prevention and Inverse Dynamics

Recent JEPA\-based world models introduce additional constraints to avoid feature collapse\. DINO\-WM\(Zhouet al\.[2025](https://arxiv.org/html/2606.31232#bib.bib1)\)stabilizes latent dynamics learning by using frozen DINOv2 visual features\(Oquabet al\.[2023](https://arxiv.org/html/2606.31232#bib.bib12)\), but this limits task\-specific adaptation of the representation\. LeWorldModel trains end\-to\-end with a SigReg\-style Gaussian regularizer to encourage non\-collapsed latent features\(Maeset al\.[2026](https://arxiv.org/html/2606.31232#bib.bib2); Balestriero and LeCun[2025](https://arxiv.org/html/2606.31232#bib.bib13)\)\. PLDM combines predictive learning with VICReg\-style regularization and inverse dynamics\(Sobalet al\.[2026](https://arxiv.org/html/2606.31232#bib.bib14); Bardeset al\.[2021](https://arxiv.org/html/2606.31232#bib.bib18)\), but its action decoder operates on concatenated state embeddings, which can allow action\-correlated endpoint cues to support inverse prediction without strongly constraining the transition itself\. In contrast, Delta\-JEPA applies inverse dynamics directly to latent displacements, using action reconstruction to make action\-induced latent differences distinguishable while avoiding frozen encoders and complex multi\-term regularization\.

![Refer to caption](https://arxiv.org/html/2606.31232v1/framework.png)Figure 1:Overview of Delta\-JEPA framework\. Raw observationsoto\_\{t\}andot\+1o\_\{t\+1\}are mapped to latent representationsztz\_\{t\}andzt\+1z\_\{t\+1\}via a shared encoder\. In the forward path, the dynamics predictor forecasts the subsequent representationz^t\+1\\hat\{z\}\_\{t\+1\}fromztz\_\{t\}and the actionata\_\{t\}, guided by the prediction lossℒpred\\mathcal\{L\}\_\{\\text\{pred\}\}\. Concurrently, the Latent Difference Action Decoder receives the latent displacementΔzt\\Delta z\_\{t\}to reconstruct the actiona^t\\hat\{a\}\_\{t\}, supervised by the action lossℒaction\\mathcal\{L\}\_\{\\text\{action\}\}\. This displacement\-based action supervision encourages action\-induced latent differences to be distinguishable, and the entire framework is optimized end\-to\-end viaℒ=ℒpred\+λℒaction\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{pred\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{action\}\}\.

## Method

### Problem Formulation

Following the standard paradigm of unsupervised latent world models, we focus on the problem of world model learning in a reward\-free, offline setting\(Maeset al\.[2026](https://arxiv.org/html/2606.31232#bib.bib2)\)\. We are given an offline dataset𝒟=\{\(o1,a1,…,oT\)\}\\mathcal\{D\}=\\\{\(o\_\{1\},a\_\{1\},\\dots,o\_\{T\}\)\\\}consisting of trajectories with alternating high\-dimensional raw image observationsot∈ℝC×H×Wo\_\{t\}\\in\\mathbb\{R\}^\{C\\times H\\times W\}and continuous actionsat∈ℝdaa\_\{t\}\\in\\mathbb\{R\}^\{d\_\{a\}\}\. Crucially,𝒟\\mathcal\{D\}contains no task\-specific reward signals and is collected by arbitrary, unknown behavior policies\.

Our goal is to learn a compact latent representation space𝒵⊆ℝd\\mathcal\{Z\}\\subseteq\\mathbb\{R\}^\{d\}with an action\-sensitive latent dynamics predictor, without reconstructing pixels or using task rewards\.

### Overview of Delta\-JEPA

As illustrated in Figure[1](https://arxiv.org/html/2606.31232#Sx2.F1), Delta\-JEPA consists of two coupled objectives\. The latent forward dynamics predictor learns to forecast the next representation from the current representation and action, providing the rollout model required for planning\. The Latent Difference Action Decoder \(LDAD\) adds an inverse\-dynamics constraint on the displacement between adjacent latent states, requiring this displacement to recover the action that caused the transition\. Together, these objectives train an end\-to\-end reconstruction\-free world model that discourages collapse to action\-insensitive representations and promotes action\-sensitive next\-state predictions\.

### Latent Forward Dynamics Predictor

The encoderfθf\_\{\\theta\}maps each observationoto\_\{t\}to a latent representationzt=fθ\(ot\)z\_\{t\}=f\_\{\\theta\}\(o\_\{t\}\)\. Conditioned onztz\_\{t\}and actionata\_\{t\}, the dynamics predictorPϕP\_\{\\phi\}estimates the next latent state:

z^t\+1=Pϕ\(zt,at\),\\hat\{z\}\_\{t\+1\}=P\_\{\\phi\}\(z\_\{t\},a\_\{t\}\),\(1\)wherez^t\+1\\hat\{z\}\_\{t\+1\}represents the predicted next latent state\.

We train the encoder and predictor with a mean\-squared prediction loss in latent space:

ℒpred=‖z^t\+1−zt\+1‖22,\\mathcal\{L\}\_\{\\text\{pred\}\}=\\left\\\|\\hat\{z\}\_\{t\+1\}\-z\_\{t\+1\}\\right\\\|\_\{2\}^\{2\},\(2\)wherezt\+1=fθ\(ot\+1\)z\_\{t\+1\}=f\_\{\\theta\}\(o\_\{t\+1\}\)is the target representation produced by the same encoder\.

Although Eq\. \([2](https://arxiv.org/html/2606.31232#Sx3.E2)\) enables reconstruction\-free dynamics learning, it is degenerate when used alone: the encoder and predictor can reduce the loss by collapsing to nearly constant representations\. Such a solution preserves little information for planning even if the prediction loss is small\. LDAD addresses this failure mode by adding an action\-grounded constraint on the difference between adjacent latent states\.

### Latent Difference Action Decoder \(LDAD\)

LDAD imposes an inverse\-dynamics constraint on the difference between adjacent latent states\. Given two encoded observationsztz\_\{t\}andzt\+1z\_\{t\+1\}, we define the latent displacement as

Δzt=zt\+1−zt\.\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\.\(3\)The decoder then predicts the executed action from this displacement:

a^t=DΘ\(Δzt\),\\hat\{a\}\_\{t\}=D\_\{\\Theta\}\(\\Delta z\_\{t\}\),\(4\)whereDΘD\_\{\\Theta\}denotes the action decoder anda^t\\hat\{a\}\_\{t\}denotes the predicted action\. The decoder is trained end\-to\-end with a mean\-squared action reconstruction loss:

ℒaction=‖a^t−at‖22\.\\mathcal\{L\}\_\{\\text\{action\}\}=\\left\\\|\\hat\{a\}\_\{t\}\-a\_\{t\}\\right\\\|\_\{2\}^\{2\}\.\(5\)
![Refer to caption](https://arxiv.org/html/2606.31232v1/path.png)Figure 2:Illustration of LDAD\-induced action\-sensitive latent geometry\. Without displacement\-level action supervision \(top left\), different actions from the same latent stateztz\_\{t\}may produce similar next embeddings\. LDAD computes each displacementΔzt\(i\)=zt\+1\(i\)−zt\\Delta z\_\{t\}^\{\(i\)\}=z\_\{t\+1\}^\{\(i\)\}\-z\_\{t\}, decodes the actiona^t\(i\)\\hat\{a\}\_\{t\}^\{\(i\)\}, and supervises it withℒaction=‖a^t−at‖22\\mathcal\{L\}\_\{\\text\{action\}\}=\\\|\\hat\{a\}\_\{t\}\-a\_\{t\}\\\|\_\{2\}^\{2\}\(bottom\)\. This encourages action\-conditioned transitions to occupy distinguishable directions and endpoints in latent space \(top right\)\.#### Action\-Supervised Displacement Mechanism\.

As illustrated in Figure[2](https://arxiv.org/html/2606.31232#Sx3.F2), the top\-left panel shows an action\-insensitive latent geometry: different actions from the sameztz\_\{t\}can produce nearby next embeddings, making the latent transition difficult to distinguish by action\. LDAD addresses this failure mode through the decoding pipeline shown at the bottom\. For each transition, it computes the displacementΔzt\(i\)=zt\+1\(i\)−zt\\Delta z\_\{t\}^\{\(i\)\}=z\_\{t\+1\}^\{\(i\)\}\-z\_\{t\}, predicts the corresponding actiona^t\(i\)\\hat\{a\}\_\{t\}^\{\(i\)\}, and optimizes the reconstruction loss against the executed actionat\(i\)a\_\{t\}^\{\(i\)\}\. Since the decoder observes only the displacement, successful action recovery requires the local transition geometry to encode the executed action, thereby encouraging action\-induced displacements to become distinguishable\.

The top\-right panel depicts the intended effect of this supervision: different actions induce separated transition directions and next embeddings\. This geometry is particularly important for planning\. When different candidate actions lead to similar latent endpoints, rollouts provide little evidence for comparing their consequences and can therefore cause the planner to select ambiguous or incorrect actions\. By contrast, separated action\-conditioned transitions make candidate rollouts more action\-controllable and more informative for planning\. The Two\-Room trajectory visualization in Figure[5](https://arxiv.org/html/2606.31232#Sx4.F5)provides empirical evidence consistent with this mechanism, showing trajectories with nearby initial states progressively separating under Delta\-JEPA as their action\-conditioned rollouts diverge\. Complementarily, the action\-response PCA in Figure[6](https://arxiv.org/html/2606.31232#Sx4.F6)directly probes the learned predictor by fixing the starting history and varying only the action input, showing that Delta\-JEPA produces clearly separated action\-wise responses whereas LeWM remains concentrated near the origin\.

#### Effects of Displacement\-Based Action Decoding\.

The displacement\-based inverse objective affects the learned representation in three ways:

1. 1\.Anti\-Collapse Effect\.The action reconstruction objective discourages the encoder from mapping consecutive observations to nearly identical latent vectors\. If adjacent observations collapse, thenΔzt\\Delta z\_\{t\}becomes uninformative andDΘD\_\{\\Theta\}cannot recover the executed action\.
2. 2\.Reducing Dependence on Single\-State Cues\.A standard inverse dynamics decoder predicts actions from concatenated latent states,a^t=DΘ\(\[zt,zt\+1\]\)\\hat\{a\}\_\{t\}=D\_\{\\Theta\}\(\[z\_\{t\},z\_\{t\+1\}\]\)\. In our setting, this formulation can admit shortcuts: because the forward predictor receivesata\_\{t\}when predictingz^t\+1\\hat\{z\}\_\{t\+1\}, the learned target representationzt\+1z\_\{t\+1\}may contain action\-correlated cues that allow the inverse decoder to recoverata\_\{t\}without strongly modeling the transition itself\. LDAD reduces this risk by conditioning the decoder only on the relative displacementΔzt\\Delta z\_\{t\}, so action reconstruction must be supported by the change between adjacent latent states rather than by state\-specific cues\.
3. 3\.Action\-Sensitive Latent Dynamics for Planning\.For planning, the latent representation must support action\-conditioned latent rollouts\. LDAD encourages different actions from the same latent state to produce distinguishable latent displacements and next\-state embeddings\. As a result, candidate actions can be compared through the distinct latent rollouts they induce, providing more informative predictions for action selection\.

### Multi\-Step Action Decoding

We implementDΘD\_\{\\Theta\}with a Transformer backbone and extend LDAD to multi\-step action decoding to capture longer\-horizon temporal structure\. Given a horizonN≥1N\\geq 1, the decoder reconstructs the sequence of actions spanning the interval fromtttot\+Nt\+Nusing the long\-horizon latent displacement:

\{a^τ\}τ=tt\+N−1=DΘ\(zt\+N−zt\)\.\\\{\\hat\{a\}\_\{\\tau\}\\\}\_\{\\tau=t\}^\{t\+N\-1\}=D\_\{\\Theta\}\(z\_\{t\+N\}\-z\_\{t\}\)\.\(6\)The multi\-step LDAD action decoder uses a Transformer withNNlearnable action queries\. The displacementzt\+N−ztz\_\{t\+N\}\-z\_\{t\}is injected into each query through Adaptive Layer Normalization \(AdaLN\), after which the Transformer layers produce theNNreconstructed continuous actions\. This multi\-step extension imposes an action\-grounded dynamics constraint over longer temporal intervals in latent space\.

### Joint Optimization and End\-to\-End Training

Ultimately, the overall training objective of our framework is formulated as a joint loss comprising the forward prediction loss and the action reconstruction loss:

ℒ=ℒpred\+λℒaction,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{pred\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{action\}\},\(7\)whereλ\>0\\lambda\>0is a balancing hyperparameter\.

Delta\-JEPA uses only two objectives: latent prediction learns action\-conditioned dynamics, and action reconstruction makes local latent transitions action\-sensitive\. It requires no frozen encoders, stop\-gradient branches, or distribution\-matching regularizers\.

Table 1:Planning success rate \(%, higher is better\) on four continuous\-control environments\. Bold numbers indicate the best performance in each environment\.

## Experiments

### Experimental Setup

Environments\.We evaluate Delta\-JEPA on four diverse continuous\-control tasks:

- •Push\-T\(Chiet al\.[2025](https://arxiv.org/html/2606.31232#bib.bib21)\): A 2D non\-prehensile manipulation task in which the agent pushes a T\-shaped object to a target pose through physical contact\.
- •Reacher\(Tassaet al\.[2018](https://arxiv.org/html/2606.31232#bib.bib19)\): A continuous\-control task in which the agent controls a two\-link planar robotic arm to reach a randomly spawned target\.
- •Cube\(Parket al\.[2025](https://arxiv.org/html/2606.31232#bib.bib20)\): A 3D robotic manipulation task in which the agent controls a gripper to relocate a cube to a target 3D position\.
- •Two\-Room\(Zhouet al\.[2025](https://arxiv.org/html/2606.31232#bib.bib1)\): A 2D continuous\-navigation task in which the agent navigates through a two\-room maze to a designated target point\.

Baselines\.We compare Delta\-JEPA with several state\-of\-the\-art JEPA\-based and representation\-learning world models:

- •LeWorldModel \(LeWM\)\(Maeset al\.[2026](https://arxiv.org/html/2606.31232#bib.bib2)\): Our primary baseline and foundation, which combines next\-step latent representation prediction with Gaussian latent\-space regularization to enable stable end\-to\-end JEPA training directly from raw pixels\.
- •Sub\-JEPA\(Zhaoet al\.[2026](https://arxiv.org/html/2606.31232#bib.bib22)\): An extension of LeWM that introduces subspace Gaussian regularization to further improve training stability and representation quality\.
- •PLDM\(Sobalet al\.[2026](https://arxiv.org/html/2606.31232#bib.bib14)\): An end\-to\-end pixel\-based world model that relies on a compound objective comprising VICReg, inverse dynamics, and temporal smoothness terms, making hyperparameter tuning highly cumbersome\.

Implementation Details\.To ensure a fair comparison, we keep the evaluation protocol and the network architectures of our encoder and predictor consistent with those of LeWM\. Specifically, the visual encoderfθf\_\{\\theta\}is instantiated as a randomly initialized ViT\-Tiny\. The dynamics predictor is parameterized as a 6\-layer causal Transformer \(16 attention heads, a head dimension of 64, and an MLP hidden dimension of 2048\), where action\-conditioning features are injected through Adaptive Layer Normalization for state prediction\. To minimize the computational overhead of action decoding, we implement the action decoder as a lightweight 3\-layer non\-causal Transformer withN=5N=5learnable action queries, 8 attention heads, a head dimension of 64, and an FFN hidden dimension of 512\.

### Planning Performance

We first report planning success rates under an evaluation protocol consistent with LeWM\. During training and evaluation, Delta\-JEPA, PLDM, Sub\-JEPA, and LeWM are trained from scratch for 50 epochs\. Specifically, Delta\-JEPA is optimized with a learning rate of5×10−55\\times 10^\{\-5\}, and the action reconstruction weightλ\\lambdais set to10\.010\.0\. For PLDM, Sub\-JEPA, and LeWM, we follow their respective official training configurations\. We randomly sample 50 and 500 trajectories from each environment to construct the validation and test sets, respectively\. All methods reported in Table[1](https://arxiv.org/html/2606.31232#Sx3.T1)are independently evaluated over 3 random seeds\. The mean planning success rates on the test set are summarized in Table[1](https://arxiv.org/html/2606.31232#Sx3.T1)\.

Delta\-JEPA achieves the highest mean planning success rate across all four environments\. The improvement is most pronounced on OGB\-Cube, where Delta\-JEPA exceeds the strongest baseline by 15\.14 percentage points, and on Two\-Room, where it improves over PLDM by 6\.27 points\. On Push\-T, Delta\-JEPA improves over LeWM by 4\.54 points, indicating that LDAD benefits contact\-rich manipulation\. On Reacher, where Sub\-JEPA already performs strongly, Delta\-JEPA still obtains the best mean result with a smaller margin\. Overall, these results suggest that action reconstruction from latent displacements helps the predictor distinguish action\-dependent outcomes, leading to stronger planning performance across navigation and manipulation tasks\.

### Ablation Study

#### Action Reconstruction Weight\.

To evaluate the impact of the proposed LDAD, we conduct a sensitivity analysis of the action reconstruction weightλ\\lambdain the Push\-T environment\. Specifically, we varyλ\\lambdaover the candidate set\{0,0\.1,1\.0,10\.0,20\.0,50\.0,100\.0,1000\.0\}\\\{0,0\.1,1\.0,10\.0,20\.0,50\.0,100\.0,1000\.0\\\}\. As shown in Figure[3](https://arxiv.org/html/2606.31232#Sx4.F3), settingλ=0\\lambda=0removes LDAD entirely, and the resulting model nearly collapses, yielding only a negligible planning success rate\. Whenλ=0\.1\\lambda=0\.1, the LDAD signal remains too weak to provide effective regularization, and the model still performs poorly\. In contrast, onceλ\\lambdafalls within a reasonable range, the planning performance becomes substantially higher and remains relatively stable, with the best result obtained atλ=50\.0\\lambda=50\.0\. Performance degrades again when the action reconstruction weight is excessively large\.

![Refer to caption](https://arxiv.org/html/2606.31232v1/action_weight_ablation.png)Figure 3:Sensitivity of Push\-T planning success to the action reconstruction weightλ\\lambda\. The curve reports the mean success rate over 3 runs, and the peak performance is highlighted\.
#### Displacement\-Based Action Decoding\.

To evaluate whether displacement\-based action decoding improves downstream planning, we compare LDAD with a variant that reconstructs actions from the concatenated endpoint embeddings\[zt,zt\+1\]\[z\_\{t\},z\_\{t\+1\}\]instead of the displacementΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. Both variants use the same training and evaluation protocol and differ only in the action\-decoder input, allowing us to isolate how this design choice affects planning success\.

Table 2:Ablation of the action\-decoder input representation\. The concat variant decodes actions from\[zt,zt\+1\]\[z\_\{t\},z\_\{t\+1\}\], whereas LDAD decodes actions fromΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. Values are planning success rates \(%\) over three seeds\.As shown in Table[2](https://arxiv.org/html/2606.31232#Sx4.T2), usingΔzt\\Delta z\_\{t\}as the action\-decoder input improves planning success on all four environments\. The gain is largest on Push\-T \(\+12\.60\+12\.60points\), followed by Two\-Room \(\+4\.07\+4\.07points\), while Reacher and OGB\-Cube show smaller but consistent improvements\. These results indicate that, under the same planning protocol, reconstructing actions from latent displacements provides a more effective training signal for action\-conditioned rollouts than reconstructing actions from concatenated endpoint embeddings\.

#### LDAD Decoding Target\.

Table 3:Ablation of LDAD decoding targets on Reacher\.We further ablate the decoding target used by LDAD on Reacher\. Besides the raw actionata\_\{t\}, we replace the action reconstruction target with state\-delta proxies derived from the agent state, includingΔ\\Deltafinger position,Δ\\Deltajoint position, and their concatenation\. As shown in Table[3](https://arxiv.org/html/2606.31232#Sx4.T3), decoding raw actions performs best, whileΔ\\Deltajoint position achieves comparable performance and substantially outperformsΔ\\Deltafinger position\. This suggests that LDAD benefits from targets that are tightly aligned with the controllable transition structure of the agent\. Notably, concatenatingΔ\\Deltafinger position withΔ\\Deltajoint position does not further improve performance, indicating that adding extra state\-change signals may introduce redundant or less action\-aligned information rather than strengthening the displacement supervision\.

### Latent Diversity and Collapse Prevention

![Refer to caption](https://arxiv.org/html/2606.31232v1/pca_vision_feat.png)Figure 4:Evolution of the learned latent space on Push\-T visualized by PCA on 2000 latent representations\.To qualitatively examine the structure of the learned latent space on Push\-T, we apply Principal Component Analysis \(PCA\)\(Abdi and Williams[2010](https://arxiv.org/html/2606.31232#bib.bib23)\)to 2000 latent representations extracted by the encoder\. Figure[4](https://arxiv.org/html/2606.31232#Sx4.F4)presents the resulting projections at epochs 1, 4, 7, and 10\. In the early stage of training, the representations are concentrated within a relatively compact region, suggesting limited latent diversity\. As training progresses, they gradually expand over a broader region and form more discernible structures\. This trend indicates that Delta\-JEPA mitigates representation collapse and learns increasingly discriminative features\.

### Action\-Sensitive Latent Dynamics

![Refer to caption](https://arxiv.org/html/2606.31232v1/delta-JEPA_lewm_two_trace_stacked.png)Figure 5:PCA visualization of two Two\-Room latent trajectories with nearby initial states but different endpoints, shown across training epochs for Delta\-JEPA \(top\) and LeWM \(bottom\)\. Blue and orange denote the two trajectories, and color intensity indicates temporal progression from early states \(light\) to later states \(dark\)\.We further compare Delta\-JEPA and LeWM on two Two\-Room trajectories selected to have nearby initial states but different endpoints, as shown in Figure[5](https://arxiv.org/html/2606.31232#Sx4.F5)\. Each point denotes a latent representation; blue and orange indicate the two trajectories, and darker colors correspond to later timesteps\. Delta\-JEPA exhibits clear temporal compositionality: the two trajectories start close in latent space and gradually separate as their action\-conditioned rollouts diverge\. This behavior is consistent with the LDAD mechanism, which encourages latent displacements to preserve action\-dependent transition information\. LeWM, by contrast, separates some features but produces a less organized geometry, with trajectories that are more scattered and less clearly aligned with temporal progression or action\-controllable rollout structure\.

![Refer to caption](https://arxiv.org/html/2606.31232v1/mean_delta_pca.png)Figure 6:PCA visualization of action\-conditioned predictor responses on Two\-Room\. We sample 512 starting histories and keep the history representationztz\_\{t\}fixed while replacing the final action with each candidate action\. For each candidate action, we visualize the predicted displacement relative to the zero\-action prediction\. Each translucent point corresponds to one starting history under one candidate action, and each numbered marker shows the mean response of that candidate action across all 512 histories\.To directly test whether the predictor responds consistently to action changes, we sample 512 Two\-Room starting histories and keep each history representation fixed while varying only the final action input\. For each candidate actionaa, we compute the predicted next representationz^t\+1\(a\)\\hat\{z\}\_\{t\+1\}\(a\)and measure its displacement relative to the zero\-action prediction,z^t\+1\(a\)−z^t\+1\(0\)\\hat\{z\}\_\{t\+1\}\(a\)\-\\hat\{z\}\_\{t\+1\}\(0\)\. Figure[6](https://arxiv.org/html/2606.31232#Sx4.F6)shows a zoomed view centered on the zero\-action response, making the action\-wise mean markers easier to distinguish\. Delta\-JEPA produces well\-separated action\-wise mean responses, with larger action magnitudes generally inducing larger predicted shifts\. In contrast, LeWM’s action\-wise means remain concentrated near the origin and substantially overlap, indicating that changing the action does not induce a stable directional change in its prediction\. These results show that Delta\-JEPA learns predictor dynamics that are more consistently conditioned on the action input\.

![Refer to caption](https://arxiv.org/html/2606.31232v1/attention_maps_combined.png)Figure 7:Attention rollout visualizations on Push\-T \(top\) and Two\-Room \(bottom\) using intermediate layers 4–6 of the ViT\-Tiny encoder\. Warmer colors indicate higher attention weights\.
### Physical and State\-Delta Probing

To evaluate whether the learned representations preserve underlying environment information, we freeze the encoder and train linear and multi\-layer perceptron probes to decode task\-specific ground\-truth physical attributes from latent states, including agent, object, and end\-effector states\. For each task, we sample 20,000 observations and split train/test data by trajectory\. Table[4](https://arxiv.org/html/2606.31232#Sx4.T4)reports the Two\-Room results, and the full probe results on the remaining environments are provided in Appendix[A\.1](https://arxiv.org/html/2606.31232#A1.SS1)\. Lower MSE and higherrrindicate better representational quality\.

Table 4:Physical latent probing results on Two\-Room\. Lower MSE and higherrrindicate better representational quality\.We use the same probing protocol as above to evaluate whether latent displacements encode environment changes\. Specifically, instead of decoding physical attributesxtx\_\{t\}from a single latent stateztz\_\{t\}, we train probes to predict state changesΔxt=xt\+1−xt\\Delta x\_\{t\}=x\_\{t\+1\}\-x\_\{t\}from latent displacementsΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. For each task, we sample 20,000 consecutive timestep pairs, split train/test data by trajectory, and train both linear and MLP probes over three random seeds\. Table[5](https://arxiv.org/html/2606.31232#Sx4.T5)reports the Two\-Room results, and Appendix[A\.2](https://arxiv.org/html/2606.31232#A1.SS2)provides the corresponding results on Push\-T, DMC Reacher, and OGB\-Cube\. Lower MSE and higherrrindicate that latent displacements better preserve the direction and magnitude of the corresponding physical or task\-state changes\.

Table 5:State\-delta probing results on Two\-Room\. The probe predictsΔxt=xt\+1−xt\\Delta x\_\{t\}=x\_\{t\+1\}\-x\_\{t\}fromΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. Lower MSE and higherrrindicate better alignment between latent displacements and physical state changes\.
### Task\-Relevant Attention Patterns

To further assess the interpretability of the learned latent representations, we visualize the self\-attention patterns of the ViT\-Tiny encoder on the Push\-T and Two\-Room tasks\. We employ attention rollout on intermediate transformer blocks and report heatmaps from layers 4–6, where object\-related cues are expected to be captured before being integrated into higher\-level task representations\. As shown in Figure[7](https://arxiv.org/html/2606.31232#Sx4.F7), although the model is trained without dense pixel\-level reconstruction or explicit object\-level supervision, the attention maps concentrate on task\-relevant regions, including the agent and the T\-shaped block, while assigning relatively low attention to background areas\. Additional layer\-wise attention visualizations in Appendix[A\.3](https://arxiv.org/html/2606.31232#A1.SS3)further show that different encoder layers can emphasize different task\-relevant entities\. Together, these qualitative results suggest that Delta\-JEPA learns compact representations that preserve physically meaningful and object\-centric visual structure across environments\.

## Conclusion

We proposed Delta\-JEPA, a reconstruction\-free latent world model that uses Latent Difference Action Decoding to supervise action information directly in latent displacements\. By reconstructing actions fromΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}, Delta\-JEPA discourages collapse and encourages different actions to induce distinguishable latent transitions for planning, while retaining a simple objective that combines latent prediction with action reconstruction\. Experiments across four continuous\-control tasks show that Delta\-JEPA improves planning performance over JEPA\-based and representation\-learning baselines, and ablations confirm the advantage of displacement\-based decoding over endpoint concatenation\. Additional analyses further indicate that the learned representations preserve action\-sensitive and physically meaningful transition structure\. These results suggest that supervising latent differences is an effective principle for learning compact, collapse\-resistant world models for planning\.

## References

- H\. Abdi and L\. J\. Williams \(2010\)Principal component analysis\.Wiley interdisciplinary reviews: computational statistics2\(4\),pp\. 433–459\.Cited by:[Latent Diversity and Collapse Prevention](https://arxiv.org/html/2606.31232#Sx4.SSx4.p1.1)\.
- M\. Assran, Q\. Duval, I\. Misra, P\. Bojanowski, P\. Vincent, M\. Rabbat, Y\. LeCun, and N\. Ballas \(2023\)Self\-supervised learning from images with a joint\-embedding predictive architecture\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15619–15629\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1),[Introduction](https://arxiv.org/html/2606.31232#Sx1.p2.1),[Joint Embedding Predictive Architectures](https://arxiv.org/html/2606.31232#Sx2.SSx2.p1.1)\.
- M\. Assran, A\. Bardes, D\. Fan, Q\. Garrido, R\. Howes, M\. Muckley, A\. Rizvi, C\. Roberts, K\. Sinha, A\. Zholus,et al\.\(2025\)V\-jepa 2: self\-supervised video models enable understanding, prediction and planning\.arXiv preprint arXiv:2506\.09985\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1)\.
- R\. Balestriero and Y\. LeCun \(2025\)Lejepa: provable and scalable self\-supervised learning without the heuristics\.arXiv preprint arXiv:2511\.08544\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p3.2),[Collapse Prevention and Inverse Dynamics](https://arxiv.org/html/2606.31232#Sx2.SSx3.p1.1)\.
- A\. Bardes, Q\. Garrido, J\. Ponce, X\. Chen, M\. Rabbat, Y\. LeCun, M\. Assran, and N\. Ballas \(2024\)Revisiting feature prediction for learning visual representations from video\.arXiv preprint arXiv:2404\.08471\.Cited by:[Joint Embedding Predictive Architectures](https://arxiv.org/html/2606.31232#Sx2.SSx2.p1.1)\.
- A\. Bardes, J\. Ponce, and Y\. LeCun \(2021\)Vicreg: variance\-invariance\-covariance regularization for self\-supervised learning\.arXiv preprint arXiv:2105\.04906\.Cited by:[Collapse Prevention and Inverse Dynamics](https://arxiv.org/html/2606.31232#Sx2.SSx3.p1.1)\.
- C\. Chi, Z\. Xu, S\. Feng, E\. Cousineau, Y\. Du, B\. Burchfiel, R\. Tedrake, and S\. Song \(2025\)Diffusion policy: visuomotor policy learning via action diffusion\.The International Journal of Robotics Research44\(10\-11\),pp\. 1684–1704\.Cited by:[1st item](https://arxiv.org/html/2606.31232#Sx4.I3.i1.p1.1)\.
- D\. Ha and J\. Schmidhuber \(2018a\)Recurrent world models facilitate policy evolution\.Advances in neural information processing systems31\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1),[Latent World Models](https://arxiv.org/html/2606.31232#Sx2.SSx1.p1.1)\.
- D\. Ha and J\. Schmidhuber \(2018b\)World models\.eprint arXiv: 1803\.10122\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1),[Latent World Models](https://arxiv.org/html/2606.31232#Sx2.SSx1.p1.1)\.
- D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi \(2019a\)Dream to control: learning behaviors by latent imagination\.arXiv preprint arXiv:1912\.01603\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1),[Latent World Models](https://arxiv.org/html/2606.31232#Sx2.SSx1.p1.1)\.
- D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson \(2019b\)Learning latent dynamics for planning from pixels\.InInternational conference on machine learning,pp\. 2555–2565\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1),[Latent World Models](https://arxiv.org/html/2606.31232#Sx2.SSx1.p1.1)\.
- D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap \(2023\)Mastering diverse domains through world models\.arXiv preprint arXiv:2301\.04104\.Cited by:[Latent World Models](https://arxiv.org/html/2606.31232#Sx2.SSx1.p1.1)\.
- M\. Hauri and F\. Zenke \(2026\)Dreamer\-cdp: improving reconstruction\-free world models via continuous deterministic representation prediction\.arXiv preprint arXiv:2603\.07083\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1)\.
- Y\. LeCunet al\.\(2022\)A path towards autonomous machine intelligence version 0\.9\. 2, 2022\-06\-27\.Open Review62\(1\),pp\. 1–62\.Cited by:[Joint Embedding Predictive Architectures](https://arxiv.org/html/2606.31232#Sx2.SSx2.p1.1)\.
- L\. Maes, Q\. L\. Lidec, D\. Scieur, Y\. LeCun, and R\. Balestriero \(2026\)Leworldmodel: stable end\-to\-end joint\-embedding predictive architecture from pixels\.arXiv preprint arXiv:2603\.19312\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p2.1),[Introduction](https://arxiv.org/html/2606.31232#Sx1.p3.2),[Collapse Prevention and Inverse Dynamics](https://arxiv.org/html/2606.31232#Sx2.SSx3.p1.1),[Problem Formulation](https://arxiv.org/html/2606.31232#Sx3.SSx1.p1.4),[1st item](https://arxiv.org/html/2606.31232#Sx4.I4.i1.p1.1)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby,et al\.\(2023\)Dinov2: learning robust visual features without supervision\.arXiv preprint arXiv:2304\.07193\.Cited by:[Collapse Prevention and Inverse Dynamics](https://arxiv.org/html/2606.31232#Sx2.SSx3.p1.1)\.
- S\. Park, K\. Frans, B\. Eysenbach, and S\. Levine \(2025\)Ogbench: benchmarking offline goal\-conditioned rl\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 94937–94982\.Cited by:[3rd item](https://arxiv.org/html/2606.31232#Sx4.I3.i3.p1.1)\.
- U\. Sobal, W\. Zhang, K\. Cho, R\. Balestriero, T\. G\. Rudner, and Y\. LeCun \(2026\)Learning from reward\-free offline data: a case for planning with latent dynamics models\.Advances in Neural Information Processing Systems38,pp\. 43905–43941\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p3.2),[Collapse Prevention and Inverse Dynamics](https://arxiv.org/html/2606.31232#Sx2.SSx3.p1.1),[3rd item](https://arxiv.org/html/2606.31232#Sx4.I4.i3.p1.1)\.
- Y\. Tassa, Y\. Doron, A\. Muldal, T\. Erez, Y\. Li, D\. d\. L\. Casas, D\. Budden, A\. Abdolmaleki, J\. Merel, A\. Lefrancq,et al\.\(2018\)Deepmind control suite\.arXiv preprint arXiv:1801\.00690\.Cited by:[2nd item](https://arxiv.org/html/2606.31232#Sx4.I3.i2.p1.1)\.
- P\. Wu, A\. Escontrela, D\. Hafner, P\. Abbeel, and K\. Goldberg \(2023\)Daydreamer: world models for physical robot learning\.InConference on robot learning,pp\. 2226–2240\.Cited by:[Introduction](https://arxiv.org/html/2606.31232#Sx1.p1.1)\.
- K\. Zhao, D\. Nie, Y\. Lin, Z\. Luo, Y\. Gu, D\. Fan, and D\. Zeng \(2026\)Sub\-jepa: subspace gaussian regularization for stable end\-to\-end world models\.arXiv preprint arXiv:2605\.09241\.Cited by:[2nd item](https://arxiv.org/html/2606.31232#Sx4.I4.i2.p1.1)\.
- G\. Zhou, H\. Pan, Y\. Lecun, and L\. Pinto \(2025\)DINO\-wm: world models on pre\-trained visual features enable zero\-shot planning\.InInternational Conference on Machine Learning,pp\. 79115–79135\.Cited by:[Collapse Prevention and Inverse Dynamics](https://arxiv.org/html/2606.31232#Sx2.SSx3.p1.1),[4th item](https://arxiv.org/html/2606.31232#Sx4.I3.i4.p1.1)\.

## Appendix AAdditional Probe and Attention Results

This appendix provides the complete diagnostic results that complement the probing and attention analyses in the main text\. It is organized into three parts\. Appendix[A\.1](https://arxiv.org/html/2606.31232#A1.SS1)reports physical state probing results for Push\-T, DMC Reacher, and OGB\-Cube\. Appendix[A\.2](https://arxiv.org/html/2606.31232#A1.SS2)reports state\-delta probing results on the same environments\. Appendix[A\.3](https://arxiv.org/html/2606.31232#A1.SS3)provides an additional attention visualization showing layer\-wise specialization in the visual encoder\.

### A\.1Physical State Probing

This section extends the physical state probing analysis beyond the Two\-Room results reported in the main text\. For each environment, we freeze the visual encoder and train linear and MLP probes to predict ground\-truth physical quantities from latent states\. Tables[6](https://arxiv.org/html/2606.31232#A1.T6)–[8](https://arxiv.org/html/2606.31232#A1.T8)report the results for Push\-T, DMC Reacher, and OGB\-Cube, covering controllable agent states, robot states, and object states\. Lower MSE and higher Pearson correlationrrindicate that the learned representation preserves more physical information\.

Table 6:Physical latent probing results on Push\-T\. Lower MSE and higherrrindicate better representational quality\.Table 7:Physical latent probing results on DMC Reacher\. Lower MSE and higherrrindicate better representational quality\.Table 8:Physical latent probing results on OGB\-Cube\. Lower MSE and higherrrindicate better representational quality\.
### A\.2State\-Delta Probing

This section evaluates whether latent displacements encode physical changes between consecutive observations\. Instead of predicting state variablesxtx\_\{t\}fromztz\_\{t\}, each probe predictsΔxt=xt\+1−xt\\Delta x\_\{t\}=x\_\{t\+1\}\-x\_\{t\}fromΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. Tables[9](https://arxiv.org/html/2606.31232#A1.T9)–[11](https://arxiv.org/html/2606.31232#A1.T11)report the results for Push\-T, DMC Reacher, and OGB\-Cube, spanning agent motion, robot motion, end\-effector motion, and object motion\. This directly tests whether the transition representation preserves the direction and magnitude of environment changes\.

Table 9:State\-delta probing results on Push\-T\. The probe predictsΔxt=xt\+1−xt\\Delta x\_\{t\}=x\_\{t\+1\}\-x\_\{t\}fromΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. Lower MSE and higherrrindicate better alignment between latent displacements and physical state changes\.Table 10:State\-delta probing results on DMC Reacher\. The probe predictsΔxt=xt\+1−xt\\Delta x\_\{t\}=x\_\{t\+1\}\-x\_\{t\}fromΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. Lower MSE and higherrrindicate better alignment between latent displacements and physical state changes\.Table 11:State\-delta probing results on OGB\-Cube\. The probe predictsΔxt=xt\+1−xt\\Delta x\_\{t\}=x\_\{t\+1\}\-x\_\{t\}fromΔzt=zt\+1−zt\\Delta z\_\{t\}=z\_\{t\+1\}\-z\_\{t\}\. Lower MSE and higherrrindicate better alignment between latent displacements and physical state changes\.
### A\.3Layer\-Wise Attention Specialization

This section complements the attention rollout visualizations in the main text by examining whether different encoder layers emphasize different task\-relevant entities\. We visualize OGB\-Cube attention maps from two intermediate layers of the same encoder to compare how attention shifts across the visual hierarchy\.

![Refer to caption](https://arxiv.org/html/2606.31232v1/layer_specialization_ogb_cube.png)Figure 8:Layer\-wise specialization of attention maps on OGB\-Cube\. Layer 5 highlights the target cube, while layer 7 more prominently attends to the robotic gripper\. Warmer colors indicate higher attention weights\.As illustrated in Figure[8](https://arxiv.org/html/2606.31232#A1.F8), different intermediate layers emphasize distinct functional components of the same OGB\-Cube scenes\. Layer 5 primarily attends to the target cube, whereas layer 7 places stronger emphasis on the robotic gripper\. These observations indicate that the encoder progressively organizes task\-relevant entities across layers, rather than relying on a single undifferentiated saliency pattern\.
Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding

Similar Articles

DVD-JEPA: an open-source, fully-reproducible JEPA world model [P]

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

@LeRobotHF: VLA-JEPA just dropped in LeRobot What makes this model special is that it does not just learn what action to take from …

I built Micro-JEPA: A lightweight JEPA (Joint Embedding Predictive Architecture) in Python

Submit Feedback

Similar Articles

DVD-JEPA: an open-source, fully-reproducible JEPA world model [P]
DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Representation Without Reward: A JEPA Audit for LLM Fine-Tuning
@LeRobotHF: VLA-JEPA just dropped in LeRobot What makes this model special is that it does not just learn what action to take from …
I built Micro-JEPA: A lightweight JEPA (Joint Embedding Predictive Architecture) in Python