Causal Object-Centric Models for Planning with Monte Carlo Tree Search

arXiv cs.AI 06/15/26, 04:00 AM Papers
Summary
COMET is a model-based reinforcement learning algorithm that combines a frozen object-centric encoder with a transformer-based world model and Monte Carlo Tree Search, using causal attention to focus on task-relevant objects, achieving higher scores on visual RL benchmarks.
arXiv:2606.14418v1 Announce Type: new Abstract: We introduce COMET (Causal Object-centric Model for Efficient Tree search), a model-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot-structured latent space. COMET pairs a frozen unsupervised object-centric encoder with a transformer-based world model, in which actions are bound to objects through a novel action-slot fusion mechanism that is used in slot transition prediction. Policy and value heads use object-causal attention, modulating token interactions by learned per-slot relevance scores so that decision-making concentrates on task-relevant entities. COMET adds an explicit object-level inductive bias to MuZero-style latent planning. Across eight visually and dynamically diverse tasks from the Object-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object-centric and monolithic baselines.
Original Article
View Cached Full Text
Cached at: 06/15/26, 09:12 AM
# Causal Object-Centric Models for Planning with Monte Carlo Tree Search
Source: [https://arxiv.org/html/2606.14418](https://arxiv.org/html/2606.14418)
Rodion Vakhitov MIRAI Moscow, Russia vakhitov\.r@miriai\.org &Leonid Ugadiarov CogAILab & MIRAI Moscow, Russia Alexey Skrynnik CogAILab & MIRAI Moscow, Russia &Aleksandr Panov CogAILab & MIRAI Moscow, Russia

###### Abstract

We introduce COMET \(Causal Object\-centric Model for Efficient Tree search\), a model\-based reinforcement learning algorithm that performs Monte Carlo Tree Search in a slot\-structured latent space\. COMET pairs a frozen unsupervised object\-centric encoder with a transformer\-based world model, in which actions are bound to objects through a novel action–slot fusion mechanism that is used in slot transition prediction\. Policy and value heads use object\-causal attention, modulating token interactions by learned per\-slot relevance scores so that decision\-making concentrates on task\-relevant entities\. COMET adds an explicit object\-level inductive bias to MuZero\-style latent planning\. Across eight visually and dynamically diverse tasks from the Object\-Centric Visual RL benchmark, ManiSkill, Robosuite, and VizDoom, COMET achieves a higher mean normalized score during the early stages of training compared to object\-centric and monolithic baselines\.

## 1Introduction

Humans can reason about the consequences of their actions before acting by mentally simulating past experiences or possible future outcomes\[[36](https://arxiv.org/html/2606.14418#bib.bib9)\]\. Motivated by this ability, world models have been introduced in reinforcement learning \(RL\) as a way to imitate the environment and improve learning efficiency\[[10](https://arxiv.org/html/2606.14418#bib.bib10)\]\. In model\-based reinforcement learning \(MBRL\), an agent learns a model of the environment dynamics and uses it to generate imagined experiences, thereby reducing the need for real\-world interactions\.

MBRL methods have achieved strong performance across a wide range of tasks\. Notable examples include the Dreamer family of algorithms\[[11](https://arxiv.org/html/2606.14418#bib.bib11),[12](https://arxiv.org/html/2606.14418#bib.bib12),[13](https://arxiv.org/html/2606.14418#bib.bib13)\], which employ latent world models for long\-horizon imagination, approaches based on Model Predictive Path Integral \(MPPI\) control for planning\[[14](https://arxiv.org/html/2606.14418#bib.bib15),[15](https://arxiv.org/html/2606.14418#bib.bib16)\], and methods that integrate Monte\-Carlo Tree Search \(MCTS\)\[[5](https://arxiv.org/html/2606.14418#bib.bib58),[33](https://arxiv.org/html/2606.14418#bib.bib60)\]with learned models\[[33](https://arxiv.org/html/2606.14418#bib.bib60)\]\.

Despite this progress, learning accurate world models remains difficult in environments that are high\-dimensional, non\-stationary, and composed of multiple interacting objects\. One of the challenges for visual environments lies in representing observations effectively\. Most existing approaches rely on convolutional neural network \(CNN\) encoders\[[21](https://arxiv.org/html/2606.14418#bib.bib21)\]that produce a single holistic representation of the input image\. However, such representations may fail to capture object\-level structure and interactions, which are often crucial for decision\-making\[[32](https://arxiv.org/html/2606.14418#bib.bib22)\]\. In complex scenes, small but task\-relevant objects, dynamic backgrounds, or many irrelevant entities can significantly degrade agent performance\[[22](https://arxiv.org/html/2606.14418#bib.bib61)\]\.

![Refer to caption](https://arxiv.org/html/2606.14418v1/x1.png)Figure 1:Object\-centric representations in COMET\. ObservationOtO\_\{t\}is transformed into a set of object representationss¯t\\bar\{s\}\_\{t\}, for which causality scoresα¯t\\bar\{\\alpha\}\_\{t\}are estimated\. By focusing on the most relevant objects and their interactions, planning can concentrate on task\-relevant elements of the scene during tree search\.Humans, by contrast, perceive the world as composed of discrete entities such as objects\[[39](https://arxiv.org/html/2606.14418#bib.bib23)\], which enables efficient reasoning and planning\. Object\-centric RL represents the environment as a set of object\-level components, where each component corresponds to an individual object\. When instance segmentation masks are available, object representations can be extracted using CNN encoders, alternatively, supervised segmentation models\[[4](https://arxiv.org/html/2606.14418#bib.bib24),[20](https://arxiv.org/html/2606.14418#bib.bib25),[31](https://arxiv.org/html/2606.14418#bib.bib26)\]can be used, though they require annotated data\. A large body of work instead focuses on unsupervised object\-centric representation learning\[[19](https://arxiv.org/html/2606.14418#bib.bib27),[23](https://arxiv.org/html/2606.14418#bib.bib28),[24](https://arxiv.org/html/2606.14418#bib.bib35),[8](https://arxiv.org/html/2606.14418#bib.bib37),[37](https://arxiv.org/html/2606.14418#bib.bib29),[38](https://arxiv.org/html/2606.14418#bib.bib38),[35](https://arxiv.org/html/2606.14418#bib.bib30),[48](https://arxiv.org/html/2606.14418#bib.bib31),[25](https://arxiv.org/html/2606.14418#bib.bib54),[7](https://arxiv.org/html/2606.14418#bib.bib32)\], which discovers structured representations directly from raw images, making it suitable for reinforcement learning without external supervision\.

Object\-centric MBRL methods that maintain an object\-level world model can explicitly represent object dynamics and interactions, enabling more focused and interpretable decision\-making\. Many real\-world and simulated environments are inherently object\-oriented: scenes consist of multiple objects whose interactions determine the reward\. However, at any given time step, only a small subset of objects typically participates in interactions relevant to the current decision\. For example, in robotic manipulation tasks, a robot often interacts with only one object at a time\. As a result, actions usually affect the state of only a few objects, while the remaining objects are largely irrelevant for the immediate decision\. Motivated by this observation, we hypothesize that explicitly modeling the importance of individual objects for decision\-making can improve policy learning\. To this end, we propose COMET, an object\-centric MBRL algorithm based on MCTS\. In COMET, the world model maintains disentangled latents for object\-centric representations\. The policy and value models use transformer\-based architectures\[[43](https://arxiv.org/html/2606.14418#bib.bib59)\]over these latents, combined with object causal attention mechanisms\. Each network processes object tokens together with a dedicated target token for action or value prediction, while attention is modulated by learned causality scores to emphasize task\-relevant objects\.

In summary, our main contributions are as follows:

- •We introduce COMET, an MCTS\-based object\-centric MBRL algorithm that combines frozen object\-level representations with a transformer\-based world model for planning in an object\-structured latent space\.
- •We propose a novel action\-object binding mechanism, where actions are fused with object\-centric slots, effectively implementing a learned binding between actions and objects within a unified transformer backbone, enabling object\-centric world modeling as well as policy/value prediction\.
- •We evaluate COMET across a diverse set of object\-oriented visual control tasks, including object\-centric benchmark environments and robotic manipulation tasks, and show that it shows consistent performance across tasks and, on average, achieves higher sample efficiency than both strong monolithic MCTS\-based MBRL methods and object\-centric RL baselines\.

![Refer to caption](https://arxiv.org/html/2606.14418v1/x2.png)Figure 2:Overview of COMET training\. A frozen slot extractor maps observations into slots, which are processed by a transformer backbone to produce latent representationsht1,ht2,…,htnh\_\{t\}^\{1\},h\_\{t\}^\{2\},\\dots,h\_\{t\}^\{n\}\. These latents, together with a learnable target token, are fed into the policy and value transformers to predict the action distribution or value\. Next, an action embedding is concatenated with each slot independently and passed through a shared MLP projector, producing slot\-conditioned action embeddingsat1,at2,…,atna\_\{t\}^\{1\},a\_\{t\}^\{2\},\\dots,a\_\{t\}^\{n\}\. These are processed by a transformer backbone to obtainzt1,zt2,…,ztnz\_\{t\}^\{1\},z\_\{t\}^\{2\},\\dots,z\_\{t\}^\{n\}, which are used to predict the next state \(next slots\) and reward\.
## 2Related Work

### 2\.1Object\-Centric Representation Learning

A growing line of research focuses on learning structured object\-centric representations directly from raw sensory inputs without manual annotations\. Instead of encoding a scene into a single global vector, these methods decompose observations into sets of entities that can be processed independently\. A key mechanism is Slot Attention\[[24](https://arxiv.org/html/2606.14418#bib.bib35)\], which iteratively assigns a fixed number of latent slots to different parts of the input via competitive cross\-attention\. Subsequent work extends this idea to sequential data\. SAVi\[[18](https://arxiv.org/html/2606.14418#bib.bib36)\]and SAVi\+\+\[[8](https://arxiv.org/html/2606.14418#bib.bib37)\]introduce temporal consistency using motion cues such as optical flow and depth, enabling slots to persist across frames\. Other approaches focus on improving reconstruction quality with more expressive generative models\. SLATE and STEVE\[[38](https://arxiv.org/html/2606.14418#bib.bib38)\]combine discrete latent tokenization \(dVAE\[[42](https://arxiv.org/html/2606.14418#bib.bib39)\]\) with transformer\-based decoders and Slot Attention\-based grouping\. In contrast, DINOSAUR\[[35](https://arxiv.org/html/2606.14418#bib.bib30)\]replaces pixel reconstruction with feature\-level objectives using pretrained DINO\[[1](https://arxiv.org/html/2606.14418#bib.bib41)\]representations to learn semantically meaningful objects\. More recent work Slot Contrast\[[25](https://arxiv.org/html/2606.14418#bib.bib54)\]enforces alignment between slots across time by contrasting corresponding object representations, resulting in more robust tracking and reduced slot ambiguity in dynamic scenes\.

Not all object\-centric models rely on Slot Attention\. Deep Latent Particles \(DLP\)\[[6](https://arxiv.org/html/2606.14418#bib.bib56)\]represent images as low\-dimensional particles that decouple spatial position and appearance\. In a different direction, Artificial Kuramoto Oscillatory Neurons \(AKOrN\)\[[26](https://arxiv.org/html/2606.14418#bib.bib57)\]introduce oscillatory neural dynamics, where synchronized neurons form coherent groups corresponding to objects or parts\.

### 2\.2Object Centric Reinforcement Learning

Recent studies incorporate object\-centric representations into model\-based reinforcement learning to better capture the compositional structure of environments\. COBRA\[[45](https://arxiv.org/html/2606.14418#bib.bib44)\]learns a transition model over latent slots obtained from MONet\[[3](https://arxiv.org/html/2606.14418#bib.bib45)\]and combines it with intrinsic motivation to improve data efficiency\. FOCUS\[[9](https://arxiv.org/html/2606.14418#bib.bib46)\]uses an encoder\-decoder architecture that segments scenes into object\-specific latent variables via learned masks\. OC\-STORM\[[49](https://arxiv.org/html/2606.14418#bib.bib47)\]employs a spatiotemporal transformer to jointly reason over object\-centric and pixel\-level representations for dynamics modeling\. COBRA is limited by the lack of explicit modeling of object interactions, restricting its ability to capture relational dynamics addressed by our method\. In contrast, FOCUS and OC\-STORM rely on annotated segmentation masks, limiting their applicability in fully unsupervised settings\.

Closer to our setting, STICA\[[28](https://arxiv.org/html/2606.14418#bib.bib33)\]proposes an object\-centric model\-based RL framework combining slot\-based representations with transformer\-based world models and decision modules with object causal attention\. SOLD\[[27](https://arxiv.org/html/2606.14418#bib.bib48)\]learns object\-centric latent dynamics directly from pixels without supervision via an action\-conditioned slot\-based dynamics model and a Slot Aggregation Transformer for policy and value learning\. Object\-Centric Dreamer\[[41](https://arxiv.org/html/2606.14418#bib.bib55)\]\(OCDreamer\) extends Dreamer by replacing the RSSM with an object\-centric RSSM and incorporating GNNs to explicitly model object interactions during prediction and control\. Beyond model\-based approaches, object\-centric representations are also used in model\-free RL\. OCRL\[[47](https://arxiv.org/html/2606.14418#bib.bib50)\]integrates a transformer\-based object encoder into PPO\[[34](https://arxiv.org/html/2606.14418#bib.bib51)\], enabling flexible use of different object\-centric features\. Similarly, OC\-CA and OC\-SA\[[40](https://arxiv.org/html/2606.14418#bib.bib52)\]use Slot Attention as a feature extractor and study its generalization across environments\.

## 3COMET

![Refer to caption](https://arxiv.org/html/2606.14418v1/images/norm_score.png)Figure 3:Mean Normalized Score \([6](https://arxiv.org/html/2606.14418#S4.E6)\) versus normalized steps\. Normalization parameters are listed in Appendix[A](https://arxiv.org/html/2606.14418#A1)\.Left:normalized score averaged over all considered tasks–Object Goal, Object Interaction, Object Comparison, Property Comparison, Object Reaching, Block Lifting, Cube Pushing, and Defend The Line–for all algorithms except SOLD\.Right:normalized score averaged over continuous\-control tasks compatible with SOLD–Object Reaching, Block Lifting, and Cube Pushing\.COMET is an MCTS\-based object\-centric model\-based RL algorithm that performs planning in a slot\-structured latent space\. The method combines three components: a frozen object\-centric encoder that maps visual observations to object slots, a transformer\-based world model that predicts future slots and rewards, and policy/value heads equipped with object\-causal attention\. Our implementation builds on the LightZero framework\[[29](https://arxiv.org/html/2606.14418#bib.bib43)\]and follows the UniZero training pipeline\[[30](https://arxiv.org/html/2606.14418#bib.bib34)\]\. Unlike UniZero, which operates on monolithic state embeddings, COMET represents each observation as a set of object\-centric slots and performs both dynamics prediction and decision\-making over these slots\. This design introduces an explicit object\-level inductive bias into MuZero\-style latent planning, enabling the model to reason over individual entities and their interactions\. As in UniZero, COMET uses the unified transformer backbone implemented using a nanoGPT\-based architecture\[[17](https://arxiv.org/html/2606.14418#bib.bib63)\]\.

### 3\.1Slots Extractor

The slots extractor maps an image observationoto\_\{t\}into a set of object\-centric latent representations\. Specifically, it produces an unordered collection of vectorss¯t=\{st1,…,stn\}\\bar\{s\}\_\{t\}=\\\{s\_\{t\}^\{1\},\\dots,s\_\{t\}^\{n\}\\\}, wherennis a fixed hyperparameter conventionally defined as the maximum number of objects in the scene plus one slot for the background\.

A key challenge of slot\-based architectures, which are trained on static image observations, is that the ordering of slots is not guaranteed to be consistent across time steps due to the permutation\-invariant and stochastic nature of slot attention\. To mitigate this issue and ensure temporal consistency of object representations within an episode, we initialize the slot representations at timet\+1t\+1using the slots obtained at timett\. This encourages stable assignment of slots to underlying objects over time\. In contrast, this issue does not arise in video\-based object\-centric models, where temporal consistency is directly modeled within the architecture\. For example, in Slot Contrast, such consistency is handled within the learning pipeline, and no additional slot\-initialization mechanism is required, as temporal correspondence is learned end\-to\-end through the model design\.

In our approach, slot extractors are pretrained on observations collected using a random policy and remain frozen during reinforcement learning\. We experiment with different object\-centric representation models depending on the task, including SLATE, DINOSAUR, and Slot Contrast\.

### 3\.2Object\-Centric Token Processing

UniZero uses a transformer backbone based on the nanoGPT architecture\. It processes sequences of state and action embeddings arranged alternately in a single sequence\. In discrete action spaces, actions are represented as learnable embedding vectors, while in continuous action spaces, actions are passed through a two\-layer MLP to produce corresponding embeddings\. The transformer processes the sequence in two stages\. First, the state embeddingztz\_\{t\}is fed into the backbone, producing latenthtzh\_\{t\}^\{z\}, which is then passed to the decision head to model policy and value\. Next, the action embeddingata\_\{t\}is processed by the same transformer backbone, yielding latenthtah\_\{t\}^\{a\}, which is passed through the dynamics head to predict the future statez^t\+1\\hat\{z\}\_\{t\+1\}and rewardr^t\\hat\{r\}\_\{t\}\. UniZero employs standard causal attention masking and learnable absolute positional encodings, with the total sequence length bounded by the context size\.

![Refer to caption](https://arxiv.org/html/2606.14418v1/images/obj_attn.jpg)Figure 4:Illustration of the attention mask used in the transformer backbone for a setting with two slots per block across three time steps\.We adapt this architecture for object\-centric representations\. In object\-centric settings, the state is represented asnnslots,s¯t=\{st1,…,stn\}\\bar\{s\}\_\{t\}=\\\{s\_\{t\}^\{1\},\\dots,s\_\{t\}^\{n\}\\\}\. In our approach these slots are fed into the backbone, producing latent representations\{ht1,…,htn\}\\\{h\_\{t\}^\{1\},\\dots,h\_\{t\}^\{n\}\\\}, which are then passed to the policy and value networks implemented as transformer modules with causal attention, as described in Section[3\.3](https://arxiv.org/html/2606.14418#S3.SS3)\. Predicting the next states^¯t\+1=\{s^t1,…,s^tn\}\\bar\{\\hat\{s\}\}\_\{t\+1\}=\\\{\\hat\{s\}\_\{t\}^\{1\},\\dots,\\hat\{s\}\_\{t\}^\{n\}\\\}is non\-trivial\. Generatingnnslots from a single action embeddingata\_\{t\}creates a bottleneck, as all object\-centric latents must be compressed into a single vector\. In our experiments, architectures using this approach perform poorly\. To address this, we concatenate the action embedding with each slot,at⊕st1,…,at⊕stn\{a\_\{t\}\\oplus s\_\{t\}^\{1\},\\dots,a\_\{t\}\\oplus s\_\{t\}^\{n\}\}, and pass the resulting vectors through a shared MLP projector\. This produces slot\-conditioned action embeddingsa¯t=\{at1,…,atn\}\\bar\{a\}\_\{t\}=\\\{a\_\{t\}^\{1\},\\dots,a\_\{t\}^\{n\}\\\}, which are fed into the transformer backbone, yielding latentsz¯t=\{zt1,…,ztn\}\\bar\{z\}\_\{t\}=\\\{z\_\{t\}^\{1\},\\dots,z\_\{t\}^\{n\}\\\}\. These latents are passed through a shared observation MLP head to predict the next\-step object slotss^¯t\+1\\bar\{\\hat\{s\}\}\_\{t\+1\}\. For reward prediction,z¯t\\bar\{z\}\_\{t\}are summed into a single vector, which is then processed by a reward head implemented as an MLP\. The input to the transformer is the sequence ofs¯t\\bar\{s\}\_\{t\}anda¯t\\bar\{a\}\_\{t\}, naturally decomposed into blocks, each corresponding to a single time step\. As in UniZero, each block is augmented with a learnable absolute positional encoding, and the transformer’s context size defines the total number of timesteps processed\. In our attention mask, each slot embedding attends to all slots within the same block, all slots from previous blocks, and the action embeddings associated with its position\. Each action embedding attends to itself as well as to all slots in the current and preceding blocks\. The attention mask for the transformer backbone is shown in Figure[4](https://arxiv.org/html/2606.14418#S3.F4)\. We view our slot\-conditioned action embeddings as closely related to the mechanism of soft action–object binding\[[2](https://arxiv.org/html/2606.14418#bib.bib62)\], where each slot is influenced by a version of the source action conditioned on the current object state\. The overall training pipeline and architecture are illustrated in Figure[2](https://arxiv.org/html/2606.14418#S1.F2)\.

### 3\.3Object\-Causal Attention

The policy and value networks are implemented as a transformer with the modified attention mechanism introduced in STICA\. Alongside the latent representationsht1,…,htnh\_\{t\}^\{1\},\\dots,h\_\{t\}^\{n\}, a learnable target token, specific to the policy or value head, is provided as input; its transformer output is decoded by an MLP to produce the corresponding prediction\. To model causal structure, a causal graph is defined over three groups of objects: the target, causal objects, and non\-causal objects:

G=\[110010001\],G=\\begin\{bmatrix\}1&1&0\\\\ 0&1&0\\\\ 0&0&1\\end\{bmatrix\},\(1\)whereGi,j=1G\_\{i,j\}=1indicates that groupjjexerts a causal influence on groupii\. Thus, causal objects influence the target \(G1,2=1G\_\{1,2\}=1\) and one another \(G2,2=1G\_\{2,2\}=1\), while non\-causal objects influence only themselves \(G3,3=1G\_\{3,3\}=1\)\.

Since object causality is not known a priori, a causality scoreαtk∈\[0,1\]\\alpha\_\{t\}^\{k\}\\in\[0,1\]is estimated for each latenthtkh\_\{t\}^\{k\}, denoting the probability that the corresponding object is relevant for the policy or value prediction\.

Wt=\[1000αt11−αt1⋮⋮⋮0αtn1−αtn\],W\_\{t\}=\\begin\{bmatrix\}1&0&0\\\\ 0&\\alpha\_\{t\}^\{1\}&1\-\\alpha\_\{t\}^\{1\}\\\\ \\vdots&\\vdots&\\vdots\\\\ 0&\\alpha\_\{t\}^\{n\}&1\-\\alpha\_\{t\}^\{n\}\\end\{bmatrix\},\(2\)whose first row corresponds to the target token and remaining rows to the latent object tokens\. The productWtGWt⊤W\_\{t\}GW\_\{t\}^\{\\top\}liftsGGto token\-level interactions, encoding the strength of causal influence between every pair of tokens, and is used to modulate scaled dot\-product attention:

CAt=Norm\(softmax\(QtKt⊤d\)⊙WtGWt⊤\)Vt,\\mathrm\{CA\}\_\{t\}=\\mathrm\{Norm\}\\\!\\left\(\\mathrm\{softmax\}\\\!\\left\(\\frac\{Q\_\{t\}K\_\{t\}^\{\\top\}\}\{\\sqrt\{d\}\}\\right\)\\odot W\_\{t\}GW\_\{t\}^\{\\top\}\\right\)V\_\{t\},\(3\)whereQtQ\_\{t\},KtK\_\{t\},VtV\_\{t\}are the query, key, and value matrices,ddis the key dimensionality,⊙\\odotis element\-wise multiplication, andNorm\(⋅\)\\mathrm\{Norm\}\(\\cdot\)denotes row\-wise normalization\. The mechanism thereby concentrates attention on objects that directly influence the target and suppresses irrelevant ones\. Although we follow the terminology of STICA and refer to these quantities as causality scores, they should be interpreted as learned object\-relevance weights rather than independently identified causal effects\.

### 3\.4Policy and World Model Learning

MuZero\-like methods learn a latent model for planning with MCTS rather than using the true environment dynamics\[[33](https://arxiv.org/html/2606.14418#bib.bib60)\]\. The model consists of a representation function, a dynamics function, and a prediction function\. The representation function encodes the observation history into a root latent statext=hθ\(o≤t,a<t\)x\_\{t\}=h\_\{\\theta\}\(o\_\{\\leq t\},a\_\{<t\}\)\. The dynamics function predicts imagined transitions and rewards,r^t,xt\+1=gθ\(xt,at\)\\hat\{r\}\_\{t\},x\_\{t\+1\}=g\_\{\\theta\}\(x\_\{t\},a\_\{t\}\), and the prediction function outputs a policy prior and value estimate,πt,v^t=fθ\(xt\)\\pi\_\{t\},\\hat\{v\}\_\{t\}=f\_\{\\theta\}\(x\_\{t\}\)\. MCTS is then performed entirely in latent space\. The learned dynamics expands candidate future trajectories, while the prediction function evaluates latent states and provides action priors\. After a fixed number of simulations, the visit countsN\(xt,at\)N\(x\_\{t\},a\_\{t\}\)at the root are normalized to produce an improved policy targetptp\_\{t\}

pt=N\(xt,at\)1/𝕋∑btN\(xt,bt\)1/𝕋,p\_\{t\}=\\frac\{N\(x\_\{t\},a\_\{t\}\)^\{1/\\mathbb\{T\}\}\}\{\\sum\_\{b\_\{t\}\}N\(x\_\{t\},b\_\{t\}\)^\{1/\\mathbb\{T\}\}\},\(4\)
where𝕋\\mathbb\{T\}denotes the temperature, which modulates the extent of exploration, the visit countN\(xt,at\)N\(x\_\{t\},a\_\{t\}\)denotes the number of times actionata\_\{t\}was selected at the root latent statextx\_\{t\}during MCTS\. The denominator∑btN\(xt,bt\)1/𝕋\\sum\_\{b\_\{t\}\}N\(x\_\{t\},b\_\{t\}\)^\{1/\\mathbb\{T\}\}sums over all possible actionsbtb\_\{t\}from the same statextx\_\{t\}for discrete action spaces, or over sampled candidate actions in continuous\-control tasks, whereN\(xt,bt\)N\(x\_\{t\},b\_\{t\}\)is the visit count for actionbtb\_\{t\}\. The model is trained end\-to\-end by unrolling the dynamics forKKsteps and optimizing policy, value, and reward prediction losses\. This framework forms the basis of several MuZero\-style algorithms\[[33](https://arxiv.org/html/2606.14418#bib.bib60),[16](https://arxiv.org/html/2606.14418#bib.bib17),[46](https://arxiv.org/html/2606.14418#bib.bib19),[44](https://arxiv.org/html/2606.14418#bib.bib20)\]\.

While standard MuZero\-like methods represent the planning state as a monolithic latent vector or feature map, COMET represents it as a set of object\-centric slots, enabling planning and prediction over slot\-structured latent space\. The joint optimization objective for COMET can be written as

ℒCOMET\(θ\)=𝔼\(ot,at,rt,ot\+1,pt\)∼ℬ\(∑t=0H−1\(βs1n∑i=1n∥s^t\+1i−st\+1i∥22\+βrCE\(r^t,rt\)\+βpCE\(πt,pt\)\+βvCE\(v^t,vt\)\)\),\\begin\{split\}\\mathcal\{L\}\_\{\\text\{COMET\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(o\_\{t\},a\_\{t\},r\_\{t\},o\_\{t\+1\},p\_\{t\}\)\\sim\\mathcal\{B\}\}\\Big\(\\sum\_\{t=0\}^\{H\-1\}\\Big\(\\beta\_\{s\}\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\hat\{s\}\_\{t\+1\}^\{i\}\-s\_\{t\+1\}^\{i\}\\\|\_\{2\}^\{2\}\\\\ \+\\beta\_\{r\}\\,\\mathrm\{CE\}\(\\hat\{r\}\_\{t\},r\_\{t\}\)\+\\beta\_\{p\}\\,\\mathrm\{CE\}\(\\pi\_\{t\},p\_\{t\}\)\+\\beta\_\{v\}\\,\\mathrm\{CE\}\(\\hat\{v\}\_\{t\},v\_\{t\}\)\\Big\)\\Big\),\\end\{split\}\(5\)
whereℬ\\mathcal\{B\}is a replay buffer that stores trajectories\{ot,at,rt,ot\+1,pt\}\\\{o\_\{t\},a\_\{t\},r\_\{t\},o\_\{t\+1\},p\_\{t\}\\\}\.HHdenotes the training context length, which corresponds to the rollout length used during training\. In COMET,HHmatches the context window of the transformer backbone\. The coefficientsβs,βr,βp,βv\\beta\_\{s\},\\beta\_\{r\},\\beta\_\{p\},\\beta\_\{v\}are constant coefficients used to balance different loss terms, corresponding to next\-state prediction, reward prediction, policy prediction, and value prediction, respectively\.CE\\mathrm\{CE\}denotes the cross\-entropy loss function\. Following UniZero, we formulate reward and value prediction as a discrete regression problem in a log\-transformed space, optimized by minimizing cross\-entropy usingvtv\_\{t\}andrtr\_\{t\}as soft targets\.s^t\+1i\\hat\{s\}\_\{t\+1\}^\{i\}denotes the predicted representation of theii\-th slot, whilest\+1is\_\{t\+1\}^\{i\}denotes the corresponding ground\-truth slot obtained from a frozen pretrained object\-centric encoder\.vtv\_\{t\}denotes the bootstrappednn\-step TD target, andrtr\_\{t\}denotes the target reward\.

## 4Experimental Setup

### 4\.1Environments

![Refer to caption](https://arxiv.org/html/2606.14418v1/images/envs.jpg)Figure 5:Visualization of observations and slot\-wise attention maps across environments\. In each row, the real observation is followed by attention maps for each slot produced by the corresponding model\. From top to bottom: SLATE in Object Reaching Task, SLATE in an Object Goal Task, Slot Contrast in Cube Pushing Task, DINOSAUR in Block Lifting Task, and Slot Contrast in Defend The Line Task\.We evaluate our approach on the Object\-Centric Visual RL benchmark\[[47](https://arxiv.org/html/2606.14418#bib.bib50)\], which includes object\-centric environments designed to test perception, interaction, and relational reasoning\. The suite consists of Object Goal, Object Interaction, Object Comparison, Property Comparison, and Object Reaching tasks, featuring target objects with distractors and requiring different forms of goal\-directed behavior\. Across tasks, the agent must identify relevant objects, reason about their properties or relationships, and act under sparse rewards with either discrete or continuous action spaces\. We further extend the evaluation to manipulation and control tasks from ManiSkill, Robosuite, and VizDoom\. Figure[5](https://arxiv.org/html/2606.14418#S4.F5)shows examples of observations from the considered environments\.

In the ManiSkill framework, we use the Cube Pushing task, where a cube is placed on a tabletop and its initial position is randomly sampled within a small region in front of the agent\. The goal is to push the cube into a target region at a fixed offset from its initial position, indicated by a visual marker\. The reward is dense and shaped, with pose\-based components weighted bypose\_reward\_coef=0\.01\\texttt\{pose\\\_reward\\\_coef\}=0\.01andplace\_reward\_coef=0\.1\\texttt\{place\\\_reward\\\_coef\}=0\.1\. Episodes are limited to 50 steps\.

In Robosuite, we evaluate the Block Lifting task, where a single Panda arm operates in a tabletop environment\. A cube is placed on the table at a fixed position, while the robot’s initial configuration is randomized at each episode\. The objective is to grasp and lift the cube above a predefined height threshold\. The task uses a dense reward that encourages gradual progress toward successful lifting, requiring stable grasping and vertical manipulation\. Episodes are limited to 125 steps\.

Finally, in VizDoom we use the Defend the Line scenario, where the agent is placed on one side of a rectangular map, while melee and ranged monsters spawn on the opposite side and continuously move toward it\. Monsters are eliminated with a single shot and respawn after a delay; over time they deal increasing damage\. The agent has limited ammunition, and the episode ends when the agent dies\. The reward is \+1 for killing a monster and \-1 for death\.

![Refer to caption](https://arxiv.org/html/2606.14418v1/images/disc_res.png)Figure 6:Success rate averaged over 30 episodes and three seeds for COMET and baselines in tasks with discrete action space\. Shaded areas indicate standard deviation\. Exponential smoothing with coefficient 0\.5 is applied\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/cont_res.png)Figure 7:Success rate averaged over 30 episodes and three seeds for COMET and baselines in tasks with continuous action space\. Shaded areas indicate standard deviation\. Exponential smoothing with coefficient 0\.5 is applied\.
### 4\.2Mean Normalized Score

The performance and sample efficiency of object\-centric RL depend on the quality of learned representations, which are strongly affected by visual complexity\. Although agents can sometimes compensate for imperfect representations, this usually reduces sample efficiency and final performance, and different architectures vary in their robustness across environments\. Therefore, a key goal is to design object\-centric RL agents that maintain stable performance across visually diverse tasks under limited interaction budgets\.

Our task suite spans environments with varying visual complexity, ranging from relatively simple settings \(e\.g\., 2D shapes on a monochrome background\) to more challenging ones \(e\.g\., ManiSkill, Robosuite, and VizDoom\)\. To quantify performance under a fixed interaction budget, we compute the average performance across environments\. To this end, we introduce a normalized score that aggregates an agent’s performance across a set of visually and dynamically diverse tasks into a single interpretable metric\. Because evaluation metrics \(e\.g\., cumulative reward and success rate\) differ in scale and nature across tasks, raw scores must be normalized prior to aggregation\. We normalize each agent’s performance relative to the best\-performing method in each environment\. Letℰ\\mathcal\{E\}denote the set of environments,𝒯e\\mathcal\{T\}\_\{e\}the time\-step budget for environmente∈ℰe\\in\\mathcal\{E\},ℳ\\mathcal\{M\}the set of RL agents \(algorithms\) we evaluate,τ=t/𝒯e≤1\\tau=t/\\mathcal\{T\}\_\{e\}\\leq 1the normalized time step, andF\(m,e;t\)F\(m,e;t\)the evaluation metric \(success rate or cumulative reward\) achieved by methodm∈ℳm\\in\\mathcal\{M\}on environmenteeaftertttraining steps\. We define the normalized score𝒮\(m,e;τ\)\\mathcal\{S\}\(m,e;\\tau\)for methodmmin environmentee, and the mean normalized score𝒮^\(m;τ\)\\hat\{\\mathcal\{S\}\}\(m;\\tau\)that aggregates performance across environments:

𝒮\(m,e;τ\)=F\(m,e;t\)maxm′,t′⁡F\(m′,e;t′\)≤1,𝒮^\(m;τ\)=1\|ℰ\|∑e∈ℰ𝒮\(m,e;τ\)\.\\mathcal\{S\}\(m,e;\\tau\)=\\dfrac\{F\(m,e;t\)\}\{\\max\_\{m^\{\\prime\},t^\{\\prime\}\}F\(m^\{\\prime\},e;t^\{\\prime\}\)\}\\leq 1,\\quad\\hat\{\\mathcal\{S\}\}\(m;\\tau\)=\\dfrac\{1\}\{\|\\mathcal\{E\}\|\}\\sum\_\{e\\in\\mathcal\{E\}\}\\mathcal\{S\}\(m,e;\\tau\)\.\(6\)

## 5Experiments

We compare COMET against a model\-free, object\-centric PPO baseline OCRL\[[47](https://arxiv.org/html/2606.14418#bib.bib50)\]that uses a transformer encoder to pool object\-centric representations\. As object\-centric MBRL baselines, we use OCDreamer and SOLD\. OCDreamer is agnostic to the action space, whereas SOLD is implemented only for continuous action spaces\. We experimented with a discrete\-action variant by replacing the continuous actor with a categorical policy and optimizing imagined rollouts with a score\-function estimator, but this variant failed to learn reliably in preliminary experiments\. We therefore restrict SOLD comparisons to continuous\-action tasks, where the original algorithm is directly applicable\. For SOLD, we use the SAVi encoder from the original implementation; examples of its attention maps on continuous tasks are provided in Appendix[J](https://arxiv.org/html/2606.14418#A10)\. For OCRL, OCDreamer, and COMET, we use SLATE on the Object Goal, Object Interaction, Property Comparison, Object Property, and Object Reaching tasks; DINOSAUR on Block Lifting; and Slot Contrast on Cube Pushing and Defend the Line\. Examples of attention maps produced by these encoders are shown in Figure[5](https://arxiv.org/html/2606.14418#S4.F5)\. Pre\-trained SLATE is used as described in Appendix[C](https://arxiv.org/html/2606.14418#A3), while all other encoders are trained on the collected data \(Appendix[B](https://arxiv.org/html/2606.14418#A2)\) and kept frozen thereafter\. Hyperparameters for SLATE, Slot Contrast, and DINOSAUR are provided in Appendices[C](https://arxiv.org/html/2606.14418#A3),[E](https://arxiv.org/html/2606.14418#A5), and[D](https://arxiv.org/html/2606.14418#A4), respectively\. As a monolithic MBRL baseline, we use UniZero\. For all baselines, we adopt the original hyperparameters specified in their respective publications and official repositories\. Hyperparameters for COMET are provided in Appendix[F](https://arxiv.org/html/2606.14418#A6)\. For the Cube Pushing task, we reducediscount\_factorto0\.9250\.925for both COMET and UniZero\. True and predicted trajectory rollouts for all tasks are presented in Appendix[H](https://arxiv.org/html/2606.14418#A8), and causality\-score visualizations for the policy and value transformers across all tasks are presented in Appendix[I](https://arxiv.org/html/2606.14418#A9)\.

Figures[6](https://arxiv.org/html/2606.14418#S4.F6)and[7](https://arxiv.org/html/2606.14418#S4.F7)show the training curves of success rate and cumulative reward for COMET and the baselines across all tasks\. The set of top\-performing algorithms changes across tasks\. COMET achieves faster convergence and higher final performance than the baselines on Object Comparison and Object Property, but does not outperform the competing methods on Block Lifting\. To account for variability in task difficulty and performance scales, we normalize the results per task for each algorithm as described in Equation[6](https://arxiv.org/html/2606.14418#S4.E6)\. The resulting normalized score, shown in Figure[3](https://arxiv.org/html/2606.14418#S3.F3), demonstrates that COMET achieves higher mean normalized score across a visually diverse set of tasks\.

In visually simple tasks, such as Object Goal, Object Comparison, and Property Comparison, COMET leverages strong object\-centric representations and a structure with a single target object and multiple distractors\. Using object causal attention, which benefits from high\-quality representations, COMET accurately identifies task\-relevant objects by assigning them higher causality scores, as illustrated in Appendix[I](https://arxiv.org/html/2606.14418#A9), enabling it to outperform the baselines\. In more dynamically complex tasks, such as Defend the Line and Object Interaction, COMET achieves performance comparable to the baselines\. We attribute this behavior to task\-specific factors: in Defend the Line, most objects are relevant for prediction, while in Object Interaction, the agent must rely on object\-pushing mechanics, so the goal is achieved indirectly through deeper causal chains than those captured by our causal object attention mechanism\. For tasks that are challenging in both visual complexity and control, such as Block Lifting and Cube Pushing, COMET achieves moderate performance\. This is likely due to limitations in the object\-centric representation model, which sometimes merges the cube and the background into a single slot\. OCDreamer demonstrates better performance on these tasks, indicating that it is more robust to imperfect representations in such settings\.

## 6Limitations & Future Work

Despite advances in unsupervised object\-centric representation learning, current methods still struggle to reliably segment complex, cluttered real\-world scenes, especially under occlusion or ambiguous boundaries, limiting their applicability in unconstrained settings\. Additionally, transformer\-based approaches scale poorly due to quadratic self\-attention costs with increasing object slots, restricting scalability in multi\-object scenes\. Future work will focus on extending these methods to realistic, open\-ended environments, such as household tasks, and on improving causal attention mechanisms that dynamically prioritize relevant objects\. Enhancing these mechanisms for not only policy and value estimation but also transition dynamics could improve efficiency, relational reasoning, and scalability in complex, multi\-object settings\.

## 7Conclusion

In this work, we introduced COMET, an MCTS\-based object\-centric model\-based reinforcement learning method that combines structured object\-centric representations with a transformer\-based world model\. COMET employs an action\-slot binding mechanism that fuses object\-centric slots with actions, enabling transition modeling for slots within a unified transformer backbone\. By leveraging a causal attention mechanism, the policy and value models focus on task\-relevant object representations, improving both the effectiveness and interpretability of decision\-making\. Experimental results across visually diverse discrete and continuous control tasks show that COMET achieves a higher mean normalized score during the early stages of training compared to object\-centric and monolithic baselines\.

## References

- \[1\]\(2021\)Deep vit features as dense visual descriptors\.arXiv preprint arXiv:2112\.05814\.Cited by:[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[2\]O\. Biza, R\. Platt, J\. van de Meent, L\. L\. Wong, and T\. Kipf\(2022\)Binding actions to objects in world models\.arXiv preprint arXiv:2204\.13022\.Cited by:[§3\.2](https://arxiv.org/html/2606.14418#S3.SS2.p2.13)\.
- \[3\]C\. P\. Burgess, L\. Matthey, N\. Watters, R\. Kabra, I\. Higgins, M\. Botvinick, and A\. Lerchner\(2019\)Monet: unsupervised scene decomposition and representation\.arXiv preprint arXiv:1901\.11390\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p1.1)\.
- \[4\]H\. K\. Cheng, S\. W\. Oh, B\. Price, J\. Lee, and A\. Schwing\(2024\)Putting the object back into video object segmentation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 3151–3161\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[5\]R\. Coulom\(2006\)Efficient selectivity and backup operators in monte\-carlo tree search\.InInternational conference on computers and games,pp\. 72–83\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p2.1)\.
- \[6\]T\. Daniel and A\. Tamar\(2022\)Unsupervised image representation learning with deep latent particles\.InInternational Conference on Machine Learning,pp\. 4644–4665\.Cited by:[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p2.1)\.
- \[7\]T\. Daniel and A\. Tamar\(2024\)DDLP: unsupervised object\-centric video prediction with deep dynamic latent particles\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=Wqn8zirthg)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[8\]G\. Elsayed, A\. Mahendran, S\. Van Steenkiste, K\. Greff, M\. C\. Mozer, and T\. Kipf\(2022\)Savi\+\+: towards end\-to\-end object\-centric learning from real\-world videos\.Advances in Neural Information Processing Systems35,pp\. 28940–28954\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[9\]S\. Ferraro, P\. Mazzaglia, T\. Verbelen, and B\. Dhoedt\(2023\)FOCUS: object\-centric world models for robotic manipulation\.InIntrinsically\-Motivated and Open\-Ended Learning Workshop @NeurIPS2023,Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p1.1)\.
- \[10\]D\. Ha and J\. Schmidhuber\(2018\)World models\.CoRRabs/1803\.10122\.External Links:[Link](http://arxiv.org/abs/1803.10122),1803\.10122Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p1.1)\.
- \[11\]D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi\(2020\)Dream to control: learning behaviors by latent imagination\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=S1lOTC4tDS)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p2.1)\.
- \[12\]D\. Hafner, T\. P\. Lillicrap, M\. Norouzi, and J\. Ba\(2021\)Mastering atari with discrete world models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0oabwyZbOu)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p2.1)\.
- \[13\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap\(2025\)Mastering diverse control tasks through world models\.Nature640\(8059\),pp\. 647–653\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p2.1)\.
- \[14\]N\. A\. Hansen, H\. Su, and X\. Wang\(2022\)Temporal difference learning for model predictive control\.InInternational Conference on Machine Learning,pp\. 8387–8406\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p2.1)\.
- \[15\]N\. Hansen, H\. Su, and X\. Wang\(2024\)TD\-MPC2: scalable, robust world models for continuous control\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Oxh5CstDJU)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p2.1)\.
- \[16\]T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Barekatain, S\. Schmitt, and D\. Silver\(2021\)Learning and planning in complex action spaces\.InInternational Conference on Machine Learning,pp\. 4476–4486\.Cited by:[§3\.4](https://arxiv.org/html/2606.14418#S3.SS4.p3.10)\.
- \[17\]A\. Karpathy\(2023\)NanoGPT: the simplest, fastest repository for training/finetuning medium\-sized gpts\.GitHub\.Note:[https://github\.com](https://github.com/)Cited by:[§3](https://arxiv.org/html/2606.14418#S3.p1.1)\.
- \[18\]T\. Kipf, G\. F\. Elsayed, A\. Mahendran, A\. Stone, S\. Sabour, G\. Heigold, R\. Jonschkowski, A\. Dosovitskiy, and K\. Greff\(2022\)Conditional object\-centric learning from video\.InInternational Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[19\]T\. Kipf, E\. van der Pol, and M\. Welling\(2020\)Contrastive learning of structured world models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=H1gax6VtDB)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[20\]A\. Kirillov, E\. Mintun, N\. Ravi, H\. Mao, C\. Rolland, L\. Gustafson, T\. Xiao, S\. Whitehead, A\. C\. Berg, W\. Lo,et al\.\(2023\)Segment anything\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4015–4026\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[21\]Y\. LeCun, Y\. Bengio, and G\. Hinton\(2015\-05\)Deep learning\.Nature521\(7553\),pp\. 436–444\(en\)\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p3.1)\.
- \[22\]A\. Liang, J\. Thomason, and E\. Bıyık\(2024\)Visarl: visual reinforcement learning guided by human saliency\.In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 2907–2912\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p3.1)\.
- \[23\]Z\. Lin, Y\. Wu, S\. V\. Peri, W\. Sun, G\. Singh, F\. Deng, J\. Jiang, and S\. Ahn\(2020\)SPACE: unsupervised object\-oriented scene representation via spatial attention and decomposition\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rkl03ySYDH)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[24\]F\. Locatello, D\. Weissenborn, T\. Unterthiner, A\. Mahendran, G\. Heigold, J\. Uszkoreit, A\. Dosovitskiy, and T\. Kipf\(2020\)Object\-centric learning with slot attention\.Advances in neural information processing systems33,pp\. 11525–11538\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[25\]A\. Manasyan, M\. Seitzer, F\. Radovic, G\. Martius, and A\. Zadaianchuk\(2025\)Temporally consistent object\-centric learning by contrasting slots\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 5401–5411\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[26\]T\. Miyato, S\. Löwe, A\. Geiger, and M\. Welling\(2025\)Artificial kuramoto oscillatory neurons\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nwDRD4AMoN)Cited by:[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p2.1)\.
- \[27\]M\. Mosbach, J\. N\. Ewertz, A\. Villar\-Corrales, and S\. Behnke\(2024\)SOLD: reinforcement learning with slot object\-centric latent dynamics\.arXiv preprint arXiv:2410\.08822\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p2.1)\.
- \[28\]Y\. Nishimoto and T\. Matsubara\(2026\)Object\-centric world models for causality\-aware reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 24585–24593\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p2.1)\.
- \[29\]Y\. Niu, Y\. Pu, Z\. Yang, X\. Li, T\. Zhou, J\. Ren, S\. Hu, H\. Li, and Y\. Liu\(2023\)Lightzero: a unified benchmark for monte carlo tree search in general sequential decision scenarios\.Advances in Neural Information Processing Systems36,pp\. 37594–37635\.Cited by:[§3](https://arxiv.org/html/2606.14418#S3.p1.1)\.
- \[30\]Y\. Pu, Y\. Niu, Z\. Yang, J\. Ren, H\. Li, and Y\. Liu\(2025\)UniZero: generalized and efficient planning with scalable latent world models\.Transactions on Machine Learning Research\.Cited by:[§3](https://arxiv.org/html/2606.14418#S3.p1.1)\.
- \[31\]N\. Ravi, V\. Gabeur, Y\. Hu, R\. Hu, C\. Ryali, T\. Ma, H\. Khedr, R\. Rädle, C\. Rolland, L\. Gustafson, E\. Mintun, J\. Pan, K\. V\. Alwala, N\. Carion, C\. Wu, R\. Girshick, P\. Dollar, and C\. Feichtenhofer\(2025\)SAM 2: segment anything in images and videos\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Ha6RTeWMd0)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[32\]A\. Santoro, D\. Raposo, D\. G\. Barrett, M\. Malinowski, R\. Pascanu, P\. Battaglia, and T\. Lillicrap\(2017\)A simple neural network module for relational reasoning\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p3.1)\.
- \[33\]J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel,et al\.\(2020\)Mastering atari, go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.14418#S3.SS4.p1.5),[§3\.4](https://arxiv.org/html/2606.14418#S3.SS4.p3.10)\.
- \[34\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p2.1)\.
- \[35\]M\. Seitzer, M\. Horn, A\. Zadaianchuk, D\. Zietlow, T\. Xiao, C\. Simon\-Gabriel, T\. He, Z\. Zhang, B\. Schölkopf, T\. Brox, and F\. Locatello\(2023\)Bridging the gap to real\-world object\-centric learning\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=b9tUk-f_aG)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[36\]R\. M\. Shiffrin, D\. S\. Bassett, N\. Kriegeskorte, and J\. B\. Tenenbaum\(2020\)The brain produces mind by modeling\.Proceedings of the National Academy of Sciences117\(47\),pp\. 29299–29301\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1912340117),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.1912340117),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1912340117Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p1.1)\.
- \[37\]G\. Singh, F\. Deng, and S\. Ahn\(2022\)Illiterate DALL\-e learns to compose\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=h0OYV0We3oh)Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[38\]G\. Singh, Y\. Wu, and S\. Ahn\(2022\)Simple unsupervised object\-centric learning for complex and naturalistic videos\.Advances in Neural Information Processing Systems35,pp\. 18181–18196\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[39\]E\. S\. Spelke and K\. D\. Kinzler\(2007\-01\)Core knowledge\.Dev\. Sci\.10\(1\),pp\. 89–96\(en\)\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[40\]A\. Stanić, Y\. Tang, D\. Ha, and J\. Schmidhuber\(2023\)Learning to generalize with object\-centric agents in the open world survival game crafter\.IEEE Transactions on Games\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p2.1)\.
- \[41\]L\. Ugadiarov, V\. Vorobyov, and A\. Panov\(2025\)Object\-centric dreamer\.InInternational Conference on Artificial Neural Networks,pp\. 153–165\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p2.1)\.
- \[42\]A\. Van Den Oord, O\. Vinyals,et al\.\(2017\)Neural discrete representation learning\.Advances in neural information processing systems30\.Cited by:[§2\.1](https://arxiv.org/html/2606.14418#S2.SS1.p1.1)\.
- \[43\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p5.1)\.
- \[44\]S\. Wang, S\. Liu, W\. Ye, J\. You, and Y\. Gao\(2024\)EfficientZero v2: mastering discrete and continuous control with limited data\.InInternational Conference on Machine Learning,pp\. 51041–51062\.Cited by:[§3\.4](https://arxiv.org/html/2606.14418#S3.SS4.p3.10)\.
- \[45\]N\. Watters, L\. Matthey, M\. Bosnjak, C\. P\. Burgess, and A\. Lerchner\(2019\)Cobra: data\-efficient model\-based rl through unsupervised object discovery and curiosity\-driven exploration\.arXiv preprint arXiv:1905\.09275\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p1.1)\.
- \[46\]W\. Ye, S\. Liu, T\. Kurutach, P\. Abbeel, and Y\. Gao\(2021\)Mastering atari games with limited data\.Advances in neural information processing systems34,pp\. 25476–25488\.Cited by:[§3\.4](https://arxiv.org/html/2606.14418#S3.SS4.p3.10)\.
- \[47\]J\. Yoon, Y\. Wu, H\. Bae, and S\. Ahn\(2023\)An investigation into pre\-training object\-centric representations for reinforcement learning\.InProceedings of the 40th International Conference on Machine Learning,pp\. 40147–40174\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.14418#S4.SS1.p1.1),[§5](https://arxiv.org/html/2606.14418#S5.p1.1)\.
- \[48\]A\. Zadaianchuk, M\. Seitzer, and G\. Martius\(2023\)Object\-centric learning for real\-world videos by predicting temporal feature similarities\.Advances in neural information processing systems36,pp\. 61514–61545\.Cited by:[§1](https://arxiv.org/html/2606.14418#S1.p4.1)\.
- \[49\]W\. Zhang, A\. Jelley, T\. McInroe, and A\. Storkey\(2025\)Objects matter: object\-centric world models improve reinforcement learning in visually complex environments\.arXiv preprint arXiv:2501\.16443\.Cited by:[§2\.2](https://arxiv.org/html/2606.14418#S2.SS2.p1.1)\.

## Appendix AMean Normalized Score

Table 1:Normalization Parameters\. Notation is consistent with Equation \([6](https://arxiv.org/html/2606.14418#S4.E6)\)\.
## Appendix BDatasets

We collect images to train object\-centric representation models using a uniform random policy\.

Table 2:Parameters of the collected datasets\.
## Appendix CSLATE‘s Hyperparameters

For Object Goal, Object Interaction, Object Comparison, Property Comparison, and Object Reaching tasks we used pre\-trained SLATE models from the OCRL official repository[https://github\.com/jsikyoon/OCRL](https://github.com/jsikyoon/OCRL)\.

Table 3:Hyperparameters for the SLATELearningTraining dataset size1000000Temp\. cooldown1\.0 to 0\.1Temp\. cooldown steps30000LR for DVAE0\.0003LR for CNN Encoder0\.0001LR for Transformer Decoder0\.0003LR warm\-up steps30000LR half time250000Dropout0\.1Clip0\.05Batch size32Epochs100DVAEVocabulary size4096CNN EncoderHidden size64Slot AttentionIterations3Slot heads1Slot dim\.192MLP hidden dim\.192Transformer DecoderLayers4Heads4Hidden dim\.192
## Appendix DDINOSAUR‘s Hyperparameters

Table 4:Hyperparameters for DINOSAURLearningTraining dataset size300000Training steps500000Batch size64LR warm\-up steps10000Peak LR0\.0004Exp\. decay half\-life100000ViT ArchitectureViT\-BFeature dim\.768Patch size8Gradient norm clipping1\.0Image/Crop size224Cropping strategyFullTokens784DecoderTypeMLPLayers4MLP hidden dim\.1024Slot AttentionIterations3Slots5Slot dim\.64MLP hidden dim\.512
## Appendix ESlot Contrast‘s Hyperparameters

Table 5:Hyperparameters for Slot ContrastLearningTraining steps100000Batch size64Training segment length4LR warm\-up steps2500OptimizerAdamPeak LR0\.0004Exp\. decay half\-life100000ViT ArchitectureDINOv2 SmallInitializationFixedLearnedInitPatch size14Feature dim\.384Gradient norm clipping0\.05Image/Crop size336Cropping strategyFullImage tokens576PredictorTypeTransformerLayers1Heads4DecoderTypeMLPSlot AttentionIterations \(first / other frames\)3 / 2Slot dim\.64Loss ParametersSlot\-slot contrastive lossdisabled
## Appendix FCOMET‘s Hyperparameters

Table 6:Hyperparameters for COMETPlanningNumber of MCTS Simulations \(sim\)50Number of Sampled Actions \(K\)20 \(Continuous tasks only\)Inference Context Length10Temperature0\.25Dirichlet Noise \(α\\alpha\)0\.3Dirichlet Noise Weight0\.25Coefficientc1c\_\{1\}1\.25Coefficientc2c\_\{2\}19652Environment and Replay BufferReplay Buffer Capacity1,000,000Sampling StrategyUniformReward ClippingTrue \(Discrete only\)Data AugmentationFalseGame Segment Length400 \(Discrete\); 100 \(Continuous\)ArchitectureNumber of Backbone Transformer Heads8 \(Discrete\); 4 \(Continuous\)Number of Backbone Transformer Layers \(N\)2Number of Policy/Value Transformer Heads4Number of Policy/Value Transformer Layers \(N\)1Dropout Rate \(p\)0\.1Activation FunctionGELUReward/Value Bins101 \(Continuous\); 601 \(Discrete\)OptimizationTraining Context Length \(H\)10Replay Ratio0\.25Buffer Reanalyze Frequency0\(Discrete\);1/1000001/100000\(Continuous\)Batch Size64OptimizerAdamWLearning Rate1×10−41\\times 10^\{\-4\}Next Latent State Loss Coefficient10Reward Loss Coefficient1 \(Discrete\); 0\.1 \(Continuous\)Policy Loss Coefficient1 \(Discrete\); 0\.1 \(Continuous\)Value Loss Coefficient0\.5 \(Discrete\); 0\.1 \(Continuous\)Policy Entropy Coefficient1×10−41\\times 10^\{\-4\}Weight Decay10−410^\{\-4\}Max Gradient Norm5Discount Factor0\.997 \(0\.925 in Cube Pushing Task\)Soft Target Update Momentum0\.05Hard Target Network Update Frequency100Temporal Difference \(TD\) Steps5

## Appendix GCompute Resources

In our setup, training COMET for 500k environment steps on a single NVIDIA H100 \(80 GB\) GPU takes approximately 18 hours on average across different tasks\.

## Appendix HCOMET‘s trajectory rollouts

![Refer to caption](https://arxiv.org/html/2606.14418v1/images/goal_dyn_policy.jpg)Figure 8:Trajectory rollout generated using COMET’s policy for the Object Goal Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/goal_dyn_random.jpg)Figure 9:Trajectory rollout generated using a random policy for the Object Goal Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/inter_dyn_policy.jpg)Figure 10:Trajectory rollout generated using COMET’s policy for the Object Interaction Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations; however, near the end of the trajectory, when objects became spatially close, the model produced inaccurate predictions\. Notably, the ground\-truth slots in this situation also became inconsistent with those from previous steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/inter_dyn_random.jpg)Figure 11:Trajectory rollout generated using a random policy for the Object Interaction Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations; however, near the end of the trajectory, when objects became spatially close, the model produced inaccurate predictions\. Notably, the ground\-truth slots in this situation also became inconsistent with those from previous steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddobj_dyn_policy.jpg)Figure 12:Trajectory rollout generated using COMET’s policy for the Object Comparison Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddobj_dyn_random.jpg)Figure 13:Trajectory rollout generated using a random policy for the Object Comparison Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddprop_dyn_policy.jpg)Figure 14:Trajectory rollout generated using COMET’s policy for the Property Comparison Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddprop_dyn_random.jpg)Figure 15:Trajectory rollout generated using a random policy for the Property Comparison Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/cw_dyn_policy.jpg)Figure 16:Trajectory rollout generated using COMET’s policy for the Object Reaching Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/cw_dyn_random.jpg)Figure 17:Trajectory rollout generated using a random policy for the Object Reaching Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the SLATE model over each slot inferred by the SLATE model\. The second row shows attention maps produced by the SLATE model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/robo_dyn_policy.jpg)Figure 18:Trajectory rollout generated using COMET’s policy for the Block Lifting Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the DINOSAUR model over each slot inferred by the DINOSAUR model\. The second row shows attention maps produced by the DINOSAUR model for each slot predicted by COMET’s dynamics mode\. COMET’s dynamics model correctly predicted object representations\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/robo_dyn_random.jpg)Figure 19:Trajectory rollout generated using a random policy for the Block Lifting Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the DINOSAUR model over each slot inferred by the DINOSAUR encoder\. The second row shows attention maps produced by the DINOSAUR model for each slot predicted by COMET’s dynamics mode\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/mani_dyn_policy.jpg)Figure 20:Trajectory rollout generated using COMET’s policy for the Cube Pushing Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the Slot Contrast model over each slot inferred by the Slot Contrast model\. The second row shows attention maps produced by the Slot Contrast model for each slot predicted by COMET’s dynamics mode\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/mani_dyn_random.jpg)Figure 21:Trajectory rollout generated using a random policy for the Cube Pushing Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the Slot Contrast model over each slot inferred by the Slot Contrast model\. The second row shows attention maps produced by the Slot Contrast model for each slot predicted by COMET’s dynamics mode\. COMET’s dynamics model correctly predicted object representations\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/viz_dyn_policy.jpg)Figure 22:Trajectory rollout generated using COMET’s policy for the Defend The Line Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the Slot Contrast model over each slot inferred by the Slot Contrast model\. The second row shows attention maps produced by the Slot Contrast model for each slot predicted by COMET’s dynamics model\. COMET’s dynamics model correctly predicted object representations, with prediction errors appearing near the end of the trajectory\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/viz_random_policy.jpg)Figure 23:Trajectory rollout generated using a random policy for the Defend The Line Task\. The first row at each time step shows the real observation from the environment along with attention maps produced by the Slot Contrast model over each slot inferred by the Slot Contrast model\. The second row shows attention maps produced by the Slot Contrast model for each slot predicted by COMET’s dynamics mode\. COMET’s dynamics model correctly predicted object representations\.
## Appendix ICOMET‘s causality probabilities in policy and value models

![Refer to caption](https://arxiv.org/html/2606.14418v1/images/goal_value_probs.png)Figure 24:Per\-slot causality scores for the value transformer in the Object Goal Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object\. The causality score is highest for the agent object across all time steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/goal_probs_policy.png)Figure 25:Per\-slot causality scores for the policy transformer in the Object Goal Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object\. The causality score is highest for the agent object across most time steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/inter_probs_value.png)Figure 26:Per\-slot causality scores for the value transformer in the Object interaction Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object, and green bounding boxes indicate auxiliary object\. The agent and target objects generally receive high causality scores; however, near the end of the trajectory, the model incorrectly assigns a higher causality score to a background object and produces comparable causality scores for other irrelevant objects\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/inter_probs_policy.png)Figure 27:Per\-slot causality scores for the policy transformer in the Object Interaction Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object, and green bounding boxes indicate auxiliary object\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddobj_probs_value.png)Figure 28:Per\-slot causality scores for the value transformer in the Object Comparison Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object\. The agent, auxiliary, and target objects receive higher causality scores compared to other objects across time steps\. The causality score is highest for the target object across most time steps\. Some other objects receive higher causality scores than the agent object\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddobj_probs_policy.png)Figure 29:Per\-slot causality scores for the policy transformer in the Object Comparison Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object\. The causality score is highest for the target object across most time steps\. Some other objects receive higher causality scores than the agent object\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddprop_probs_value.png)Figure 30:Per\-slot causality scores for the value transformer in the Property Comparison Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object\. In this case, the model incorrectly assigns causality scores to the agent and target objects relative to other objects\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/oddprop_probs_policy.png)Figure 31:Per\-slot causality scores for the policy transformer in the Property Comparison Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object\. The causality score is highest for the target object across most time steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/cw_probs_value.png)Figure 32:Per\-slot causality scores for the value transformer in the Object Reaching Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Blue bounding boxes indicate target object\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/cw_probs_policy.png)Figure 33:Per\-slot causality scores for the policy transformer in the Object Reaching Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the SLATE model for each slot inferred by the SLATE model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target object\. The causality score is highest for the target object and and for agent‘s objects distributed across multiple slots\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/robo_probs_value.png)Figure 34:Per\-slot causality scores for the value transformer in the Block Lifting Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the DINOSAUR model for each slot inferred by the DINOSAUR model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Blue bounding boxes indicate target object\. The causality score is highest for the target object and and for agent‘s objects distributed across multiple slots\. The causality score is highest for the target and agent objects across most time steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/robo_probs_policy.png)Figure 35:Per\-slot causality scores for the policy transformer in the Block Lifting Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the DINOSAUR model for each slot inferred by the DINOSAUR model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target objects\. The causality score is highest for the target and agent objects across most time steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/mani_probs_value.png)Figure 36:Per\-slot causality scores for the value transformer in the Cube Pushing Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the Slot Contrast model for each slot inferred by the Slot Contrast model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target objects\. The causality score is highest for the target and agent objects across most time steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/mani_probs_policy.png)Figure 37:Per\-slot causality scores for the policy transformer in the Cube Pushing Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the Slot Contrast model for each slot inferred by the Slot Contrast model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related objects, blue bounding boxes indicate target objects\. The causality score is highest for the target and agent objects across most time steps\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/viz_probs_value.png)Figure 38:Per\-slot causality scores for the value transformer in the Defend The Line Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the Slot Contrast model for each slot inferred by the Slot Contrast model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for value prediction\. Red bounding boxes indicate agent\-related object\. All objects, including the agent and the monsters, receive high causality scores\. However, the model incorrectly assigns a high causality score to a background object\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/viz_probs_policy.png)Figure 39:Per\-slot causality scores for the policy transformer in the Defend The Line Task\. Each row corresponds to a time step and shows the real observation from the environment together with attention maps produced by the Slot Contrast model for each slot inferred by the Slot Contrast model\. The number above each slot denotes its causality scoreαti\\alpha\_\{t\}^\{i\}, indicating the probability that the corresponding object is causally relevant for policy prediction\. Red bounding boxes indicate agent\-related object\. All objects, including the agent and the monsters, receive high causality scores\. However, the model incorrectly assigns a high causality score to a background object\.
## Appendix JSAVI‘s visualizations for SOLD

![Refer to caption](https://arxiv.org/html/2606.14418v1/images/savi_maniskill.png)Figure 40:Hard attention maps produced by the SAVI model in the Cube Pushing task\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/savi_robosuite.png)Figure 41:Hard attention maps produced by the SAVI model in the Block Lifting task\.![Refer to caption](https://arxiv.org/html/2606.14418v1/images/savi_cw.png)Figure 42:Hard attention maps produced by the SAVI model in the Object Reaching task\.
Causal Object-Centric Models for Planning with Monte Carlo Tree Search

Similar Articles

Learning Agent-Compatible Context Management for Long-Horizon Tasks

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

Agentic Transformers Provably Learn to Search via Reinforcement Learning

MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

Submit Feedback

Similar Articles

Learning Agent-Compatible Context Management for Long-Horizon Tasks
Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
Agentic Transformers Provably Learn to Search via Reinforcement Learning
MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments
COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents