AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
This paper proposes AGWM, an affordance-grounded world model that uses a dynamic prerequisite graph to track action executability in environments with compositional prerequisites. Experiments show it reduces prediction error and improves generalization compared to standard world models.
arXiv:2605.06841v1 Announce Type: new Abstract: In model-based learning, the agent learns behaviors by simulating trajectories based on world model predictions. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions. In interactive environments, however, agent actions can reshape the future affordance space. At each timestep, an action may becomes executable only after its prerequisites are met, or non-executable when they are destroyed. We term such events structure-changing events (SC events). As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi-step predictions. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon. In this paper, we propose AGWM (Affordance-Grounded World Model), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions. Experiments on game-based simulated environments demonstrate the effectiveness of our method by achieving lower multi-step prediction error, better generalization to novel configurations, and improved interpretability.
Original Article
View Cached Full Text
Cached at: 05/11/26, 07:08 AM
# AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites
Source: [https://arxiv.org/html/2605.06841](https://arxiv.org/html/2605.06841)
Qinshi Zhang1Weipeng Deng2Zhihan Jiang3Jiaming Qu4 Qianren Li5Weitao Xu5Ray LC5 1University of California, San Diego2University of Hong Kong3Columbia University 4Amazon5City University of Hong Kong qiz065@ucsd\.eduvincentdengp@icloud\.comzj2445@cumc\.columbia\.eduqjiaming@amazon\.com qianrenli2@cityu\.edu\.hkweitaoxu@cityu\.edu\.hkraylc@cityu\.edu\.hk

###### Abstract

In model\-based learning, the agent learns behaviors by simulating trajectories based on world model predictions\. Standard world models typically learn a stationary transition function that maps states and actions to next states, when an action and an outcome frequently co\-occur in training data, the model tends to internalize this correlation as a general causal rule while ignoring action preconditions\. In interactive environments, however, agent actions can reshape the future affordance space\. At each timestep, an action may becomes executable only after its prerequisites are met, or non\-executable when they are destroyed\. We term such events*structure\-changing events*\(SC events\)\. As a result, a conventional world model often fails to determine whether a given action is executable in the current state, especially in multi‑step predictions\. Each imagined step is conditioned on an incorrect affordance state, and therefore the prediction error compounds over the rollout horizon\. In this paper, We propose AGWM \(Affordance‑Grounded World Model\), which learns an abstract affordance structure represented as a DAG of prerequisite dependencies to explicitly track the dynamic executability of actions\. Experiments on game\-based simulated environments demonstrate the effectiveness of our method by achieving lower multi\-step prediction error, better generalizataaion to novel configurations, and improved interpretability\.

![Refer to caption](https://arxiv.org/html/2605.06841v1/figures/figure1.jpg)Figure 1:AGWM overview\.Top:The agent traverses a four\-tier tech tree; SC events \(colored markers\) progressively expand the applicable action set\.Bottom:AGWM operates in three stages: \(1\)DetectSC events via the SC Classifier; \(2\)Updatethe Dynamic Affordance Graph to track active \(green\), frontier \(blue\), and locked \(purple\) capabilities without oracle input; \(3\)Imagineby gating RSSM rollouts with the graph embedding, so only feasible action sequences are considered\.## 1Introduction

World models encode environment dynamics in neural network parameters, learning statistical associations between states, actions, and their outcomes\(Ha & Schmidhuber,[2018](https://arxiv.org/html/2605.06841#bib.bib1);Ke et al\.,[2019](https://arxiv.org/html/2605.06841#bib.bib7);Li et al\.,[2020](https://arxiv.org/html/2605.06841#bib.bib9)\)\. Given a learned world model, an agent can roll out imagined trajectories and select the action sequence that maximizes expected return without additional environment interaction\(Hafner et al\.,[2023](https://arxiv.org/html/2605.06841#bib.bib5)\)\. Standard world models often reconstruct observations well, but they do not model affordances, which are the set of actions currently available to the agent\(Gibson,[1979](https://arxiv.org/html/2605.06841#bib.bib2)\)\. They can predict what follows a given action yet cannot reliably determine whether that action is executable in the current state\(Khetarpal et al\.,[2020](https://arxiv.org/html/2605.06841#bib.bib8)\)\. In open\-ended environments such as robotic manipulation and autonomous planning, an agent may confidently commit to action sequences that are physically or logically impossible, pursuing goals it can never reach or taking irreversible steps with real\-world consequences\. In safety\-critical settings, such failures may further lead the agent to select irreversible or high\-cost actions based on predicted future states that are unattainable in practice\(Amodei et al\.,[2016](https://arxiv.org/html/2605.06841#bib.bib45);Berkenkamp et al\.,[2017](https://arxiv.org/html/2605.06841#bib.bib46)\)\.

This issue becomes critical when agent actions alter the future executable action sets, we term theseStructure\-Changing Events \(SC Events\)\. Such events are common in domains with compositional Prerequisites\. For example, crafting a wooden pickaxe unlocks stone mining in Craftax\(Hafner,[2022](https://arxiv.org/html/2605.06841#bib.bib3)\); drinking a potion enables lava traversal in MiniHack\(Samvelyan et al\.,[2021](https://arxiv.org/html/2605.06841#bib.bib11)\); picking up a potato near a sink enablescleanin ALFWorld\(Shridhar et al\.,[2021](https://arxiv.org/html/2605.06841#bib.bib25)\)\. Over long horizons, SC events can cascade, with each event expanding the set of executable actions and inducing exponential growth in the joint affordance space as compositional depth increases\. Standard world models cannot reliably track which actions are executable after each SC event, and thus accumulate compounding prediction errors over multi\-step rollouts\. Prior work on structured world models has explored context\-dependent causal graphs\(Hwang et al\.,[2024](https://arxiv.org/html/2605.06841#bib.bib6)\), but these methods capture transition dynamics within fixed causal regimes\. They infer causal structure from observational data without representing the preconditions that govern when each action becomes available, and therefore still cannot prevent rule violations during imagination\. Fundamentally, implicit world models struggle with compositional affordance structures because they fail to answer the question "is this action allowed now?" from "what will this action do?", and thus cannot recombine known preconditions to reason about action availability at depths unseen during training\.

In this paper, we propose the Affordance\-Grounded World Model \(AGWM\) to address two failure modes of implicit world models: compounding multi\-step imagination error and failure to generalize when novel rule combinations appear at test time\. Rather than leaving affordance structure implicit, AGWM explicitly tracks at each timestep which affordances are currently active, which are newly reachable given satisfied prerequisites \(the frontier\), and how prerequisite relations are organized across the environment’s DAG schema\. By conditioning its dynamics on this explicit representation, AGWM adapts immediately to SC events instead of relying on latent state drift\. Across multiple benchmarks, AGWM significantly reduces multi\-step imagination error and generalizes to novel affordance configurations without oracle supervision\.

Our contributions are:

- •\(1\) Formalization\.We formalize SC events and identify two key failure modes of implicit world models under affordance\-structure change\.
- •\(2\) AGWM\.We propose AGWM, combining an SC Classifier with a Dynamic Affordance Graph that enforces a frontier\-mask constraint to track affordance change explicitly\.
- •\(3\) Generalization\.We show empirically that the self\-evolved affordance graph generalizes to novel rule combinations unseen during training\.

## 2Related Work

World models learn environment dynamics to enable agents to plan in imagination and select optimal action sequences\(Ha & Schmidhuber,[2018](https://arxiv.org/html/2605.06841#bib.bib1);Hafner et al\.,[2020](https://arxiv.org/html/2605.06841#bib.bib4)\)\. DreamerV3\(Hafner et al\.,[2023](https://arxiv.org/html/2605.06841#bib.bib5)\)rolls out imagined trajectories in latent space, with follow\-up works exploring Transformer, diffusion, and JEPA architectures\. These approaches share a common formulation: actions condition the transition function, but the model cannot distinguish action legality from transition dynamics\. To address this, MuZero\(Schrittwieser et al\.,[2020](https://arxiv.org/html/2605.06841#bib.bib41)\)introduces legal action masks in MCTS planning, and ResWM\(Zhang et al\.,[2026](https://arxiv.org/html/2605.06841#bib.bib42)\)decomposes action\-induced changes\. However, a shared limitation remains: action legality is treated as a static function of the current state that does not evolve as the agent acts\. We propose to explicitly track both the legality of current actions and the evolution of capabilities following execution, making the world model’s prediction process fully traceable\.

More fundamentally, affordance modeling in RL\(Khetarpal et al\.,[2020](https://arxiv.org/html/2605.06841#bib.bib8)\)formalizes valid actions conditioned on agent capabilities and improves planning efficiency, but treats affordances as static attributes not incorporated into world model dynamics\. Recent visual affordance reasoning\(Wang et al\.,[2026](https://arxiv.org/html/2605.06841#bib.bib38)\)similarly decouples affordances from dynamics prediction\. A related line of work introduces explicit relational structure into world models via graph networks\(Kipf et al\.,[2020](https://arxiv.org/html/2605.06841#bib.bib44);Huang et al\.,[2022](https://arxiv.org/html/2605.06841#bib.bib40);Li et al\.,[2020](https://arxiv.org/html/2605.06841#bib.bib9)\), but the graph topology is fixed at training time and cannot evolve with environment state\. Recent causal graph work recognizes that causal structure can vary with state\(Hwang et al\.,[2024](https://arxiv.org/html/2605.06841#bib.bib6);Zhao et al\.,[2025](https://arxiv.org/html/2605.06841#bib.bib43)\), yet switches are indexed by latent meta\-states rather than causally bound to specific agent actions\.

Evolving environment dynamics have also been studied from the perspective of non\-stationarity\. Hidden\-parameter MDPs\(Doshi\-Velez & Konidaris,[2016](https://arxiv.org/html/2605.06841#bib.bib17)\)and meta\-RL\(Wang et al\.,[2016](https://arxiv.org/html/2605.06841#bib.bib18)\)model task variation as latent parameters or task distributions, allowing agents to adapt to different dynamics regimes\. WALL\-E\(Zhou et al\.,[2025](https://arxiv.org/html/2605.06841#bib.bib30)\)uses large language models to extract if\-then symbolic rules from trajectories, enabling interpretable policy learning\.Gospodinov et al\.\([2024](https://arxiv.org/html/2605.06841#bib.bib34)\)embed non\-stationarity handling directly into a DreamerV3\-style world model, continuously estimating distributional drift in latent space and adapting dynamics parameters online\. Each of these approaches models dynamics that shift across tasks or over time, but none treats action\-triggered affordance expansion as a first\-class modeling target\.

The unifying blind spot across all three threads is attribution: which agent action caused this affordance change, and when? In a cascading tech tree, one action at stepttcan activate several downstream affordance edges spanning multiple tiers; each activation is a zero\-to\-one transition triggered by an identifiable action, not a background dynamics drift or a latent regime shift\. A frozen\-graph model predicts with the topology from training time and misses newly activated edges entirely\. A causal switcher represents the transition as a move between pre\-fitted regime distributions, attributing the change to a latent meta\-state rather than to the action\. A non\-stationarity model detects that dynamics changed but cannot identify the cause\. Our proposed design targets attribution directly:gtg\_\{t\}is derived per\-timestep from the current observation via a fixed DAG schema, so each transition in node states and frontier mask is tied to the SC event that caused it\. Because the graph explicitly encodes which affordances are currently reachable \(frontier mask\) and which prerequisites are satisfied \(edge states\), the model can represent affordance states that reflect novel combinations of rules, a capability that latent\-indexed methods, bounded by their training\-distribution regimes, cannot provide by construction\.

## 3Method: Affordance\-Grounded World Models

### 3\.1Problem Formulation

#### Structure\-Changing Events\.

We consider environments modeled as a Markov decision process \(MDP\)\(𝒮,𝒜,T,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},T,R,\\gamma\), where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}the action space,T:𝒮×𝒜→𝒮T:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathcal\{S\}the transition function,RRthe reward function, andγ∈\[0,1\)\\gamma\\in\[0,1\)the discount factor\. An actionata\_\{t\}at statests\_\{t\}isstructure\-changing\(SC\) if it alters the affordance set:

SC\(st,at\)=𝟙\[ℱ\(st\+1\)≠ℱ\(st\)\],\\text\{SC\}\(s\_\{t\},a\_\{t\}\)=\\mathbb\{1\}\\left\[\\mathcal\{F\}\(s\_\{t\+1\}\)\\neq\\mathcal\{F\}\(s\_\{t\}\)\\right\],\(1\)whereℱ\(s\)⊆𝒜\\mathcal\{F\}\(s\)\\subseteq\\mathcal\{A\}denotes the*applicable action set*at statess, i\.e\., the actions whose environmental preconditions hold\. SC events are distinct from ordinary state changes: an ordinary action \(e\.g\., moving or looking\) updates the state but preserves the affordance set, while a structure\-changing action \(e\.g\., crafting or equipping\) alters both\. Implicit world models fail in this setting through two mechanisms\. First, multi\-step imagination suffers from compounding error: without an explicit affordance state, each rollout step is conditioned on stale action preconditions, and prediction error accumulates\. We observe this directly on MiniHack: Vanilla’s error multiplies by73\.7×73\.7\\timesfrom step 1 to step 8, while AGWM remains at5\.5×5\.5\\times\(Table[1](https://arxiv.org/html/2605.06841#S4.T1)\)\. Second, compositional generalization fails: Vanilla achieves0%0\\%CDA on KeyDungeon and2\.5%2\.5\\%on Craftax, near the random baseline of5\.9%5\.9\\%, when evaluated on SC\-critical decisions \(Table[2](https://arxiv.org/html/2605.06841#S4.T2)\), and collapses on out\-of\-distribution affordance configurations where AGWM reaches90\.990\.9–100%100\\%CDA \(Table[2](https://arxiv.org/html/2605.06841#S4.T2)\)\.

### 3\.2AGWM

#### Dynamic Affordance Graph\.

![Refer to caption](https://arxiv.org/html/2605.06841v1/figures/fig_agwm_overview.png)Figure 2:AGWM system overview\.The environment delivers reward and observation to AGWM\. The SC Classifier predicts whether\(ht,at,et\)\(h\_\{t\},a\_\{t\},e\_\{t\}\)triggers a structure\-changing event and signals the Dynamic Affordance Graph to self\-evolvegtg\_\{t\}\. The graph embeddingete\_\{t\}conditions the RSSM World Model, gating imagination rollouts to the current affordance frontier\. The Imagination Planning loop uses the imagined trajectories to optimize the Actor\-Critic policy\.To address these problems, we representℱ\(s\)\\mathcal\{F\}\(s\)as a structured feature vectorgt∈\{0,1\}dg\_\{t\}\\in\\\{0,1\\\}^\{d\}, derived per\-timestep from the current observation, which we call theaffordance graph\. Each environment has a fixed DAG schema definingNNaffordance nodes \(items, tools, structures\) andEEprerequisite edges \(tech\-tree dependencies\);gtg\_\{t\}concatenates three binary components:*node states*\(NNdims\) encoding which affordances are currently achieved; a*frontier mask*\(NNdims\) encoding which affordances are newly reachable \(all prerequisites satisfied but not yet achieved\); and*edge states*\(EEdims\) indicating which prerequisite edges are currently satisfied\. The model learns to predict how the graph transitions at each step:

g^t\+1=fgraph\(ht,at,gt\),\\hat\{g\}\_\{t\+1\}=f\_\{\\text\{graph\}\}\(h\_\{t\},a\_\{t\},g\_\{t\}\),\(2\)wherehth\_\{t\}is the recurrent hidden state andfgraphf\_\{\\text\{graph\}\}is a learned predictor\. The key design property is the*frontier\-mask constraint*: an affordance can become active only when its DAG prerequisites are already met, which matches the OR\-monotone structure of tech trees and enablesddindependent binary predictions rather than joint modeling of the full2d2^\{d\}affordance space\.

Figure[3](https://arxiv.org/html/2605.06841#S3.F3)contrasts the graphical models of a conventional world model and AGWM\. In a standard POMDP world model, the actionata^\{t\}feeds unconditionally into the next statest\+1s^\{t\+1\}: the model associates action\-outcome pairs statistically but cannot enforce whetherata^\{t\}is currently executable\. AGWM introduces the affordance graphgtg^\{t\}as an explicit variable\.gtg^\{t\}gates the transitionat→st\+1a^\{t\}\\to s^\{t\+1\}, preventing the model from predicting outcomes for infeasible actions\. Crucially,gtg^\{t\}is not observed but self\-evolved: it is predicted from\(ht,at\)\(h\_\{t\},a\_\{t\}\)and updated monotonically as the agent discovers new affordances\.

![Refer to caption](https://arxiv.org/html/2605.06841v1/figures/fig_graphical_model.png)Figure 3:Probabilistic graphical models of world model variants\.\(a\) Vanilla world model:ata^\{t\}feeds unconditionally intost\+1s^\{t\+1\}; the model cannot enforce whetherata^\{t\}is currently executable, causing compounding imagination error after SC events\.\(b\) AGWM \(one step\):gtg^\{t\}is introduced as an explicit affordance variable\. A structure\-changing actionata^\{t\}triggers an SC event edge \(magenta\) that updatesgt\+1g^\{t\+1\}, whilegtg^\{t\}gates the transition tost\+1s^\{t\+1\}via a gating edge \(teal\), enforcing action legality\.\(c\) AGWM imagination rollout: the predicted graphg^t\\hat\{g\}^\{t\}\(dashed\) is propagated forward at each step, gating every imagined transition so that multi\-step rollouts remain within the current affordance frontier\.Specifically, AGWM augments a GRU\-based recurrent world model with three modules \(Figure[2](https://arxiv.org/html/2605.06841#S3.F2)\)\. The base architecture encodes observations into latent stateszt=Enc\(ot\)z\_\{t\}=\\text\{Enc\}\(o\_\{t\}\), updates a recurrent hidden stateht=GRU\(\[zt,at\],ht−1\)h\_\{t\}=\\text\{GRU\}\(\[z\_\{t\},a\_\{t\}\],h\_\{t\-1\}\), predicts the next latentz^t\+1=fdyn\(ht\)\\hat\{z\}\_\{t\+1\}=f\_\{\\text\{dyn\}\}\(h\_\{t\}\), and reconstructso^t\+1=Dec\(z^t\+1\)\\hat\{o\}\_\{t\+1\}=\\text\{Dec\}\(\\hat\{z\}\_\{t\+1\}\)\.

Graph Encoder\.The affordance graphgt∈\{0,1\}dg\_\{t\}\\in\\\{0,1\\\}^\{d\}is embedded via a linear projectionet=GraphEnc\(gt\)∈ℝdee\_\{t\}=\\text\{GraphEnc\}\(g\_\{t\}\)\\in\\mathbb\{R\}^\{d\_\{e\}\}, wherede=64d\_\{e\}=64across all environments\. This embedding is injected into two places: \(1\) the GRU input, extending it toht=GRU\(\[zt,at,et\],ht−1\)h\_\{t\}=\\text\{GRU\}\(\[z\_\{t\},a\_\{t\},e\_\{t\}\],h\_\{t\-1\}\), so the recurrent state is informed by the current affordance structure; and \(2\) the decoder, givingo^t\+1=Dec\(\[z^t\+1,et\+1\]\)\\hat\{o\}\_\{t\+1\}=\\text\{Dec\}\(\[\\hat\{z\}\_\{t\+1\},e\_\{t\+1\}\]\), so the reconstruction loss directly backpropagates through the graph embedding\. The decoder path is critical: without it, the graph enters only through the GRU, where the model can learn to ignore it\. Conditioning the decoder forces the reconstruction loss to depend on graph quality, providing a strong gradient signal to the graph predictor\.

SC Classifier\.A two\-layer MLPfscf\_\{\\text\{sc\}\}takes the hidden state and action embedding as input and predicts whetherata\_\{t\}triggers a structure change:p^sc=σ\(fsc\(ht,at,et\)\)\\hat\{p\}\_\{\\text\{sc\}\}=\\sigma\(f\_\{\\text\{sc\}\}\(h\_\{t\},a\_\{t\},e\_\{t\}\)\)\. SC events are rare in practice \(typically 5–15% of steps\), so we apply positive\-class weighting \(wpos=5w\_\{\\text\{pos\}\}=5\) to the binary cross\-entropy loss to prevent the classifier from collapsing to always\-negative predictions\.

Graph Predictor\.A separate MLPfgraphf\_\{\\text\{graph\}\}predicts the next affordance state per dimension:g^t\+1=σ\(fgraph\(ht,at,et\)\)∈\[0,1\]d\\hat\{g\}\_\{t\+1\}=\\sigma\(f\_\{\\text\{graph\}\}\(h\_\{t\},a\_\{t\},e\_\{t\}\)\)\\in\[0,1\]^\{d\}\. Each dimension is predicted independently, decomposing the2d2^\{d\}joint affordance space intoddbinary problems\. At inference, the predicted logits are thresholded and combined with the current graph via the monotonicity constraint \(Section[3\.3](https://arxiv.org/html/2605.06841#S3.SS3)\)\. Becauseet=GraphEnc\(gt\)e\_\{t\}=\\text\{GraphEnc\}\(g\_\{t\}\)enters the predictor input, a gradient shortcut throughete\_\{t\}is possible in principle; however, the 10\.0% Aff\. Acc improvement over Vanilla on Craftax \(Table[2](https://arxiv.org/html/2605.06841#S4.T2)\) confirms that the predictor learns meaningful affordance dynamics rather than trivially copying its input\.

![Refer to caption](https://arxiv.org/html/2605.06841v1/figures/architecture_comparison.png)Figure 4:Architecture comparison\.\(a\) Vanilla RSSM processes observations and actions through a GRU\. \(b\) AGWM augments the RSSM with a self\-evolving affordance graph: the Graph Encoder embeds affordance structure into the GRU input and decoder, while the SC Classifier and Graph Predictor auxiliary heads learn to detect and predict structure changes\.

### 3\.3Self\-Evolving Affordance Discovery

Unlike prior affordance models that assume a fixed or oracle\-provided mapping, AGWM’s affordance graph is*self\-evolved*from agent experience, oracle\-free at test time\.

SC label generation\.At each training step, the SC label is computed by comparing consecutive affordance states:psc=𝟏\[gt\+1≠gt\]p\_\{\\text\{sc\}\}=\\mathbf\{1\}\[g\_\{t\+1\}\\neq g\_\{t\}\]\. The per\-dimension graph target isgt\+1g\_\{t\+1\}itself, obtained by comparingℱ\(st\)\\mathcal\{F\}\(s\_\{t\}\)andℱ\(st\+1\)\\mathcal\{F\}\(s\_\{t\+1\}\)directly from environment state\. This requires access to the affordance functionℱ\\mathcal\{F\}during training but not at test time, wheregtg\_\{t\}is maintained by the model’s own predictions\.

![Refer to caption](https://arxiv.org/html/2605.06841v1/figures/fig_graph_evolution.png)Figure 5:Affordance graph evolution in Craftax\.As the agent progresses through the tech tree within an episode, the node\-state and frontier\-mask components ofgtg\_\{t\}update to reflect newly achieved affordances and currently reachable next steps; the graph predictor learns to anticipate these transitions from\(ht,at,gt\)\(h\_\{t\},a\_\{t\},g\_\{t\}\)\.Frontier\-mask constraint\.Affordances in tech\-tree environments follow prerequisite ordering: stone is mineable only after a wood pickaxe is crafted\. The frontier mask encodes this structurally: the frontier bit for nodevvis 1 if and only if all parent prerequisites ofvvare satisfied in the current node\-state vector\. The graph predictor therefore only needs to activate nodes whose prerequisites are already met, restricting the prediction space to the DAG’s reachable frontier at each step\. This constraint is computed analytically from the fixed DAG schema and requires no additional learning, preventing the model from producing physically impossible affordance combinations under distribution shift\.

Graph evolution within an episode\.Becausegtg\_\{t\}is computed from the current observation at each step, it naturally tracks the agent’s affordance state as it progresses through the tech tree within an episode\. As the agent crafts tools and gathers resources, the frontier mask expands to reflect newly reachable affordances\. Figure[5](https://arxiv.org/html/2605.06841#S3.F5)illustrates this evolution alongside the agent’s progression in Craftax\.

### 3\.4Training Objective

The total loss combines reconstruction, dynamics, SC classification, and graph prediction:

ℒ=ℒrecon\+ℒdyn\+λSCℒSC\+λgraphℒgraph,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{recon\}\}\+\\mathcal\{L\}\_\{\\text\{dyn\}\}\+\\lambda\_\{\\text\{SC\}\}\\,\\mathcal\{L\}\_\{\\text\{SC\}\}\+\\lambda\_\{\\text\{graph\}\}\\,\\mathcal\{L\}\_\{\\text\{graph\}\},\(3\)where each term is:

ℒrecon\\displaystyle\\mathcal\{L\}\_\{\\text\{recon\}\}=MSE\(o^t\+1,ot\+1\),\\displaystyle=\\text\{MSE\}\(\\hat\{o\}\_\{t\+1\},o\_\{t\+1\}\),ℒdyn\\displaystyle\\qquad\\mathcal\{L\}\_\{\\text\{dyn\}\}=MSE\(z^t\+1,zt\+1\),\\displaystyle=\\text\{MSE\}\(\\hat\{z\}\_\{t\+1\},z\_\{t\+1\}\),\(4\)ℒSC\\displaystyle\\mathcal\{L\}\_\{\\text\{SC\}\}=BCE\(p^sc,psc\),\\displaystyle=\\text\{BCE\}\(\\hat\{p\}\_\{\\text\{sc\}\},p\_\{\\text\{sc\}\}\),ℒgraph\\displaystyle\\qquad\\mathcal\{L\}\_\{\\text\{graph\}\}=1d∑i=1dBCE\(g^t\+1\(i\),gt\+1\(i\)\)\.\\displaystyle=\\tfrac\{1\}\{d\}\\textstyle\\sum\_\{i=1\}^\{d\}\\text\{BCE\}\(\\hat\{g\}\_\{t\+1\}^\{\(i\)\},g\_\{t\+1\}^\{\(i\)\}\)\.\(5\)
The graph loss averages over alldddimensions independently\. We setλSC=1\.0\\lambda\_\{\\text\{SC\}\}=1\.0andλgraph=2\.0\\lambda\_\{\\text\{graph\}\}=2\.0\. Results are robust to both hyperparameters: degradation is stable acrossλgraph∈\[0\.5,5\.0\]\\lambda\_\{\\text\{graph\}\}\\in\[0\.5,5\.0\]and across graph embedding dimensionsde∈\{8,16,32,64\}d\_\{e\}\\in\\\{8,16,32,64\\\}, withλgraph=0\\lambda\_\{\\text\{graph\}\}=0the only configuration that causes substantial degradation \(see Appendix[F](https://arxiv.org/html/2605.06841#A6)\)\. The reconstruction and dynamics losses are unweighted relative to each other\.

## 4Experiments

We evaluate AGWM by examining the following research questions: \(1\) Does AGWM reduce multi\-step imagination error over the Vanilla baseline, and does the advantage grow with longer rollout horizons? \(Table[1](https://arxiv.org/html/2605.06841#S4.T1), Table[1](https://arxiv.org/html/2605.06841#S4.T1)\) \(2\) Does it accurately detect SC events and learn interpretable affordance structure? \(Table[2](https://arxiv.org/html/2605.06841#S4.T2), Figure[5](https://arxiv.org/html/2605.06841#S3.F5)\) \(3\) Does the self\-evolved graph generalize to novel rule combinations not seen during training? \(Table[2](https://arxiv.org/html/2605.06841#S4.T2)\) \(4\) Are SC detection and graph conditioning both necessary? \(Table[3](https://arxiv.org/html/2605.06841#S4.T3)\)

### 4\.1Benchmarks

Full environment descriptions are provided in Appendix[B](https://arxiv.org/html/2605.06841#A2)\. Briefly:KeyDungeon,Forage, andHarvestare custom pixel gridworlds atD=1D\{=\}1where SC events erase colored squares from the observation\.Relay\(D=3D\{=\}3\) andCascade\(D=4D\{=\}4\) extend this with chain\-unlocking mechanics that produce two simultaneous pixel changes per SC event\.MiniHack\(Samvelyan et al\.,[2021](https://arxiv.org/html/2605.06841#bib.bib11)\)\(LavaCross,D=2D\{=\}2\),Crafter\(Hafner,[2022](https://arxiv.org/html/2605.06841#bib.bib3)\)\(D=3D\{=\}3\), andCraftax\(Matthews et al\.,[2024](https://arxiv.org/html/2605.06841#bib.bib26)\)\(D=4D\{=\}4\) are standard benchmarks included for breadth and comparability with prior work\.

### 4\.2Implementation Details

All three variants share the RSSM backbone from DreamerV3\(Hafner et al\.,[2023](https://arxiv.org/html/2605.06841#bib.bib5),[2020](https://arxiv.org/html/2605.06841#bib.bib4)\)and identical capacity \(within 17%\)\.Vanillais the unmodified RSSM\.SC\-Onlyadds the SC Classifier head to Vanilla \(Section[3](https://arxiv.org/html/2605.06841#S3)\) but*without*the affordance graph; it isolates whether SC event detection alone \(without explicit graph conditioning\) suffices\.AGWMis the full model: SC Classifier plus Graph Encoder, Graph Predictor, and frontier\-mask constraint\. All models are trained with AdamW \(lr=5×10−45\\times 10^\{\-4\}, weight decay=10−410^\{\-4\}\), gradient clipping \(norm 1\.0\), on 16\-step sequences with batch size 64, for 500k steps\. The observation encoder and decoder are environment\-specific: CNN for pixel observations \(all environments except Craftax\) and MLP for symbolic observations \(Craftax\)\. All results report 3\-seed mean±\\pmstd\.

Prediction quality is measured via imagination MSE \(8\-step rollout after 10\-step posterior warmup\), counterfactual detection accuracy \(CDA\), counterfactual prediction gap, and affordance prediction accuracy\. For Vanilla, CDA is measured by training a linear probe on frozen hidden states; for AGWM andSC\-Only, CDA uses the trained SC Classifier directly\.

### 4\.3World Model Quality

Table 1:Imagination MSE \(×10−3\\times 10^\{\-3\}, lower is better\)\.Avg: step\-averaged over the full rollout;Step 1/Step 8: short\- and long\-horizon error accumulation\. AGWM uses a self\-evolved graph \(no oracle\)\.Bold\+underline: best per column;bold: second best\. 3\-seed mean±\\pmstd\.†MiniHack: glyph\-embedding space \(16\-dim×\\times21×\\times79 grid\), not comparable to pixel/symbolic rows\.‡Craftax: 256\-dim obs\-embedding space; DR≈\\approx1\.0 for both models \(no horizon degradation\); CDA is the primary differentiator \(Table[2](https://arxiv.org/html/2605.06841#S4.T2)\)\.

Graph conditioning helps most when SC events cause large, visible changes to the observation\. Our2×22\{\\times\}2design isolates this: Relay and Cascade have the same compositional depth as Crafter and Craftax, but with clear pixel changes at every SC event, AGWM achieves101\.7×101\.7\\timesand12\.3×12\.3\\timeslower MSE\. At the same depth, Crafter and Craftax show no improvement because their SC signals are subtle or symbolic\. When crafting an item changes only a small inventory icon, even a model with the correct affordance graph cannot improve its pixel predictions enough to reduce overall MSE\. The difference grows over longer rollouts: on MiniHack, Vanilla reaches73\.7×73\.7\\timesof its step\-1 error by step 8, while AGWM stays at5\.5×5\.5\\times\. This happens because errors build up: without knowing what the new affordance state is, each step starts from a wrong assumption about which actions are available\.

#### SC detection and affordance understanding\.

Table 2:SC Detection Accuracy \(CDA, %, higher is better\), probing results, and OOD generalization \(3\-seed mean±\\pmstd\)\. Aff\. Acc: next\-step affordance prediction accuracy \(%\); CF Gap: L2 distance between SC and non\-SC predictions\.‡OOD splits \(AGWM only, Vanilla/\+SC not evaluated on held\-out configs\): KD L4 = Levels 1–3→\\toL4 cross\-color; KD L5 = Levels 1–4→\\toL5 chain4; Craftax = Tech tiers 0–1→\\totiers 2–3\.As the affordance structure gets deeper, a fixed\-size latent vector finds it harder to keep track of it\. On Craftax, where four prerequisite tiers must be maintained at the same time, Vanilla scores only2\.5%2\.5\\%CDA while AGWM reaches98\.6%98\.6\\%; a hidden state just does not have enough room to store that much structured information\. On Crafter,SC\-Onlyreaches24\.2%24\.2\\%CDA \(vs\.0%0\\%for AGWM\) but with high variance \(±22\.7%\\pm 22\.7\\%, driven by a single seed at54\.5%54\.5\\%\) and no improvement in MSE, suggesting the SC head sometimes detects events that are not real affordance transitions\. The full model avoids this because any detected SC event must update the graph, and the model is then judged on whether those updates lead to better predictions\. The graph also makes the model’s internal state readable: on KeyDungeon, AGWM’s predicted observations differ20\.5×20\.5\\timesmore between SC and non\-SC actions than Vanilla’s \(1\.641\.64vs\.0\.080\.08CF Gap\)\. Figure[5](https://arxiv.org/html/2605.06841#S3.F5)shows each unlock as a single added edge, with no probing required to see what the model knows\.

Table 3:Graph source ablation on KeyDungeon\. All AGWM variants share the same trained model\. MSE \(×10−3\\times 10^\{\-3\}\), mean±\\pmstd, 3 seeds\.
#### Compositional generalization\.

The learned graph works well on affordance rules not seen during training\. On KeyDungeon, AGWM reaches90\.9%90\.9\\%CDA on cross\-color rules and100\.0%100\.0\\%on entirely new L5 chain mechanisms \(Table[2](https://arxiv.org/html/2605.06841#S4.T2)\); on Craftax, training on tiers 0–1 and testing on tiers 2–3 gives87\.9±0\.7%87\.9\\pm 0\.7\\%OOD CDA\. The100%100\\%accuracy on new L5 chains suggests the graph has learned general affordance rules \(“collecting*some*item enables*some*next item”\) rather than memorizing specific item pairs, so it handles new item combinations correctly\. The oracle control experiment \(Appendix[E](https://arxiv.org/html/2605.06841#A5)\) shows why this works: the self\-evolved graph \(1\.34×1\.34\\timesdegradation\) is close to the oracle ceiling \(1\.12×1\.12\\times\) without any ground\-truth graph at test time, while zeroing the graph \(6\.06×6\.06\\times\) performs worst, confirming that the self\-evolving mechanism is the main source of test\-time accuracy\.

#### Ablations\.

Table[3](https://arxiv.org/html/2605.06841#S4.T3)shows that removing either component hurts\. Without graph\-conditioned decoding, the graph becomes a passive input the model can learn to ignore, which is whySC\-Onlyoutperforms the full AGWM in that setting\. Adding the decoding loss brings the benefit back\. Even a frozen graph \(fixed at its initial state,5\.765\.76\) already beats Vanilla \(9\.199\.19\), showing that any reasonable affordance structure is more useful than none\. The self\-evolving mechanism then closes most of the gap to the oracle \(5\.235\.23vs\.4\.954\.95\)\. Giving the model an all\-zero graph at test time \(19\.5419\.54\) is worse than no graph at all \(9\.199\.19\), since the model was never trained on that input\. Any non\-zeroλ\\lambdahelps, and the gain levels off above0\.50\.5\.

#### Sample efficiency\.

With less training data, the benefit of explicit SC structure gets larger\. At 25% data, AGWM andSC\-Onlyboth beat Vanilla \(14\.9414\.94and14\.7614\.76vs\.16\.6416\.64\); at full data all three models perform about the same \(Table[4](https://arxiv.org/html/2605.06841#S4.T4)\)\. Graph accuracy also levels off early: Aff\.Acc reaches97\.2%±1\.8%97\.2\\%\\pm 1\.8\\%at just 25% data and goes above98%98\\%at 50%, showing that the affordance transition structure can be learned from a small number of trajectories\.

Table 4:Sample efficiency on KeyDungeon\. Imagination MSE \(×10−3\\times 10^\{\-3\}\) and affordance\-graph accuracy \(Aff\.Acc\) vs\. training data fraction\. Mean±\\pmstd, 3 seeds\. MSE lower is better; Aff\.Acc higher is better\.

## 5Conclusion

We introduced AGWM, a world model that explicitly tracks how agent actions change environment affordances through a self\-evolving affordance graph\. Across our benchmark suite, we demonstrated that explicit SC detection and graph\-conditioned imagination reduce multi\-step prediction error; that the self\-evolved graph generalizes to novel rule combinations without oracle supervision; and that the frontier\-mask constraint is essential: removing it causes performance worse than the graph\-free Vanilla baseline due to distribution shift at test time\.

## Limitations\.

The graph dimensions \(which affordance types to track\) are currently hand\-designed per environment\. Learning the graph schema from data is an important direction\. The self\-evolving mechanism also requires comparing consecutive states to detect SC events; extending to delayed or partial observability is future work\.

## Broader impact\.

This work proposes a method for learning world models that explicitly track which agent actions remain executable as the environment’s rule structure evolves\. The primary intended application is model\-based reinforcement learning in compositional domains such as open\-ended game environments and robotic task planning\. On the positive side, better world models that respect action preconditions could improve the safety and reliability of autonomous agents deployed in structured environments, since the agent’s imagination is constrained to feasible action sequences rather than arbitrary counterfactuals\. On the negative side, more capable planning agents could be applied to adversarial or harmful tasks\. We note that the current work is evaluated exclusively on simulated benchmark environments; deployment in physical systems would require additional safety validation beyond what this paper addresses\. We do not anticipate near\-term negative societal impacts specific to this work\.

## References

- Ha & Schmidhuber \(2018\)Ha, D\. & Schmidhuber, J\. \(2018\)\. World models\.Advances in Neural Information Processing Systems\.
- Gibson \(1979\)Gibson, J\. J\. \(1979\)\.The Ecological Approach to Visual Perception\. Houghton Mifflin\.
- Hafner \(2022\)Hafner, D\. \(2022\)\. Benchmarking the spectrum of agent capabilities\.Transactions on Machine Learning Research\.
- Hafner et al\. \(2020\)Hafner, D\., Lillicrap, T\., Ba, J\., & Norouzi, M\. \(2020\)\. Dream to control: Learning behaviors by latent imagination\.ICLR\.
- Hafner et al\. \(2023\)Hafner, D\., Pasukonis, J\., Ba, J\., & Lillicrap, T\. \(2025\)\. Mastering diverse control tasks through world models\.Nature, 640, 647–653\.
- Hwang et al\. \(2024\)Hwang, I\., Kwak, Y\., Choi, S\., Zhang, B\.\-T\., & Lee, S\. \(2024\)\. Fine\-grained causal dynamics learning with quantization for improving robustness in reinforcement learning\.ICML\.
- Ke et al\. \(2019\)Ke, N\. R\., Singh, A\., Touati, A\., Goyal, A\., Bengio, Y\., Parikh, D\., & Batra, D\. \(2019\)\. Learning dynamics model in reinforcement learning by incorporating the long term future\.ICLR\.
- Khetarpal et al\. \(2020\)Khetarpal, K\., Ahmed, Z\., Comanici, G\., Abel, D\., & Precup, D\. \(2020\)\. What can I do here? A theory of affordances in reinforcement learning\.ICML\.
- Li et al\. \(2020\)Li, M\., Yang, M\., Liu, F\., Chen, X\., Chen, Z\., & Wang, J\. \(2020\)\. Causal world models by unsupervised deconfounding of physical dynamics\.arXiv preprint arXiv:2012\.14228\.
- Pearl \(2009\)Pearl, J\. \(2009\)\.Causality: Models, Reasoning, and Inference\(2nd ed\.\)\. Cambridge University Press\.
- Samvelyan et al\. \(2021\)Samvelyan, M\., Kirk, R\., Kurin, V\., Parker\-Holder, J\., Jiang, M\., Hambro, E\., Zilly, F\., Küttler, H\., Grefenstette, E\., & Rocktäschel, T\. \(2021\)\. MiniHack the planet: A sandbox for open\-ended reinforcement learning research\.NeurIPS Datasets and Benchmarks Track\.
- Schölkopf et al\. \(2021\)Schölkopf, B\., Locatello, F\., Bauer, S\., Ke, N\. R\., Kalchbrenner, N\., Goyal, A\., & Bengio, Y\. \(2021\)\. Toward causal representation learning\.Proceedings of the IEEE, 109\(5\), 612–634\.
- Hafner et al\. \(2025a\)Hafner, D\., Pasukonis, J\., Ba, J\., & Lillicrap, T\. \(2025\)\. Mastering diverse control tasks through world models\.Nature, 640, 647–653\.
- Micheli et al\. \(2023\)Micheli, V\., Alonso, E\., & Fleuret, F\. \(2023\)\. Transformers are sample\-efficient world models\.ICLR\.
- Alonso et al\. \(2024\)Alonso, E\., Jelley, A\., Micheli, V\., Kanervisto, A\., Beard, A\., & Fleuret, F\. \(2024\)\. Diffusion for world modeling: Visual details matter in Atari\.NeurIPS\.
- Robine et al\. \(2023\)Robine, J\., Höftmann, M\., Uel, T\., & Harmeling, S\. \(2023\)\. Transformer\-based world models are happy with 100k interactions\.ICLR\.
- Doshi\-Velez & Konidaris \(2016\)Doshi\-Velez, F\. & Konidaris, G\. \(2016\)\. Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations\.IJCAI\.
- Wang et al\. \(2016\)Wang, J\.X\., Kurth\-Nelson, Z\., Tirumala, D\., Soyer, H\., Leibo, J\.Z\., Munos, R\., Blundell, C\., Kumaran, D\., & Botvinick, M\. \(2017\)\. Learning to reinforcement learn\.Proceedings of the 39th Annual Conference of the Cognitive Science Society\.
- Bellemare et al\. \(2012\)Bellemare, M\.G\., Veness, J\., & Bowling, M\. \(2012\)\. Investigating contingency awareness using Atari 2600 games\.AAAI\.
- Badia et al\. \(2020\)Badia, A\.P\., Sprechmann, P\., Viber, A\., Guo, D\., Piot, B\., Kapturowski, S\., Tieleman, O\., Arjovsky, M\., Pritzel, A\., Bolt, A\., & Blundell, C\. \(2020\)\. Agent57: Outperforming the Atari human benchmark\.ICML\.
- Gibson \(1977\)Gibson, J\.J\. \(1977\)\. The theory of affordances\. InPerceiving, Acting, and Knowing\. Erlbaum\.
- Do et al\. \(2018\)Do, T\.T\., Nguyen, A\., & Reid, I\. \(2018\)\. AffordanceNet: An end\-to\-end deep learning approach for object affordance detection\.ICRA\.
- Mo et al\. \(2021\)Mo, K\., Guibas, L\.J\., Mukadam, M\., Gupta, A\., & Tulsiani, S\. \(2021\)\. Where2Act: From pixels to actions for articulated 3D objects\.ICCV\.
- Abel et al\. \(2022\)Abel, D\., Dabney, W\., Harutyunyan, A\., Ho, M\.K\., Littman, M\., Precup, D\., & Singh, S\. \(2022\)\. A definition of continual reinforcement learning\.NeurIPS\.
- Shridhar et al\. \(2021\)Shridhar, M\., Yuan, X\., Côté, M\.A\., Bisk, Y\., Trischler, A\., & Hausknecht, M\. \(2021\)\. ALFWorld: Aligning text and embodied environments for interactive learning\.ICLR\.
- Matthews et al\. \(2024\)Matthews, M\., Sheratt, M\., Sheratt, O\., Sheratt, E\., & Sheratt, J\. \(2024\)\. Craftax: A lightning\-fast benchmark for open\-ended reinforcement learning\.arXiv preprint\.
- Pearl \(2000\)Pearl, J\. \(2000\)\.Causality: Models, Reasoning, and Inference\. Cambridge University Press\.
- Behrens et al\. \(2018\)Behrens, T\.E\.J\., Muller, T\.H\., Whittington, J\.C\.R\., Mark, S\., Baram, A\.B\., Stachenfeld, K\.L\., & Kurth\-Nelson, Z\. \(2018\)\. What is a cognitive map? Organizing knowledge for flexible behavior\.Neuron, 100\(4\), 946–954\.
- Choi et al\. \(2019\)Choi, J\., Guo, Y\., Moczulski, M\., Oh, J\., Wu, N\., Norouzi, M\., & Lee, H\. \(2019\)\. Contingency\-aware exploration in reinforcement learning\.ICLR\.
- Zhou et al\. \(2025\)Zhou, S\., Zhou, T\., Yang, Y\., Long, G\., Ye, D\., Jiang, J\., & Zhang, C\. \(2025\)\. WALL\-E: World alignment by rule learning improves world model\-based LLM agents\.NeurIPS\.
- Hafner et al\. \(2025b\)Hafner, D\., Yan, W\., & Lillicrap, T\. \(2025\)\. Training agents inside of scalable world models\.arXiv preprint arXiv:2509\.24527\.
- Morihira et al\. \(2026\)Morihira, N\. et al\. \(2026\)\. R2\-Dreamer: Redundancy\-reduced world models without decoders or augmentation\.ICLR\.
- Wu et al\. \(2025\)Wu, J\., Yin, S\., Feng, N\., & Long, M\. \(2025\)\. RLVR\-World: Training world models with reinforcement learning\.NeurIPS\.
- Gospodinov et al\. \(2024\)Gospodinov, E\., Shaj, V\., Becker, P\., Geyer, S\., & Neumann, G\. \(2024\)\. Adaptive world models: Learning behaviors by latent imagination under non\-stationarity\.NeurIPS Workshop on Adaptive Foundation Models\.
- Zhang et al\. \(2025\)Zhang, Y\. et al\. \(2025\)\. Multi\-level RL with model\-changing actions over transition kernel spaces\.arXiv preprint arXiv:2510\.15056\.
- Maes et al\. \(2026\)Maes, L\. et al\. \(2026\)\. LeWorldModel: Stable end\-to\-end JEPA world models from pixels\.arXiv preprint arXiv:2603\.19312\.
- Dainese et al\. \(2024\)Dainese, N\., Merler, M\., Alakuijala, M\., & Marttinen, P\. \(2024\)\. Generating code world models with large language models guided by Monte Carlo tree search\.NeurIPS\.
- Wang et al\. \(2026\)Wang, H\. et al\. \(2026\)\. Affordance\-R1: Reinforcement learning for generalizable affordance reasoning in multimodal LLMs\.AAAI\.
- Farebrother et al\. \(2025\)Farebrother, J\., Pirotta, M\., Tirinzoni, A\., Munos, R\., Lazaric, A\., & Touati, A\. \(2025\)\. Temporal difference flows\.ICML \(Oral\)\.
- Huang et al\. \(2022\)Huang, B\., Lu, C\., Leqi, L\., Hernandez\-Lobato, J\. M\., Glymour, C\., Scholkopf, B\., & Zhang, K\. \(2022\)\. Action\-sufficient state representation learning for control with structural constraints\.ICML\.
- Schrittwieser et al\. \(2020\)Schrittwieser, J\., Antonoglou, I\., Hubert, T\., Simonyan, K\., Sifre, L\., Schmitt, S\., Guez, A\., Lockhart, E\., Hassabis, D\., Graepel, T\., Lillicrap, T\., & Silver, D\. \(2020\)\. Mastering Atari, Go, chess and shogi by planning with a learned model\.Nature, 588, 604–609\.
- Zhang et al\. \(2026\)Zhang, J\. et al\. \(2026\)\. ResWM: Residual\-action world model for visual RL\.arXiv preprint arXiv:2603\.11110\.
- Zhao et al\. \(2025\)Zhao, Z\., Li, H\., Zhang, H\., Wang, J\., Faccio, F\., Schmidhuber, J\., & Yang, M\. \(2025\)\. Curious causality\-seeking agents learn meta causal world\.NeurIPS\.
- Kipf et al\. \(2020\)Kipf, T\., van der Pol, E\., & Welling, M\. \(2020\)\. Contrastive learning of structured world models\.ICLR\.
- Amodei et al\. \(2016\)Amodei, D\., Olah, C\., Steinhardt, J\., Christiano, P\., Schulman, J\., & Mané, D\. \(2016\)\. Concrete problems in AI safety\.arXiv preprint arXiv:1606\.06565\.
- Berkenkamp et al\. \(2017\)Berkenkamp, F\., Turchetta, M\., Schoellig, A\. P\., & Krause, A\. \(2017\)\. Safe model\-based reinforcement learning with stability guarantees\.NeurIPS\.

## Appendix AImplementation Details

#### Network architecture\.

All models share an identical base architecture\. The observation encoder is a 3\-layer CNN \(channels 32/64/64, kernel 3, stride 2\) for pixel observations \(Crafter, KeyDungeon\), a glyph\-embedding CNN \(16\-dim per glyph, then 3 Conv layers\) for MiniHack, and a 2\-layer MLP \(hidden dim 256\) for symbolic observations \(Craftax\)\. The GRU has hidden size 512\. The dynamics networkfdynf\_\{\\text\{dyn\}\}and decoder are symmetric 2\-layer MLPs with hidden dim 512\. For AGWM, the Graph Encoder is a single linear layer mappinggt∈\{0,1\}dg\_\{t\}\\in\\\{0,1\\\}^\{d\}toet∈ℝ64e\_\{t\}\\in\\mathbb\{R\}^\{64\}\. The SC Classifier and Graph Predictor are each 2\-layer MLPs \(hidden dim 256, ReLU\) taking\[ht,at,et\]\[h\_\{t\},a\_\{t\},e\_\{t\}\]as input\. Total parameter counts on KeyDungeon: Vanilla 1\.76M,SC\-Only1\.77M, AGWM 1\.89M; on Crafter: Vanilla 1\.60M with proportionally similar scaling\. AGWM adds at most 7% parameters over Vanilla\.

#### Training\.

All models are trained for 100 epochs on pre\-collected expert trajectories \(500 train / 100–200 eval per environment\)\. We use the AdamW optimizer with learning rate5×10−45\\times 10^\{\-4\}, weight decay10−410^\{\-4\}, and gradient clipping at norm 1\.0\. Batch size is 16 sequence windows\. Loss weights:λSC=1\.0\\lambda\_\{\\text\{SC\}\}=1\.0,λgraph=2\.0\\lambda\_\{\\text\{graph\}\}=2\.0\. The SC classifier uses positive\-class weightingwpos=5w\_\{\\text\{pos\}\}=5to compensate for class imbalance \(SC events occur in 5–15% of steps depending on environment\)\. The affordance graphgtg\_\{t\}is computed per\-timestep from the current observation via the environment’s fixed DAG schema \(node states, frontier mask, edge states\); graph targets for training are derived directly from environment state, requiring no manual annotation\. Seed 42, 123, 456 are used for reproducibility; we report 3\-seed mean±\\pmstd throughout\.

#### Evaluation\.

Imagination MSE is measured over 8\-step rollouts following a 10\-step posterior warmup from ground\-truth observations\. All metrics are reported as mean±\\pmstd over 3 independent seeds with different random initializations\.

#### Environment\-specific details\.

## Appendix BEnvironment Descriptions

Table 5:Environment overview\. Compositional depthDD= maximum cascading SC chain length\.†This work\.#### KeyDungeon \(this work,D=1D\{=\}1\)\.

A custom 48×\\times48 pixel gridworld where the agent collects a key to open a locked door and retrieves a chest\. A single affordance transition \(key held→\\todoor traversable\) defines the sole SC event per episode\. Item acquisition produces a discrete, visible pixel change: the key sprite disappears and the door glyph switches to an open doorway\. KeyDungeon isolates the minimal SC setting, providing a controlled testbed for single\-transition affordance prediction\.

#### Forage \(this work,D=1D\{=\}1\)\.

A 64×\\times64 RGB gridworld with six colored items on an 8×\\times8 grid\. Collecting an item causes its colored square to disappear \(SC event\)\. The binary affordance graphgt∈\{0,1\}6g\_\{t\}\\in\\\{0,1\\\}^\{6\}records which items have been collected; a scripted policy with 15% random noise yields SC events at≈\\approx5% of steps\. Forage provides a controlled pixel\-change setting: the decoder conditioned on predictedgt\+1g\_\{t\+1\}can anticipate item disappearance directly\.

#### Harvest \(this work,D=1D\{=\}1\)\.

Extends Forage to eight items with two additional item colors, raising the SC rate to 6\.7% and increasing simultaneous\-item\-tracking complexity\. The graphgt∈\{0,1\}8g\_\{t\}\\in\\\{0,1\\\}^\{8\}; all other mechanics identical to Forage\.

#### Relay \(this work,D=3D\{=\}3\)\.

A 64×\\times64 RGB gridworld with a single chain of four items \(D=3D\{=\}3, three SC transitions\)\. Collecting itemkkdesaturates it to a neutral used\-marker color*and*simultaneously reveals itemk\+1k\{\+\}1from gray \(locked\) to its assigned bright color, producing two large pixel changes per SC event\. The affordance graphgt∈\{0,1\}4g\_\{t\}\\in\\\{0,1\\\}^\{4\}encodes which items have been collected \(monotone: bits only transition0→10\\\!\\to\\\!1\)\. A scripted policy targeting the nearest visible item yields SC events at≈\\approx6\.7% of steps\.

#### Cascade \(this work,D=4D\{=\}4\)\.

Extends Relay to a single chain of five items \(D=4D\{=\}4, four SC transitions,gt∈\{0,1\}5g\_\{t\}\\in\\\{0,1\\\}^\{5\}\)\. The same dual\-change SC mechanism applies: collecting itemkkdesaturates it to a used marker while itemk\+1k\{\+\}1becomes visible\. SC rate≈\\approx8\.3%\. Cascade provides a pixel\-observation counterpart to Craftax at matching compositional depth\.

#### MiniHack\(Samvelyan et al\.,[2021](https://arxiv.org/html/2605.06841#bib.bib11)\)\(D=2D\{=\}2\)\.

LavaCross task: the agent acquires a levitation potion and consumes it to cross lava\. SC event = potion consumption activating thelevitationaffordance\. Observations are21×7921\\times 79glyph maps; affordance changes manifest as inventory transitions with no dramatic pixel shift\. Tests prediction when SC events are detectable but not visually salient\.

#### Crafter\(Hafner,[2022](https://arxiv.org/html/2605.06841#bib.bib3)\)\(D=3D\{=\}3\)\.

A 2D survival game with a three\-tier tech tree\. SC events include placing a crafting table \(enables wood tools\), placing a furnace \(enables metal processing\), and crafting successive tool tiers\. Observations are64×6464\\times 64RGB images; affordance changes produce subtle rather than dramatic pixel differences\. Joint affordance space spans2172^\{17\}states\.

#### Craftax\(Matthews et al\.,[2024](https://arxiv.org/html/2605.06841#bib.bib26)\)\(D=4D\{=\}4\)\.

A JAX\-accelerated symbolic implementation of Crafter with a four\-tier cascading tech tree\. Observations are 1345\-dimensional symbolic vectors encoding inventory, achievements, and local map state\. The symbolic observation directly encodes affordance\-relevant information, so horizon\-degradation ratio DR≈\\approx1\.0 for all models; CDA is the primary differentiator at this depth\. Craftax uses the symbolic observation format \(1345\-dim\) with the JAX\-based implementation\(Matthews et al\.,[2024](https://arxiv.org/html/2605.06841#bib.bib26)\), enabling fast parallel rollouts\. MiniHack uses the NetHack symbolic observation format: a 21×\\times79 glyph map \(each cell is a categorical glyph ID in \[0, 6000\)\) paired with a 27\-dim blstats vector; a learned glyph embedding \(16\-dim per glyph\) followed by a CNN encoder maps this to a latent state\. KeyDungeon is a custom MiniGrid\-based environment with procedurally generated key\-door configurations; we evaluate on held\-out map layouts unseen during training\.

## Appendix CAffordance Graph Design

We design compact affordance graphs for each environment:

#### Craftax \(60 dims, DAG\)\.

15 tech\-tree nodes \(resources, tools, structures\)×\\times2 state types \(node\_active, frontier\) plus 30 directed prerequisite edges =15\+15\+30=6015\+15\+30=60dims\. The frontier vector encodes nodes whose prerequisites are satisfied; edge states are active only when both endpoints are unlocked\.

#### KeyDungeon \(14 dims, DAG\)\.

5 nodes \(find\_key, pickup\_key, reach\_door, unlock\_door, reach\_goal\)×\\times2 state types \(node\_active, frontier\) plus 4 directed prerequisite edges =5\+5\+4=145\+5\+4=14dims\. The key SC event is picking up the key \(activatespickup\_key, gates the frontier tounlock\_door\) and unlocking the door \(activatesunlock\_door, expands frontier toreach\_goal\)\.

#### Crafter \(38 dims, DAG\)\.

11 tech\-tree nodes×\\times2 state types \(node\_active, frontier\) plus 16 directed prerequisite edges =11\+11\+16=3811\+11\+16=38dims\. The DAG captures Crafter’s 3\-tier tech tree \(wood→\\tostone→\\toiron→\\todiamond\); SC events are tier\-crossing actions \(placing a table, crafting a pickaxe\) that activate new frontier nodes\.

#### MiniHack LavaCross \(15 dims, DAG\)\.

6 affordance nodes \(find\_potion, drink\_potion, lava\_traversable, reach\_goal, find\_stairs, reach\_stairs\)×\\times2 state types \(node\_active, frontier\) plus 3 directed prerequisite edges =6\+6\+3=156\+6\+3=15dims\. The key SC event is drinking the levitation potion, which activates thelava\_traversablenode and expands the frontier to includereach\_goal\.

## Appendix DAdditional Results

Table 6:Imagination MSE \(×10−3\\times 10^\{\-3\}\) across all environments\. AGWM uses self\-evolved graph\. Mean±\\pmstd over 3 seeds\. Red: prior architecture; blue: current DAG architecture\. Entries marked “⋅\\cdot” indicate results not yet collected\.Table 7:Full probing results across all environments\. Mean±\\pmstd over 3 seeds\.
## Appendix EOracle Graph Control Experiment

To rule out the alternative explanation that the oracle model underperforms on novel rules due to stale graph inputs \(i\.e\., the graph was fixed at training time and does not reflect novel\-rule affordances\), we run a controlled ablation on KeyDungeon novel\-rule test sets\. We compare four oracle graph conditions at test time:

Table 8:8\-step degradation ratio on KeyDungeon novel rules under different oracle graph conditions\. Mean±\\pmstd over 3 seeds\. Lower is better\.The ground\-truth graph \(Oracle\-Current,1\.12×1\.12\\times\) achieves the lowest degradation, confirming that correct affordance conditioning reduces imagination error and establishing Oracle AGWM as an upper bound\. The stale training\-time graph \(1\.17×1\.17\\times\) performs close to Oracle\-Current: although semantically incorrect for novel rules, it preserves the dimension\-activation patterns seen during training and introduces minimal distributional shift\. The self\-evolved graph \(1\.34×1\.34\\times\) closely approaches the oracle upper bound without ground\-truth access; the0\.22×0\.22\\timesgap demonstrates that the self\-evolving mechanism captures novel rule structure effectively\. All\-zero graph conditioning \(6\.06×6\.06\\times\) is the worst condition: the model was trained with non\-zero graphs, so an all\-zero input at test time constitutes an OOD signal that severely disrupts imagination\.

## Appendix FHyperparameter Sensitivity

We evaluate sensitivity to the graph loss weightλgraph\\lambda\_\{\\text\{graph\}\}and the graph embedding dimensionded\_\{e\}on Craftax \(D=4D\{=\}4\), using 8\-step degradation ratio and affordance prediction accuracy \(Aff\. Acc\) as metrics\. Results are reported as mean±\\pmstd over 3 seeds\.

Table 9:Effect of graph loss weightλgraph\\lambda\_\{\\text\{graph\}\}\(graph embedding dimension fixed atde=32d\_\{e\}=32\)\. Degradation ratio = MSE at step 8 / MSE at step 1; lower is better\.Table 10:Effect of graph embedding dimensionded\_\{e\}\(λgraph\\lambda\_\{\\text\{graph\}\}fixed at 1\.0\)\. Mean over 3 seeds\.λgraph=0\\lambda\_\{\\text\{graph\}\}=0\(no graph loss\) yields aff\. accuracy of only 0\.46 and2\.53×2\.53\\timesdegradation, confirming the graph loss is essential\. Onceλgraph≥0\.5\\lambda\_\{\\text\{graph\}\}\\geq 0\.5, degradation stabilizes within a0\.05×0\.05\\timesband regardless of further increases, indicating the model is not sensitive to the exact value in this range\.ded\_\{e\}has negligible effect: all four values achieve the same Aff\. Acc \(0\.939\) and degradation differences remain within0\.11×0\.11\\times\.

## Appendix GMonotonicity Constraint Ablation

We ablate the monotonicity constraint on Craftax \(D=4D\{=\}4\), comparing four conditions: \(1\)Vanilla: no affordance graph; \(2\)Graph \(no constraint\): graph\-conditioned model with oracle graph targets, no OR\-operation enforced at inference; \(3\)Graph\-AR \(no constraint\): same but with auto\-regressive predicted graph; \(4\)AGWM \(constrained\): full model with monotonicity OR\-operation\.

Table 11:8\-step degradation ratio \(MSE at step 8 / MSE at step 1\) on Craftax\. Mean±\\pmstd over 3 seeds\. Lower is better\.The monotonicity constraint is decisive: AGWM holds at1\.10×1\.10\\timesdegradation across all three seeds \(std=0\.01=0\.01\), while all unconstrained variants diverge\. Notably, graph\-conditioned models*without*the constraint perform far worse than Vanilla \(136\.3×136\.3\\timesvs\.51\.4×51\.4\\times\)\. Without the OR\-operation, the graph predictor can deactivate previously discovered affordances, introducing distribution shift that compounds more severely than having no graph at all\. The constraint eliminates this failure mode by construction\.

## Appendix HCompute Resources

All experiments were conducted on a single NVIDIA RTX 4090 GPU \(24 GB VRAM\)\. No distributed or cloud compute was used\.

## Appendix IMiniHack LavaCross Details

MiniHack\-LavaCross\-v0\(Samvelyan et al\.,[2021](https://arxiv.org/html/2605.06841#bib.bib11)\)places the agent on a map separated from the goal by a lava corridor\. A levitation potion is available nearby; drinking it activates thelava\_traversableaffordance \(D=2D\{=\}2SC chain: find potion→\\todrink potion→\\tocross lava\)\. The environment has 77 discrete actions\. Imagination MSE is computed in the glyph\-embedding space \(16\-dim per glyph×\\times21×\\times79 grid\) and is not directly comparable to pixel\-space MSE in KeyDungeon/Crafter\. Over 3 seeds, AGWM achieves1021\.2±102\.4×10−31021\.2\\pm 102\.4\\times 10^\{\-3\}MSE vs\. Vanilla’s1496\.2±1116\.5×10−31496\.2\\pm 1116\.5\\times 10^\{\-3\}\(1\.47×1\.47\\timesimprovement\)\. Notably, AGWM exhibits substantially lower cross\-seed variance \(std102\.4102\.4\) than Vanilla \(std1116\.51116\.5\), indicating that affordance graph conditioning stabilizes imagination quality across random initializations\.

## Usage of Large Language Models

During the preparation of this manuscript, large language models were used solely for grammar checking and writing refinement\. They were not used as any component of the AGWM method, did not contribute to experimental design or result analysis, and had no impact on the scientific content or conclusions of this paper\.
AGWM: Affordance-Grounded World Models for Environments with Compositional Prerequisites

Similar Articles

The DAWN of World-Action Interactive Models

World Action Models: The Next Frontier in Embodied AI

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Multi-Agent World Models (3 minute read)

Submit Feedback

Similar Articles

The DAWN of World-Action Interactive Models
World Action Models: The Next Frontier in Embodied AI
Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
Multi-Agent World Models (3 minute read)