Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

arXiv cs.AI 06/29/26, 04:00 AM Papers
Summary
This paper introduces Grounded Iterative Language Planning (GILP), a method that combines a small parameterized world model with LLM-based reasoning to reduce hallucination propagation in LLM agents. Experiments show GILP reduces hallucinated-state rate from 0.176 to 0.035 and raises task success from 0.668 to 0.838 on graph-structured planning benchmarks.
arXiv:2606.27806v1 Announce Type: new Abstract: World models for language agents come in two useful forms. An agent-based world model calls an LLM API and reasons flexibly in language, but its errors appear as hallucinated state changes that are hard to score with ordinary regression losses. A parameterized world model is a trained transition predictor; its errors are easier to measure with quantities such as NodeMSE, delta accuracy, and validity accuracy, but it is usually weaker as a standalone planner. We compare these two families on four graph-structured planning benchmarks and introduce operational hallucination metrics for the agent-based case. The comparison motivates \textbf{Grounded Iterative Language Planning} (GILP), which trains only a small parameterized backbone and combines it with API-based agent reasoning. The backbone supplies valid actions, predicted state deltas, risk, and value; the LLM drafts an action and imagined delta; and a consistency gate asks for revision when the two disagree. On real GPT-4o-mini calls, GILP reduces hallucinated-state rate from 0.176 to 0.035. In calibrated simulator ablations, it raises success from 0.668 to 0.838 while adding only ~22% extra LLM calls.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:27 AM
# How Parameterized World Models Reduce Hallucination Propagation in LLM Agents
Source: [https://arxiv.org/html/2606.27806](https://arxiv.org/html/2606.27806)
## Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

Xinyuan Song1Zekun Cai2,3 1Emory University, Atlanta, GA, USA2The University of Tokyo, Tokyo, Japan 3LocationMind, Tokyo, Japan xinyuan\.song@emory\.edu, caizekun@csis\.u\-tokyo\.ac\.jp

###### Abstract

World models for language agents come in two useful forms\. An agent\-based world model calls an LLM API and reasons flexibly in language, but its errors appear as hallucinated state changes that are hard to score with ordinary regression losses\. A parameterized world model is a trained transition predictor; its errors are easier to measure with quantities such as NodeMSE, delta accuracy, and validity accuracy, but it is usually weaker as a standalone planner\. We compare these two families on four graph\-structured planning benchmarks and introduce operational hallucination metrics for the agent\-based case\. The comparison motivatesGrounded Iterative Language Planning\(GILP\), which trains only a small parameterized backbone and combines it with API\-based agent reasoning\. The backbone supplies valid actions, predicted state deltas, risk, and value; the LLM drafts an action and imagined delta; and a consistency gate asks for revision when the two disagree\. On real GPT\-4o\-mini calls, GILP reduces hallucinated\-state rate from0\.1760\.176to0\.0350\.035\. In calibrated simulator ablations, it raises success from 0\.668 to 0\.838 while adding only∼\\sim22% extra LLM calls\. Code:[https://github\.com/Hik289/Environment\-reduce\-error\.git](https://github.com/Hik289/Environment-reduce-error.git)\.

Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

Xinyuan Song1Zekun Cai2,31Emory University, Atlanta, GA, USA2The University of Tokyo, Tokyo, Japan3LocationMind, Tokyo, Japanxinyuan\.song@emory\.edu, caizekun@csis\.u\-tokyo\.ac\.jp

## 1Introduction

Large language models \(LLMs\) are now a common backbone for autonomous agents, whether they are called through OpenAI’schat\.completions\(OpenAI,[2023](https://arxiv.org/html/2606.27806#bib.bib55)\), Anthropic’smessages\(Anthropic,[2024](https://arxiv.org/html/2606.27806#bib.bib56)\), or the Google Gemini endpoint\(Google DeepMind,[2023](https://arxiv.org/html/2606.27806#bib.bib77)\)\. In chain\-of\-thought and ReAct\-style planning\(Weiet al\.,[2022](https://arxiv.org/html/2606.27806#bib.bib3); Yaoet al\.,[2023b](https://arxiv.org/html/2606.27806#bib.bib1); Haoet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib6); Shinnet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib2); Yaoet al\.,[2023a](https://arxiv.org/html/2606.27806#bib.bib4); Wanget al\.,[2023a](https://arxiv.org/html/2606.27806#bib.bib5)\), the agent is not only choosing actions\. It is also acting as an*agent\-based world model*: it writes what it thinks the next state will be, then uses that text in the next decision\. This is useful because goals, tool calls, and observations all live in the same medium, and the recipe works well on many interactive benchmarks\(Shridharet al\.,[2021](https://arxiv.org/html/2606.27806#bib.bib11); Yaoet al\.,[2022](https://arxiv.org/html/2606.27806#bib.bib12); Liuet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib16); Zhouet al\.,[2024a](https://arxiv.org/html/2606.27806#bib.bib65); Jimenezet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib69)\)\.

The hard part is that not all world\-model errors look the same\. A parameterized world model has an explicit prediction target, so its error can be computed directly: NodeMSE on node states, delta accuracy on changed nodes, validity accuracy on actions, and so on\. An agent\-based world model is different\. Its transition is a piece of language and structured JSON produced by an API model\. The most damaging errors are semantic hallucinations: a completion that did not happen, a dependency that is ignored, or an entity state that is written into the history and reused later\. These errors are not well described by a single MSE\. We therefore define operational metrics for them: hallucinated\-state rate, propagation depth, and long\-horizon error growth\.

This leads to the central comparison of the paper\. Parameterized world models have measurable and often lower transition error, but they are weak semantic planners\. Agent\-based world models reason well, but their hallucinations are harder to measure and compound over long horizons\. In our experiments, the agent baseline’s per\-step error probability climbs to 0\.393 by step ten, its hallucinated\-state rate reaches 0\.205, and a hallucinated atom persists for a mean of 2\.45 steps\. The natural question is whether a small amount of trained parameterization can be used to control the hallucination error of an API\-based agent without giving up its reasoning ability\.

#### Motivating example\.

Consider a six\-step workflow with tasks\{1,…,6\}\\\{1,\\dots,6\\\}\. At step three the agent’s imagined transition declares “task 3: completed” although the environment kept it pending \(the precondition was missed in the JSON serialisation\)\. The model continues to plan against the corrupted state, emitsexecute\(task\_5\)which depends on task 3, the environment rejects it as invalid, and the agent patches with three more hallucinated state atoms before the episode times out\. One false token generated three invalid actions\. We see this cascade both in the calibrated simulator and in direct GPT\-4o\-mini API calls\. JSON\-mode enforcement\(OpenAI,[2023](https://arxiv.org/html/2606.27806#bib.bib55)\)helps with*syntax*; it does not by itself stop the agent from believing a false state\.

#### Two world models\.

We study both sides explicitly\. The agent\-based world model is the LLM planner: it calls an API, reasons over the serialized task, chooses an action, and writes an imagined next\-state delta\. The parameterized world model is a small trained network\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2606.27806#bib.bib22); Hafneret al\.,[2020](https://arxiv.org/html/2606.27806#bib.bib23); Sutton,[1991](https://arxiv.org/html/2606.27806#bib.bib26); Moerlandet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib27)\)that predicts action validity, next\-state deltas, completion, value, and risk\. The latter has ordinary supervised errors, including NodeMSE and delta accuracy; the former needs hallucination metrics because its mistakes live in generated state claims\. The two models therefore fail in complementary ways\.

#### Our approach: GILP\.

We proposeGrounded Iterative Language Planning\(GILP\) to combine them\. GILP trains only a small parameterized backbone, then keeps the API agent as the reasoning engine\. At each step, the backbone scores candidate actions and serialises a compact*skeleton*: validity, predicted delta, risk, affected entities, and value\. The LLM drafts an action and an imagined next\-state delta in structured JSON\. A Jaccard consistency gate compares the agent’s delta with the parameterized prediction; when consistency falls belowτlow=0\.30\\tau\_\{\\text\{low\}\}\{=\}0\.30, the agent receives a short correction message that names the disagreeing atoms\. The goal is not to make the parameterized model solve the task\. It is to use its measurable transition signal to reduce the hallucination error of the agent\-based world model\.

#### Contributions\.

- •We frame long\-horizon planning as a comparison between two world models: an agent\-based model with flexible API reasoning and a parameterized model with measurable supervised transition error\.
- •We define hallucination propagation as the agent\-based analogue of world model error and measure it with HSR, PD, and long\-horizon error\-probability proxies\.
- •We introduce GILP, which trains a small parameterized backbone and uses it to correct the hallucinated state deltas of an API\-based reasoning agent\.
- •We show that GILP improves both task success and state faithfulness: simulator success rises from 0\.668 to 0\.838, real GPT\-4o\-mini HSR falls by80%80\\%, and long\-horizon success improves from 0\.471 to 0\.758\.
- •We release the prompt suite \(Appendix[A](https://arxiv.org/html/2606.27806#A1)\), simulator, benchmarks, and code artifacts for reproducible follow\-up work\.

## 2Related Work

#### LLMs as language world models\.

A growing line of work treats LLM agents as*implicit*world models that generate next\-state predictions in natural language as part of chain\-of\-thought planning\. ReAct\(Yaoet al\.,[2023b](https://arxiv.org/html/2606.27806#bib.bib1)\)interleaves reasoning and action so that each thought predicts a downstream outcome; Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib2)\)adds verbal self\-critique that updates the agent’s belief about past predictions; Plan\-and\-Solve\(Wanget al\.,[2023a](https://arxiv.org/html/2606.27806#bib.bib5)\)forces commitment to a complete imagined trajectory before execution; Tree of Thoughts and Graph of Thoughts\(Yaoet al\.,[2023a](https://arxiv.org/html/2606.27806#bib.bib4); Bestaet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib62)\)branch the rollout into a search tree; LATS\(Liu and others,[2023](https://arxiv.org/html/2606.27806#bib.bib66)\)unifies reasoning, acting, and planning via language agent tree search; RAP\(Haoet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib6)\)uses the LLM*as*the transition model inside MCTS\. The same paradigm drives benchmark and embodied deployments\(Shridharet al\.,[2021](https://arxiv.org/html/2606.27806#bib.bib11); Yaoet al\.,[2022](https://arxiv.org/html/2606.27806#bib.bib12); Qinet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib13); Wanget al\.,[2024a](https://arxiv.org/html/2606.27806#bib.bib14); Linet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib19); Liuet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib16); Zhouet al\.,[2024a](https://arxiv.org/html/2606.27806#bib.bib65); Jimenezet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib69)\)\.Sumerset al\.\([2024](https://arxiv.org/html/2606.27806#bib.bib73)\)provide a unified cognitive\-architecture perspective on these systems\. Critically, all of them*rely on the LLM’s own generation*to model the world, so errors in imagined states are injected directly into the context for all subsequent steps\.

#### LLM API ecosystem\.

Three major API paradigms dominate agent deployments\. The*OpenAI API*\(OpenAI,[2023](https://arxiv.org/html/2606.27806#bib.bib55)\)\(gpt\-4o\-mini,gpt\-4o,o1\-mini\) exposeschat\.completionswith aresponse\_format=\{"type":"json\_object"\}option, function calling, and token\-level logprobs; 2025 list prices forgpt\-4o\-miniare $0\.15/$0\.60 per million input/output tokens\. The*Anthropic API*\(Anthropic,[2024](https://arxiv.org/html/2606.27806#bib.bib56)\)\(claude\-3\-haikuthroughclaude\-opus\) exposes themessagesendpoint with tool\-use, extended thinking, and streaming support\. The*Google Gemini API*\(Google DeepMind,[2023](https://arxiv.org/html/2606.27806#bib.bib77)\)\(gemini\-1\.5\-flash,gemini\-1\.5\-pro\) offersresponse\_mime\_type="application/json"at $0\.075/$0\.30 per million tokens for Flash\.*Open\-source serving*via vLLM or Ollama runs Llama\-3, Mistral, or Qwen behind an OpenAI\-compatible API at zero marginal token cost\. The three paradigms differ on JSON\-mode reliability, latency, and tool\-use idioms—all of which affect how often the agent emits a parseable action and whether its imagined state can be extracted from the response\. Instruction fine\-tuning\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.27806#bib.bib57)\)and agent specialisation\(Zenget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib74); Chenet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib76); Wanget al\.,[2024b](https://arxiv.org/html/2606.27806#bib.bib68)\)shift this reliability curve per\-model, but no general grounding mechanism spans all APIs\. GILP’s consistency gate is API\-agnostic\.

#### Parametric world models\.

Parametric world models predict environment transitions from data and have been a workhorse of model\-based RL since Dyna\(Sutton,[1991](https://arxiv.org/html/2606.27806#bib.bib26)\)\. Modern variants include latent\-imagination architectures\(Ha and Schmidhuber,[2018](https://arxiv.org/html/2606.27806#bib.bib22); Hafneret al\.,[2020](https://arxiv.org/html/2606.27806#bib.bib23),[2021](https://arxiv.org/html/2606.27806#bib.bib24),[2023](https://arxiv.org/html/2606.27806#bib.bib25)\), probabilistic ensembles\(Chuaet al\.,[2018](https://arxiv.org/html/2606.27806#bib.bib28)\), sequence\-modeled offline trajectories\(Janneret al\.,[2021](https://arxiv.org/html/2606.27806#bib.bib29)\), and learned models combined with search\(Schrittwieseret al\.,[2020](https://arxiv.org/html/2606.27806#bib.bib30)\)\.Moerlandet al\.\([2023](https://arxiv.org/html/2606.27806#bib.bib27)\)survey the broader model\-based RL landscape\. Because our benchmarks are graph\-structured we instantiate graph neural network backbones: GCN\(Kipf and Welling,[2017](https://arxiv.org/html/2606.27806#bib.bib31)\), GraphSAGE\(Hamiltonet al\.,[2017](https://arxiv.org/html/2606.27806#bib.bib32)\), MPNN\(Gilmeret al\.,[2017](https://arxiv.org/html/2606.27806#bib.bib33)\), GAT\(Veličkovićet al\.,[2018](https://arxiv.org/html/2606.27806#bib.bib34)\), GIN\(Xuet al\.,[2019](https://arxiv.org/html/2606.27806#bib.bib37)\), R\-GCN\(Schlichtkrullet al\.,[2018](https://arxiv.org/html/2606.27806#bib.bib36)\), and the graph transformer GPS\(Rampášeket al\.,[2022](https://arxiv.org/html/2606.27806#bib.bib35)\)\. These models are cheap and stable, but their outputs come from fixed heads rather than open\-ended language; they cannot reason compositionally about novel goals and systematically under\-solve tasks requiring semantic understanding \(0\.565 SR vs\. 0\.668 for the best agent baseline\)\.

#### Hallucination, faithfulness, and self\-correction\.

Single\-turn hallucination is well documented in NLG\(Jiet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib46); Maynezet al\.,[2020](https://arxiv.org/html/2606.27806#bib.bib47); Zhanget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib43); Huanget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib44); Rawteet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib45)\)\. Faithfulness approaches include faithful chain\-of\-thought reasoning\(Lyu and others,[2023](https://arxiv.org/html/2606.27806#bib.bib72)\), self\-consistency sampling\(Wanget al\.,[2023b](https://arxiv.org/html/2606.27806#bib.bib70)\), chain\-of\-verification\(Dhuliawala and others,[2023](https://arxiv.org/html/2606.27806#bib.bib71)\), external knowledge augmentation\(Peng and others,[2023](https://arxiv.org/html/2606.27806#bib.bib78)\), and tool\-interactive critiquing\(Gou and others,[2024](https://arxiv.org/html/2606.27806#bib.bib79)\)\.Pan and others \([2024](https://arxiv.org/html/2606.27806#bib.bib80)\)survey the diverse landscape of automated correction strategies for LLMs\. Process reward models\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib64)\)score intermediate reasoning steps with a learned verifier\. All of this literature treats hallucination as a*single\-step*detection or correction problem\. We focus instead on*multi\-step propagation*: a single hallucinated atom in an imagined\-state JSON segment influences every subsequent token the LLM emits within the same trajectory\. Our HSR, PD, and EES metrics and the long\-horizon any\-error proxyP^any\(H\)\\widehat\{P\}\_\{\\mathrm\{any\}\}\(H\)quantify this horizon\-resolved phenomenon\. GILP’s consistency gate combines the parametric backbone’s structural prediction with the LLM’s own delta estimate, running a correction only when the two diverge—a targeted, compute\-efficient form of self\-correction\.

#### Hybrid and grounded planning\.

A complementary line of work pairs LLMs with symbolic or learned components\. LLM\+P\(Liuet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib51)\)routes language planning to a classical PDDL solver;Guanet al\.\([2023](https://arxiv.org/html/2606.27806#bib.bib52)\); Silveret al\.\([2024](https://arxiv.org/html/2606.27806#bib.bib53)\)use LLMs to construct or generalise PDDL models;Xianget al\.\([2023](https://arxiv.org/html/2606.27806#bib.bib39)\); Zhaoet al\.\([2024](https://arxiv.org/html/2606.27806#bib.bib42)\); Nottinghamet al\.\([2023](https://arxiv.org/html/2606.27806#bib.bib40)\); Silveret al\.\([2022](https://arxiv.org/html/2606.27806#bib.bib50)\)learn or distil world\-model knowledge into LLM\-driven agents;Kambhampatiet al\.\([2024](https://arxiv.org/html/2606.27806#bib.bib17)\)argues for an “LLM\-modulo” architecture where symbolic components verify and revise LLM plans\. CodeAct\(Wanget al\.,[2024b](https://arxiv.org/html/2606.27806#bib.bib68)\)uses executable code as the action representation to reduce structural ambiguity\. Most of these systems intervene*after*the LLM produces an action: filter, verify, or rerank\. GILP intervenes at two complementary stages: \(i\) the skeleton enters the*prompt*so the imagined state is grounded*before*sampling, and \(ii\) the consistency gate issues a corrective re\-prompt*during*the same step if the LLM’s imagined delta diverges from the backbone’s prediction\. This differs from post\-hoc reranking\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib64)\)and from system\-level verification\(Kambhampatiet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib17)\): GILP operates at the per\-step planning loop withO\(1\)O\(1\)additional backbone forward passes\.

#### Model\-based LLM agents\.

A concurrent line of work explicitly equips LLM agents with structured world models\.WorldCoder\(Tang and Ellis,[2024](https://arxiv.org/html/2606.27806#bib.bib87)\)synthesises executable code that serves as the agent’s world model, iteratively refining it from environment interactions\.WALL\-E\(Zhouet al\.,[2024b](https://arxiv.org/html/2606.27806#bib.bib85); Zhou and others,[2025](https://arxiv.org/html/2606.27806#bib.bib86)\)aligns LLM priors with environment dynamics by inducing symbolic rules from rollouts; the rules act as a lightweight transition oracle and reduce hallucinated state predictions in a manner complementary to GILP’s parametric consistency gate\.Guet al\.\([2025](https://arxiv.org/html/2606.27806#bib.bib88)\)demonstrate that LLMs themselves implicitly encode rich web\-environment dynamics, motivating principled study of when these implicit predictions are reliable—exactly the regime where GILP’s external skeleton intervenes\. Compared with these systems, GILP requires neither code synthesis nor symbolic rule extraction: a 5 k\-parameter MLP suffices to ground hallucinations, and the consistency gate is API\-agnostic\.

#### Structured output and grounding\.

Ensuring LLMs produce syntactically and semantically valid structured outputs is an active area\. Grammar\-constrained decoding\(Genget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib82)\)restricts the sampling vocabulary to valid tokens at each step; schema\-guided generation\(Josifoskiet al\.,[2022](https://arxiv.org/html/2606.27806#bib.bib83)\)conditions on ontologies; JSON Schema enforcement is now supported natively in major inference libraries\. Our GILP consistency gate adds a*semantic*layer on top of syntactic validity: even when the JSON parses correctly, the imagined delta may be physically inconsistent with the environment’s transition dynamics\. We quantify this “semantic JSON fail” rate across four APIs and show it is more prevalent for open\-source models \(9\.2% for Llama\-3\-8B vs\. 0\.4% for GPT\-4o\-mini in json\_object mode\)\.

## 3Problem Formulation

We frame multi\-step LLM planning as a constrained*text\-generation*problem in which the same model produces \(i\) a structured world\-state prediction and \(ii\) an action token, conditioned on a serialised representation of the environment and a goal description\. This section makes the generation pipeline explicit, defines what a hallucinated token means in this context, and gives the cost model we use throughout\.

### 3\.1Multi\-step Language Planning

A task is a tuple\(G0,g,T⋆,H\)\(G\_\{0\},g,T^\{\\star\},H\)whereG0G\_\{0\}is the initial*world state*,ggis a natural\-language goal,T⋆T^\{\\star\}is the true \(deterministic up to stochastic failures\) transition, andHHis an oracle horizon\. Our benchmarks exposeGtG\_\{t\}as a typed graphGt=\(Vt,Et,Xt,Zt\)G\_\{t\}=\(V\_\{t\},E\_\{t\},X\_\{t\},Z\_\{t\}\):VtV\_\{t\}are entities \(subtasks, tools, resources, components\),EtE\_\{t\}are dependency edges,XtX\_\{t\}are per\-node attributes \(type and one of five status values: pending/active/completed/failed/skipped\), andZt∈\{0,1\}\|Vt\|Z\_\{t\}\\in\\\{0,1\\\}^\{\|V\_\{t\}\|\}is a binary goal mask \(Zt\[v\]=1Z\_\{t\}\[v\]\{=\}1marks a node that must reachcompletedfor the task to succeed\)\. We then*serialise*GtG\_\{t\}into a textual block𝚜𝚎𝚛𝚒𝚊𝚕𝚒𝚜𝚎\(Gt\)\\mathtt\{serialise\}\(G\_\{t\}\)that is concatenated with a system prompt, the goal, and a candidate\-action list to form the user message of a single LLM API call\.

#### Agent\-based world model\.

Given contextct=\(𝚜𝚎𝚛𝚒𝚊𝚕𝚒𝚜𝚎\(Gt\),g,𝒜t\)c\_\{t\}=\(\\mathtt\{serialise\}\(G\_\{t\}\),g,\\mathcal\{A\}\_\{t\}\), the agent LLM produces a JSON response containing a*selected action*a^t\\hat\{a\}\_\{t\}, an*imagined next state*represented as a node\-status deltaΔ~t∈\{−1,0,\+1\}\|Vt\|\\tilde\{\\Delta\}\_\{t\}\\in\\\{\-1,0,\+1\\\}^\{\|V\_\{t\}\|\}, an optional textual rationale, and a confidencez^t∈\[0,1\]\\hat\{z\}\_\{t\}\\in\[0,1\]:

\(a^t,Δ~t,z^t\)∼PLLM\(⋅\|ct\)\.\(\\hat\{a\}\_\{t\},\\tilde\{\\Delta\}\_\{t\},\\hat\{z\}\_\{t\}\)\\sim P\_\{\\text\{LLM\}\}\(\\,\\cdot\\,\|c\_\{t\}\)\.The environment then appliesGt\+1=T⋆\(Gt,a^t\)G\_\{t\+1\}=T^\{\\star\}\(G\_\{t\},\\hat\{a\}\_\{t\}\)\. A task succeeds when every node withZ\[v\]=goalZ\[v\]\{=\}\\textsc\{goal\}reaches statuscompletedbeforeHHsteps elapse\. This is a world model because the agent explicitly predicts a transition throughΔ~t\\tilde\{\\Delta\}\_\{t\}\. Its error, however, is not a standard supervised loss over a fixed output layer; it is a semantic error in generated state claims\.

#### State serialisation matters\.

The sameGtG\_\{t\}admits many textual serialisations: a flat node table, a nested adjacency list, a Markdown checklist\. In Appendix[B](https://arxiv.org/html/2606.27806#A2)we ablate three formats and show the serialisation alone changes the hallucinated\-state rate by a factor of1\.6×1\.6\{\\times\}, because longer serialisations crowd the attention budget and shorter ones omit dependencies the agent must reason about\. All main\-text experiments use the JSON\-table format from Appendix[A](https://arxiv.org/html/2606.27806#A1)\.

### 3\.2Agent\-Model Error: Hallucination and Propagation

We work with status*deltas*Δ~t\\tilde\{\\Delta\}\_\{t\}rather than full imagined statesG~t\+1\\tilde\{G\}\_\{t\+1\}; deltas are sparse and directly verifiable against the true transition\. LetΔt⋆=𝚍𝚎𝚕𝚝𝚊\(Gt,T⋆\(Gt,a^t\)\)\\Delta^\{\\star\}\_\{t\}=\\mathtt\{delta\}\(G\_\{t\},T^\{\\star\}\(G\_\{t\},\\hat\{a\}\_\{t\}\)\)be the ground\-truth node\-status delta\. A*hallucinated state atom*is anyΔ~t\[v\]\\tilde\{\\Delta\}\_\{t\}\[v\]such thatΔ~t\[v\]≠Δt⋆\[v\]\\tilde\{\\Delta\}\_\{t\}\[v\]\\neq\\Delta^\{\\star\}\_\{t\}\[v\]for an entityvvthe agent asserted to change\. This definition turns the agent\-based world model’s semantic error into quantities that can be compared across methods\. We quantify three horizon\-resolved phenomena:

- •Hallucinated\-State Rate\(HSR\): fraction of imagined node\-status atoms across the rollout that disagree withT⋆T^\{\\star\}\.
- •Propagation Depth\(PD\): the mean number of subsequent steps that condition on a hallucinated atom before the rollout either recovers or the episode ends\. Because hallucinated atoms sit inside the chat history that becomes the next API call’s context, they propagate through attention\.
- •Error\-Explosion Slope\(EES\): the least\-squares slope oflog⁡\(error magnitudek\)\\log\(\\text\{error magnitude\}\_\{k\}\)across the first eight steps, capturing how quickly compounding happens before saturation\.

#### Scope\.

HSR and PD capture status\-level delta errors: false completions, failures, and skip claims\. Entity\-set hallucinations \(asserting a nonexistent tool\) and reward\-attribution errors lie outside this operationalisation and are left for future work \(Section[7](https://arxiv.org/html/2606.27806#S7)\)\. We complement these with the per\-step error probabilityperr\(k\)p\_\{\\text\{err\}\}\(k\)and two long\-horizon quantities reported in Table[6](https://arxiv.org/html/2606.27806#S5.T6):IndepBound\(H\)=1−∏k\(1−perr\(k\)\)\\mathrm\{IndepBound\}\(H\)\{=\}1\{\-\}\\prod\_\{k\}\(1\{\-\}p\_\{\\text\{err\}\}\(k\)\), the independence\-baseline probability proxy \(see caveat in Appendix[E](https://arxiv.org/html/2606.27806#A5)\); andExpectedErrors\(H\)=∑kperr\(k\)\\mathrm\{ExpectedErrors\}\(H\)\{=\}\\sum\_\{k\}p\_\{\\text\{err\}\}\(k\), the expected number of erroneous steps \(may exceed 1\)\. Both decrease when per\-step hallucination decreases\.

### 3\.3Parameterized World\-Model Error

We train a small parametric world modelFθ:\(Gt,a\)↦\(p^valid,ΔG^,r^,p^done,ρ^,U^,J^K\)F\_\{\\theta\}:\(G\_\{t\},a\)\\mapsto\(\\hat\{p\}\_\{\\text\{valid\}\},\\widehat\{\\Delta G\},\\hat\{r\},\\hat\{p\}\_\{\\text\{done\}\},\\hat\{\\rho\},\\hat\{U\},\\hat\{J\}\_\{K\}\)on oracle transitions𝒟=\{\(Gt,at,Gt\+1,rt,dt,mt\)\}\\mathcal\{D\}=\\\{\(G\_\{t\},a\_\{t\},G\_\{t\+1\},r\_\{t\},d\_\{t\},m\_\{t\}\)\\\}with the multi\-task loss

ℒWM\\displaystyle\\mathcal\{L\}\_\{\\text\{WM\}\}=ℒstate\+λrℒreward\\displaystyle=\\mathcal\{L\}\_\{\\text\{state\}\}\+\\lambda\_\{r\}\\mathcal\{L\}\_\{\\text\{reward\}\}\+λdℒdone\+λmℒmask\+λρℒrisk\.\\displaystyle\\quad\+\\lambda\_\{d\}\\mathcal\{L\}\_\{\\text\{done\}\}\+\\lambda\_\{m\}\\mathcal\{L\}\_\{\\text\{mask\}\}\+\\lambda\_\{\\rho\}\\mathcal\{L\}\_\{\\text\{risk\}\}\.Unlike the agent\-based model,FθF\_\{\\theta\}has ordinary supervised error measures\. We report state prediction error with NodeMSE,

NodeMSE=1\|Vt\|∑v∈Vt‖x^t\+1\(v\)−xt\+1⋆\(v\)‖22,\\mathrm\{NodeMSE\}=\\frac\{1\}\{\|V\_\{t\}\|\}\\sum\_\{v\\in V\_\{t\}\}\\left\\\|\\widehat\{x\}\_\{t\+1\}\(v\)\-x^\{\\star\}\_\{t\+1\}\(v\)\\right\\\|\_\{2\}^\{2\},and also track validity accuracy, delta accuracy, reward MSE, and done accuracy on held\-out transitions\. These metrics are useful because they make the parameterized model’s error visible before it is used inside an agent\.FθF\_\{\\theta\}is intentionally simple \(a few\-layer GNN\); the empirical question is whether a model with measurable but imperfect transition error is good enough to serve as a grounding signal for an API\-based agent\.

### 3\.4Cost Model

Each LLM API round trip consumesTin\(t\)T\_\{\\text\{in\}\}\(t\)input tokens andTout\(t\)T\_\{\\text\{out\}\}\(t\)output tokens\. Lettingpin,poutp\_\{\\text\{in\}\},p\_\{\\text\{out\}\}be the vendor’s per\-million\-token prices, the dollar cost of policyπ\\pion a task is

C\(π\)=∑t=0steps\(Tin\(t\)pin\+Tout\(t\)pout\)⋅10−6USD\.C\(\\pi\)=\\sum\_\{t=0\}^\{\\text\{steps\}\}\\bigl\(T\_\{\\text\{in\}\}\(t\)\\,p\_\{\\text\{in\}\}\+T\_\{\\text\{out\}\}\(t\)\\,p\_\{\\text\{out\}\}\\bigr\)\\cdot 10^\{\-6\}\\text\{ USD\}\.For GPT\-4o\-mini,\(pin,pout\)=\(0\.15,0\.60\)\(p\_\{\\text\{in\}\},p\_\{\\text\{out\}\}\)=\(0\.15,0\.60\)USD/MTok; for Claude\-3\-Haiku,\(0\.25,1\.25\)\(0\.25,1\.25\); for self\-hosted Llama\-3\-8B the marginal cost is zero\. We report cost\-per\-1k\-tasks in Section[5\.10](https://arxiv.org/html/2606.27806#S5.SS10)\.

## 4Method: Grounded Iterative Language Planning

### 4\.1Overview

GILP is the hybrid world model used in our experiments\. It keeps the agent\-based model for what it is good at: API reasoning over goals, instructions, and semantic constraints\. It uses the parameterized model for what it is good at: cheap, measurable transition predictions\. At stepttit consumes the world stateGtG\_\{t\}, the goalgg, and the candidate\-action set𝒜t\\mathcal\{A\}\_\{t\}; it returns a chosen actionata\_\{t\}together with diagnostic signals \(Jaccard consistency, whether a corrective re\-prompt fired, whether the risk gate fired\)\. The pipeline, summarized in Figure[7](https://arxiv.org/html/2606.27806#S5.F7)and Algorithm[1](https://arxiv.org/html/2606.27806#algorithm1), has four phases:

1. 1\.Skeleton scoring\(free, parametric\)\.
2. 2\.LLM draft\(one API call returning action \+ imagined delta\)\.
3. 3\.Consistency gate\(Jaccard against the backbone delta; optional corrective re\-prompt\)\.
4. 4\.Risk gate\(escalation when the backbone risk exceeds a threshold\)\.

Each phase is described below; the full algorithm is in Algorithm[1](https://arxiv.org/html/2606.27806#algorithm1), and the skeleton prompt format is illustrated in Figure[8](https://arxiv.org/html/2606.27806#S5.F8)\.

### 4\.2Phase 1: Parameterized Skeleton Scoring

For eacha∈𝒜ta\\in\\mathcal\{A\}\_\{t\}the parametric backboneFθF\_\{\\theta\}emits

bθ\(Gt,a\)=\(pvalid,ΔG^,r^,p^done,ρ^,U^,J^K\)\.b\_\{\\theta\}\(G\_\{t\},a\)=\\bigl\(p\_\{\\text\{valid\}\},\\widehat\{\\Delta G\},\\hat\{r\},\\hat\{p\}\_\{\\text\{done\}\},\\hat\{\\rho\},\\hat\{U\},\\hat\{J\}\_\{K\}\\bigr\)\.These predictions are not treated as ground truth\. They are a low\-cost, parameterized estimate whose own error can be audited by NodeMSE, delta accuracy, and validity accuracy\. We then*compress*the candidate set to a manageable skeletonBtB\_\{t\}: top\-kkactions byJ^K\\hat\{J\}\_\{K\}\(value\) and top\-kkbyρ^\\hat\{\\rho\}\(risk\)\. Compression is essential at long horizons where\|𝒜t\|\|\\mathcal\{A\}\_\{t\}\|can exceed 50: a full skeleton overflows the context window and dilutes attention\. Empiricallyk=4k\{=\}4retains 96% of value while shrinking the serialised block by5×5\{\\times\}\.

### 4\.3Phase 2: LLM Draft \(one API call\)

GILP then calls the agent\-based world model\. The system prompt instructs the LLM to respond in a single JSON object with four fields:selected\_action,imagined\_next\_state\(specifically achanged\_nodeslist plus per\-node status predictions\),reasoning, andconfidence\. The*user*prompt concatenates the serialised state, the goal, the skeleton blockBtB\_\{t\}, and the candidate\-action enumeration\. On OpenAI we useresponse\_format="type":"json\_object"; on Anthropic we fall back to regex extraction because the Messages API does not guarantee JSON\. We denote the parsed response

\(a^t,Δ~t,z^t\)=LLM\(Gt,g,𝒜t,Bt\)\.\(\\hat\{a\}\_\{t\},\\tilde\{\\Delta\}\_\{t\},\\hat\{z\}\_\{t\}\)=\\mathrm\{LLM\}\(G\_\{t\},g,\\mathcal\{A\}\_\{t\},B\_\{t\}\)\.

### 4\.4Phase 3: Consistency Gate and Corrective Re\-prompting

This phase is where the two error types meet\. The parameterized model predicts which nodes should change undera^t\\hat\{a\}\_\{t\}; the agent\-based model imagines which nodes will change\. We measure their agreement by the Jaccard similarity of the two change\-sets\. LetSt=\{v:Δ~t\[v\]≠0\}S\_\{t\}\{=\}\\\{v:\\tilde\{\\Delta\}\_\{t\}\[v\]\{\\neq\}0\\\}andS^t=\{v:ΔG^t\[v\]≠0\}\\hat\{S\}\_\{t\}\{=\}\\\{v:\\widehat\{\\Delta G\}\_\{t\}\[v\]\{\\neq\}0\\\}; then:

cons\(Δ~t,ΔG^t\)=\|St∩S^t\|\|St∪S^t\|\.\\mathrm\{cons\}\(\\tilde\{\\Delta\}\_\{t\},\\,\\widehat\{\\Delta G\}\_\{t\}\)\\;=\\;\\frac\{\|S\_\{t\}\\cap\\hat\{S\}\_\{t\}\|\}\{\|S\_\{t\}\\cup\\hat\{S\}\_\{t\}\|\}\.The thresholds areτhigh=\\tau\_\{\\text\{high\}\}\{=\}0\.70 andτlow=\\tau\_\{\\text\{low\}\}\{=\}0\.30:

- •cons≥τhigh\\mathrm\{cons\}\\geq\\tau\_\{\\text\{high\}\}:accepta^t\\hat\{a\}\_\{t\}\.
- •τlow≤cons<τhigh\\tau\_\{\\text\{low\}\}\\leq\\mathrm\{cons\}<\\tau\_\{\\text\{high\}\}: accept with a risk\-weighted penalty applied during ranking \(the action stays butρ^\\hat\{\\rho\}is doubled\)\.
- •cons<τlow\\mathrm\{cons\}<\\tau\_\{\\text\{low\}\}:correctvia Phase 3b\.

τlow=0\.30\\tau\_\{\\text\{low\}\}\{=\}0\.30was selected via the Hybrid\-DeltaOnly ablation on a held\-out 100\-task validation split \(not the test set\): below 0\.25 the gate over\-triggers on minor discrepancies; above 0\.40 it misses hallucinations caught at 0\.30\. The ablation study in Table[5](https://arxiv.org/html/2606.27806#S5.T5)shows robustness acrossτ∈\[0\.25,0\.40\]\\tau\\in\[0\.25,0\.40\]\.

#### Phase 3b: targeted re\-prompt\.

When the consistency gate trips, GILP issues a second LLM call carrying the explicit discrepancy\. For every nodevvwhereΔ~t\[v\]≠ΔG^t\[v\]\\tilde\{\\Delta\}\_\{t\}\[v\]\\neq\\widehat\{\\Delta G\}\_\{t\}\[v\], the correction prompt contains a line of the form “Nodevv: backbone predictssps\_\{p\}but you imaginedsas\_\{a\}\.” The agent is asked to revise either its imagined state or its selected action so the two agree\. We cap corrections atmax\_corrections=1per step to bound cost; empirically a single revision resolves∼\\sim93% of triggered cases\.

### 4\.5Phase 4: Risk Gate

After consistency is resolved, ifρ^\(a^t\)\>ρth\\hat\{\\rho\}\(\\hat\{a\}\_\{t\}\)\>\\rho\_\{\\text\{th\}\}\(we setρth=\\rho\_\{\\text\{th\}\}\{=\}0\.65\), GILP issues a third LLM call that warns the agent and re\-presents only the low\-risk subset of𝒜t\\mathcal\{A\}\_\{t\}\. This gate fires on∼\\sim8% of steps in our calibration; it primarily intercepts catastrophic mistakes such as retrying a task whose failure root cause is still unresolved\.

### 4\.6Algorithm

Input:environment

Env\\mathrm\{Env\}, goal

gg, backbone

FθF\_\{\\theta\}, agent LLM

πLLM\\pi\_\{\\text\{LLM\}\}, thresholds

τhigh,τlow,ρth\\tau\_\{\\text\{high\}\},\\tau\_\{\\text\{low\}\},\\rho\_\{\\text\{th\}\}, horizon

HH
1for*t=0,…,H−1t=0,\\dots,H\-1*do

2serialise

GtG\_\{t\}; obtain

𝒜t\\mathcal\{A\}\_\{t\};

//Phase 1: skeleton

3

\{bθ\(Gt,a\)\}a∈𝒜t\\\{b\_\{\\theta\}\(G\_\{t\},a\)\\\}\_\{a\\in\\mathcal\{A\}\_\{t\}\};;

4

Bt←compress\(⋅\)B\_\{t\}\\leftarrow\\mathrm\{compress\}\(\\cdot\);

//Phase 2: draft

5

\(a^t,Δ~t,z^t\)←πLLM\(Gt,g,𝒜t,Bt\)\(\\hat\{a\}\_\{t\},\\tilde\{\\Delta\}\_\{t\},\\hat\{z\}\_\{t\}\)\\leftarrow\\pi\_\{\\text\{LLM\}\}\(G\_\{t\},g,\\mathcal\{A\}\_\{t\},B\_\{t\}\);

//Phase 3: consistency gate

6

c←Jaccard\(Δ~t,ΔG^t\(a^t\)\)c\\leftarrow\\mathrm\{Jaccard\}\(\\tilde\{\\Delta\}\_\{t\},\\widehat\{\\Delta G\}\_\{t\}\(\\hat\{a\}\_\{t\}\)\);

7if*c<τ*low*c<\\tau\_\{\\text\{low\}\}*then

8

msg←\\mathrm\{msg\}\\leftarrowformat\_discrepancy

\(Δ~t,ΔG^t\)\(\\tilde\{\\Delta\}\_\{t\},\\widehat\{\\Delta G\}\_\{t\}\);

9

\(a^t,Δ~t,z^t\)←πLLMrevise\(⋅,msg\)\(\\hat\{a\}\_\{t\},\\tilde\{\\Delta\}\_\{t\},\\hat\{z\}\_\{t\}\)\\leftarrow\\pi\_\{\\text\{LLM\}\}^\{\\text\{revise\}\}\(\\cdot,\\mathrm\{msg\}\);

10

11end if

//Phase 4: risk gate

12if*ρ^\(a^t\)\>ρ*th*\\hat\{\\rho\}\(\\hat\{a\}\_\{t\}\)\>\\rho\_\{\\text\{th\}\}*then

13

a^t←πLLMescalate\(⋅,low\-risk subset\)\\hat\{a\}\_\{t\}\\leftarrow\\pi\_\{\\text\{LLM\}\}^\{\\text\{escalate\}\}\(\\cdot,\\text\{low\-risk subset\}\);

14

15end if

16execute

a^t\\hat\{a\}\_\{t\}; log

\(Gt,Bt,a^t,Δ~t,Gt\+1,c,tokens\)\(G\_\{t\},B\_\{t\},\\hat\{a\}\_\{t\},\\tilde\{\\Delta\}\_\{t\},G\_\{t\+1\},c,\\text\{tokens\}\);

17if*done*thenbreak;

18

19end for

Algorithm 1Grounded Iterative Language Planning \(GILP\)
### 4\.7Error Reduction Guarantee

The guarantee is deliberately about the agent\-based error, not about claiming the parameterized model is perfect\. LetEkE\_\{k\}be the event that the Phase\-2 agent draft at stepkkcontains a semantic transition hallucination,DkD\_\{k\}the event that the Jaccard gate detects it using the parameterized prediction, andRkR\_\{k\}the event that the corrective re\-prompt removes it\. Define

αk\\displaystyle\\alpha\_\{k\}=Pr⁡\[Dk∣Ek\],\\displaystyle=\\Pr\[D\_\{k\}\\mid E\_\{k\}\],βk\\displaystyle\\beta\_\{k\}=Pr⁡\[Rk∣Dk,Ek\]\.\\displaystyle=\\Pr\[R\_\{k\}\\mid D\_\{k\},E\_\{k\}\]\.Hereαk\\alpha\_\{k\}is gate recall on erroneous agent drafts andβk\\beta\_\{k\}is repair success conditional on detection\. Both can be less than one because the parameterized model also has error\.

###### Assumption 1\(Non\-adversarial correction\)\.

Conditioned on an erroneous draft, the corrective re\-prompt does not introduce a new semantic transition error when the original error has been repaired\.

###### Proposition 1\(One\-step hallucination contraction\)\.

Under the non\-adversarial correction assumption,

Pr⁡\[EkGILP\]=Pr⁡\[Ek\]\(1−αkβk\)≤Pr⁡\[Ek\]\.\\Pr\[E\_\{k\}^\{\\mathrm\{GILP\}\}\]=\\Pr\[E\_\{k\}\]\\bigl\(1\-\\alpha\_\{k\}\\beta\_\{k\}\\bigr\)\\leq\\Pr\[E\_\{k\}\]\.\(1\)Moreover, ifαkβk≥γ\>0\\alpha\_\{k\}\\beta\_\{k\}\\geq\\gamma\>0for allk≤Hk\\leq H, then the expected number of erroneous steps satisfies

𝔼\[∑k=1H𝟏\{EkGILP\}\]≤\(1−γ\)𝔼\[∑k=1H𝟏\{Ek\}\]\.\\mathbb\{E\}\\Bigl\[\\sum\_\{k=1\}^\{H\}\\mathbf\{1\}\\\{E\_\{k\}^\{\\mathrm\{GILP\}\}\\\}\\Bigr\]\\leq\(1\-\\gamma\)\\mathbb\{E\}\\Bigl\[\\sum\_\{k=1\}^\{H\}\\mathbf\{1\}\\\{E\_\{k\}\\\}\\Bigr\]\.\(2\)

The proof is a direct conditioning argument and is given in Appendix[F](https://arxiv.org/html/2606.27806#A6)\. The statement allows real parameterized\-model error: if the backbone misses a hallucination, or if its correction signal is wrong, the contraction is smaller\. Empirically, the gate detects about83%83\\%of agent hallucinations and the revision fails on about7%7\\%of triggered erroneous cases, so the hybrid still contracts the agent\-based error sharply\. This matches the observed tenth\-step error probability: GILP reaches 0\.164 versus 0\.213 for Hybrid\-Full and 0\.393 for Agent\-Replan \(Table[6](https://arxiv.org/html/2606.27806#S5.T6)\)\.

#### Expected token overhead\.

GILP always pays for the draft call\. The correction gate fires on∼\\sim22% of steps and the risk gate on∼\\sim8%\. Thus the expected number of LLM calls per step is approximately1\+0\.22\+0\.08=1\.301\+0\.22\+0\.08=1\.30, which is consistent with the observed token counts in Table[3](https://arxiv.org/html/2606.27806#S5.T3)\.

## 5Experiments

#### Setup\.

We use four graph\-structured world\-model planning benchmarks—TaskGraph \(workflow\), ToolChain \(tool\-use data flow\), ResourceAlloc \(resource and contention\), and RepairFlow \(failure recovery with cascading and hidden failures\)\. Each has 500 train / 100 validation / 100 test tasks; test sets are 60 in\-distribution and 40 out\-of\-distribution\. We train six parametric backbones per benchmark and evaluate eleven planners including agent\-based, parameterized\-only, and hybrid variants\. Because production LLM rollouts are expensive, the agent and hybrid behaviours are reproduced by a behavioural simulator calibrated against measured GPT\-4o\-mini TaskGraph runs \(the real\-API validation\) and grounded in the true graph structure; every metric is derived from one shared per\-step event stream\. The experiment asks three questions in order: how the two world\-model families fail, whether one is better as a planner, and whether a small trained backbone can reduce the hallucination error of the API\-based agent\. Results pool the four benchmarks unless noted, with eight seeds per task\.

#### Simulator calibration\.

Behavioural parameters \(HSR profile, SR\-vs\-horizon, token cost\) are fit to then=5n\{=\}5real GPT\-4o\-mini TaskGraph episodes \(the real\-API validation\)\. Residuals: SR bias\+0\.10\+0\.10–\+0\.18\+0\.18pp \(simulator under\-predicts, i\.e\. is conservative\) and token cost over\-estimated 3–4×\\times\. ToolChain, ResourceAlloc, and RepairFlow apply a0\.900\.90scaling to the TaskGraph calibration\. Full residuals are inCALIBRATION\_RESIDUALS\.md\. The simulator is thus a conservative proxy: GILP gains reported here are lower bounds on real\-API performance\.

### 5\.1Main Comparison

Table[1](https://arxiv.org/html/2606.27806#S5.T1)gives the main comparison between the two world\-model families and their hybrids\. The agent\-based rows are better semantic planners than the parameterized\-only rows: Agent\-Replan reaches SR0\.6680\.668, while Parametric\-WM\-MPC reaches0\.5650\.565and Parametric\-WM\-MCTS reaches0\.6250\.625\. The cost is hallucination: Agent\-Replan has HSR 0\.205 and propagation depth 2\.45\. The parameterized\-only rows make fewer hallucinated language claims and have lower HSR, but they under\-solve the tasks\. GILP is the hybrid point we want: it keeps API reasoning, trains only a small parameterized backbone, and uses the backbone to reduce the agent’s hallucination\. It reaches the best SR \(0\.8380\.838\), lowers HSR from 0\.205 to 0\.079, cuts invalid actions from 0\.169 to 0\.065, and shortens propagation depth from 2\.45 to 1\.51\.

Table 1:Main benchmark results, pooled over four graph\-structured planning benchmarks\. Entries are pooled means overn=8n\{=\}8random seeds per benchmark; each seed aggregates 100 test tasks\. Agent\-based planners reason better but hallucinate more, parameterized\-only planners hallucinate less but solve fewer tasks, and GILP combines API reasoning with a small trained backbone to obtain the best overall planner\. Best per columnbold, second\-bestunderlined\.
### 5\.2Long\-Horizon Scaling

Table[2](https://arxiv.org/html/2606.27806#S5.T2)and Figure[1](https://arxiv.org/html/2606.27806#S5.F1)group tasks by horizon\. The contrast between world\-model types becomes sharper as the horizon grows\. At short horizons, the API agent is already strong \(0\.957 atH≤3H\{\\leq\}3\), and GILP only improves it modestly\. AtH\>10H\{\>\}10, the agent\-based model’s hallucination compounds and SR falls to 0\.471\. The parameterized planner is steadier but lower \(0\.502\)\. GILP reaches 0\.758 because the trained backbone catches enough state\-delta errors before the agent carries them forward\.

Table 2:Success rate by horizon bucket\. The agent\-based world model is strong at short horizons but degrades when hallucinated state atoms propagate; the parameterized planner is steadier but weaker\. GILP preserves the agent’s short\-horizon strength and reduces the long\-horizon collapse\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x1.png)Figure 1:Horizon scaling\. Left: success rate by oracle horizon bucket\. Right: hallucinated\-state rate on the same tasks\. Agent\-based planning degrades sharply beyond ten steps; GILP uses the parameterized prediction as a correction signal and keeps both success and state faithfulness more stable\.
### 5\.3Cost\-Quality Tradeoff

Table[3](https://arxiv.org/html/2606.27806#S5.T3)and Figure[2](https://arxiv.org/html/2606.27806#S5.F2)show the cost of combining the two world models\. Parameterized\-only planning is cheapest because it avoids API reasoning, but it gives up too much semantic success\. Agent\-only planning spends tokens on repeated reasoning while still carrying hallucinated state forward\. GILP pays for about1\.301\.30LLM calls per step \(one draft plus occasional correction and risk gates\)\. That extra cost buys error reduction: GILP uses fewer tokens per solved task than Hybrid\-Full \(15\.0k vs\. 13\.5k\) because it solves more episodes\.

Table 3:Cost\-quality tradeoff\. Token and wall\-time costs are reported alongside success; GILP uses more calls than Hybrid\-Full but solves enough additional tasks to reduce tokens per successful episode\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x2.png)Figure 2:Pareto frontier of success rate versus tokens per task on a log scale\. Parameterized\-only planning is cheap but semantically weaker; GILP pays a small API overhead to combine trained transition prediction with agent reasoning and reaches the high\-success frontier\.
### 5\.4Hallucination Propagation

Comparing imagined transitions to ground truth \(Figure[3](https://arxiv.org/html/2606.27806#S5.F3)\), GILP lowers the agent\-based world model’s false\-completion, false\-dependency, and wrong\-entity rates\. Propagation depth drops from 2\.45 to 1\.51\. The key mechanism is the Phase 3 consistency gate:∼\\sim83% of agent hallucinations have low Jaccard agreement with the parameterized delta and are caught before they become part of the next prompt\.

![Refer to caption](https://arxiv.org/html/2606.27806v1/x3.png)Figure 3:A false completion of nodeXXat step 3 makes the agent\-based world model take invalid actions at steps 4–6; GILP catches the inconsistency between the agent’s imaginedX:completedX\{:\}\\textsc\{completed\}and the parameterized backbone’s predictedX:pendingX\{:\}\\textsc\{pending\}, and the corrective re\-prompt revises the action\.
### 5\.5Backbone Strength

Table[4](https://arxiv.org/html/2606.27806#S5.T4)and Figure[4](https://arxiv.org/html/2606.27806#S5.F4)show that the parameterized component need not be a strong planner\. The MLP reaches only 0\.843 transition accuracy and 0\.509 standalone success, yet as a skeleton provider it already lifts hybrid success to 0\.768\. The stronger MPNN \(0\.992\) reaches 0\.771 and cuts HSR by 69\.1%\. The useful signal is not full planning competence; it is a measurable estimate of validity and state delta that the agent can check itself against\.

Table 4:Backbone strength as standalone planner and as a skeleton provider\. Transition, validity, and delta accuracy are measured on held\-out oracle transitions; these are the parameterized model’s computable errors\. HybridSR/HybridHSR show the same backbone when used to reduce hallucination in the agent\-based world model\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x4.png)Figure 4:Standalone versus hybrid success for each backbone family\. The MLP is weak as a planner by itself, yet its validity and delta heads already provide enough computable transition signal to ground the LLM agent\.
### 5\.6Hybrid and GILP Ablation

Table[5](https://arxiv.org/html/2606.27806#S5.T5)isolates how the parameterized signal helps the agent\-based model\. Validity mainly lowers invalid actions \(0\.087\); delta prediction lowers hallucinated state \(0\.121 vs\. 0\.205\); risk lowers risk\-weighted failure \(0\.159\); value improves ranking\. The full skeleton \(Hybrid\-Full\) reaches 0\.750\. Within GILP, removing the correction gate removes the explicit mechanism that turns parameterized error estimates into agent revisions; removing the risk gate preserves SR but doubles risk\-weighted failure\. Always correcting \(GILP\-tau0\) spends more tokens for little gain, while never correcting \(GILP\-tau1\) falls back toward Hybrid\-Full\. The 0\.30 threshold is the practical sweet spot\.

Table 5:Hybrid and GILP ablations\. Each row removes or isolates one computable signal from the parameterized world model: validity, delta, risk, value, correction, or risk gating\. The full GILP row combines those signals with targeted correction of the agent\-based world model\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x5.png)Figure 5:GILP ablation SR and HSR\. Removing the correction gate \(NoCorrGate\) removes the main bridge from parameterized prediction to agent revision; removing the risk gate \(NoRiskGate\) preserves SR but fails on risky tasks\. Theτ=0\.30\\tau\{=\}0\.30threshold \(full GILP\) is the practical sweet spot—always correcting \(τ=0\\tau\{=\}0\) wastes tokens; never correcting \(τ=1\\tau\{=\}1\) collapses to Hybrid\-Full\.
### 5\.7OOD Robustness

On held\-out OOD tasks the parametric planner degrades most under semantic shift \(gap 0\.108\), the agent transfers but hallucinates, and the hybrid preserves transfer while reducing propagation and risk: 0\.618 OOD success versus 0\.450 for the agent\.

### 5\.8Empirical Error\-Probability Proxies

Table[6](https://arxiv.org/html/2606.27806#S5.T6)and Figure[6](https://arxiv.org/html/2606.27806#S5.F6)report the agent\-side error proxies after each planning strategy is run: per\-step error probabilities,ExpectedErrors@10\\mathrm\{ExpectedErrors\}@10\(expected erroneous steps; may exceed 1\), andIndepBound@10\\mathrm\{IndepBound\}@10\(independence\-baseline probability proxy; Appendix[E](https://arxiv.org/html/2606.27806#A5)discusses the serial\-correlation caveat\)\. These are not NodeMSE\-style parameterized losses; they are the hallucination/error quantities for the agent\-based world model\. GILP’s expected error count \(1\.303\) and independence\-baseline bound \(0\.753\) are well below the agent’s \(3\.148, 0\.978\) and Hybrid\-Full’s \(1\.808, 0\.864\)\.

Table 6:Empirical per\-step error probability and long\-horizon proxies \(H=10\)\.ExpectedErrors@10=∑kp^err\(k\)\\sum\_\{k\}\\hat\{p\}\_\{\\text\{err\}\}\(k\): expected erroneous steps \(may exceed 1\)\.IndepBound@10=1−∏k\(1−p^err\(k\)\)1\{\-\}\\prod\_\{k\}\(1\{\-\}\\hat\{p\}\_\{\\text\{err\}\}\(k\)\): independence\-baseline bound; see Appendix[E](https://arxiv.org/html/2606.27806#A5)for caveat\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x6.png)Figure 6:EmpiricalP^any\(H\)\\widehat\{P\}\_\{\\text\{any\}\}\(H\)for agent\-side semantic error\. GILP curves lie below Hybrid\-Full and agent\-only curves, especially at long horizons, showing that the parameterized signal reduces hallucination propagation rather than merely improving final\-task ranking\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x7.png)Figure 7:Three planning paradigms\. Agent\-based world modeling uses the LLM’s own imagined state and can propagate false atoms\. Parameterized world modeling has computable transition error and is more stable, but lacks semantic flexibility\. GILP combines them: the parameterized skeleton grounds the agent draft, the consistency gate detects semantic disagreement, and the correction prompt repairs the state delta before execution\.
### 5\.9Real LLM Validation

We run GPT\-4o\-mini \(OpenAI API\) onn=20n\{=\}20tasks per benchmark \(H∈\[3,8\]H\\in\[3,8\]\) in both agent\-only and GILP \(Hybrid\) modes across all four benchmarks \(n=80n\{=\}80agent episodes;n=76n\{=\}76hybrid due to a runtime cap on four RepairFlowH=8H\{=\}8tasks; see Table[7](https://arxiv.org/html/2606.27806#S5.T7)\)\. SR=1\.000=1\.000for both arms in all four benchmarks, confirming thatH≤8H\\leq 8tasks are solvable by this LLM—the bottleneck is hallucinated state content, not task complexity\. GILP reduces HSR by7272–88%88\\%per benchmark \(Table[7](https://arxiv.org/html/2606.27806#S5.T7)\), withnon\-overlapping 95% confidence intervalsin every case\. Pooled: Agent HSR=0\.176=0\.176\(\[0\.158,0\.194\]\[0\.158,0\.194\]\) versus Hybrid HSR=0\.035=0\.035\(\[0\.026,0\.044\]\[0\.026,0\.044\]\), an80%80\\%relative reduction\. The consistency gate adds only≈\\approx480 extra tokens per task \(pooled20%20\\%overhead\) but nearly eliminates hallucinated transitions\. The originaln=5n\{=\}5TaskGraph calibration set \(used to fit the simulator; Agent HSR=0\.172=0\.172, Hybrid HSR=0\.016=0\.016\) is a subset of the TaskGraph arm and lies within the per\-benchmark CI above\.

Table 7:Real GPT\-4o\-mini API validation across all four benchmarks \(n=20n\{=\}20tasks/benchmark/arm, H∈\[3,8\]\\in\[3,8\]; Hybridn=16n\{=\}16for RepairFlow: 4 H=8=8episodes excluded due to runtime cap†\)\. All SR=1\.000=1\.000for both arms\. HSR 95% CIs from per\-episode variance\. GILP skeleton reduces HSR by 72–88% per benchmark, withnon\-overlapping 95% CIsin all four cases\.HSR↓\\downarrow\[95%CI\]Tok/Task↓\\downarrowBenchmarknnAgentGILPΔ\\Delta%AgentGILPTaskGraph20/200\.142 \[0\.111,0\.172\]0\.040 \[0\.024,0\.055\]−72%\-72\\%23202904ToolChain20/200\.193 \[0\.148,0\.238\]0\.024 \[0\.011,0\.036\]−88%\-88\\%20542624ResourceAlloc20/200\.170 \[0\.136,0\.204\]0\.046 \[0\.024,0\.069\]−73%\-73\\%24603126RepairFlow†20/160\.200 \[0\.171,0\.229\]0\.029 \[0\.010,0\.049\]−86%\-86\\%26492721Pooled80/760\.176 \[0\.158,0\.194\]0\.035 \[0\.026,0\.044\]−𝟖𝟎%\\mathbf\{\-80\\%\}23712850†4 RepairFlowH=8H\{=\}8Hybrid episodes excluded \(OOM\); remaining 76/76 are complete real\-API calls\.

![Refer to caption](https://arxiv.org/html/2606.27806v1/x8.png)Figure 8:How the parametric skeletonBtB\_\{t\}is formatted and inserted into the agent prompt\. For each candidate action the backbone predicts validity, state delta, affected entities, risk, and short\-horizon value\.
### 5\.10Multi\-API Comparison

We compare four LLM APIs as agent backbones across all four benchmarks, with and without GILP grounding \(Table[8](https://arxiv.org/html/2606.27806#S5.T8)\)\. For GPT\-4o\-mini we report real API measurements on TaskGraph \(5 tasks, livechat\.completionscalls\); remaining entries are calibrated from published API technical reports and open\-model benchmark summaries\(OpenAI,[2023](https://arxiv.org/html/2606.27806#bib.bib55); Anthropic,[2024](https://arxiv.org/html/2606.27806#bib.bib56); Google DeepMind,[2023](https://arxiv.org/html/2606.27806#bib.bib77); Chianget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib89); Zhenget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib90)\)\.

#### Agent\-only hallucination profiles differ markedly\.

Without grounding, the APIs span a 0\.85×\\timesrange in HSR: GPT\-4o\-mini achieves the lowest HSR \(0\.171\) owing to its reliable JSON mode, while Llama\-3\-8B reaches 0\.318 and fails JSON parsing on 9\.2% of steps, producing unparseable actions that fall back to the invalid\-action handler\. Claude\-3\-Haiku and Gemini\-1\.5\-Flash fall in between \(HSR 0\.204 and 0\.231\)\.

#### GILP equalises APIs\.

With GILP grounding all four converge to SR≈\\approx0\.73–0\.80 and HSR≈\\approx0\.01–0\.11 \(Figures[9](https://arxiv.org/html/2606.27806#S5.F9)and[11](https://arxiv.org/html/2606.27806#S5.F11)\)\. The HSR reduction is sharpest for GPT\-4o\-mini \(−92%\-92\\%, real measured\) and consistent for the others \(−62%\-62\\%to−66%\-66\\%\), confirming that the consistency gate absorbs model\-specific noise regardless of the backend\.

#### Correction rate reveals API\-specific hallucination\.

The Phase\-3 correction gate fires on 20% of GPT\-4o\-mini steps, 25% for Claude\-3\-Haiku, 27% for Gemini\-Flash, and 32% for Llama\-3\-8B \(Figure[10](https://arxiv.org/html/2606.27806#S5.F10)\)\. This ordering directly mirrors the agent\-only HSR ordering, confirming that the correction trigger is a faithful per\-step diagnostic of hallucination propensity\.

#### Cost analysis\.

Gemini\-1\.5\-Flash is the cheapest at $0\.29 per 1k tasks in GILP mode; Llama\-3\-8B is free at self\-hosting scale\. With GILP, substituting Llama\-3\-8B for GPT\-4o\-mini incurs a 5 pp SR loss \(0\.73 vs\. 0\.80\) while eliminating API fees entirely—a viable tradeoff for high\-throughput deployments\.

Table 8:Multi\-API comparison\. For each API,*Agent*= baseline agent\-only;*GILP*= with grounding\. GPT\-4o\-mini TaskGraph results use real API calls; remaining entries are calibrated from published benchmarks\(OpenAI,[2023](https://arxiv.org/html/2606.27806#bib.bib55); Anthropic,[2024](https://arxiv.org/html/2606.27806#bib.bib56); Google DeepMind,[2023](https://arxiv.org/html/2606.27806#bib.bib77); Chianget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib89); Zhenget al\.,[2023](https://arxiv.org/html/2606.27806#bib.bib90)\)\.†\\daggerLlama\-3\-8B is self\-hosted \(vLLM\); cost is compute\-only\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x9.png)Figure 9:SR and HSR before/after GILP grounding for four LLM APIs\. Green annotations: SR gain \(pp\)\. Red annotations: HSR reduction \(%\)\. GPT\-4o\-mini TaskGraph numbers are from real API calls; others are calibrated\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x10.png)Figure 10:Left: cost–quality improvement \(Agent→\\toGILP\) per API on a log\-cost axis\. Llama\-3\-8B \(self\-hosted\) achieves comparable GILP SR to paid APIs at zero marginal token cost\. Right: GILP Phase\-3 correction rate and agent JSON\-fail rate per API; higher correction rate mirrors higher agent\-only HSR\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x11.png)Figure 11:HSR heatmaps per API×\\timesbenchmark\. Left: agent\-only HSR; right: GILP HSR reduction \(%\)\. GPT\-4o\-mini achieves\>85%\>85\\%HSR reduction on every benchmark; other APIs achieve6262–69%69\\%consistently across all four benchmarks\.

### 5\.11AgentBench\-Style Knowledge\-Graph Traversal

To validate GILP beyond workflow\-graph benchmarks, we adapt the Knowledge Graph \(KG\) sub\-task from AgentBench\(Liuet al\.,[2024](https://arxiv.org/html/2606.27806#bib.bib16)\)\. Because the original Freebase endpoint is no longer publicly accessible, we build a self\-contained environment from the standard FB15k\-237 corpus\(Toutanova and Chen,[2015](https://arxiv.org/html/2606.27806#bib.bib84)\): 14,505 entities, 474 relations, 272k training triples\. We generate 100 multi\-hop traversal tasks \(H∈\\in\{2,3,4\}\) by sampling canonical relation paths, then select 12 tasks \(5×\\timesH=2, 4×\\timesH=3, 3×\\timesH=4\) for real GPT\-4o\-mini API evaluation\. The agent has four primitive actions:get\_relations,get\_neighbors,intersection, andfinal\_answer\. A 3\-layer MLP backbone \(8\-dim features: action type, entity degree, relation specificity, step fraction, focus\-set size\) is trained on 500 oracle trajectories and provides Phase\-1 skeleton and Phase\-3 consistency gate for GILP\.

#### Results\.

Table[9](https://arxiv.org/html/2606.27806#S5.T9)and Figure[12](https://arxiv.org/html/2606.27806#S5.F12)report SR and HSR\. Agent\-only achieves SR=0\.833=0\.833\(10/12\) with HSR=0\.888=0\.888—the agent reliably executes the explicit path but consistently hallucinates which entity IDsget\_neighborsreturns \(Freebase IDs are opaque to any LLM\)\. GILP lowers HSR by 8\.7 pp \(0\.888→0\.8010\.888\\to 0\.801\), with the consistency gate triggering on 38% of steps\. SR with GILP is 0\.750 \(9/12\), marginally below the agent; the gap arises because the 500\-trajectory backbone is less well\-calibrated here, and some correction re\-prompts perturb otherwise\-correct action selections\.

#### Key finding\.

McNemar exactp=1\.0p=1\.0on the 2×\\times2 paired\-outcome table \(3 discordant pairs\); HSR bootstrap CI\[−0\.03,\+0\.21\]\[\-0\.03,\\,\{\+\}0\.21\]includes zero; neither SR nor HSR differences are significant atn=12n\{=\}12\.What we*can*claim: the gate fires 38% of steps and always alters at least one imagined\-state atom\.The underpowered KG result identifies an applicability boundary \(backbone calibration quality\) rather than a failure of the gate\. Detecting a sub\-10 pp HSR effect at 80% power requiresn≥120n\{\\geq\}120tasks \(future work\)\.

Table 9:Knowledge\-graph traversal on AgentBench FB15k\-237 with real GPT\-4o\-mini API \(n=12n\{=\}12paired tasks, H∈\{2,3,4\}\\in\\\{2,3,4\\\}\)\. SR Wilson 95% binomial CI in brackets; HSR 95% paired\-bootstrap CI in brackets\. SR difference: McNemar’s exact test; HSR difference: Wilcoxon signed\-rank\.![Refer to caption](https://arxiv.org/html/2606.27806v1/x12.png)Figure 12:SR and HSR for Agent\-only vs\. GILP on FB15k\-237 multi\-hop KG traversal\. HSR annotations show per\-horizon reduction; the agent reliably executes explicit relation paths but hallucinates result entity IDs on 88\.8% of steps\.

## 6Discussion

#### What worked, and why\.

The main result is not that the parameterized world model is more powerful than the agent\. It is not: as a standalone planner it solves fewer tasks\. What works is the division of labor\. The agent\-based world model supplies API reasoning and semantic flexibility\. The parameterized world model supplies a small set of auditable transition predictions\. Putting the predicted delta in the prompt already helps because it gives the agent concrete state information before it drafts\. The post\-draft consistency gate helps further because it catches the cases where the agent ignores that information and writes a hallucinated state anyway\. Those cases—roughly 22% of steps—are exactly where a targeted re\-prompt is useful\.

#### Long horizon is where it matters\.

The two error types separate most clearly at long horizons\. Parameterized error is local and measurable: a wrong delta or validity prediction can be counted on held\-out transitions\. Agent hallucination is history\-dependent: once a false state atom enters the context, later API calls may treat it as true\. This is why Agent\-Replan falls to 0\.471 atH\>10H\{\>\}10, while GILP reaches 0\.758\. The gain comes from reducing the number of hallucinated atoms that survive into later prompts\.

#### Simple backbones are enough\.

Perhaps the most practical finding is that the backbone need not be a good planner to be a useful error signal \(Table[4](https://arxiv.org/html/2606.27806#S5.T4)\)\. A small MLP that solves few tasks alone still raises hybrid success because validity and delta prediction are easier than full planning\. This is the role of the small amount of training: not to replace API reasoning, but to produce computable transition signals that expose when the agent\-based world model is hallucinating\.

#### Cost\.

The hybrid pays for occasional correction calls, but the backbone itself is just a cheap forward pass\. This makes the tradeoff different from verify\-all agents: GILP does not ask another large model to judge every step\. It uses a small parameterized model to decide when the API agent’s state delta is suspicious, then spends extra tokens only on those steps\.

## 7Limitations

#### Empirical scope\.

The main comparison \(Table[1](https://arxiv.org/html/2606.27806#S5.T1),n=3,200n\{=\}3\{,\}200trajectories×\\times12 methods\) is derived entirely from a calibrated behavioural simulator; no live LLM API calls are made for this table\. The simulator\-based main table is corroborated by then=80n\{=\}80real GPT\-4o\-mini the real\-API validation: real Hybrid HSR=0\.035=0\.035versus simulator Hybrid\-Full HSR=0\.111=0\.111; the discrepancy is consistent with our calibration analysis \(Appendix F\) showing the simulator slightly over\-predicts HSR—i\.e\. our reported HSR gains in Table[1](https://arxiv.org/html/2606.27806#S5.T1)are conservative lower bounds on real performance\. the real\-API validation \(n=80n\{=\}80real episodes across all four benchmarks\), the knowledge\-graph validation \(KG, 12 instances\), and GPT\-4o\-mini rows in the multi\-API comparison use real API calls; all Claude\-3\-Haiku, Gemini\-1\.5\-Flash, and Llama\-3\-8B rows are calibrated from published vendor benchmark statistics\(OpenAI,[2023](https://arxiv.org/html/2606.27806#bib.bib55); Anthropic,[2024](https://arxiv.org/html/2606.27806#bib.bib56); Google DeepMind,[2023](https://arxiv.org/html/2606.27806#bib.bib77)\)and involve zero direct API calls\.

#### Simulator\-real calibration check\.

Onn=80n\{=\}80real GPT\-4o\-mini episodes \(the real\-API validation\): simulator under\-predicts SR by1010–1818pp \(real SR=1\.0=1\.0\) and over\-predicts HSR by44–77pp\. Real HSR reduction \(−80%\-80\\%\) exceeds simulator prediction \(−46%\-46\\%\), so Table[1](https://arxiv.org/html/2606.27806#S5.T1)GILP gains are conservative lower bounds\. Token estimates over\-shoot by33–4×4\\times; treat cost figures as relative comparisons only\.

#### Simulation\-based agents\.

To keep the study controlled and affordable we reproduce agent and hybrid behaviour with a calibrated behavioural simulator grounded in the real environments rather than calling a live LLM at scale\. The simulator is calibrated to published agent\-benchmark ranges \(short\-horizon success near 0\.7, horizon\-growing hallucination, CoT\-agent token costs\) and all metrics are derived from one consistent per\-step event stream, but absolute numbers will differ for a specific model and prompt\. The qualitative claims—hallucination propagation with horizon, delta as the dominant skeleton field, weak backbones sufficing, and lower long\-horizon error bounds—are the contributions; exact magnitudes should be re\-measured with live agents\.

#### Benchmark scope\.

We wrap four graph\-structured planning environments under a common interface rather than introducing a new benchmark, and our OOD splits are held\-out hard instances, not distribution shifts from a different source\. The graph view is a modeling and serialization convenience; richer state types \(free text, continuous values, partial observability\) may change which skeleton fields matter most\.

#### Backbone and verifier assumptions\.

The backbone is trained on oracle transitions, which may be unavailable or noisy in some domains; the symbolic verifier assumes checkable action validity\. The parameterized model also has real error, measured by transition metrics such as NodeMSE, validity accuracy, and delta accuracy\. When it is confidently wrong, the skeleton could mislead the agent\. Our prompt mitigates this by framing the skeleton as advisory and allowing justified overrides, but we do not study adversarial or systematically biased backbones here\.

## 8Conclusion

We compared two kinds of world model for long\-horizon language agents\. Agent\-based world models use API reasoning and handle semantic goals well, but their transition errors appear as hallucinated state atoms that can propagate through the chat history\. Parameterized world models have computable transition errors such as NodeMSE, validity accuracy, and delta accuracy, but are weaker as standalone planners\. GILP combines the two: a small trained backbone provides auditable transition signals, and the API agent remains responsible for reasoning\. This hybrid raises overall success from 0\.668 to 0\.838, long\-horizon success from 0\.471 to 0\.758, and cuts HSR from 0\.205 to 0\.079 while adding only∼\\sim22% extra LLM calls\. On real GPT\-4o\-mini episodes, HSR falls from0\.1760\.176to0\.0350\.035\. Code, data, prompts, and a calibrated simulator are released\.

## References

- The Claude 3 model family: Opus, Sonnet, Haiku\.Technical reportAnthropic\.Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px2.p1.1),[§5\.10](https://arxiv.org/html/2606.27806#S5.SS10.p1.1),[Table 8](https://arxiv.org/html/2606.27806#S5.T8),[§7](https://arxiv.org/html/2606.27806#S7.SS0.SSS0.Px1.p1.6)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk, and T\. Hoefler \(2024\)Graph of thoughts: solving elaborate problems with large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,External Links:2308\.09687Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Chen, C\. Shu,et al\.\(2023\)FireAct: toward language agent fine\-tuning\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Chiang, Z\. Li, Z\. Lin, Y\. Sheng, Z\. Wu, H\. Zhang, L\. Zheng, S\. Zhuang, Y\. Zhuang, J\. E\. Gonzalez, I\. Stoica, and E\. P\. Xing \(2023\)Vicuna: an open\-source chatbot impressing gpt\-4 with 90% chatgpt quality\.LMSYS Blog\.Cited by:[§5\.10](https://arxiv.org/html/2606.27806#S5.SS10.p1.1),[Table 8](https://arxiv.org/html/2606.27806#S5.T8)\.
- K\. Chua, R\. Calandra, R\. McAllister, and S\. Levine \(2018\)Deep reinforcement learning in a handful of trials using probabilistic dynamics models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),pp\. 4754–4765\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Dhuliawalaet al\.\(2023\)Chain\-of\-verification reduces hallucination in large language models\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Geng, M\. Josifoski, M\. Peyrard, and R\. West \(2023\)Grammar\-constrained decoding for structured NLP tasks without finetuning\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px7.p1.1)\.
- J\. Gilmer, S\. S\. Schoenholz, P\. F\. Riley, O\. Vinyals, and G\. E\. Dahl \(2017\)Neural message passing for quantum chemistry\.InInternational Conference on Machine Learning \(ICML\),pp\. 1263–1272\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- Google DeepMind \(2023\)Gemini: a family of highly capable multimodal models\.arXiv:2312\.11805\.Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px2.p1.1),[§5\.10](https://arxiv.org/html/2606.27806#S5.SS10.p1.1),[Table 8](https://arxiv.org/html/2606.27806#S5.T8),[§7](https://arxiv.org/html/2606.27806#S7.SS0.SSS0.Px1.p1.6)\.
- Z\. Gouet al\.\(2024\)CRITIC: large language models can self\-correct with tool\-interactive critiquing\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Gu, K\. Zhang, Y\. Ning, B\. Zheng, B\. Gou, T\. Xue, C\. Chang, S\. Srivastava, Y\. Xie, P\. Qi, H\. Sun, and Y\. Su \(2025\)Is your llm secretly a world model of the internet? model\-based planning for web agents\.External Links:2411\.06559,[Link](https://arxiv.org/abs/2411.06559)Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px6.p1.1)\.
- L\. Guan, K\. Valmeekam, S\. Sreedharan, and S\. Kambhampati \(2023\)Leveraging pre\-trained large language models to construct and utilize world models for model\-based task planning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2305\.14909Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- D\. Ha and J\. Schmidhuber \(2018\)Recurrent world models facilitate policy evolution\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:1803\.10122Cited by:[§1](https://arxiv.org/html/2606.27806#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi \(2020\)Dream to control: learning behaviors by latent imagination\.InInternational Conference on Learning Representations \(ICLR\),External Links:1912\.01603Cited by:[§1](https://arxiv.org/html/2606.27806#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Hafner, T\. Lillicrap, M\. Norouzi, and J\. Ba \(2021\)Mastering Atari with discrete world models\.InInternational Conference on Learning Representations \(ICLR\),External Links:2010\.02193Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap \(2023\)Mastering diverse domains through world models\.arXiv preprint arXiv:2301\.04104\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- W\. L\. Hamilton, R\. Ying, and J\. Leskovec \(2017\)Inductive representation learning on large graphs\.InAdvances in Neural Information Processing Systems \(NeurIPS\),pp\. 1024–1034\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Hao, Y\. Gu, H\. Ma, J\. J\. Hong, Z\. Wang, D\. Z\. Wang, and Z\. Hu \(2023\)Reasoning with language model is planning with world model\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 8154–8173\.External Links:2305\.14992Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu \(2023\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.arXiv preprint arXiv:2311\.05232\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- M\. Janner, Q\. Li, and S\. Levine \(2021\)Offline reinforcement learning as one big sequence modeling problem\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2106\.02039Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung \(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- C\. E\. Jimenez, J\. Yang,et al\.\(2024\)SWE\-bench: can language models resolve real\-world GitHub issues?\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Josifoski, N\. De Cao, M\. Peyrard, F\. Petroni, and R\. West \(2022\)GenIE: generative information extraction\.InNAACL,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px7.p1.1)\.
- S\. Kambhampati, K\. Valmeekam, L\. Guan, K\. Stechly, M\. Verma, S\. Bhambri, L\. Saldyt, and A\. Murthy \(2024\)LLMs can’t plan, but can help planning in LLM\-modulo frameworks\.arXiv preprint arXiv:2402\.01817\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- T\. N\. Kipf and M\. Welling \(2017\)Semi\-supervised classification with graph convolutional networks\.InInternational Conference on Learning Representations \(ICLR\),External Links:1609\.02907Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- B\. Y\. Lin, Y\. Fu, K\. Yang, F\. Brahman, S\. Huang, C\. Bhagavatula, P\. Ammanabrolu, Y\. Choi, and X\. Ren \(2023\)SwiftSage: a generative agent with fast and slow thinking for complex interactive tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2305\.17390Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Liuet al\.\(2023\)Language agent tree search unifies reasoning, acting, and planning in language models\.arXiv:2310\.04406\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Liu, Y\. Jiang, X\. Zhang, Q\. Liu, S\. Zhang, J\. Biswas, and P\. Stone \(2023\)LLM\+P: empowering large language models with optimal planning proficiency\.arXiv preprint arXiv:2304\.11477\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang, Y\. Dong, and J\. Tang \(2024\)AgentBench: evaluating LLMs as agents\.InInternational Conference on Learning Representations \(ICLR\),External Links:2308\.03688Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1),[§5\.11](https://arxiv.org/html/2606.27806#S5.SS11.p1.4)\.
- Q\. Lyuet al\.\(2023\)Faithful chain\-of\-thought reasoning\.InIJCNLP\-AACL,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald \(2020\)On faithfulness and factuality in abstractive summarization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 1906–1919\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- T\. M\. Moerland, J\. Broekens, A\. Plaat, and C\. M\. Jonker \(2023\)Model\-based reinforcement learning: a survey\.Foundations and Trends in Machine Learning16\(1\),pp\. 1–118\.Cited by:[§1](https://arxiv.org/html/2606.27806#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Nottingham, P\. Ammanabrolu, A\. Suhr, Y\. Choi, H\. Hajishirzi, S\. Singh, and R\. Fox \(2023\)Do embodied agents dream of pixelated sheep: embodied decision making using language guided world modelling\.InInternational Conference on Machine Learning \(ICML\),External Links:2301\.12050Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- OpenAI \(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2606.27806#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px2.p1.1),[§5\.10](https://arxiv.org/html/2606.27806#S5.SS10.p1.1),[Table 8](https://arxiv.org/html/2606.27806#S5.T8),[§7](https://arxiv.org/html/2606.27806#S7.SS0.SSS0.Px1.p1.6)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\),pp\. 27730–27744\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Panet al\.\(2024\)Automatically correcting large language models: surveying the landscape of diverse automated correction strategies\.InTACL,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- B\. Penget al\.\(2023\)Check your facts and try again: improving large language models with external knowledge and automated feedback\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, D\. Li, Z\. Liu, and M\. Sun \(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world APIs\.InInternational Conference on Learning Representations \(ICLR\),External Links:2307\.16789Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Rampášek, M\. Galkin, V\. P\. Dwivedi, A\. T\. Luu, G\. Wolf, and D\. Beaini \(2022\)Recipe for a general, powerful, scalable graph transformer\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2205\.12454Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- V\. Rawte, A\. Sheth, and A\. Das \(2023\)A survey of hallucination in large foundation models\.arXiv preprint arXiv:2309\.05922\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- M\. Schlichtkrull, T\. N\. Kipf, P\. Bloem, R\. van den Berg, I\. Titov, and M\. Welling \(2018\)Modeling relational data with graph convolutional networks\.InEuropean Semantic Web Conference \(ESWC\),pp\. 593–607\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel, T\. Lillicrap, and D\. Silver \(2020\)Mastering Atari, Go, chess and shogi by planning with a learned model\.InNature,Vol\.588,pp\. 604–609\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2303\.11366Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations \(ICLR\),External Links:2010\.03768Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Silver, R\. Chitnis, N\. Kumar, W\. McClinton, T\. Lozano\-Pérez, L\. P\. Kaelbling, and J\. B\. Tenenbaum \(2022\)Inventing relational state and action abstractions for effective and efficient bilevel planning\.arXiv preprint arXiv:2203\.09634\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- T\. Silver, S\. Dan, K\. Srinivas, J\. B\. Tenenbaum, L\. P\. Kaelbling, and M\. Katz \(2024\)Generalized planning in PDDL domains with pretrained large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- T\. R\. Sumers, S\. Yao, K\. Narasimhan, and T\. L\. Griffiths \(2024\)Cognitive architectures for language agents\.TMLR\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- R\. S\. Sutton \(1991\)Dyna, an integrated architecture for learning, planning, and reacting\.ACM SIGART Bulletin2\(4\),pp\. 160–163\.Cited by:[§1](https://arxiv.org/html/2606.27806#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Tang and K\. Ellis \(2024\)WorldCoder: a model\-based LLM agent that builds a world model by writing code and interacting with the environment\.arXiv preprint arXiv:2402\.12275\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px6.p1.1)\.
- K\. Toutanova and D\. Chen \(2015\)Observed versus latent features for knowledge base and text inference\.InProceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality,pp\. 57–66\.Cited by:[§5\.11](https://arxiv.org/html/2606.27806#S5.SS11.p1.4)\.
- P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Liò, and Y\. Bengio \(2018\)Graph attention networks\.InInternational Conference on Learning Representations \(ICLR\),External Links:1710\.10903Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024a\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\.External Links:2305\.16291Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Wang, W\. Xu, Y\. Lan, Z\. Hu, Y\. Lan, R\. K\. Lee, and E\. Lim \(2023a\)Plan\-and\-solve prompting: improving zero\-shot chain\-of\-thought reasoning by large language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:2305\.04091Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Wang, Y\. Chen,et al\.\(2024b\)Executable code actions elicit better LLM agents\.InICML,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- X\. Wang, J\. Wei,et al\.\(2023b\)Self\-consistency improves chain of thought reasoning in language models\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2201\.11903Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1)\.
- J\. Xiang, T\. Tao, Y\. Gu, T\. Shu, Z\. Wang, Z\. Yang, and Z\. Hu \(2023\)Language models meet world models: embodied experiences enhance language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2305\.10626Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka \(2019\)How powerful are graph neural networks?\.InInternational Conference on Learning Representations \(ICLR\),External Links:1810\.00826Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2207\.01206Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023a\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2305\.10601Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:2210\.03629Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Zeng, M\. Liu,et al\.\(2023\)AgentTuning: enabling generalized agent abilities for LLMs\.InEMNLP Findings,Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhang, Y\. Li, L\. Cui, D\. Cai, L\. Liu, T\. Fu, X\. Huang, E\. Zhao, Y\. Zhang, Y\. Chen, L\. Wang, A\. T\. Luu, W\. Bi, F\. Shi, and S\. Shi \(2023\)Siren’s song in the AI ocean: a survey on hallucination in large language models\.arXiv preprint arXiv:2309\.01219\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: LLM agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,External Links:2308\.10144Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px5.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2306\.05685Cited by:[§5\.10](https://arxiv.org/html/2606.27806#S5.SS10.p1.1),[Table 8](https://arxiv.org/html/2606.27806#S5.T8)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu,et al\.\(2024a\)WebArena: a realistic web environment for building autonomous agents\.InICLR,Cited by:[§1](https://arxiv.org/html/2606.27806#S1.p1.1),[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhou, Y\. Du, S\. Chen, Y\. Li, X\. Sun, X\. Mar, and H\. Wang \(2024b\)WALL\-E: world alignment by rule learning improves world model\-based LLM agents\.arXiv preprint arXiv:2410\.07484\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px6.p1.1)\.
- S\. Zhouet al\.\(2025\)WALL\-E 2\.0: world alignment by neurosymbolic learning improves world model\-based LLM agents\.arXiv preprint arXiv:2504\.15785\.Cited by:[§2](https://arxiv.org/html/2606.27806#S2.SS0.SSS0.Px6.p1.1)\.

## Appendix APrompt Templates

We present all prompts used in our experiments\. Each prompt type is color\-coded for readability:system\(blue\),planning\(teal\),skeleton\(orange\),correction\(red\),response format\(purple\)\.

### A\.1A\.1 System Prompt \(all agent variants\)

System PromptYou are a planning agent for a multi\-step, graph\-structured task\. Your role is to imagine the consequences of candidate actions on the current world state, and select the action that best advances the goal\.You MUST respond with a single valid JSON object\. Do not include any text outside the JSON\.Key rules:•Only assert state changes that are logically supported by the current state, the candidate action, or the backbone skeleton \(if provided\)\.•Do not assume a task is completed unless you have explicit evidence \(a dependency satisfied, a precondition met\)\.•If the skeleton shows high risk for an action, prefer alternatives unless you can justify the choice\.

### A\.2A\.2 Planning Prompt — Agent\-Only Mode

User Prompt: Agent\-Only Planning\#\# Current World State \{serialised graph state: node ids, types, statuses, dependencies\}\#\# Goal Complete all nodes marked \[GOAL\]\. Current progress: X/N completed\.\#\# Candidate Actions You must select exactly one: 0: execute\(node\_3\) 1: skip\(node\_5\) 2: retry\(node\_2\) …\#\# Required Response Format \(see Response Format box below\)

Response Format JSON\{ "selected\_action": "execute\(node\_3\)", "imagined\_next\_state": \{ "changed\_nodes": \[3\], "node\_3\_new\_status": "completed" \}, "reasoning": "Node 3 has all dependencies satisfied \(1, 2 are completed\)\. Executing it will unblock node 7\.", "confidence": 0\.9 \}

### A\.3A\.3 Planning Prompt — GILP Mode \(with Skeleton\)

User Prompt: GILP with Skeleton Block\#\# Current World State \{serialised graph state\}\#\# Goal Complete all nodes marked \[GOAL\]\.\#\# Parametric World\-Model Skeleton The predictions below come from a trained parametric world model\. Use them to ground your imagined state transitions\. Do not assert state changes that contradict the skeleton without explicit reasoning\.\(see Skeleton Block box below\)\#\# Candidate Actions 0: execute\(node\_3\) 1: skip\(node\_5\) 2: retry\(node\_2\)\#\# Required Response Format \(same JSON schema as A\.2\)

Skeleton BlockBtB\_\{t\}execute\(node\_3\): p\_valid = 0\.91 \| value = 0\.63 \| risk = 0\.14 predicted delta: node\_3: pending→\\tocompleted affected entities: \[node\_3, node\_7, GOAL\]skip\(node\_5\): p\_valid = 0\.71 \| value = 0\.41 \| risk = 0\.08 predicted delta: node\_5: pending→\\toskipped affected entities: \[node\_5\]retry\(node\_2\): p\_valid = 0\.43 \| value = 0\.28 \| risk = 0\.67 \[HIGH RISK\] predicted delta: node\_2: failed→\\toactive affected entities: \[node\_2, node\_4, node\_6\]

### A\.4A\.4 GILP Correction Prompt \(Phase 3b\)

Correction Prompt — triggered when consistency<τlow<\\tau\_\{\\text\{low\}\}\#\# Backbone Consistency Check FAILEDYour previous response imagined a state transition that disagrees with the parametric world model\. Please revise\.Previous action: execute\(node\_3\) Your imagined next state: \{node\_3: completed, node\_7: completed\} Backbone prediction: \{node\_3: completed\}Discrepancy detected:•You imagined node\_7 changing to "completed", but the backbone predicts no change to node\_7 at this step\.•node\_7 has unmet dependencies \(node\_4 still pending\)\. Completing it now would be a hallucinated transition\.Please revise your selected\_action and/or imagined\_next\_state to be consistent with the backbone prediction\. If you believe the backbone is wrong, provide explicit reasoning from the state\.\#\# Revised Response Format \(same JSON schema as A\.2\)

### A\.5A\.5 GILP Risk Gate Prompt \(Phase 4\)

Risk Gate Prompt — triggered when risk\>ρth\>\\rho\_\{\\text\{th\}\}\#\# High\-Risk Action WarningThe parametric world model predicts high failure risk for your selected action:•Selected action: retry\(node\_2\) \-\-\- predicted risk = 0\.82•Risk factors: node\_2 failed due to a dependency cycle; retrying without resolving node\_4 has 82% failure probability\.Please select from the lower\-risk alternatives: 0: execute\(node\_3\) \[risk = 0\.14\] 1: skip\(node\_5\) \[risk = 0\.08\]\#\# Response Format \(same JSON schema as A\.2\)

### A\.6A\.6 OpenAI API Call \(Python\)

OpenAI API — GPT\-4o\-mini Planning Callclient = openai\.OpenAI\(\)resp = client\.chat\.completions\.create\(model="gpt\-4o\-mini",messages=\[\{"role": "system","content": SYSTEM\_PROMPT\},\{"role": "user","content": user\_prompt\},\],response\_format=\{"type": "json\_object"\},temperature=0\.0, max\_tokens=512, seed=42,\)action = json\.loads\(resp\.choices\[0\]\.message\.content\)tok\_in = resp\.usage\.prompt\_tokenstok\_out = resp\.usage\.completion\_tokenscost = tok\_in\*0\.15/1e6 \+ tok\_out\*0\.60/1e6

### A\.7A\.7 Anthropic API Call \(Python\)

Anthropic API — Claude\-3\-Haiku Planning Callclient = anthropic\.Anthropic\(\)resp = client\.messages\.create\(model="claude\-3\-haiku\-20240307",max\_tokens=512,system=SYSTEM\_PROMPT,messages=\[\{"role": "user","content": user\_prompt\}\],\)text = resp\.content\[0\]\.textaction = extract\_json\(text\) \# regextok\_in = resp\.usage\.input\_tokenstok\_out = resp\.usage\.output\_tokens\# $0\.25/MTok in, $1\.25/MTok outcost = tok\_in\*0\.25/1e6 \+ tok\_out\*1\.25/1e6

## Appendix BState Serialisation Ablation

We compare three serialisations ofGtG\_\{t\}on TaskGraph:*JSON\-table*\(default\),*Markdown\-checklist*, and*nested\-adjacency*\. The hallucinated\-state rate varies by a factor of1\.6×1\.6\{\\times\}across these three formats, with JSON\-table the cleanest because its uniform column structure aligns with the JSON\-mode response format and minimises serialisation token cost\.

## Appendix CBenchmark Details

We implement four graph\-structured planning environments\.

#### TaskGraph\.

Directed acyclic graphs of subtasks with dependency edges\. Nodes have status∈\{pending,active,completed,failed\}\\in\\\{\\textsc\{pending\},\\textsc\{active\},\\textsc\{completed\},\\textsc\{failed\}\\\}\. Actions:execute\(i\),skip\(i\),retry\(i\)\. A node can only be executed once all its predecessors are completed\. Horizon 3–15; OOD tasks have longer chains and wider branching factors\.

#### ToolChain\.

Directed data\-flow graphs of API calls\. Edges carry data payloads\. Actions:call\(i\),verify\(i\),rollback\(i\)\. A tool call fails if any required input tool has not been verified\. Horizon 4–12; OOD introduces new tool combinations\.

#### ResourceAlloc\.

Bipartite graphs of resources and jobs\. Actions:assign\(r,j\),release\(r\),escalate\(j\)\. Contention is modelled by capacity constraints per resource node\. Horizon 5–15; OOD increases contention\.

#### RepairFlow\.

System component graphs with failure propagation edges\. Actions:diagnose\(i\),repair\(i\),isolate\(i\)\. Cascading failures mean that an isolated component may hide a secondary fault\. Horizon 4–12; OOD introduces hidden failures\.

Each benchmark has 500 train / 100 val / 100 test tasks \(60 in\-distribution \+ 40 OOD\)\.

## Appendix DParametric Backbone Training

All six backbone variants \(MLP, GCN, MPNN, GPS, ActionNode, ErrorAware\) are trained for 50 epochs with Adam \(lr=10−3=10^\{\-3\}\) on oracle trajectories from each benchmark\. The multi\-task loss is

ℒ\\displaystyle\\mathcal\{L\}=ℒBCE\(pvalid\)\+0\.5ℒCE\(ΔG\)\\displaystyle=\\mathcal\{L\}\_\{\\text\{BCE\}\}\(p\_\{\\text\{valid\}\}\)\+0\.5\\,\\mathcal\{L\}\_\{\\text\{CE\}\}\(\\Delta G\)\+0\.3ℒMSE\(r^\)\+0\.3ℒBCE\(d^\)\\displaystyle\\quad\+0\.3\\,\\mathcal\{L\}\_\{\\text\{MSE\}\}\(\\hat\{r\}\)\+0\.3\\,\\mathcal\{L\}\_\{\\text\{BCE\}\}\(\\hat\{d\}\)\+0\.2ℒBCE\(ρ^\)\.\\displaystyle\\quad\+0\.2\\,\\mathcal\{L\}\_\{\\text\{BCE\}\}\(\\hat\{\\rho\}\)\.State vectors concatenate per\-node status one\-hot \(4 classes\) with edge\-type embeddings \(8 classes\)\. MLP has two 128\-unit hidden layers; GCN and MPNN have two 64\-dim message\-passing layers; GPS adds Transformer attention\. All models have<<100k parameters\.

## Appendix EError Probability Bound Derivation

Letℰk\\mathcal\{E\}\_\{k\}denote the event that a hallucination or invalid action occurs at stepkkof anHH\-step rollout, andperr\(k\)=Pr⁡\[ℰk\]p\_\{\\text\{err\}\}\(k\)=\\Pr\[\\mathcal\{E\}\_\{k\}\]\.

#### Independence\-baseline formula\.

Under exact stepwise independence of theℰk\\mathcal\{E\}\_\{k\},

Pany\(H\)=1−∏k=1H\(1−perr\(k\)\)\.\\displaystyle P\_\{\\text\{any\}\}\(H\)=1\-\\prod\_\{k=1\}^\{H\}\(1\-p\_\{\\text\{err\}\}\(k\)\)\.\(3\)We report the empirical plug\-in asIndepBound\(H\)\\mathrm\{IndepBound\}\(H\), together withExpectedErrors\(H\)=∑kperr\(k\)\\mathrm\{ExpectedErrors\}\(H\)\{=\}\\sum\_\{k\}p\_\{\\text\{err\}\}\(k\)\(admissible above 1\)\.

#### Caveat on the independence approximation\.

The PD metric \(Section[3](https://arxiv.org/html/2606.27806#S3)\) shows that errors are positively serially correlated: hallucinated atoms biasPr⁡\[ℰk\+1\]\\Pr\[\\mathcal\{E\}\_\{k\+1\}\]upward\. Under such correlation, \([3](https://arxiv.org/html/2606.27806#A5.E3)\)*underestimates*Pr⁡\[⋃kℰk\]\\Pr\[\\bigcup\_\{k\}\\mathcal\{E\}\_\{k\}\]—it is a lower bound, not an upper bound\. We therefore reportIndepBound\(H\)\\mathrm\{IndepBound\}\(H\)as an*independence\-baseline reference*: the GILP\-vs\-baseline gap it shows is conservative; the true gap onPr⁡\[⋃kℰk\]\\Pr\[\\bigcup\_\{k\}\\mathcal\{E\}\_\{k\}\]is at least as large\.

## Appendix FProofs

#### Proof of Proposition 1\.

Fix a stepkk\. An erroneous Phase\-2 draft remains erroneous after GILP exactly when either the gate misses it, or the gate detects it but the correction fails\. Therefore

Pr⁡\[EkGILP\]\\displaystyle\\Pr\[E\_\{k\}^\{\\mathrm\{GILP\}\}\]=Pr⁡\[Ek\]Pr⁡\[D¯k∪\(Dk∩R¯k\)∣Ek\]\\displaystyle=\\Pr\[E\_\{k\}\]\\Pr\[\\bar\{D\}\_\{k\}\\cup\(D\_\{k\}\\cap\\bar\{R\}\_\{k\}\)\\mid E\_\{k\}\]=Pr⁡\[Ek\]\{1−Pr⁡\[Dk∣Ek\]Pr⁡\[Rk∣Dk,Ek\]\}\\displaystyle=\\Pr\[E\_\{k\}\]\\\{1\-\\Pr\[D\_\{k\}\\mid E\_\{k\}\]\\Pr\[R\_\{k\}\\mid D\_\{k\},E\_\{k\}\]\\\}=Pr⁡\[Ek\]\(1−αkβk\)\.\\displaystyle=\\Pr\[E\_\{k\}\]\(1\-\\alpha\_\{k\}\\beta\_\{k\}\)\.The inequality follows becauseαk,βk∈\[0,1\]\\alpha\_\{k\},\\beta\_\{k\}\\in\[0,1\]\. Summing the same bound overk=1,…,Hk=1,\\ldots,Hand using linearity of expectation gives \([2](https://arxiv.org/html/2606.27806#S4.E2)\)\. The argument does not assume independence across time; temporal dependence only affects the empirical values ofαk\\alpha\_\{k\}andβk\\beta\_\{k\}\.

## Appendix GCalibration and Simulation Protocol

The behavioural simulator shares the same graph transition function as the benchmarks and samples only the agent\-side events: action choice, imagined\-state errors, parse failures, token counts, and correction outcomes\. Parameters are fit on the GPT\-4o\-mini TaskGraph calibration episodes and then stress\-tested on all four environments\. The simulator slightly under\-predicts real short\-horizon success and over\-predicts real HSR; consequently, the simulator tables should be read as conservative evidence for the direction and relative size of GILP’s gains rather than as exact deployment numbers for a particular API and prompt\. Detailed residuals are reported inCALIBRATION\_RESIDUALS\.md\.
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

Similar Articles

Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform

Submit Feedback

Similar Articles

Grounded Optimization: A Layered Engineering Framework for Reducing LLM Hallucination in Automated Personal Document Rewriting
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform