HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

arXiv cs.CL Papers

Summary

HalluWorld is a controlled benchmark framework for evaluating hallucination in large language models using explicit reference world models across synthetic environments like gridworlds, chess, and realistic terminal tasks. It enables fine-grained analysis of failure modes such as perceptual hallucination, multi-step state tracking, and causal simulation, revealing that frontier models still struggle with complex reasoning not solved by extended thinking.

arXiv:2605.19341v1 Announce Type: new Abstract: Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:25 AM

# HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models
Source: [https://arxiv.org/html/2605.19341](https://arxiv.org/html/2605.19341)
Emmy Liu Carnegie Mellon University &Varun Gangal Patronus AI &Michael Yu Independent Researcher &Zhuofu Tao Independent Researcher &Karan Singh Stanford University &Sachin Kumar The Ohio State University &Steven Y\. Feng Stanford University All authors are affiliated with DegenAI Labs\. Corresponding authors: Emmy Liu \(mengyan3@cs\.cmu\.edu\), Varun Gangal \(vgtomahawk@gmail\.com\), and Steven Y\. Feng \(syfeng@stanford\.edu\)\.

###### Abstract

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across tasks such as summarization, question answering, retrieval\-augmented generation, and agentic interaction\. This fragmentation makes it unclear whether a mitigation that works in one setting actually reduces hallucinations across contexts\. Current hallucination benchmarks either require human annotation and fixed references that may eventually be memorized, or rely on naturalistic observations often recorded in settings that are difficult to reproduce or test systematically\. To enable further research on the root causes of hallucination, we introduceHalluWorld, an extensible benchmark framework grounded in an explicit reference\-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this reference world\. Building on this view, we construct a family of synthetic and semi\-synthetic benchmark environments in which the reference world is fully specified, the model’s observable view is controlled, and hallucination labels can be generated automatically by construction\.HalluWorldspans multiple settings that are classically representative for AI, i\.e\., gridworlds, chess, and realistic terminal tasks\. This enables controlled variation of key factors such as world complexity, observability, temporal change, and source\-conflict policy, allowing us to disentangle hallucinations into more fine\-grained error categories\. We evaluate frontier and open\-weight language models across these settings and find consistent patterns across domains: perceptual hallucination on directly observed information is near\-solved for frontier models, while multi\-step state tracking and causal forward simulation are still difficult for frontier models, and are not generally solved by extended thinking\. In the terminal setting specifically, models also struggle with when to abstain from answering\. The uneven profile of failures across probe types and domains suggest that different hallucinations arise from qualitatively distinct failure modes rather than reflecting a single underlying capability\. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models\.

††footnotetext:Code and data:[https://github\.com/DegenAI\-Labs/HalluWorld](https://github.com/DegenAI-Labs/HalluWorld)## 1Introduction

Despite years of effort on mitigation, hallucinations remain an unresolved problem in frontier language models \(LM\)\. As LM\-based agents are increasingly used in high\-stakes scenarios, it is imperative to understand exactly what conditions lead LMs to hallucinate and why\. Despite being grouped under the same term, misreading a fact, misremembering a prior decision, incorrectly reasoning about the deployment context, or trusting misinformation are different failure modes which require different forms of mitigation\. Although existing benchmarks have made progress in taxonomizing hallucinations\[[52](https://arxiv.org/html/2605.19341#bib.bib21),[77](https://arxiv.org/html/2605.19341#bib.bib77),[2](https://arxiv.org/html/2605.19341#bib.bib65)\], a critical limitation is that the environment, what the LM can observe, and how it must resolve conflicting information are fixed by design and cannot be varied independently, making it difficult to isolate failure modes\.

To address this, we ground our approach in a recent unified formulation of hallucination proposed byLiuet al\.\[[41](https://arxiv.org/html/2605.19341#bib.bib24)\], which decomposes hallucinations into areference worldencoding the set of ground truth states and transitions, aview functioncontrolling what the model can observe, and aconflict policydeciding how conflicts between observable sources are settled\. In this setting, ahallucinationoccurs when the model’s output contradicts the ground\-truth reference world\. Making these components explicit allows us to stress\-test models in a systematic way by independently varying one factor while keeping others fixed\. To make this concrete, consider a terminal agent tasked with locating a configuration file: the same agent may answer correctly when given full directory output, hallucinate a file path when given partial output, and hallucinate a different path when a stale log contradicts what it can currently observe\. By isolating which conditions trigger hallucination, we can proactively support the model with targeted interventions\.

![Refer to caption](https://arxiv.org/html/2605.19341v1/images/Fig1Images/Figure1_v1_FOV.png)Figure 1:TheHalluWorldbenchmarkspans three domains \(gridworlds, chess, and terminals\) and tests models using five probe categories targeting distinct cognitive skills:Causal\(C\) tests understanding of cause\-effect relationships,Perceptual\(P\) tests spatial reasoning and object tracking,Memory\(M\) tests retention of past observations,Uncertainty\(U\) tests reasoning under partial observability, andCompound/X\(X\) tests multi\-step reasoning across connected environments\. Hallucination is measured by placed probes that query models about environment observations they have seen\.HalluWorldqualitative examples from each of the three domains can be found in §[A](https://arxiv.org/html/2605.19341#A1)\.We construct a family of benchmarks,HalluWorld, with three domains spanning a customizable gridworld, chess environment, and agentic terminal environment\. In these worlds, the ground truth labels for hallucinations are defined by construction, with no human labeling required\. Crucially,HalluWorldis designed forextensibility– new questions probing hallucinations can be easily added for each environment\. Our contributions can be summarized as:

1. 1\.We introduceHalluWorld, a benchmark\-framework that operationalizes hallucination as observable error relative to an explicit reference world and conflict policy\.
2. 2\.We instantiate our benchmark as a suite of controlled environments spanning gridworlds, chess, and terminal tasks\. Our benchmark comprises 33 unique levels and 839 probe questions in the gridworld environment, 7 unique levels and 350 probes in the chess environment, and 110 unique terminal tasks and 529 probes in the terminal environment\. Additionally, a level editor and trajectory recorder on the gridworld environment enable extension to more complex scenarios\.
3. 3\.We benchmark a dozen frontier and open models onHalluWorldand find that perceptual accuracy on direct observed information is near\-solved on frontier models, while both memory hallucinations and causal forward simulation remain an issue even for frontier models\. Furthermore, models consistently rely on environmental testimony over direct observation\. Increasing thinking effort seldom uniformly helps, suggesting that reasoning alone may not mitigate hallucinations\.

## 2Related Works \(Extended in §[K](https://arxiv.org/html/2605.19341#A11)\)

Hallucination has traditionally been defined relative to a fixed source, e\.g\., unsupported content in summarization, and later broadened to factual errors in LLM outputs\[[32](https://arxiv.org/html/2605.19341#bib.bib64),[44](https://arxiv.org/html/2605.19341#bib.bib80),[28](https://arxiv.org/html/2605.19341#bib.bib88),[39](https://arxiv.org/html/2605.19341#bib.bib97),[34](https://arxiv.org/html/2605.19341#bib.bib72)\]\. Recent work highlights that benchmarks often conflate multiple failure modes and lack a consistent definition, motivating more structured formulations such as the “world model” view of hallucination\[[2](https://arxiv.org/html/2605.19341#bib.bib65),[41](https://arxiv.org/html/2605.19341#bib.bib24)\]\.

Most existing benchmarks evaluate hallucination in*static*settings: summarization and QA benchmarks define truth w\.r\.t\. fixed documents or facts, while RAG benchmarks study conflicts between parametric and retrieved knowledge but treat provided context as the full reference\[[31](https://arxiv.org/html/2605.19341#bib.bib59),[48](https://arxiv.org/html/2605.19341#bib.bib60),[49](https://arxiv.org/html/2605.19341#bib.bib86),[17](https://arxiv.org/html/2605.19341#bib.bib85)\]\. Recent work on RAG\-considerate pretraining studies how models should allocate knowledge between parametric memory and retrieval during pretraining, highlighting that hallucination tendencies may depend on what information is made externally observable vs\. learned in the parameters\[[57](https://arxiv.org/html/2605.19341#bib.bib7)\]\. Agent benchmarks such asMIRAGE\-BenchandAgentHalluevaluate action\-level hallucinations, but rely on snapshot audits or human annotation rather than fully specified, controllable worlds\[[77](https://arxiv.org/html/2605.19341#bib.bib77),[42](https://arxiv.org/html/2605.19341#bib.bib44)\]\.

HalluWorlddiffers by evaluating hallucination in*partially observed, evolving environments*with ground truth defined by simulator state\. This enables automatic, reproducible labeling and controlled variation of observability, temporal dynamics, and conflicting evidence\. By combining an explicit definition of hallucination with controllable environments and automatic labels,HalluWorldprovides a more precise and diagnostic framework for studying when and why hallucinations occur\.

## 3TheHalluWorldBenchmark Suite

HalluWorldspans three domains: gridworlds, chess, and terminal tasks\. Each instantiates a different set of challenges for models\. Gridworlds are fully synthetic and customizable game\-like environments, commonly used as testbeds for RL and planning\[[61](https://arxiv.org/html/2605.19341#bib.bib26),[64](https://arxiv.org/html/2605.19341#bib.bib27)\]\. Hence, they provide a high degree of control over world state and specific challenges instantiated, while having minimal overlap with pretraining data other than game\-specific priors\. On the other hand, the terminal environment provides the most realistic scenarios for a deployed agent as the trajectories we use to generate the benchmark are based on real software engineering tasks\. However, the complexity and noisiness of terminal contexts makes it somewhat harder to isolate specific failure modes\. Chess occupies a middle ground: it is a real and well\-specified domain with representation in pretraining, but it is easy to generate new board states and check correctness\.

Instantiating the reference world model framework\.Our benchmark suite directly instantiates the formal hallucination definition proposed in recent work\[[41](https://arxiv.org/html/2605.19341#bib.bib24)\], which defines hallucination as observable world\-model error w\.r\.t\. a reference world𝒲=\(𝒮,ℋ,ℛ\)\\mathcal\{W\}=\(\\mathcal\{S\},\\mathcal\{H\},\\mathcal\{R\}\), view functionVV, and truth functionT𝒲,𝒫T\_\{\\mathcal\{W\},\\mathcal\{P\}\}\. Table[1](https://arxiv.org/html/2605.19341#S3.T1)shows how each environment family specifies these components\. In all three domains,𝒲\\mathcal\{W\}is*well specified and programmatically known*, enabling automatic ground\-truth generation without human/model annotation\. Conflict policies𝒫\\mathcal\{P\}play a role when adversarial information arises \(e\.g\., misleading signposts in gridworlds, transposed FEN strings in chess\)\.

Table 1:Instantiation of the reference world model framework across the threeHalluWorldfamilies\. Each domain provides a fully specified𝒲\\mathcal\{W\}, controllableVV, and computableT𝒲,𝒫T\_\{\\mathcal\{W\},\\mathcal\{P\}\}\.Shared probe categories\.We use a set of fiveprobe categoriesfor each domain, each targeting a distinct cognitive demand\.Perceptual \(P\)probes test accurate read\-out of values directly present in the current observation\.Memory \(M\)probes require tracking values or states across multiple prior observations\.Causal \(C\)probes require reasoning about cause\-effect relationships or forward\-simulating the outcome of actions under environment mechanics\.Uncertainty \(U\)probes require recognizing the limits of available evidence and abstaining from a definite answer when context is insufficient\. Chess omits uncertainty probes by design \(§[3\.2](https://arxiv.org/html/2605.19341#S3.SS2)\)\.Compound \(X\)probes require integrating evidence across multiple context sections, room visits, or artifact types simultaneously\.

### 3\.1HalluWorld\-Grid: Interactive Environments for Controlled Hallucination Probing

HalluWorld\-Gridis a family of hand\-crafted gridworld environments, built on MiniGrid\[[7](https://arxiv.org/html/2605.19341#bib.bib15)\]\. It contains33 unique levelsand 839 probe questions \(see §[C](https://arxiv.org/html/2605.19341#A3)for a full list of levels\)\. Hallucination labels are generated automatically from environment state\. We provide a level editor and trajectory recorder to allow crafting of more complex environment\-probe tuples \(see §[J](https://arxiv.org/html/2605.19341#A10)\)\.Serializerscontrol*how*the environment state is presented to the model; in other words, a manipulation ofVVindependent of the content itself\. We implement three formats \(Table[2](https://arxiv.org/html/2605.19341#S3.T2)\)\.

Table 2:Serializer formats used inHalluWorld\-Grid\. Each controls the presentation of world state to the model, independently of what the state contains\.Evaluation Details:Levels are organized into the five probe categories \(§[1](https://arxiv.org/html/2605.19341#S3.T1)\), with specific instantiations:P\(6 levels\) tests static\-scene perception under varying object density, orientations, and change conditions\.M\(6 levels\) probes temporal integration via multi\-room traversal and object tracking under river physics\.C\(9 levels\) introduces interactive mechanics: fire, flood, and pressure plates, with harder levels presenting adversarial notice\-boards that contradict observations \(see §[D](https://arxiv.org/html/2605.19341#A4)\)\.U\(5 levels\) spans unobserved rooms, signposts with varying reliability, and scenes with unanswerable questions\.X\(7 levels\) combines the above into extended multi\-room episodes probed across the full observation history\.000The adversarial notice\-boards and signposts, along with the agent’s observations, are an example of multiple info sources, with conflict resolution policy𝒫\\mathcal\{P\}being that the agent should rely on its own observations \(source of truth\)\.Within each level, probes take one of six closed\-form answer types:presence\(yes/no object detection\),count\(exact integer\),state\(attribute/door status\),location\(coordinate\),causal\(forward simulation outcome\), anduncertainty\(answer concretely or“can’t determine”\)\.

### 3\.2HalluWorld\-Chess: Structured World\-Model Probes in a Rule\-Governed Domain

Chess provides a complementary environment to gridworlds: chess is a domain well represented in training data\[[36](https://arxiv.org/html/2605.19341#bib.bib25)\], and a rich set of cognitively distinct question types that map naturally onto our probe categories\. Chess also probes an adversarial form of hallucination: positions are drawn from real Lichess puzzles, so the model may simultaneously hold strong distributional priors over “typical” game continuations*and*be confronted with an observed position that contradicts them\.

Environment and observation format\.Each episode draws a position from a curated pool of Lichess FEN\[[11](https://arxiv.org/html/2605.19341#bib.bib28)\]strings \(varied rating and theme strata\)\. Before the probe is presented,kkrandom legal continuations are executed from that position \(defaultk=14k=14\), advancing the board into novel territory and suppressing opening\-book exploitation\. The serializer then renders a fixed observation format: i\) an ASCII 8×\\times8 board grid with labeled files \(a–h\) and ranks \(1–8\), ii\) the side to move, and iii\) an FEN string \(standardized, single\-line text used to describe a specific board position; optionally misleading with two pieces transposed\)\.

Probes\.We evaluate 7 probes \(see §[B](https://arxiv.org/html/2605.19341#A2)\) across 3 of the probe categories \(§[3](https://arxiv.org/html/2605.19341#S3)\)\. Uncertainty is omitted as the observed board state always fully specifies the current position\. Perceptual probes require static read\-out from the current board\. Causal probes require forward simulation: applying one or two hypothetical moves and evaluating a property of the resulting position\. Memory probes remove the board, requiring the model to maintain accurate game state across a long move sequence\.ChatContextmode prepends 4\-16 past\-game moves with commentary before the real observation \(chess analogue of gridworld X levels\), testing whether models separate context signal from noise\.

### 3\.3HalluWorld\-Terminal: Agentic Hallucinations in Terminal Tasks

HalluWorld\-TerminalgroundsHalluWorldin realistic agentic deployment: the reference world𝒲\\mathcal\{W\}is the full state of a live Linux filesystem, and the viewVVis the raw terminal history visible to the agent at a given trajectory step, which is≈\\approx30\-60K chars long\.

Probe construction\.We derive 529 probes from 132Terminal\-Bench\[[46](https://arxiv.org/html/2605.19341#bib.bib37)\]trajectories spanning 110 software\-engineering tasks \(e\.g\., kernel builds, ML framework configuration, etc\.\), executed by agpt\-4o\-miniagent \(to elicit diverse error modes instead of optimized solutions from stronger models\) in a live Linux environment\. Probes are LLM\-generated with \(gpt\-5\.4,reasoning\_effort=high\) and grounded in file\-system diffs \(e\.g\. SHA\-256 hashes, mtimes, directory listings\) at the trigger step\. We retain probes withanswerability\_score = 5anddifficulty\_score≥\\geq4, obtaining an answerable and hard final set of probes\.111Both scores are assigned by the probe\-generating LLM during creation\. Answerability \(1–5\) measures whether the probe has unambiguous ground truth from supplied context; difficulty \(1–5\) measures the extent of reasoning required to answer\.All 5 probe categories are instantiated with approximately equal coverage \(≈105\\approx 105\-108108probes each\)\. Each probe targets 1/5 hallucination failure modes:stale memory\(187 probes\) i\.e\., relying on an outdated value rather than the most recent terminal output;cross\-category reasoning\(110\) i\.e\., failing to bridge between context sections such as a process listing and a directory snapshot;uncertainty overclaim\(107\) i\.e\., asserting a definite answer with insufficient context;causal shortcut\(104\) i\.e\., incorrect attribution of cause in the trace; andversion/API hallucination\(21\) i\.e\., substituting train\-time API knowledge for observed behavior\.

Evaluation protocol\.Models receive the unedited, ANSI\-stripped terminal context and must return an answer in a constrained key\-value schema \(e\.g\.,outcome=⟨ok\|fail⟩\)\. Scoring is fully rule\-based and no LLM judge is used\. Details shared in §[E](https://arxiv.org/html/2605.19341#A5)\.

## 4Experimental Setup

We evaluate several models, both frontier closed and open\-weight, with both thinking and non\-thinking variants when applicable\.222We use API credits from external providers \(Baseten, OpenAI, and Anthropic\) for open and closed models\.Non\-thinking:GPT\-4o\-mini, GPT\-5\.4\-mini, GPT\-4o, Sonnet 4\.6, Opus 4\.6; DeepSeek\-V3\.1, GLM\-5, Qwen\-3\-30B, Kimi K2\.6\.Thinking:o3\-mini, o4\-mini, o3, GPT\-5\.5, Sonnet 4\.6 \(T\), Opus 4\.6 \(T\)\. All models with temperature are queried atτ=0\\tau=0\. Thinking models are by default queried at amediumlevel, with reasoning ablations for selected models per domain\. ForHalluWorld\-Grid, each \(level, serializer\) pair is evaluated over 10 fixed seeds for nondeterministic levels \(P1 and P3\)\. ForHalluWorld\-Chess, each probe is run over 50 different board positions\. ForHalluWorld\-Terminal, each probe is evaluated with the full runtime context plus the probe question, up to 60k characters with middle\-truncation on longer agent trajectories\. Unless otherwise specified, models are allocated 256 tokens for their answer and 16k tokens for thinking\.

## 5HalluWorld\-GridResults

Table 3:Micro\-averaged hallucination rates by probe category and serializer forHalluWorld\-Grid, with models ordered by overall hallucination\. \(T\) = medium thinking\. Columns are grouped into \(left\) probe categories —P\(Perceptual\),M\(Memory\),C\(Causal\),U\(Uncertainty\),X\(Cross\-category\); and \(right\) input serializers —Symbolic,Grid, andMemory\. TheOverallcolumn reports global average hallucination across all conditions\. Cell color reflects relative global hallucination rate \(green = low, red = high\);boldmarks the column minimum\.### 5\.1Category\-Level Summary

Table[3](https://arxiv.org/html/2605.19341#S5.T3)reports probe\-level hallucination rates by category for each model\. A consistent hierarchy holds for most models:U<P<M<C\\text\{U\}<\\text\{P\}<\\text\{M\}<\\text\{C\}, with X showing high variance\. Presence probes approach 100% accuracy for strong models, while the causal category is difficult even for frontier models \(Claude Opus 4\.6: 20\.1%, GPT\-5\.5: 25\.3%\)\. Two frontier clusters emerge: reasoning models \(GPT\-5\.5, Claude Opus/Sonnet 4\.6, o3, o4\-mini, o3\-mini\) achieve 5\.0\-8\.1% overall, while non\-reasoning frontier models \(GPT\-4o\) and most open\-weight models cluster at 12\.6\-15\.4%\. Kimi K2\.6 shows a striking anomaly: a low overall rate of 9\.3% driven almost entirely by the C\-category at 47\.4%, while remaining relatively low on P and U\. This breakdown highlights that overall hallucination rate conflates different failure modes, motivating category\-level decomposition\.

### 5\.2Serializer as an Independent Variable

The format in which the environment state is presented has a notable effect on hallucination, as seen in[Table 3](https://arxiv.org/html/2605.19341#S5.T3)\. Grid serialization is harder overall, but the damage is probe\-specific: allocentric rotation \(P3\) collapses to near 0% under grid for all models, while presence probes remain near\-ceiling in both formats\. Cross\-observation attribute tracking \(M2\) is catastrophic under grid even for strong models, yet count probes are differentially preserved – o3\-mini maintains near\-100% count accuracy under grid where non\-thinking GPT models drop to 13–25%\. This serializer and probe type×\\timesmodel interaction cannot be decomposed by existing benchmarks, and has a direct practical implication: the format in which an agent receives environment state is an independent driver of hallucination\.

### 5\.3Thinking Models

Comparing thinking and non\-thinking variants within the same model family reveals a consistent and counterintuitive pattern\. Extended thinking slightly improves perceptual accuracy \(Sonnet 4\.6 P:2\.8%→1\.1%2\.8\\%\\to 1\.1\\%; Opus 4\.6 P:3\.7%→1\.7%3\.7\\%\\to 1\.7\\%\) but*worsens*causal hallucination \(Sonnet 4\.6 C:19\.7%→24\.0%19\.7\\%\\to 24\.0\\%; Opus 4\.6 C:20\.1%→25\.2%20\.1\\%\\to 25\.2\\%\), with negligible difference on uncertainty\. We hypothesize that this may be caused by perceptual probes having a clear answer that is derivable by carefully reading the context through an extended CoT, whereas causal probes require open\-ended forward simulation; additional reasoning provides more opportunity to confabulate plausible intermediate states rather than trusting direct observations\. This echoes results in VLMs where extended reasoning can increase hallucination\[[40](https://arxiv.org/html/2605.19341#bib.bib38)\]\. See §[G](https://arxiv.org/html/2605.19341#A7)for full reasoning ablation\.

### 5\.4Hard Subset

The Hard subset comprises 12 \(level, serializer\) pairs where≥\\geq5 models achieve≥\\geq20% hallucination rate \(full results in §[H](https://arxiv.org/html/2605.19341#A8)\), skewing toward memory\-serializer causal levels and grid\-serializer perceptual levels – confirming serializer format as a first\-order difficulty driver\. Frontier models achieve 15\.5\-21\.3% mean hallucination rates on this subset; thinking models provide no consistent advantage \(Sonnet 4\.6: 17\.9% with thinking vs\. 20\.0% without\), consistent with the causal thinking paradox in §[5\.3](https://arxiv.org/html/2605.19341#S5.SS3)\. Kimi K2\.6 is the clearest outlier at 53\.9%, driven by near\-total failure on the C1b variants \(95\.0% and 100\.0%\), and C6 \(Flood\-Fire Escape\) is among the hardest levels, where adversarial testimony compounds causal reasoning under a time constraint \(Sonnet 4\.6: 56\.1%\)\. Overall, we see that the Hard subset ofHalluWorld\-Gridis effective and challenges even frontier models\.

### 5\.5Failure Mode Analysis

Over\-trusting textual information\.Models consistently privilege written testimony over direct observation\. In M4 \(Unreliable Narrator\), even strong models partially defer to a signpost that contradicts what they can directly observe\. C1a with\-board vs\. no\-board sharpens this: Opus 4\.6 hallucinates*more*with an*accurate*notice board present \(42\.0% vs\. 28\.4% without\), suggesting that written rules \(whether accurate or not\) can derail implicit causal reasoning rather than support it\.

Failure to track multi\-step state changes\.Memory levels reveal that models struggle to maintain accurate world state across observations\. In M1 \(River Field\), models fallback to the stale notice board rather than computing current object positions from elapsed steps\. In M2 \(Witness Stand\), cross\-chamber attribute tracking collapses under grid serialization even when presence detection remains intact, isolating the failure to the binding of properties to objects across time\.

Difficulty reasoning about future states\.Causal levels requiring forward simulation are the hardest\. C3 \(Flood Room\) demands prediction of which tiles will be passable at a future timestep\. Hallucination rates reveal a striking within\-frontier split: Opus 4\.6 and Sonnet 4\.6 achieve 0%, while o3 reaches 43\.3% and o4\-mini 33\.3% – reflecting different causal reasoning behavior and exposing distinct failure modes across model families\. C6 \(Flood\-Fire Escape\) compounds forward simulation with adversarial testimony, and is among the hardest levels for frontier models \(Sonnet 4\.6: 56\.1%\)\. Overall,HalluWorld\-Gridallows us to stress\-test models effectively by isolating features such as the grid serialization, forward\-simulation dynamics, and adversarial testimony\.

### 5\.6In\-Navigation Setting \(InNav\)

To studyHalluWorld\-Gridin a more interactive setting and mirror real\-world agent deployment, we evaluate models in an In\-Navigation \(InNav\) setting where they answer probes while actively navigating the environment\. By comparingInNavto aCtrlStaticcondition with identical trajectories, we can examine two effects: epistemic grounding from action\-perception feedback vs\. cognitive load from navigation demands\. We find that navigation often reduces hallucination for reasoning models \(e\.g\., up to−13\.2%\-13\.2\\%\), especially in causal and uncertainty categories, but can increase errors in memory or long\-horizon tasks, where cognitive load dominates\. Full details are in §[I](https://arxiv.org/html/2605.19341#A9)\. Clearly, hallucination depends on more than just static reasoning ability, but on how models interact with their environment\. This demonstrates thatHalluWorld\-Gridis also useful for evaluating agents in more realistic, embodied contexts\.

## 6HalluWorld\-ChessResults

Table 4:HalluWorld\-Chesshallucination rates by probe category under two serializers: excluding a FEN and including an incorrect FEN\.Columns: Overall \(all 7 probes,∼\\sim348 questions\); P = Perceptual \(can\_capture,defended,hanging\); C = Causal \(hypothetical\_in\_check,after\_move\_undefended\_count\); M = Memory \(hidden\_side\_capture\_stats\); X = Compound \(san\_legal\_move\)\. Color marksrelativehallucination rate \(green = low, red = high\);boldmarks column minimum\.Category\-level summary\.Non\-reasoning models and those without extended reasoning cluster tightly on P and C:gpt\-4o,gpt\-4o\-mini, andgpt\-5\.4\-miniwithout reasoning are nearly indistinguishable on those two categories with≈\\approx50% hallucination\. All three collapse on both long\-sequence categories: M and X reach avg\. hallucination \>80%\. The same broad pattern holds for the open\-weight models: Qwen\-3\-30B, GLM\-5, and DeepSeek\-V3\.1\. Adding a reasoning budget togpt\-5\.4\-minicuts hallucination sharply on P \(49\.3%→\\to7\.3%\) and C \(52\.0%→\\to13\.0%\), but leaves substantial long\-horizon failures: M remains 75\.0% and X remains 36\.0%\. Clearly, our probe categories allow us to separate failure modes rather than collapsing them into one aggregate score\.

Memory and compound probes as frontier stress tests\.At P and C, the four strongest models are near\-saturated:gpt\-5\.5,o3,gpt\-5\.4, ando4\-miniall achieve hallucination rates<<10%\. The long\-sequence categories separate them from weaker models, but also reveal meaningful differences within the frontier group\.gpt\-5\.5solves both M and X at 0\.0%;o3remains near\-saturated \(M 2\.1%, X 6\.0%\);gpt\-5\.4is similarly strong \(M 8\.3%, X 4\.0%\)\. By contrast,o4\-minisolves the compound SAN probe \(X 0\.0%\) but is weaker on the pure memory\-counting probe \(M 20\.8%\), whileo3\-minihas low P \(1\.3%\) but much higher C \(39\.0%\), M \(20\.8%\), and X \(28\.0%\)\. Claude thinking models show the opposite: Opus 4\.6 \(T\) and Sonnet 4\.6 \(T\) are strong on M/X \(M 2\.1%, X 6\.0%; M 8\.3%, X 10\.0%\) despite high causal hallucination \(48\.0% and 44\.0%\)\.

Incorrect FEN metadatatests robustness against corrupted auxiliary states\. While overall model ordering is largely preserved, the perturbation exposes brittleness; even top models likegpt\-5\.5ando3show a 3\-4% increase in hallucination\. The effect is more mixed for weaker models: some improve slightly because the corrupted FEN is not the dominant failure source, while others remain dominated by M/X failures\. The FEN ablation acts as a serializer\-robustness probe, distinguishing models that reason from the board itself from those distracted by inconsistent textual states\.

HalluWorld\-Chessillustrates the diagnostic value of a structured probe taxonomy over aggregate scores\. The weakest non\-reasoning models cluster tightly under No FEN \(Overall 57\-63%\) while the P/C/M/X decomposition immediately locates the dominant failures in long\-sequence state tracking and compound SAN reconstruction\. Model\-specific anomalies only surface at the probe level\. The incorrect\-FEN condition adds a robustness test, revealing whether models over\-trust corrupted auxiliary metadata even with a visible board\. This validates the benchmark design: the ground truths are algorithmically checkable, difficulty is tunable via game length, and the taxonomy reveals particular failure modes\.

## 7HalluWorld\-TerminalResults

Table 5:HalluWorld\-Terminalhallucination rates by probe category\.\(T\) = extended thinking\. Categories: C = Causal, X = Cross\-tier compound, M = Memory, P = Perceptual, U = Uncertainty\. Cell color reflects relative hallucination \(green = low, red = high\);boldmarks column min\.Overall accuracy\.The terminal domain is the most difficult inHalluWorld: even GPT\-5\.5 achieves only 94\.1% overall accuracy, and no model saturates any individual probe category\. Two tiers emerge, mirroring the gridworld finding: thinking/reasoning models achieve 5\.9–21\.0% hallucination rates overall, while standard models cluster at 26\.3–56\.5%\. As in the gridworld evaluation, this gap partly reflects a compute budget disparity: thinking models were givenmax\_output\_tokens=16 384while standard models were capped at 256\. The performance difference should therefore be interpreted as an upper bound on the underlying capability difference, not a controlled comparison\.

Probe category hierarchy\.A consistent ordering holds across most models \(by accuracy\):Perceptual≈Causal\>Cross\-Tier≈Memory≫Uncertainty\\text\{Perceptual\}\\approx\\text\{Causal\}\>\\text\{Cross\-Tier\}\\approx\\text\{Memory\}\\gg\\text\{Uncertainty\}\. Perceptual and causal accuracy are near\-ceiling for the strongest models \- correctly reading a value directly present in a 50k\-character terminal context is tractable once long\-context retrieval is reliable\. Uncertainty probes are the hardest: even GPT\-5\.5 and o3 cap at 76\.9% accuracy, meaning the best models hallucinate≈\\approx1/4 answers when the correct response is“cannot determine”\. This contrasts with the controlled gridworld U\-category, where uncertainty is often easier, and suggests that epistemic abstention becomes substantially harder in realistic terminal contexts\.

Notable anomalies\.GPT\-4o\-mini shows a unique inverted memory profile: its memory hallucination rate \(57\.4%\) substantially exceeds its causal rate \(39\.6%\), suggesting a specific failure in long\-horizon state tracking\. Kimi\-K2\.6 exhibits the highest uncertainty hallucination in the terminal domain \(73\.1%\)\. Since U\-probes specifically target theuncertainty overclaimfailure mode, this shows Kimi’s high tendency towards confident confabulation under epistemic limits here\.

Thinking depth ablation\.The thinking paradox identified inHalluWorld\-Grid\(§[5\.3](https://arxiv.org/html/2605.19341#S5.SS3)\) has a mixed instantiation inHalluWorld\-Terminal\. Looking at[Table 14](https://arxiv.org/html/2605.19341#A5.T14), for Sonnet 4\.6, accuracy increases monotonically from 79\.2% \(no thinking\) to 86\.2% \(max\), with every probe category benefiting from deeper reasoning\. GPT\-5\.5 shows a different pattern: any reasoning budget yields a large jump \(\+\+8\.5 pp, 86\.4%→\\to94\.9%\), but the gains are non\-monotonic:mediumis slightly belowlow,highis best, andxhighslightly regresses relative tohigh\. GPT\-5\.5 shows a non\-monotonic pattern on uncertainty:low,medium, andxhighreasoning underperform theno\-reasoning setting, whilehighreasoning reaches the best U accuracy \(84\.3%\)\. This suggests that, for the most part, reasoning interferes specifically with epistemic abstention rather than with factual retrieval or causal inference\. Uncertainty remains the hardest category regardless of reasoning depth for both models, with all configurations plateauing in the 60–85% range\. Full ablation results are in §[F](https://arxiv.org/html/2605.19341#A6)\. Additional analysis of hallucination rate as a function of trajectory depth is provided in[Figure 2](https://arxiv.org/html/2605.19341#A5.F2)\.

## 8Conclusion & Future Work

Our benchmark,HalluWorld, demonstrates that hallucination is fundamentally a world\-modeling failure that decomposes into distinct categories \(perception, memory, causality, uncertainty\), many of which remain unsolved for frontier models\. By grounding evaluation in explicit, controllable environments,HalluWorldenables reproducible and fine\-grained diagnosis of when and why hallucinations occur\.HalluWorldturns hallucination from a vague metric into an experimentally tractable problem, supporting targeted interventions and deeper analysis beyond aggregate scores\.

There are several directions for future work\. First, our current probing measures explicit or stated beliefs, but models may have a wider distribution in their internal representations\[[56](https://arxiv.org/html/2605.19341#bib.bib16)\]\. Mechanistic techniques such as causal tracing\[[51](https://arxiv.org/html/2605.19341#bib.bib18),[45](https://arxiv.org/html/2605.19341#bib.bib19)\]could provide one lens to observe these internal beliefs\. Second, current probes use canonical locations co\-designed with the constructed environment as locations where cognitive load is likely higher\. More general formulation of probe placement could use active learning approaches\[[73](https://arxiv.org/html/2605.19341#bib.bib20)\]to dynamically discover states in a given environment where the agent would face maximal cognitive load\. Further, one could more explicitly incorporate and assess multi\-source conflict resolution\. Lastly, it would be interesting to investigate hallucinations in small model and data limited regimes, e\.g\., at the BabyLM\-scale\[[71](https://arxiv.org/html/2605.19341#bib.bib2),[25](https://arxiv.org/html/2605.19341#bib.bib3),[16](https://arxiv.org/html/2605.19341#bib.bib5),[43](https://arxiv.org/html/2605.19341#bib.bib14),[12](https://arxiv.org/html/2605.19341#bib.bib13),[24](https://arxiv.org/html/2605.19341#bib.bib12),[75](https://arxiv.org/html/2605.19341#bib.bib6)\]\.

## Acknowledgments

We gratefully acknowledge Modal, the National Science Foundation ACCESS Program, Lambda Labs’ Research Grant Program, and NVIDIA’s Academic Grant Program for providing compute resources that enabled this work\. EL was supported by the National Sciences and Engineering Research Council of Canada \(NSERC\), \[funding reference number 578085\], as well as the SoftBank\-ARM Fellowship\.

## References

- \[1\]\(2025\-04\)Hallucination of Multimodal Large Language Models: A Survey\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[2\]Y\. Bang, Z\. Ji, A\. Schelten, A\. Hartshorn, T\. Fowler, C\. Zhang, N\. Cancedda, and P\. Fung\(2025\-04\)HalluLens: LLM Hallucination Benchmark\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.19341#S1.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p1.1)\.
- \[3\]F\. S\. Bao, M\. Li, R\. Qu, G\. Luo, E\. Wan, Y\. Tang, W\. Fan, M\. S\. Tamber, S\. Kazi, V\. Sourabh, M\. Qi, R\. Tu, C\. Xu, M\. Gonzales, O\. Mendelevitch, and A\. Ahmad\(2024\-10\)FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[4\]Center for AI Safety, Scale AI, and HLE Contributors Consortium\(2026\)A benchmark of expert\-level academic questions to assess ai capabilities\.Nature649,pp\. 1139–1146\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09962-4),[Link](https://doi.org/10.1038/s41586-025-09962-4)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[5\]J\. Chen, H\. Lin, X\. Han, and L\. Sun\(2023\-12\)Benchmarking Large Language Models in Retrieval\-Augmented Generation\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[6\]M\. Chevalier\-Boisvert, D\. Bahdanau, S\. Lahlou, L\. Willems, C\. Saharia, T\. H\. Nguyen, and Y\. Bengio\(2019\-12\)BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[7\]M\. Chevalier\-Boisvert, B\. Dai, M\. Towers, R\. Perez\-Vicente, L\. Willems, S\. Lahlou, S\. Pal, P\. S\. Castro, and J\. Terry\(2023\-12\)Minigrid & miniworld: modular & customizable reinforcement learning environments for goal\-oriented tasks\.InAdvances in Neural Information Processing Systems 36, New Orleans, LA, USA,Cited by:[§3\.1](https://arxiv.org/html/2605.19341#S3.SS1.p1.1)\.
- \[8\]P\. Clark, O\. Tafjord, and K\. Richardson\(2020\-05\)Transformers as Soft Reasoners over Language\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[9\]M\. Côté, Á\. Kádár, X\. Yuan, B\. Kybartas, T\. Barnes, E\. Fine, J\. Moore, R\. Y\. Tao, M\. Hausknecht, L\. E\. Asri, M\. Adada, W\. Tay, and A\. Trischler\(2019\-11\)TextWorld: A Learning Environment for Text\-based Games\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[10\]A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez, N\. Chapados, and A\. Lacoste\(2024\-07\)WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[11\]S\. J\. Edwards, S\. Forsyth, J\. Stanback, and A\. Saremba\(1994\)Standard: portable game notation specification and implementation guide\. 1994\.URL https://ia802908\.us\.archive\.org/26/items/pgn\-standard\-1994\-03\-12/PGN\_standard\_1994\-03\-12\.txt\.Cited by:[§3\.2](https://arxiv.org/html/2605.19341#S3.SS2.p2.3)\.
- \[12\]S\. Y\. Feng, N\. D\. Goodman, and M\. C\. Frank\(2024\-11\)Is child\-directed speech effective training data for language models?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 22055–22071\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1231/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1231)Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[13\]S\. Y\. Feng, J\. Huynh, C\. P\. Narisetty, E\. Hovy, and V\. Gangal\(2021\-08\)SAPPHIRE: approaches for enhanced concept\-to\-text generation\.InProceedings of the 14th International Conference on Natural Language Generation,A\. Belz, A\. Fan, E\. Reiter, and Y\. Sripada \(Eds\.\),Aberdeen, Scotland, UK,pp\. 212–225\.External Links:[Link](https://aclanthology.org/2021.inlg-1.21/),[Document](https://dx.doi.org/10.18653/v1/2021.inlg-1.21)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[14\]S\. Y\. Feng, V\. Khetan, B\. Sacaleanu, A\. Gershman, and E\. Hovy\(2023\-05\)CHARD: clinical health\-aware reasoning across dimensions for text generation models\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,A\. Vlachos and I\. Augenstein \(Eds\.\),Dubrovnik, Croatia,pp\. 313–327\.External Links:[Link](https://aclanthology.org/2023.eacl-main.24/),[Document](https://dx.doi.org/10.18653/v1/2023.eacl-main.24)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[15\]S\. Y\. Feng, K\. Lu, Z\. Tao, M\. Alikhani, T\. Mitamura, E\. Hovy, and V\. Gangal\(2022\-Jun\.\)Retrieve, caption, generate: visual grounding for enhancing commonsense in text generation models\.Proceedings of the AAAI Conference on Artificial Intelligence36\(10\),pp\. 10618–10626\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/21306),[Document](https://dx.doi.org/10.1609/aaai.v36i10.21306)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[16\]S\. Y\. Feng, A\. W\. M\. Tan, and M\. C\. Frank\(2026\)Baby scale: investigating models trained on individual children’s language input\.External Links:2603\.29522,[Link](https://arxiv.org/abs/2603.29522)Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[17\]R\. Friel, M\. Belyi, and A\. Sanyal\(2025\-01\)RAGBench: Explainable Benchmark for Retrieval\-Augmented Generation Systems\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p2.1)\.
- \[18\]T\. Guan, F\. Liu, X\. Wu, R\. Xian, Z\. Li, X\. Liu, X\. Wang, L\. Chen, F\. Huang, Y\. Yacoob, D\. Manocha, and T\. Zhou\(2024\-03\)HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision\-Language Models\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[19\]D\. Ha and J\. Schmidhuber\(2018\-05\)World Models\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.1207631)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[20\]D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi\(2020\-03\)Dream to Control: Learning Behaviors by Latent Imagination\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[21\]D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson\(2019\-06\)Learning Latent Dynamics for Planning from Pixels\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[22\]R\. Harang, J\. Naradowsky, Y\. Gujju, and Y\. Miyao\(2025\-08\)Tracking World States with Language Models: State\-Based Evaluation Using Chess\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[23\]M\. Henaff, J\. Weston, A\. Szlam, A\. Bordes, and Y\. LeCun\(2017\-05\)Tracking the World State with Recurrent Entity Networks\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[24\]J\. Hu, A\. W\. M\. Tan, S\. Y\. Feng, and M\. C\. Frank\(2025\)Language production is harder than comprehension for children and language models\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.47\.External Links:[Link](https://escholarship.org/uc/item/5rz8b9jg)Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[25\]M\. Y\. Hu, A\. Mueller, C\. Ross, A\. Williams, T\. Linzen, C\. Zhuang, R\. Cotterell, L\. Choshen, A\. Warstadt, and E\. G\. Wilcox\(2024\)Findings of the second babylm challenge: sample\-efficient pretraining on developmentally plausible corpora\.InThe 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning,pp\. 1–21\.Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[26\]L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu\(2024\-11\)A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions\.External Links:[Document](https://dx.doi.org/10.1145/3703155)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px1.p1.1)\.
- \[27\]A\. Jacovi, A\. Wang, C\. Alberti, C\. Tao, J\. Lipovetz, K\. Olszewska, L\. Haas, M\. Liu, N\. Keating, A\. Bloniarz, C\. Saroufim, C\. Fry, D\. Marcus, D\. Kukliansky, G\. S\. Tomar, J\. Swirhun, J\. Xing, L\. Wang, M\. Gurumurthy, M\. Aaron, M\. Ambar, R\. Fellinger, R\. Wang, Z\. Zhang, S\. Goldshtein, and D\. Das\(2025\-01\)The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long\-Form Input\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[28\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. Bang, D\. Chen, W\. Dai, H\. S\. Chan, A\. Madotto, and P\. Fung\(2024\-07\)Survey of Hallucination in Natural Language Generation\.External Links:[Document](https://dx.doi.org/10.1145/3571730)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p1.1)\.
- \[29\]M\. Kazemi, Q\. Yuan, D\. Bhatia, N\. Kim, X\. Xu, V\. Imbrasaite, and D\. Ramachandran\(2023\-06\)BoardgameQA: A Dataset for Natural Language Reasoning with Contradictory Information\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[30\]H\. Kokel, M\. Katz, K\. Srinivas, and S\. Sohrabi\(2026\-02\)ACPBench: Reasoning about Action, Change, and Planning\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v39i25.34857)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[31\]W\. Kryściński, B\. McCann, C\. Xiong, and R\. Socher\(2019\-10\)Evaluating the Factual Consistency of Abstractive Text Summarization\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p2.1)\.
- \[32\]K\. Lee, O\. Firat, A\. Agarwal, C\. Fannjiang, and D\. Sussillo\(2019\)Hallucinations in Neural Machine Translation\.External Links:[Link](https://openreview.net/forum?id=SkxJ-309FQ)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p1.1)\.
- \[33\]B\. Z\. Li, Z\. C\. Guo, and J\. Andreas\(2025\-10\)\(How\) Do Language Models Track State?\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[34\]J\. Li, X\. Cheng, W\. X\. Zhao, J\. Nie, and J\. Wen\(2023\-10\)HaluEval: A Large\-Scale Hallucination Evaluation Benchmark for Large Language Models\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p1.1)\.
- \[35\]Y\. Li, Y\. Du, K\. Zhou, J\. Wang, W\. X\. Zhao, and J\. Wen\(2023\-10\)Evaluating Object Hallucination in Large Vision\-Language Models\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[36\]Lichess\(2022\)Lichess open database\.Note:Accessed: 2026\-05\-04External Links:[Link](https://database.lichess.org/)Cited by:[§3\.2](https://arxiv.org/html/2605.19341#S3.SS2.p1.1)\.
- \[37\]B\. Y\. Lin, W\. Zhou, M\. Shen, P\. Zhou, C\. Bhagavatula, Y\. Choi, and X\. Ren\(2020\-11\)CommonGen: a constrained text generation challenge for generative commonsense reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 1823–1840\.External Links:[Link](https://aclanthology.org/2020.findings-emnlp.165/),[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.165)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[38\]J\. Lin, Y\. Du, O\. Watkins, D\. Hafner, P\. Abbeel, D\. Klein, and A\. Dragan\(2024\-05\)Learning to Model the World with Language\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[39\]S\. Lin, J\. Hilton, and O\. Evans\(2022\-05\)TruthfulQA: Measuring How Models Mimic Human Falsehoods\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p1.1)\.
- \[40\]C\. Liu, Z\. Xu, Q\. Wei, J\. Wu, J\. Zou, X\. E\. Wang, Y\. Zhou, and S\. Liu\(2025\)More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models\.External Links:2505\.21523,[Link](https://arxiv.org/abs/2505.21523)Cited by:[§5\.3](https://arxiv.org/html/2605.19341#S5.SS3.p1.4)\.
- \[41\]E\. Liu, V\. Gangal, C\. Zou, M\. Yu, X\. Huang, A\. Chang, Z\. Tao, K\. Singh, S\. Kumar, and S\. Y\. Feng\(2026\)A unified definition of hallucination: it’s the world model, stupid\!\.External Links:2512\.21577,[Link](https://arxiv.org/abs/2512.21577)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.19341#S1.p2.1),[§2](https://arxiv.org/html/2605.19341#S2.p1.1),[§3](https://arxiv.org/html/2605.19341#S3.p2.5)\.
- \[42\]X\. Liu, X\. Yang, Z\. Li, P\. Li, and R\. He\(2026\-01\)AgentHallu: Benchmarking Automated Hallucination Attribution of LLM\-based Agents\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p2.1)\.
- \[43\]B\. Long, R\. Z\. Sparks, V\. Xiang, S\. Stojanov, Z\. Yin, G\. E\. Keene, A\. W\. Tan, S\. Y\. Feng, C\. Zhuang, V\. A\. Marchman,et al\.\(2024\)The babyview dataset: high\-resolution egocentric videos of infants’ and young children’s everyday experiences\.arXiv preprint arXiv:2406\.10447\.Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[44\]J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald\(2020\-05\)On Faithfulness and Factuality in Abstractive Summarization\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p1.1)\.
- \[45\]K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov\(2022\)Locating and editing factual associations in gpt\.Advances in neural information processing systems35,pp\. 17359–17372\.Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[46\]M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan, J\. Shen, G\. Ye, H\. Lin, J\. Poulos, M\. Wang, M\. Nezhurina, J\. Jitsev, D\. Lu, O\. M\. Mastromichalakis, Z\. Xu, Z\. Chen, Y\. Liu, R\. Zhang, L\. L\. Chen, A\. Kashyap, J\. Uslu, J\. Li, J\. Wu, M\. Yan, S\. Bian, V\. Sharma, K\. Sun, S\. Dillmann, A\. Anand, A\. Lanpouthakoun, B\. Koopah, C\. Hu, E\. Guha, G\. H\. S\. Dreiman, J\. Zhu, K\. Krauth, L\. Zhong, N\. Muennighoff, R\. Amanfu, S\. Tan, S\. Pimpalgaonkar, T\. Aggarwal, X\. Lin, X\. Lan, X\. Zhao, Y\. Liang, Y\. Wang, Z\. Wang, C\. Zhou, D\. Heineman, H\. Liu, H\. Trivedi, J\. Yang, J\. Lin, M\. Shetty, M\. Yang, N\. Omi, N\. Raoof, S\. Li, T\. Y\. Zhuo, W\. Lin, Y\. Dai, Y\. Wang, W\. Chai, S\. Zhou, D\. Wahdany, Z\. She, J\. Hu, Z\. Dong, Y\. Zhu, S\. Cui, A\. Saiyed, A\. Kolbeinsson, J\. Hu, C\. M\. Rytting, R\. Marten, Y\. Wang, A\. Dimakis, A\. Konwinski, and L\. Schmidt\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.External Links:2601\.11868,[Link](https://arxiv.org/abs/2601.11868)Cited by:[Appendix E](https://arxiv.org/html/2605.19341#A5.p1.1),[§3\.3](https://arxiv.org/html/2605.19341#S3.SS3.p2.3)\.
- \[47\]G\. Mialon, C\. Fourrier, C\. Swift, T\. Wolf, Y\. LeCun, and T\. Scialom\(2023\-11\)GAIA: a benchmark for General AI Assistants\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[48\]S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi\(2023\-10\)FActScore: Fine\-grained Atomic Evaluation of Factual Precision in Long Form Text Generation\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p2.1)\.
- \[49\]C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang\(2024\-05\)RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval\-Augmented Language Models\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p2.1)\.
- \[50\]A\. Pagnoni, V\. Balachandran, and Y\. Tsvetkov\(2021\-07\)Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[51\]N\. Prakash, N\. Shapira, A\. S\. Sharma, C\. Riedl, Y\. Belinkov, T\. R\. Shaham, D\. Bau, and A\. Geiger\(2026\)Language models use lookbacks to track beliefs\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[52\]A\. Ravichander, S\. Ghela, D\. Wadden, and Y\. Choi\(2025\-07\)HALoGEN: fantastic LLM hallucinations and where to find them\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1402–1425\.External Links:[Link](https://aclanthology.org/2025.acl-long.71/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.71),ISBN 979\-8\-89176\-251\-0Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.19341#S1.p1.1)\.
- \[53\]A\. Rohrbach, L\. A\. Hendricks, K\. Burns, T\. Darrell, and K\. Saenko\(2019\-03\)Object Hallucination in Image Captioning\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[54\]J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel, T\. Lillicrap, and D\. Silver\(2020\-02\)Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model\.External Links:[Document](https://dx.doi.org/10.1038/s41586-020-03051-4)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[55\]M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\-03\)ALFWorld: Aligning Text and Embodied Environments for Interactive Learning\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[56\]A\. Shrivastava and A\. Holtzman\(2025\)Linearly decoding refused knowledge in aligned language models\.arXiv preprint arXiv:2507\.00239\.Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[57\]K\. Singh, M\. Yu, V\. Gangal, Z\. Tao, S\. Kumar, E\. Liu, and S\. Y\. Feng\(2026\)To memorize or to retrieve: scaling laws for rag\-considerate pretraining\.External Links:2604\.00715,[Link](https://arxiv.org/abs/2604.00715)Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p2.1)\.
- \[58\]K\. Sinha, S\. Sodhani, J\. Dong, J\. Pineau, and W\. L\. Hamilton\(2019\-09\)CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[59\]Z\. Su, J\. Zhang, X\. Qu, T\. Zhu, Y\. Li, J\. Sun, J\. Li, M\. Zhang, and Y\. Cheng\(2024\-08\)ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLM\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[60\]Z\. Sun, S\. Shen, S\. Cao, H\. Liu, C\. Li, Y\. Shen, C\. Gan, L\. Gui, Y\. Wang, Y\. Yang, K\. Keutzer, and T\. Darrell\(2023\-09\)Aligning Large Multimodal Models with Factually Augmented RLHF\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[61\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§3](https://arxiv.org/html/2605.19341#S3.p1.1)\.
- \[62\]O\. Tafjord, B\. D\. Mishra, and P\. Clark\(2021\-06\)ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[63\]R\. Tamari, K\. Richardson, A\. Sar\-Shalom, N\. Kahlon, N\. Liu, R\. Tsarfaty, and D\. Shahaf\(2021\-11\)Dyna\-bAbI: unlocking bAbI’s potential with dynamic synthetic benchmarking\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[64\]C\. Tovey and S\. Koenig\(2000\)Gridworlds as testbeds for planning with incomplete information\.InAAAI/IAAI,pp\. 819–824\.Cited by:[§3](https://arxiv.org/html/2605.19341#S3.p1.1)\.
- \[65\]K\. Valmeekam, M\. Marquez, A\. Olmo, S\. Sreedharan, and S\. Kambhampati\(2023\-11\)PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[66\]P\. N\. Venkit, T\. Chakravorti, V\. Gupta, H\. Biggs, M\. Srinath, K\. Goswami, S\. Rajtmajer, and S\. Wilson\(2024\-09\)An Audit on the Perspectives and Challenges of Hallucinations in NLP\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px1.p1.1)\.
- \[67\]A\. Wang, K\. Cho, and M\. Lewis\(2020\-04\)Asking and Answering Questions to Evaluate the Factual Consistency of Summaries\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[68\]J\. Wang, Y\. Wang, G\. Xu, J\. Zhang, Y\. Gu, H\. Jia, J\. Wang, H\. Xu, M\. Yan, J\. Zhang, and J\. Sang\(2024\-02\)AMBER: An LLM\-free Multi\-dimensional Benchmark for MLLMs Hallucination Evaluation\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[69\]Y\. Wang, S\. Feng, H\. Wang, W\. Shi, V\. Balachandran, T\. He, and Y\. Tsvetkov\(2024\-10\)Resolving Knowledge Conflicts in Large Language Models\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px2.p1.1)\.
- \[70\]Z\. Wang, G\. Bingham, A\. Yu, Q\. Le, T\. Luong, and G\. Ghiasi\(2024\-07\)HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[71\]A\. Warstadt, A\. Mueller, L\. Choshen, E\. Wilcox, C\. Zhuang, J\. Ciro, R\. Mosquera, B\. Paranjabe, A\. Williams, T\. Linzen,et al\.\(2023\)Findings of the babylm challenge: sample\-efficient pretraining on developmentally plausible corpora\.InProceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning,Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[72\]J\. Weston, A\. Bordes, S\. Chopra, A\. M\. Rush, B\. van Merriënboer, A\. Joulin, and T\. Mikolov\(2015\-12\)Towards AI\-Complete Question Answering: A Set of Prerequisite Toy Tasks\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[73\]Y\. Xia, X\. Liu, T\. Yu, S\. Kim, R\. Rossi, A\. Rao, T\. Mai, and S\. Li\(2024\)Hallucination diversity\-aware active learning for text summarization\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8665–8677\.Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[74\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu\(2024\-05\)OSWorld: Benchmarking Multimodal Agents for Open\-Ended Tasks in Real Computer Environments\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.
- \[75\]L\. Zeng, S\. Y\. Feng, and M\. C\. Frank\(2026\)Bringing up a bilingual babylm: investigating multilingual language acquisition using small\-scale models\.External Links:2603\.29552,[Link](https://arxiv.org/abs/2603.29552)Cited by:[§8](https://arxiv.org/html/2605.19341#S8.p2.1)\.
- \[76\]A\. Zhang, K\. Nguyen, J\. Tuyls, A\. Lin, and K\. Narasimhan\(2024\-09\)Language\-Guided World Models: A Model\-Based Approach to AI Control\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px4.p1.1)\.
- \[77\]W\. Zhang, Y\. Sun, P\. Huang, J\. Pu, H\. Lin, and D\. Song\(2025\-07\)MIRAGE\-Bench: LLM Agent is Hallucinating and Where to Find Them\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.19341#S1.p1.1),[§2](https://arxiv.org/html/2605.19341#S2.p2.1)\.
- \[78\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\(2024\-04\)WebArena: A Realistic Web Environment for Building Autonomous Agents\.Cited by:[Appendix K](https://arxiv.org/html/2605.19341#A11.SS0.SSS0.Px3.p1.1)\.

## Appendix AHalluWorldQualitative Examples

### A\.1HalluWorld\-Grid

Table 6:HalluWorld\-GridPerception Probes \(P1 Dense Array\):Example showing world state \(Grid serializer\), four probe types, and model responses\.✗indicates hallucination,✓indicates correct answer\.Table 7:HalluWorld\-GridMemory/Causal Probes \(C1a Persistent Chain\):InNav mode navigation amplifies hallucinations in GPT\-4o and o3\-mini \(demonstratingcognitive load\), whereas GPT\-5\.5 maintains accuracy \(epistemic grounding\)\.
### A\.2HalluWorld\-Chess

Table 8:HalluWorld\-ChessPerceptual/Causal Probes:Example board used for static board\-reading probes and short\-horizon causal rollout\. Ground truth is computed directly from the chess engine\.
### A\.3HalluWorld\-Terminal

Table 9:HalluWorld\-TerminalRuntime Probes \(3d\-model\-format\-legacy\):Example showing a late compile loop where the model must reconcile terminal output, command history, filesystem snapshots, and an inconsistent working directory signal\.ComponentDetailsTask:3d\-model\-format\-legacyRuntime Terminal Evidence \(step 0258\):Serializer:Terminal trace \+ path snapshotsCmdOperation/app/MdfLib mtimelisting\_sha2560249timeline memset edit1777902356661d\.\.\.0250header memset edit177790235648dd\.\.\.0251delete windows\.h from stdafx\.cpp177790235648dd\.\.\.0252g\+\+ compile177790235648dd\.\.\.0253header memset edit1777902371e0cf\.\.\.0254mesh memset edit1777902372e0cf\.\.\.0255timeline memset edit1777902372e0cf\.\.\.0256header memset edit17779023723dfc\.\.\.0257delete windows\.h from stdafx\.cpp17779023733dfc\.\.\.0258g\+\+ compile17779023733dfc\.\.\.Trigger Command:g\+\+ \-o MdfFormat MdfFormat\.cpp stdafx\.cpp \-I\.Trace Signal:Compile output still reportsMdfFormat\.cpp:543:55andstdafx\.h:11:10: fatal error: windows\.h; captured prompt shows/app/MdfLib\#, while relative snapshots forMdfFormat\.cppandstdafx\.cppunder/appare nonexistent\.Mode:Runtime probeScope:through\_triggerProbes \(5 total\)Probe 1 \(Perceptual\):“Which command indices changedlisting\_sha256while leavingmtime\_epochunchanged?”→\\rightarrowidx=0250,0256Probe 2 \(Memory\):“Which earlier reported compile issue was never directly targeted by anysedpattern in commands 0219–0257?”→\\rightarrowuntouched=line543\_castProbe 3 \(Causal\):“If command 0257 deleted\#include <windows\.h\>fromstdafx\.hinstead ofstdafx\.cpp, what would happen to that fatal diagnostic?”→\\rightarrowdiag=disappearProbe 4 \(Uncertainty\):“Were the relative snapshotsMdfFormat\.cppandstdafx\.cpptaken in the same directory context that the compile commands used?”→\\rightarrowstatus=contradictedProbe 5 \(Cross\-Category\):“Which recurring edit type is associated with every observed/app/MdfLiblisting\-hash change and unrelated to thewindows\.hinclude chain?”→\\rightarrowedit=header\_memset\_fixEvaluation NoteThe extracted probe JSONs store grounded questions and gold answers, but this checkout does not include per\-probeprobe\_evalmodel\-answer records\. Aggregate results over all 529 probes are available separately: GPT\-5\.5 94\.1%, o3 89\.8%, GPT\-4o 61\.6%, GPT\-4o\-mini 48\.2%\.Table 10:HalluWorld\-TerminalMemory/Causal Probes \(hf\-train\-lora\-adapter\):Example showing how terminal traces distinguish syntax repair, API repair, stale metadata, and unsupported claims about training\.ComponentDetailsTask:hf\-train\-lora\-adapterRuntime Terminal Evidence \(step 0008\):Serializer:Terminal trace \+ file snapshotsCmdEventtrain\_lora\.py stateOutput artifact state0002echo script with literal \\nsize 1094, alpha=16none0003pip install peft==0\.15\.0unchangednone0004python /app/train\_lora\.pySyntaxErrornone0005sed converts literal \\nsize 1060none0006python /app/train\_lora\.pyTypeError: unexpected alphanone0008remove alpha and rerunsize 1051, no alpha in configadapter \+ results JSONTrigger Command:python /app/train\_lora\.pyFinal Source Signal:LoraConfig\(\.\.\.\)no longer containsalpha, butresults\[’lora\_config’\]still records\{’r’: 8, ’alpha’: 16, ’epochs’: 3\}\. The script loads train/val JSON only to count samples, then saves the adapter without an optimizer, backward pass, or training loop\.Mode:Runtime probeScope:through\_triggerProbes \(5 total\)Probe 1 \(Perceptual\):“Compare the executableLoraConfig\(\.\.\.\)call with the later metadata\. Which mismatch class applies?”→\\rightarrowmismatch=metadata\_extra\_alphaProbe 2 \(Memory\):“Immediately before command 8 began, how had/app/train\_lora\.pychanged relative to the original file?”→\\rightarrowdelta=newline\_and\_alpha\_fixProbe 3 \(Causal\):“If after command 5 the code had replacedalpha=16withlora\_alpha=16, what would the next run most likely have done?”→\\rightarrowoutcome=would\_succeedProbe 4 \(Uncertainty\):“Do final adapter weights contain gradient\-based training updates from/app/data/train\.json?”→\\rightarrowstatus=contradictedProbe 5 \(Cross\-Category\):“Which created output is known to reflect dataset contents rather than only model/config defaults?”→\\rightarrowdepends\_on\_dataset=results\_onlyEvaluation NoteThis example targets hallucinations that confuse successful artifact creation with real training, or conflate stale metadata with executable configuration\. Per\-probe model responses can be appended fromruns/probe\_evalresult JSONs when those artifacts are restored\.

## Appendix BHalluWorld\-ChessProbe Types

Table 11:AllHalluWorld\-Chessprobes\. Cat\. = probe category \(P = Perceptual, C = Causal, M = Memory\)\. Board = whether the ASCII board is shown; History = whether a move sequence is provided\. All probes scored 0/1; 50 episodes each\.Table 12:HalluWorld\-ChessMemory and Compound Probes:Examples of long\-horizon chess probes\. The M probe tests event tracking over a move sequence; the X probe combines memory, causal transition rules, and valid action generation from SAN\-only input\.
## Appendix CList of allHalluWorld\-GridLevels

Table 13:All 33HalluWorld\-Gridlevels\.Probes:Pr = Presence, Ct = Count, St = State/Attribute, Lo = Location, Ca = Causal, Un = Uncertainty\.Ser\.:Sym = Symbolic, Grd = Grid, Mem = Memory, Mix = Mixed\.∙\\bullet= included in HalluWorld\-Hard subset\.IDNameProbesSer\.Core challengeHardPerceptual \(P\)P1Dense ArrayPr, Ct, StSym, GrdDense regular grid with deliberate violations \(color swap, absence, state mismatch\); tests pattern\-completion failureGrd∙\\bulletP2Corridor GauntletPr, Ct, LoSym, GrdNarrow corridor; objects at varying depths and lateral offsets; tests spatial ordering precisionGrd∙\\bulletP3Rotation ChallengeLoSym, GrdFixed scene, random agent spawn direction; tests egocentric vs\. allocentric frame conversion—P4aHarder ArrayPr, Ct, StSym, GrdDenser, larger P1 variant with more object types; tests compound counting under densityGrd∙\\bulletP4bDelta PerceptionPr, Ct, StSym, GrdTwo observations separated by agent movement; tests FOV\-shift appearance and disappearance detection—P5Object PermanencePr, Ct, StSym, GrdObject moved out of FOV between observations; tests out\-of\-view state tracking—Memory \(M\)M1 \(N=3\)River Field \(early\)Pr, Ct, StMemRiver carries objects east at fixed rate; stale notice board shows t=0 state; 3 steps elapsed—M1 \(N=6\)River Field \(mid\)Pr, Ct, StMemSame as M1 at 6 steps; object positions maximally ambiguous∙\\bulletM1 \(N=9\)River Field \(late\)Pr, Ct, StMemSame as M1 at 9 steps; one object has dried, testing temporal decay tracking∙\\bulletM2Witness StandPr, Ct, StSym, GrdFive sequential chamber observations; tests cross\-observation source attribution and recency interference—M3Incident ReportPr, Ct, StSym, GrdTwo room snapshots with 3 silent changes \(color swap, state change, removal\); tests change enumeration—M4Unreliable NarratorPr, Ct, StMemSignpost makes 3 false claims about the room alongside direct observation; tests sign vs\. observation conflict—Causal \(C\)C1aPersistent ChainPr, CaMemThree\-step chain; all effectspersistent\(one\-shot triggers\); notice board states rule; tests continuous\-bias error∙\\bulletC1a \(nb\)Persistent Chain \(no board\)Pr, CaMemC1a without notice board; model must infer persistence from mechanics alone∙\\bulletC1bContinuous ChainPr, CaMemIdentical layout to C1a; all effectscontinuous\(gates re\-lock on key removal\); tests persistent\-bias error∙\\bulletC1b \(nb\)Continuous Chain \(no board\)Pr, CaMemC1b without notice board; model must infer continuity from mechanics alone∙\\bulletC2Fire CrossingPr, St, CaMemWet ball extinguishes fire barrier; dry decoy ball cannot; tests wet\-condition state tracking under time pressure—C3Flood RoomPr, St, CaMemFloodtiles advance one row per step; agent must reach elevated goal before path is cut off; tests step\-precise forward simulation—C4Forking PathsPr, CaMemThree paths to goal: iron key \(valid\), lit torch on wood door \(valid\), water crossing \(red herring—no boat\); tests solution\-space exploration—C5aAdversarial BoardPr, CaMemC1a layout; notice boardlies\(claims gate is continuous\); annotation is correct; pits adversarial testimony against structural inference∙\\bulletC6Flood–Fire EscapePr, St, CaMemFlood extinguishes fire at step 4 creating a timed action window; lying notice board gives wrong timings; tests cross\-mechanic composition with adversarial testimony∙\\bulletUncertainty \(U\)U1Fog of WarPr, UnSymHub with four never\-observed sealed rooms described on notice board; tests epistemic abstention vs\. confident confabulation—U2 \(high\)Oracle Problem \(80%\)UnMixTwo signposts claim room contents; notice board states 80% signpost accuracy; tests confidence scaling to stated reliability—U2 \(mid\)Oracle Problem \(50%\)UnMixSame setup at 50% stated accuracy—U2 \(low\)Oracle Problem \(20%\)UnMixSame setup at 20% stated accuracy—U4The AmnesiacSt, Ca, UnSymRoom with evidence of prior activity but no stored observations; tests deducible vs\. genuinely unanswerable causal attribution—Cross\-Category Compound \(X\)X1Facility TourPr, Ct, StSym3\-zone facility tour; baseline multi\-room test of cross\-zone intrusion—X2Facility Tour\+Pr, Ct, StSym, Grd5\-zone tour; lying signpost claims incorrect details about a prior zone—X3Facility Tour\+\+Pr, Ct, StSym, Grd7\-zone tour; earlier zones degrade more under recency bias—X4Compound WitnessPr, Ct, StSym5\-zone tour with embedded Witness Stand sub\-sequence mid\-tour—X5Cascading TestimonyPr, Ct, StSym, Grd7\-zone tour; later signposts contradict the agent’s own earlier direct observationsGrd∙\\bulletX6Return VisitPr, Ct, StSym, Grd5\-zone tour \+ revisit; Zone A modified on return; tests world\-model update after backtracking—X7Dragon KeepPr, Ct, St, CaSym, Grd8\-zone RPG tour; unreliable NPCs, detour, backtracking, goal requires integrating information across all zones—
## Appendix DHalluWorld\-GridEnvironmental Mechanics

Beyond standard MiniGrid objects, we implement the following mechanics to support probing of memory and causal hallucinations\.

##### Tiles\.

RiverTile\(direction, speed\)carries objects in a fixed direction at a fixed rate per step\.FireTileis impassable and is extinguished by any object withwet\_turns\_remaining\>\>0\.FloodTile\(rise\_step\)becomes impassable at a specified step count\.PressurePlate\(effect\)supports two effect types:*continuous*\(gate open while plate is weighted\) and*trigger*\(one\-shot, permanent\)\.DarkZonetiles are passable but block the agent’s field of view\.

##### Object states\.

Objects carry aconditionattribute \(dry\|\|wet\|\|soaked\) withwet\_turns\_remainingdecrementing each step off\-river and resetting on re\-entry\. This state is the primary mechanic for C2 and C6\.

##### Adversarial testimony objects\.

NoticeBoardObject\(text\)renders stale written information that is always visible in the serializer\. It is accurate in M1, inaccurate in M4, and actively wrong in C5a and C6, making it the primary source of adversarial testimony across tiers\.SignpostObject\(text, accurate\)supports controlled false information with stated accuracy rates, used in U2\.

## Appendix EHalluWorld\-Terminal: Further Details

HalluWorld\-Terminalis constructed from a collection of over 100 Terminal\-Bench\[[46](https://arxiv.org/html/2605.19341#bib.bib37)\]tasks\. We intentionally chose an older and smaller agent, GPT\-4o\-mini, to attempt these tasks and generate trajectories\. This choice reflects the hypothesis that weaker agents tend to be less methodical and less thorough, producing longer, noisier, and more failure\-prone interaction traces\. Such trajectories create richer settings in which hallucination\-relevant evidence may be hidden, stale, partially contradicted, or distributed across many terminal steps, allowing us to test whether later and stronger models can still recover the correct answer from the available evidence\.

To construct the benchmark, we developed a runtime probe injector that observes the agent’s terminal session during task execution\. The injector records each command, working directory, terminal output, pre\- and post\-command pane state, and relevant file snapshots\. From these observations, it constructs evidence contexts and generates targeted probe questions whose answers are grounded in the recorded trajectory\. Crucially, the probes are generated from the runtime trace rather than shown to the original task\-solving agent, so they do not alter the agent’s behavior or contaminate the trajectory\.

The resulting probes are designed to test several hallucination failure modes in agentic settings, including stale memory, perceptual mistakes, causal over\-attribution, uncertainty overclaiming, and cross\-category reasoning failures that require connecting commands, files, outputs, and task state\. Each probe has a structured answer schema and an associated golden answer, making evaluation mostly automatic while keeping the questions narrow and auditable\. After extraction, probes are filtered and curated for answerability and difficulty, yielding a compact benchmark subset of high\-quality examples\.

Table 14:Thinking depth ablation onHalluWorld\-Terminal\(probe accuracy, higher is better\)\. Probe categories: Ca = Causal, X = Cross\-category, M = Memory, P = Perceptual, U = Uncertainty\.Mediumrows are the main leaderboard entries\.![Refer to caption](https://arxiv.org/html/2605.19341v1/images/TerminalEnvPlots/depth_vs_hallucination_smooth_by_type_trajectory_cap500.png)\(a\)Hallucination Rate vs\. Trajectory Depth pooled across all models
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/TerminalEnvPlots/depth_vs_hallucination_smooth_by_type_per_model_trajectory_cap500.png)\(b\)Hallucination Rate vs\. Trajectory Depth per model

Figure 2:HalluWorld\-TerminalHallucination Rate vs\. Trajectory Depth\. Trajectory depth is defined as the index of the trigger command, i\.e\., the number of shell commands executed before the probe injection\. Hallucination rates are smoothed using a Gaussian kernel in log\-depth space\. Context is truncated to 60k characters using middle truncation \(retaining the start and end of trajectories\)\. Uncertainty probes \(red\) show the strongest depth sensitivity, with increasing rates at larger depths, while most other probe types remain stable or increase mildly\.
## Appendix FHalluWorld\-TerminalThinking Depth Ablation

Table[14](https://arxiv.org/html/2605.19341#A5.T14)reports probe accuracy by category across thinking/reasoning depth for Sonnet 4\.6 and GPT\-5\.5, while Figure[2](https://arxiv.org/html/2605.19341#A5.F2)visualizes the hallucination rate by trajectory depth\. Themediumrows match the main leaderboard entries exactly\.†\\daggerTwo probes permanently excluded from GPT\-5\.5 high/xhigh runs due to content policy rejection onsql\-injection\-attackprobes at high reasoning effort \(N=527N\{=\}527\)\.

## Appendix GHalluWorld\-GridReasoning Ablation

From Table[15](https://arxiv.org/html/2605.19341#A7.T15), it appears that more reasoning does not necessarily correlate with lower hallucination\. Interestingly, it mainly appears to increase hallucination for Causal while having negligible improvement on the other probe categories, aligning with the analysis in §[5\.3](https://arxiv.org/html/2605.19341#S5.SS3)\.

Table 15:Effect of reasoning budget on hallucination rate by category forHalluWorld\-Grid\.Categories:P= Perceptual,M= Memory,C= Causal,U= Uncertainty,X= Cross\-category\. For GPT\-5\.5, “None” indicates reasoning disabled; the default evaluation uses medium budget\. For Claude Sonnet 4\.6, the non\-thinking model serves as the None baseline\.
## Appendix HHalluWorld\-GridHard Subset Results

Table 16:HalluWorld\-GridHard subset results\.The 12 hardest \(level, serializer\) pairs where≥\\geq5 models achieve≥\\geq20% hallucination rate \(see §[C](https://arxiv.org/html/2605.19341#A3)\)\. Models sorted by mean hard\-subset hallucination rate \(ascending\)\. Cell color reflects relative hallucination rate \(green = low, red = high\)\.Bold= column minimum\.Columns:C1a nb, C1a, C1b, C1b nb, C5a, C6, M16, M19, P1, P2, P4, X5\.
## Appendix IIn\-Navigation Experiments forHalluWorld\-Grid

In this section, we evaluate models in anIn\-Navigation\(InNav\) setting where the model actively navigates the environment while being probed about world state\. Technically,InNavmode uses chat completions with conversational turn\-based history in the standard messages format: each navigation step \(action selection, observation receipt\) is appended to the conversation history, and probes are asked as follow\-up messages within this ongoing dialogue\. This mirrors how agents would be deployed in interactive settings where world related queries arise during task execution, e\.g\., with a user in the loop conversing with the model to understand what’s going on\.

To enable controlled comparison without conflating navigation ability with reasoning ability, we use a single high\-performance navigation model \(GPT\-5\.4\-mini withreasoning\_effort=high\) to generate trajectories forallprobe models\.

### I\.1Epistemic Grounding vs\. Cognitive Load: Does Navigation Help or Hurt Hallucination?

Navigation introduces two competing effects:epistemic grounding\(action\-perception feedback may anchor the agent’s internal world model better\) versuscognitive load\(navigation demands may degrade state tracking\)\. To isolate navigation’s net impact, we compare two conditions using identical trajectories:In\-Navigation\(InNav\), where models answer probes while navigating, versusControlled Static\(CtrlStatic\), where models observe the same states without navigation context\. This paired design controls for trajectory information, isolating whether navigation helps or hurts hallucination rates\. Results are in Table[17](https://arxiv.org/html/2605.19341#A9.T17)\.

Table 17:Navigation impact on hallucination rates \(overall and by category\) forHalluWorld\-Grid\.Navigation Effect \(NavEff\) =InNav−CtrlStatic\\text\{\{InNav\}\}\-\\text\{\{CtrlStatic\}\}hallucination %\. Left: Overall results across all 31 worlds\. Right: Category\-specific NavEff results \(P=Perception, C=Causal, M=Memory, U=Uncertainty, X=X\-level\)\. Models sorted by overall NavEff\. \*\* means statistically significant \(95% CI excludes zero\)\. Cell color reflects relative hallucination rate or effect \(green=low/improvement, red=high/degradation\);boldmarks the column minimum\.ModelInNavCtrlStNavEffgpt\-5\.5\\cellcolorpastelgreen\!92\!pastelred27\.4\\cellcolorpastelgreen\!48\!pastelred40\.7\\cellcolorpastelgreen\!83\!pastelred−13\.2\-13\.2\*\*GLM\-5\\cellcolorpastelgreen\!60\!pastelred37\.0\\cellcolorpastelgreen\!45\!pastelred41\.5\\cellcolorpastelgreen\!45\!pastelred−4\.4\-4\.4\*\*o3\-mini\\cellcolorpastelgreen\!63\!pastelred36\.1\\cellcolorpastelgreen\!48\!pastelred40\.5\\cellcolorpastelgreen\!45\!pastelred−4\.4\-4\.4\*\*o4\-mini\\cellcolorpastelgreen\!58\!pastelred37\.6\\cellcolorpastelgreen\!46\!pastelred41\.3\\cellcolorpastelgreen\!43\!pastelred−3\.8\-3\.8\*\*o3\\cellcolorpastelgreen\!57\!pastelred37\.8\\cellcolorpastelgreen\!47\!pastelred41\.0\\cellcolorpastelgreen\!40\!pastelred−3\.2\-3\.2\*\*DeepSeek\-V3\\cellcolorpastelgreen\!40\!pastelred43\.0\\cellcolorpastelgreen\!32\!pastelred45\.4\\cellcolorpastelgreen\!37\!pastelred−2\.4\-2\.4gpt\-4o\\cellcolorpastelgreen\!54\!pastelred38\.9\\cellcolorpastelgreen\!49\!pastelred40\.2\\cellcolorpastelgreen\!32\!pastelred−1\.3\-1\.3gpt\-4o\-mini\\cellcolorpastelgreen\!52\!pastelred39\.5\\cellcolorpastelgreen\!51\!pastelred39\.8\\cellcolorpastelgreen\!27\!pastelred−0\.3\-0\.3Opus 4\.6\\cellcolorpastelgreen\!61\!pastelred36\.8\\cellcolorpastelgreen\!64\!pastelred35\.7\\cellcolorpastelgreen\!21\!pastelred\+1\.1\+1\.1Sonnet 4\.6\\cellcolorpastelgreen\!59\!pastelred37\.3\\cellcolorpastelgreen\!63\!pastelred36\.1\\cellcolorpastelgreen\!21\!pastelred\+1\.2\+1\.2Kimi\-K2\.6\\cellcolorpastelgreen\!60\!pastelred37\.0\\cellcolorpastelgreen\!64\!pastelred35\.8\\cellcolorpastelgreen\!20\!pastelred\+1\.3\+1\.3gpt\-5\.4\-mini\\cellcolorpastelgreen\!50\!pastelred40\.1\\cellcolorpastelgreen\!59\!pastelred37\.2\\cellcolorpastelgreen\!13\!pastelred\+2\.9\+2\.9
ModelPCMUXgpt\-5\.5\\cellcolorpastelgreen\!17\!pastelred\+2\.0\+2\.0\\cellcolorpastelgreen\!95\!pastelred−15\.8\-15\.8\\cellcolorpastelgreen\!68\!pastelred−9\.6\-9\.6\\cellcolorpastelgreen\!100\!pastelred−16\.9\-16\.9\\cellcolorpastelgreen\!84\!pastelred−13\.3\-13\.3GLM\-5\\cellcolorpastelgreen\!27\!pastelred−0\.3\-0\.3\\cellcolorpastelgreen\!45\!pastelred−4\.3\-4\.3\\cellcolorpastelgreen\!17\!pastelred\+2\.0\+2\.0\\cellcolorpastelgreen\!47\!pastelred−4\.9\-4\.9\\cellcolorpastelgreen\!81\!pastelred−12\.6\-12\.6o3\-mini\\cellcolorpastelgreen\!28\!pastelred−0\.6\-0\.6\\cellcolorpastelgreen\!52\!pastelred−6\.0\-6\.0\\cellcolorpastelgreen\!40\!pastelred−3\.3\-3\.3\\cellcolorpastelgreen\!67\!pastelred−9\.3\-9\.3\\cellcolorpastelgreen\!40\!pastelred−3\.1\-3\.1o4\-mini\\cellcolorpastelgreen\!35\!pastelred−2\.1\-2\.1\\cellcolorpastelgreen\!39\!pastelred−2\.9\-2\.9\\cellcolorpastelgreen\!42\!pastelred−3\.6\-3\.6\\cellcolorpastelgreen\!55\!pastelred−6\.6\-6\.6\\cellcolorpastelgreen\!45\!pastelred−4\.3\-4\.3o3\\cellcolorpastelgreen\!27\!pastelred−0\.3\-0\.3\\cellcolorpastelgreen\!31\!pastelred−1\.2\-1\.2\\cellcolorpastelgreen\!40\!pastelred−3\.2\-3\.2\\cellcolorpastelgreen\!60\!pastelred−7\.7\-7\.7\\cellcolorpastelgreen\!45\!pastelred−4\.3\-4\.3DeepSeek\-V3\\cellcolorpastelgreen\!23\!pastelred\+0\.7\+0\.7\\cellcolorpastelgreen\!64\!pastelred−8\.7\-8\.7\\cellcolorpastelgreen\!0\!pastelred\+10\.2\+10\.2\\cellcolorpastelgreen\!61\!pastelred−8\.1\-8\.1\\cellcolorpastelgreen\!27\!pastelred−0\.1\-0\.1gpt\-4o\\cellcolorpastelgreen\!34\!pastelred−1\.8\-1\.8\\cellcolorpastelgreen\!51\!pastelred−5\.8\-5\.8\\cellcolorpastelgreen\!22\!pastelred\+0\.9\+0\.9\\cellcolorpastelgreen\!35\!pastelred−2\.1\-2\.1\\cellcolorpastelgreen\!12\!pastelred\+3\.3\+3\.3gpt\-4o\-mini\\cellcolorpastelgreen\!9\!pastelred\+3\.9\+3\.9\\cellcolorpastelgreen\!48\!pastelred−5\.1\-5\.1\\cellcolorpastelgreen\!0\!pastelred\+6\.1\+6\.1\\cellcolorpastelgreen\!47\!pastelred−4\.9\-4\.9\\cellcolorpastelgreen\!23\!pastelred\+0\.9\+0\.9Opus 4\.6\\cellcolorpastelgreen\!2\!pastelred\+5\.6\+5\.6\\cellcolorpastelgreen\!21\!pastelred\+1\.1\+1\.1\\cellcolorpastelgreen\!18\!pastelred\+1\.9\+1\.9\\cellcolorpastelgreen\!53\!pastelred−6\.3\-6\.3\\cellcolorpastelgreen\!17\!pastelred\+2\.2\+2\.2Sonnet 4\.6\\cellcolorpastelgreen\!30\!pastelred−0\.8\-0\.8\\cellcolorpastelgreen\!19\!pastelred\+1\.7\+1\.7\\cellcolorpastelgreen\!9\!pastelred\+4\.1\+4\.1\\cellcolorpastelgreen\!28\!pastelred\+0\.6\+0\.6\\cellcolorpastelgreen\!22\!pastelred\+1\.1\+1\.1Kimi\-K2\.6\\cellcolorpastelgreen\!13\!pastelred\+3\.1\+3\.1\\cellcolorpastelgreen\!30\!pastelred−0\.9\-0\.9\\cellcolorpastelgreen\!14\!pastelred\+2\.7\+2\.7\\cellcolorpastelgreen\!42\!pastelred−2\.6\-2\.6\\cellcolorpastelgreen\!8\!pastelred\+4\.3\+4\.3gpt\-5\.4\-mini\\cellcolorpastelgreen\!2\!pastelred\+5\.5\+5\.5\\cellcolorpastelgreen\!28\!pastelred−0\.5\-0\.5\\cellcolorpastelgreen\!4\!pastelred\+5\.3\+5\.3\\cellcolorpastelgreen\!17\!pastelred\+3\.2\+3\.2\\cellcolorpastelgreen\!15\!pastelred\+2\.7\+2\.7

Epistemic grounding dominates overall \(left table\): 5 of 12 models show statistically significant hallucinationreductionsfrom navigation \(95% CI excludes zero\), while zero models show significant increases\. The strongest benefits appear in reasoning\-focused models \(gpt\-5\.5:−13\.2%\-13\.2\\%, GLM\-5:−4\.4%\-4\.4\\%, o3\-mini:−4\.4%\-4\.4\\%\), showing reasoning models steadily gain from multiple action\-feedback cycles in the world\.

Category\-specific patterns reveal nuanced effects \(right table\):Causalworlds show universal grounding benefits \(nearly all models negative or near\-zero\), with gpt\-5\.5 achieving−15\.8%\-15\.8\\%\.Uncertaintyworlds have largest variation: gpt\-5\.5 achieves the strongest benefit overall \(−16\.9%\-16\.9\\%\), while most reasoning models benefit substantially \(−4\.9\-4\.9to−9\.3%\-9\.3\\%\), but weaker models show mixed patterns \(gpt\-5\.4\-mini:\+3\.2%\+3\.2\\%, Sonnet 4\.6:\+0\.6%\+0\.6\\%\)\.PerceptionandMemoryworlds exhibit model\-dependent effects: frontier models show diverse patterns \(gpt\-5\.5 Perception:\+2\.0%\+2\.0\\%but Memory:−9\.6%\-9\.6\\%\), while smaller models experience cognitive load on Memory tasks \(gpt\-4o\-mini:\+6\.1%\+6\.1\\%, gpt\-5\.4\-mini:\+5\.3%\+5\.3\\%\)\.X\-levelmulti\-zone worlds show the largest model variance: GLM\-5 achieves−12\.6%\-12\.6\\%\(strong grounding\), while Kimi\-K2\.6 shows\+4\.3%\+4\.3\\%\(cognitive load dominates\)\. These patterns suggest navigation’s net impact depends on both task complexity and model reasoning capacity\. Detailed category analysis appears in §[I\.5](https://arxiv.org/html/2605.19341#A9.SS5)\. Overall,InNavprovides a rich setting to test if models can jointly handle both goal\-seeking \(navigation\) and internal world model updating, and if they can use reasoning to synergize both\.

### I\.2Trajectory Depth Analysis: Intrinsic Difficulty vs\. Navigation Impact

To understand whether hallucinations increase with trajectory depth, and whether this pattern differs between In\-Navigation and Controlled Static modes, we analyze hallucination rates as a function ofrelative trajectory position\. Each trajectory is divided into five quintiles: 1/5 \(first 20% of steps\), 2/5 \(20–40%\), 3/5 \(40–60%\), 4/5 \(60–80%\), and 5/5 \(final 80–100%\)\. We compute linear slopes \(percentage point increase in hallucination per quintile\) for both modes\.

Key finding:All 13 models show positive slopes inbothIn\-Navigation and Controlled Static modes \(Figure[3](https://arxiv.org/html/2605.19341#A9.F3)\), indicating that states become intrinsically harder to model as trajectories progress regardless of whether the model is navigating or passively observing\. The median slope is\+5\.1%\+5\.1\\%per quintile in Controlled Static versus\+6\.6%\+6\.6\\%in In\-Navigation, corresponding to approximately 20–26% higher hallucination rates at trajectory endpoints compared to starting positions\.

However, thedifferentialslopes reveal model\-specific navigation strategies \(Figure[3](https://arxiv.org/html/2605.19341#A9.F3)\):

- •6 models show cognitive load accumulation\(red bars\):InNavslopes exceedCtrlStaticby \+1\.4% to \+5\.9%/quintile \(e\.g\., o3\-mini:\+11\.2%\+11\.2\\%vs\.\+5\.3%\+5\.3\\%, Opus 4\.6:\+9\.3%\+9\.3\\%vs\.\+3\.6%\+3\.6\\%\), suggesting navigation amplifies depth effects\.
- •7 models show grounding mitigation\(green bars\):InNavslopes are lower thanCtrlStaticby 0\.7% to 4\.4%/quintile \(e\.g\., GPT\-5\.4\-mini:\+0\.7%\+0\.7\\%vs\.\+5\.1%\+5\.1\\%, GLM\-5:\+1\.7%\+1\.7\\%vs\.\+4\.8%\+4\.8\\%\), indicating navigation mitigates depth effects through epistemic grounding\.

i\) Implications for Deployment:Models cluster into two strategies for long\-horizon tasks\. Grounding\-mitigation models \(green bars\) actively leverage navigation feedback to counteract complexity accumulation, making them more suitable for extended autonomous operation\. Cognitive\-load\-accumulation models \(red bars\) show compounding errors over time despite strong static performance, requiring more frequent human intervention or checkpointing in embodied settings\.

ii\) Implications for Evaluation:Static benchmarks systematically underestimate depth\-related failures in navigation\-capable models\. The universal positive slopes \(\+5\.1\+5\.1to\+6\.6%\+6\.6\\%per quintile\) mean hallucination rates nearly double from trajectory start to end, yet this effect manifests differently under navigation, some models adapt through epistemic grounding, others deteriorate further through cognitive load\. Future embodied AI evaluations must measure performance across trajectory depth to capture these dynamics\.

![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/CognitiveLoad/trajectory_depth_ego_vs_static.png)\(a\)Slope difference \(InNav−CtrlStatic\\text\{\{InNav\}\}\-\\text\{\{CtrlStatic\}\}\) forHalluWorld\-Grid\. Red: cognitive load amplifies depth effect\. Green: grounding mitigates depth effect\. Labels show: Diff \(InNav/CtrlStatic\) in %/quintile\. Magnitude indicates strength of navigation’s impact on depth accumulation\.
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/CognitiveLoad/final_depth_progression_compact.png)\(b\)Depth progression forHalluWorld\-Grid\(select models shown for clarity\)\.InNavhallucinations increase with depth, illustrating how world understanding is harder as models explore deeper\.

Figure 3:Navigation impact on trajectory depth effects\.\(a\) All 13 models show differential slopes betweenInNavandCtrlStaticmodes, revealing model\-specific balance between epistemic grounding and cognitive load\. Positive difference \(red\) indicates cognitive load dominates; negative \(green\) indicates grounding dominates\. Frontier reasoning models paradoxically show MORE load accumulation, while efficiency\-optimized models show grounding mitigation\. \(b\) Representative models illustrate depth progression patterns: all show increasing hallucination rates over trajectory quintiles, demonstrating that states become intrinsically harder to model as trajectories progress\.
### I\.3Trajectory Depth Analysis by World Category

We observe that the effect of depth onInNavhallucination varies substantially by category \(Figure[4](https://arxiv.org/html/2605.19341#A9.F4); we defer detailed analysis to §[I\.6](https://arxiv.org/html/2605.19341#A9.SS6)\)\. Perception \(\-1\.0%/quintile\) and Causal worlds \(\-1\.8%/quintile\) showdecreasinghallucination with trajectory depth, indicatingepistemic grounding dominates\. Memory and Uncertainty worlds show minimal depth effects \(\+0\.2%/quintile\), with grounding and cognitive load balanced\. X\-level multi\-zone worlds exhibit massive accumulation \(\+6\.5%/quintile\), wherecognitive load dominates: long trajectories through many zones overwhelm the LM’s working memory\.

### I\.4In\-Navigation Serialization Sensitivity Analysis

To check if the better\-performing serialization persists in theInNavcase, we compare canonical serializers listed in TableLABEL:tab:level\-referencewith alternative ones\. Two categories showed striking inversions:Perception\(grid canonical\) favored symbolic by 14\.3% on average \(5/6 worlds\), whileUncertainty\(memory canonical\) favored grid by 12\.8% on average \(4/5 worlds\), showing some serializers support better epistemic grounding even though suboptimal for answering probes in isolation\. We defer full detail to §[I\.7](https://arxiv.org/html/2605.19341#A9.SS7)\.

### I\.5Navigation Effect by World Category

Table[18](https://arxiv.org/html/2605.19341#A9.T18)breaks down the navigation effect by world category\. Navigation impact varies substantially by category: Uncertainty worlds show the strongest overall benefit \(median−6\.0%\-6\.0\\%\), suggesting navigation helps models resolve partial observability\. X\-Levels show large model\-dependent effects, with GLM\-5 \(−12\.6%\-12\.6\\%\) and GPT\-5\.5 \(−11\.2%\-11\.2\\%\) benefiting substantially while weaker models show slight costs\. Memory worlds exhibit mixed effects \(DeepSeek\-V3:\+10\.2%\+10\.2\\%vs GPT\-5\.5:−9\.6%\-9\.6\\%\), suggesting complex architecture\-dependent interactions between memory demands and navigation\. Perception worlds show minimal navigation effect \(median of−0\.3%\-0\.3\\%\), indicating visual\-spatial reasoning operates largely independently of navigation demands\.

Table 18:HalluWorld\-GridNavigation effect by world category\.Values show In\-Navigation hallucination rate minus Controlled Static rate\. Negative values indicate navigation helps \(reduces hallucinations\)\. Categories: P = Perception, C = Causal, M = Memory, U = Uncertainty, X = Cross\-category\.
### I\.6More Details: Trajectory Depth Analysis by World Category

To investigate whether the balance between epistemic grounding and cognitive load varies across reasoning domains, we decompose our trajectory depth analysis by world category\. Figure[4](https://arxiv.org/html/2605.19341#A9.F4)presentsInNavhallucination rates versus trajectory depth \(quintiles 1/5 through 5/5\) separately for Perception, Causal, Memory, Uncertainty, and X\-level worlds\. Each category exhibits distinct depth dynamics, revealing that navigation’s net impact depends critically on the nature of the reasoning task\.

#### I\.6\.1Perception Worlds \(\-1\.0% per quintile\)

Perception\-focused worlds \(P1\-P5; 5,066 probes across 6 worlds\) test spatial reasoning, object tracking, and visual memory\. Counter to the overall trend, these worlds showdecreasing hallucinationas trajectories deepen \(average slope: \-1\.0% per quintile\)\. This reversed pattern indicates thatepistemic grounding dominates: additional observations help models resolve perceptual ambiguities by providing disambiguating evidence about spatial relationships and object properties, reducing uncertainty over time\. The effect is particularly pronounced in P4\_harder\_array and P1\_dense\_array, where systematic exploration helps models build accurate spatial representations incrementally\. Cognitive load from navigation is outweighed by the grounding benefit of multiple viewpoints\.

#### I\.6\.2Causal Worlds \(\-1\.8% per quintile\)

Causal reasoning worlds \(C1a\-C6, C1b\_continuous\_chain; 8,964 probes across 8 worlds\) test understanding of cause\-effect relationships, state transitions, and action consequences\. These worlds exhibit the strongestepistemic grounding effect\(average slope: \-1\.8% per quintile\), with models showing consistent improvement as trajectories progress\. This pattern indicates that navigation provides critical disambiguating evidence for causal inference: observing the results of actions \(e\.g\., “Does pushing this boulder open the path?”, “Is the fire barrier passable after the lever is pulled?”\) allows models to refine their causal models over time\. The effect is most pronounced in C2\_fire\_crossing and C3\_flood\_room, where state changes must be observed to answer probes correctly\. Grounding from action\-perception feedback loops substantially outweighs cognitive load\.

#### I\.6\.3Memory Worlds \(\+0\.2% per quintile\)

Memory\-focused worlds \(M1–M4; 6,900 probes across 4 worlds\) test retention of past observations, event sequencing, and testimony integration\. These worlds show minimal depth effect \(average slope: \+0\.2% per quintile\), nearly flat across trajectory positions\. This suggestsepistemic grounding and cognitive load roughly balance: while models must track more events as navigation proceeds \(increasing cognitive load\), additional observations provide compensating context \(epistemic grounding\)\. The incremental cost is small\. M2\_witness\_stand and M3\_incident\_report show the weakest effects, likely because structured testimony provides explicit memory cues that reduce working memory load while simultaneously grounding responses\.

#### I\.6\.4Uncertainty Worlds \(\+0\.2% per quintile\)

Uncertainty\-focused worlds \(U1, U2\_oracle\_\*, U4; 4,898 probes across 5 worlds\) test reasoning under partial observability, probabilistic inference, and information\-seeking behavior\. Like Memory worlds, these show minimal depth effect \(\+0\.2% per quintile\)\. The flat trend suggests models handle increasing trajectory length well: fog\-of\-war restrictions \(U1\) and oracle limitations \(U2\) impose constant rather than accumulating cognitive load, while additional observations provide grounding for probabilistic reasoning\. However, U4\_amnesiac shows slightly stronger accumulation \(\+0\.8%\), indicating that memory erasure creates compounding difficulties that overwhelm grounding benefits\.

#### I\.6\.5X\-Level Multi\-Zone Worlds \(\+6\.5% per quintile\)

X\-level compound worlds \(X1–X7; 6,860 probes across 7 worlds\) feature multi\-room navigation through 3–8 connected zones with complex state tracking requirements\. These worlds showmassive cognitive load accumulation\(average slope: \+6\.5% per quintile\), more than 5× the overall average\. Long trajectories through multiple zones overwhelm working memory, causing dramatic hallucination increases at later trajectory positions\.Cognitive load dominates epistemic groundingin these complex multi\-zone environments\. The effect is strongest in X3\_facility\_7zone \(\+8\.2% per quintile\) and X7\_dragon\_keep \(\+7\.8%\), where traversing many rooms creates compounding memory demands\. X1\_facility\_3zone shows the weakest accumulation \(\+4\.1%\), confirming that zone count drives cognitive load\.

#### I\.6\.6Summary and Implications

These category\-specific patterns reveal that depth effects reflect thebalance between epistemic grounding and cognitive load, which varies by category\. We identify three distinct regimes:

1. 1\.Grounding\-dominated regimes\(Perception, Causal\): Negative slopes \(\-1\.0% to \-1\.8% per quintile\) indicate that additional observations provide more benefit \(disambiguation of spatial/causal relationships\) than cost \(working memory load\)\. Navigation activelyhelpsstate tracking in these domains\.
2. 2\.Balanced regimes\(Memory, Uncertainty\): Near\-zero slopes \(\+0\.2% per quintile\) indicate that grounding benefits and cognitive load costs roughly cancel\. Models handle increasing trajectory length without substantial performance degradation, suggesting efficient resource allocation for these task types\.
3. 3\.Load\-dominated regimes\(X\-Levels\): Large positive slopes \(\+6\.5% per quintile\) indicate that working memory costs from multi\-zone navigation overwhelm any grounding benefits\. The super\-linear growth \(5× overall average\) suggests compounding interference rather than linear scaling\.

This category\-dependence has practical implications for agent deployment: models should be deployed with navigation in perceptual/causal tasks \(where it helps\), can navigate freely in memory/uncertainty tasks \(neutral impact\), but should minimize navigation steps in multi\-zone environments \(where it hurts\)\. The patterns also suggest distinct underlying mechanisms: perceptual and causal reasoning benefit from action\-perception feedback loops \(grounding effect\), while spatial navigation across many rooms imposes compounding working memory costs \(cognitive load effect\)\. Future work should investigate whether these patterns reflect fundamental differences in how models allocate cognitive resources across task types, and whether architectural interventions \(e\.g\., external memory, spatial memory modules\) can mitigate load\-dominated regimes\.

![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/DepthOfNavigation/egocentric_by_depth_perception_v2.png)\(a\)Perception worlds \(\-1\.0%/quintile\)
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/DepthOfNavigation/egocentric_by_depth_causal_v2.png)\(b\)Causal worlds \(\-1\.8%/quintile\)
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/DepthOfNavigation/egocentric_by_depth_memory_v2.png)\(c\)Memory worlds \(\+0\.2%/quintile\)
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/DepthOfNavigation/egocentric_by_depth_uncertainty_v2.png)\(d\)Uncertainty worlds \(\+0\.2%/quintile\)
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/DepthOfNavigation/egocentric_by_depth_xlevels_v2.png)\(e\)X\-level multi\-zone worlds \(\+6\.5%/quintile\)

Figure 4:Trajectory depth effects by world category forHalluWorld\-Grid\.Depth dynamics vary considerably across domains, reflecting the balance between epistemic grounding and cognitive load\. Perception and Causal worlds show reversed effects \(grounding dominates: navigation helps\), Memory and Uncertainty worlds show minimal depth effects \(balanced\), while X\-level multi\-zone facilities show large cognitive load accumulation \(\+6\.5% per quintile: cognitive load dominates\)\. This reveals that navigation’s impact is category\-dependent\.

### I\.7More Details: Serialization Sensitivity Analysis

To validate our canonical serialization choices, we systematically compared alternative serializations on worlds where multiple representations were available duringInNavevaluations\. This analysis addresses a critical methodological question: are canonical serializers universally optimal during navigation, or do certain world types benefit from alternative representations?

#### I\.7\.1Methodology

Coverage:We identified 30 worlds tested with multiple serializers:

- •18 worlds with grid and memory \(Causal, Memory, Uncertainty categories\)
- •11 worlds with grid and symbolic \(Perception and X\-level categories\)
- •1 world with memory and symbolic \(M2\_witness\_stand\)

Models:For each world\-serializer pair, we included only models tested on both serializers, yielding 229 model\-level comparisons across 30 worlds\.

Aggregation:To ensure fair comparison despite different navigation trajectories per serializer, we use hierarchical aggregation:

1. 1\.Probe→\\rightarrowTrace:Group probes by episode seed, calculate hallucination rate per trace
2. 2\.Trace→\\rightarrowWorld:Average across traces \(equal weight per episode\)
3. 3\.World→\\rightarrowSerializer:Compare world\-level hallucination rates

This prevents bias from trajectory length differences \(grid navigation may take more/fewer steps than memory navigation\)\.

Table 19:Serialization comparison by world category forHalluWorld\-Grid\.Canonical serializers win only half of In\-Navigation comparisons, with systematic inversions in Perception and Uncertainty categories\. “Best Non\-Canonical” shows the serializer that outperformed the canonical choice, with average margin in percentage points\.![Refer to caption](https://arxiv.org/html/2605.19341v1/images/EgocentricGraphsAndPlots/Serialization/serialization_innav_only.png)Figure 5:Serialization comparison in In\-Navigation mode forHalluWorld\-Grid\.Each bar shows the hallucination difference between the winning and losing serializer, sorted by margin\. Green indicates canonical serializer wins \(15/30 worlds\); red indicates non\-canonical wins \(15/30 worlds\)\. Notable inversions include Perception worlds \(symbolic dominates despite grid being canonical: P3\_rotation\_challenge 21\.5%, P2\_corridor\_gauntlet 17\.2%\) and Uncertainty worlds \(grid dominates despite memory being canonical: U2\_oracle\_mid 19\.0%, U2\_oracle\_high 16\.1%\)\. X\-level worlds validate the canonical symbolic choice \(6/7 wins\), with X7\_dragon\_keep showing the largest margin \(39\.0%\)\.
#### I\.7\.2Overall Results

Table[19](https://arxiv.org/html/2605.19341#A9.T19)summarizes results by world category\. Canonical serializers won only 50\.0% ofInNavcomparisons \(15 of 30 worlds\), revealing that no single serialization is universally optimal during navigation\. The pattern varies substantially by category, with systematic inversions in Perception and Uncertainty worlds\.

The inversions are particularly striking: in Perception worlds where grid was chosen as canonical based on prior evaluation, symbolic representations reduced hallucination by an average of 14\.3 percentage points across models, winning 5 of 6 worlds\. Similarly, Uncertainty worlds showed a 12\.8 percentage point advantage for grid over the canonical memory choice, winning 4 of 5 worlds\. These margins are substantial \- comparable to or exceeding the navigation effect itself \(Section[I\.5](https://arxiv.org/html/2605.19341#A9.SS5)\) \- indicating that serialization choice can be as impactful as the evaluation mode during navigation\-based reasoning\.

#### I\.7\.3Category\-Specific Analysis

##### Perception Worlds: Symbolic Dominance

Perception worlds \(P1–P5\) test spatial reasoning, object tracking, and visual memory\. Despite grid being canonical, symbolic representations achieved lower hallucination in 5 of 6 worlds during navigation:

- •P3\_rotation\_challenge:Symbolic wins by 21\.5%
- •P2\_corridor\_gauntlet:Symbolic wins by 17\.2%
- •P5\_object\_permanence:Symbolic wins by 6\.6%
- •P4\_harder\_array:Symbolic wins by 5\.5%
- •P1\_dense\_array:Symbolic wins by 1\.3%

Only P4b\_delta\_perception favored the canonical grid \(6\.2% margin\)\. This systematic preference suggests that during navigation, language\-based spatial descriptions \(“Ahead: red ball\. Left: blue box\. Right: wall\.”\) provide more effective grounding than raw ASCII grid layouts for perceptual reasoning\. The magnitude of this effect is substantial: the 21\.5% margin on P3\_rotation\_challenge exceeds most model\-to\-model differences, indicating serialization choice can matter more than model selection for certain tasks\.

##### Uncertainty Worlds: Grid Dominance Inspite of Memory Omniscience

Uncertainty worlds \(U1, U2\_oracle\_\*, U4\) test reasoning under partial observability with fog\-of\-war, oracle access limits, and memory erasure\. Despite memory being canonical, grid representations achieved lower hallucination in 4 of 5 worlds:

- •U2\_oracle\_mid:Grid wins by 19\.0%
- •U2\_oracle\_high:Grid wins by 16\.1%
- •U2\_oracle\_low:Grid wins by 10\.7%
- •U1\_fog\_of\_war:Grid wins by 5\.5%

Only U4\_amnesiac favored memory \(3\.7% margin\)\. This pattern suggests that explicit spatial layouts help models reason about unobserved regions more effectively than textual memory descriptions during navigation\. Grids may provide clearer spatial invariants \(“I know what’s at \(5,7\) even if I can’t see it now”\) compared to memory’s sequential event descriptions\.

##### X\-Level Worlds: Symbolic Validated

X\-level multi\-zone worlds \(X1–X7\) test navigation across 3–8 connected rooms with witness testimony and complex spatial reasoning\. Symbolic representations \(canonical choice\) achieved lower hallucination in 6 of 7 worlds:

- •X7\_dragon\_keep:Symbolic wins by 39\.0%
- •X1\_facility\_3zone:Symbolic wins by 15\.8%
- •X2\_facility\_5zone:Symbolic wins by 14\.7%
- •X5\_facility\_tour:Symbolic wins by 11\.1%
- •X4\_compound\_witness:Symbolic wins by 9\.1%
- •X3\_facility\_7zone:Symbolic wins by 7\.0%

Only X6\_return\_visit favored grid \(3\.0% margin\)\. The strong symbolic advantage validates our canonical choice: reasoning about multi\-room environments benefits from higher\-level linguistic abstractions that describe room connectivity, zone properties, and witness locations more naturally than low\-level grid coordinates\.

##### Causal and Memory Worlds: Mixed Results

Causal worlds \(memory canonical\) showed 5/8 wins for memory, with grid competitive on dynamic state\-change worlds \(C2\_fire\_crossing, C3\_flood\_room\)\. Memory worlds showed mixed patterns, with symbolic winning on testimony\-based worlds \(M2\_witness\_stand: 19\.4%, M3\_incident\_report: 21\.4%\) but grid winning on M4\_narrator \(11\.9%\)\.

#### I\.7\.4Model\-Specific Patterns

Different models show varying serialization preferences \(Table[20](https://arxiv.org/html/2605.19341#A9.T20)\)\. GPT\-4o and GPT\-4o\-mini exhibit the strongest symbolic preference \(66\.7% win rate\), while Opus 4\.6 shows more balanced performance across serializers\. O4\-mini demonstrates the most uniform distribution \(11 symbolic, 9 grid, 9 memory wins\), suggesting more serialization\-agnostic reasoning\.

Table 20:Model\-specific serialization preferences \(In\-Navigation mode\) forHalluWorld\-Grid\.Win counts across all world\-serializer comparisons\. GPT\-4o models show strong symbolic preference; Claude models are more balanced\.
#### I\.7\.5Implications

These results have both methodological and practical implications:

Methodologically, they validate our multi\-serializer evaluation approach: no single serialization is universally optimal during navigation\. Perception and Uncertainty categories show systematic inversions where non\-canonical choices substantially outperform canonical ones \(14\.3% and 12\.8% average margins\)\.

Practically, they suggest task\-dependent deployment guidelines: agents should use symbolic serializers for local perceptual reasoning during navigation \(P worlds\), grid layouts for spatial reasoning under uncertainty \(U worlds\), and symbolic for complex multi\-zone environments \(X worlds\)\.

Theoretically, the 50% canonical win rate reveals that navigation contexts have distinct representational requirements\. Symbolic’s linguistic clarity appears to help with action\-integrated perceptual grounding, while grid’s spatial explicitness helps with uncertainty reasoning\. Future work should investigate whether these patterns reflect fundamental differences in how models process spatial vs linguistic representations during active navigation\.

## Appendix JHalluWorld\-GridLevel Editor and Trajectory Recorder Tools

To allow frictionless benchmark extensibility for the community, we provide two complementary tools, aLevel Editorfor prototyping/designing worlds and aTrajectory Recorderfor ground truth annotation and qualitative analysis of trajectories and vulnerability to hallucination inHalluWorld\-Grid\.

### J\.1Level Editor: Enabling Rapid World Prototyping

To both visualize as well as modify existing levels \(worlds\) and craft new ones, we provide a lightweight but intuitive level editor, available both as a localhost Web UI as well as a ncurses based in\-terminal editor\.

The Web UI additonally provides sidebars for object selection, placement and grid state preview\. This is reminiscent of the world building UI in RTS video games such as Age of Empires II, Warcraft III etc where players can visually place units, terrains and objectives on a grid canvas\.

The editor supports all MiniGrid\-compatible objects \(walls, doors, keys, balls, boxes\) extended with our custom mechanics: fire tiles, flood tiles, pressure plates, notice boards, signposts, river tiles, and dark zones\. Researchers can interactively place objects on a grid canvas, configure their properties \(e\.g\., door colors, flood activation steps, notice board text with deliberately false claims\), and export environments to plain\-text\.txtformat compatible withAsciiEnv\. The editor enforces MiniGrid constraints \(e\.g\., wall borders, valid object placements\) and automatically generates level metadata \(agent start position, view size, see\-through\-walls flag\)\.

##### Rapid prototyping in practice\.

The barrier to creating new test worlds is low: our custom “Ring of Fire” level \(Figure[6\(c\)](https://arxiv.org/html/2605.19341#A10.F6.sf3)\), featuring a real path and a decoy path both constrained by water and fire tiles, with signposts that provide misleading navigational clues, was prototyped in under 8 minutes using the web interface\. Similarly, the C6 Flood\-Fire Escape level \(Figure[6\(a\)](https://arxiv.org/html/2605.19341#A10.F6.sf1)\), with timed flood tile advancement, fire mechanics, pushable boulders, pressure plate triggering, and adversarial notice board testimony, demonstrates the editor’s support for complex multi\-mechanic interactions without requiring manual text file editing\.

![Refer to caption](https://arxiv.org/html/2605.19341v1/images/LevelEditorAndTrajRecorder/levelwebui_flood_fire_escape.png)\(a\)C6 Flood\-Fire Escape level on the Web UI, with elements e\.g\. flood, fire, boulder, plate, door etc\.
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/LevelEditorAndTrajRecorder/levelncurses_ui_example_floodfireescape_final.png)\(b\)Terminal ncurses editor with keyboard\-driven workflow for in\-termina editing
![Refer to caption](https://arxiv.org/html/2605.19341v1/images/LevelEditorAndTrajRecorder/levelwebui_ringoffire_final.png)\(c\)Custom “Ring of Fire” level: real path \+ decoy path with misleading signposts \(8 min prototyping\)\.

Figure 6:Level Editor interfaces across modalities\.\(a\) Web\-based editor showing C6 Flood Fire Escape’s complex mechanics\. \(b\) Terminal ncurses editor for SSH\-only environments\. \(c\) Custom created new level demonstrating rapid prototyping\. All export to the same plain\-text\.txtformat\.

### J\.2Trajectory Recorder: Ground Truth by Construction

The Trajectory Recorder is a Flask based interactive web application \(accessible atlocalhost:5050\) that enables manually navigating through levels to annotate and embed in hallucination probes\. Unlike benchmarks requiring post\-hoc human labeling or LLM adjudication, our recorder generates naturally generates verifiable ground truth*amidst the process of probe placement*: the human operator walks through the environment using keyboard controls \(WASD for movement, G/F for pick/drop, T for toggle\), observes the exact simulator state at each step, and plants probes with known\-correct answers at moments of interest\.

##### Interactive navigation and probe planting\.

The recorder interface displays two synchronized views \(Figure[7\(a\)](https://arxiv.org/html/2605.19341#A10.F7.sf1)\): a full grid state panel \(left\) showing the god’s\-eye view of all objects, agent position, and hidden state \(e\.g\., whether flood tiles are active, fire extinguished\), and an observation text panel \(right\) showing the agent’s field\-of\-view rendered viaMemorySerializer, exactly as models will see it during evaluation\. At any trajectory step, the operator can pressPto open the probe planting panel, which provides a dropdown for probe type \(presence/count/attribute\), a text field for the question, and a ground\-truth answer field\. The operator can verify the correct answer by inspecting the grid state, ensuring probes are neither ambiguous nor inadvertently wrong\. Undo functionality \(Zkey\) allows backtracking within the current room segment\.

![Refer to caption](https://arxiv.org/html/2605.19341v1/images/LevelEditorAndTrajRecorder/trajectoryrecorder_floodfire_12thstep.png)\(a\)Trajectory Recorder web interface showing grid state \(left\), observation text \(right\), and probe planting panel \(bottom\)\. Operator has navigated to step 12 in C6 and is planting a probe about the exit door DE being open or not\.
\{

"segments":\[

\{

"level\_file":"levels/c6\.txt",

"seed":42,

"actions":\[2,2,0,2,5,\.\.\.\]

\}

\],

"probes":\[

\{

"segment":0,

"step":12,

"probe\_type":"presence",

"question":"Isthefirebarrier

atrow6currently

passable?",

"ground\_truth":"false",

"metadata":\{\}

\}

\]

\}

\(b\)Trajectory JSON output showing action sequence and embedded probe\. Ground truth \(false\) is known by operator’s observation; fire extinguishes at step 14, not step 12\.

Figure 7:Trajectory Recorder workflow\.\(a\) Interactive web UI for manual navigation and probe annotation\. \(b\) Deterministic JSON format consumed by evaluation harness\. Ground truth determined by operator’s direct observation, not post\-hoc adjudication\.
##### Deterministic replay and output format\.

The recorder saves trajectories as JSON files with two top\-level keys \(Figure[7\(b\)](https://arxiv.org/html/2605.19341#A10.F7.sf2)\):segments\(list of \{level\_file,seed,actions\} dictionaries, one per room\) andprobes\(list of \{segment,step,probe\_type,question,ground\_truth,metadata\} dictionaries\)\. This format is consumed byrun\_trajectory\_eval\.py, which replays the action sequence deterministically—same level files, same random seeds, same actions—and evaluates model responses against the stored ground truth\. The deterministic replay guarantees that different models are probed about identical world states, enabling controlled comparison without confounding by stochastic environment variation\.

### J\.3Integrated Workflow: From Prototyping to Evaluation

The tools form an end\-to\-end pipeline that lowers the barrier to creating reproducible hallucination benchmarks:

1. 1\.Design: Researcher sketches a hallucination scenario \(e\.g\., “test if models trust stale notice boards over direct observation”\)\.
2. 2\.Prototype: Level Editor is used to construct the environment\. For C6 Flood\-Fire Escape, this involved placing flood tiles with timed activation \(rows 1–5, activating at steps 0–4\), a fire barrier at row 6 that blocks the agent’s path, a pushable boulder at \(col = 7, row = 6\), a pressure plate at \(col = 9, row = 6\) linked to an exit door, and a notice board with deliberately wrong flood timing claims \(“flood reaches fire at step 2”, actual: step 4\)—all configured visually in the web editor\.
3. 3\.Experience: Trajectory Recorder loadsc6\_flood\_fire\_escape\.txtwith seed 42\. The researcher manually solves the puzzle: waits until step 14 \(flood extinguishes fire\), navigates to the now\-passable fire barrier location, pushes the boulder east twice onto the pressure plate \(triggering the exit door to open\), and walks to the goal\. This manual walkthrough ensures the researcher observes the exact sequence of state changes \(flood advancing, fire extinguishing, door opening\)\.
4. 4\.Annotate: At critical steps, the researcher plants probes by pressingP\. For example, in the world C6 if one just moves upwards, then at step 8: “Is the fire barrier at row 6 currently passable?” \(ground truth:false, fire still active\)\. At step 16: “Is the fire barrier passable now?” \(ground truth:true, fire extinguished at step 14\)\. At step 5: “According to the notice board, when does the flood reach the fire barrier?” \(ground truth: “step 2”, testing whether models defer to written information despite observing the actual timing\)\.
5. 5\.Evaluate:run\_trajectory\_eval\.pyreplays the trajectory for each model\. At each step, the harness feeds the model the observation \(rendered by the specified serializer: grid/memory/symbolic\) and trajectory history, then collects the model’s probe response\. Hallucination rates are computed by comparing model answers to the stored ground\-truth labels\.
6. 6\.Extend: The level\.txtfile and trajectory\.jsonare committed to the repository \(e\.g\.,levels/c6\_flood\_fire\_escape\.txt,trajectories/c6\_s42\.json\)\. Other researchers can load the level in the editor to create variants \(e\.g\., change flood timing to step 6, add decoy paths with different fire patterns\) or record alternative trajectories with different probe placements \(e\.g\., focus on boulder pushing mechanics instead of flood timing\)\.

This workflow ensures reproducibility: anyone with the repository can rerun evaluations on identical world states \(same levels, same seeds, same action sequences\)\. It also enables extensibility: the tools are open\-source, and the plain\-text formats \(level\.txt, trajectory\.json\) are human\-readable and version\-control\-friendly, facilitating pull requests for new levels and trajectories\.

### J\.4Integration and Extensibility

A key design goal is enabling external researchers to contribute new test cases without modifying core benchmark code\. The tools support this through modular architecture and shared trajectory collections for egocentric probing\.

##### Modular object system\.

New object types can be added tohalluworld/envs/objects/by subclassingWorldObjand implementing\_\_init\_\_,render, andtogglemethods\. The Level Editor automatically detects registered object classes via Python’s class registry and adds them to the object palette\. For example, addingTorchObject\(a toggleable light source affectingDarkZonevisibility\) requires 40 lines of code and zero editor UI changes

##### Serializer plugins\.

Alternative observation formats can be tested by implementing theSerializerinterface \(serialize\(env, step\) \-\> str\)\. The benchmark harness accepts\-\-serializer grid\|memory\|symbolicflags, enabling A/B testing of whether representation format affects hallucination rates\. Our three serializers differ by∼\\sim2×\\timesin verbosity \(grid: 150 tokens/step, memory: 80 tokens/step, symbolic: 40 tokens/step\), yet show <5pp hallucination rate difference for most models—suggesting format granularity matters less than world complexity for hallucination detection\.

##### Probe type extensibility\.

New probe categories can be added by subclassingBaseProbeand implementinggenerate\(env, trajectory, step\)\(returns question string and ground truth\) andevaluate\(ground\_truth, model\_response\)\(returns correctness boolean\)\. The Trajectory Recorder’s probe type dropdown is populated from registeredBaseProbesubclasses, making new probe types immediately available in the annotation UI\. For instance, aSpatialRelationProbeasking “Is object X to the left/right of object Y from the agent’s perspective?” \(allocentric reasoning\) was added by implementing a 60\-line subclass and became usable in the recorder without UI code changes\.

##### Shared trajectory repository\.

We release all 250\+ navigation trajectories \(5 episodes×\\times31 worlds×\\times1–3 serializers, depending on canonical serializer per world\) used in our experiments, enabling exact replication of our evaluation setup even in the egocentric case\. Researchers can fork trajectories \(e\.g\., extend M2 Witness Stand from 5 chambers to 10\), add new probes to existing trajectories \(e\.g\., test counterfactual reasoning: “If the agent had turned left at step 5, would the fire be visible at step 10?”\) inter alia\.

## Appendix KRelated Works Expanded

##### Definitions and scope\.

Hallucination work in the early days was mainly grounded in a source\. For example, in neural machine translation, hallucination was defined as translations detached from the input, and in abstractive summarization, it referred to content unsupported by the source document\[[32](https://arxiv.org/html/2605.19341#bib.bib64),[44](https://arxiv.org/html/2605.19341#bib.bib80),[28](https://arxiv.org/html/2605.19341#bib.bib88)\]\. Later with LLMs, the term broadened to encompass fluent but nonfactual or unfaithful content\. Recent audits and benchmark papers note that the field still lacks a single stable definition of hallucination, and often mixes source inconsistency, world\-knowledge error, and other failure types under one label\[[66](https://arxiv.org/html/2605.19341#bib.bib49),[26](https://arxiv.org/html/2605.19341#bib.bib40),[2](https://arxiv.org/html/2605.19341#bib.bib65)\]\. The recent “world model” view is closest toHalluWorld\. It makes the reference world explicit and treats hallucination as an observable mismatch between the model’s internal beliefs and that reference world\[[41](https://arxiv.org/html/2605.19341#bib.bib24)\]\.

##### Static hallucination benchmarks in text and RAG\.

Summarization benchmarks such as FactCC, QAGS, FRANK, and FaithBench are strong tests of document faithfulness, but they define truth with respect to a fixed source article rather than an evolving world state\[[31](https://arxiv.org/html/2605.19341#bib.bib59),[67](https://arxiv.org/html/2605.19341#bib.bib50),[50](https://arxiv.org/html/2605.19341#bib.bib98),[3](https://arxiv.org/html/2605.19341#bib.bib61)\]\. Open\-domain benchmarks such as TruthfulQA, HaluEval, FActScore, HALoGEN, and HalluLens broaden coverage to short\-form QA, long\-form claim decomposition, and cross\-domain verification, but they still evaluate static outputs against fixed facts, corpora, or retrieved evidence\[[39](https://arxiv.org/html/2605.19341#bib.bib97),[34](https://arxiv.org/html/2605.19341#bib.bib72),[48](https://arxiv.org/html/2605.19341#bib.bib60),[52](https://arxiv.org/html/2605.19341#bib.bib21)\]\. Expert\-level QA benchmarks such as HLE further stress\-test frontier models on difficult academic questions, but primarily measure answer correctness on static problems rather than hallucination as mismatch against an evolving reference world\[[4](https://arxiv.org/html/2605.19341#bib.bib4)\]\. RAG benchmarks such as RGB, RAGTruth, RAGBench, FACTS Grounding,*Resolving Knowledge Conflicts in Large Language Models*, and ConflictBank study conflicts between parametric and contextual knowledge, but the reference world is still the retrieved or provided evidence rather than a controllable simulator with explicit hidden state and observability\[[5](https://arxiv.org/html/2605.19341#bib.bib52),[49](https://arxiv.org/html/2605.19341#bib.bib86),[17](https://arxiv.org/html/2605.19341#bib.bib85),[27](https://arxiv.org/html/2605.19341#bib.bib91),[69](https://arxiv.org/html/2605.19341#bib.bib87),[59](https://arxiv.org/html/2605.19341#bib.bib55)\]\. Recent work on RAG\-considerate pretraining studies how models should allocate knowledge between parametric memory and retrieval during pretraining, highlighting that hallucination tendencies may also depend on what information is made externally observable vs\. learned in the parameters\[[57](https://arxiv.org/html/2605.19341#bib.bib7)\]\.

##### Multimodal and agent benchmarks\.

In VLMs, CHAIR and POPE focus on object hallucination, MMHal\-Bench and HallusionBench stress open\-ended visual grounding and control\-paired reasoning, and AMBER and HaloQuest expand evaluation to attributes, relations, and synthetic imagery\[[53](https://arxiv.org/html/2605.19341#bib.bib79),[35](https://arxiv.org/html/2605.19341#bib.bib58),[60](https://arxiv.org/html/2605.19341#bib.bib47),[18](https://arxiv.org/html/2605.19341#bib.bib67),[68](https://arxiv.org/html/2605.19341#bib.bib48),[70](https://arxiv.org/html/2605.19341#bib.bib70)\]\. These benchmarks are central for multimodal grounding, but most are single\-scene evaluations and hence do not test hallucinations under a temporally evolving latent state\[[1](https://arxiv.org/html/2605.19341#bib.bib63)\]\. For agents, MIRAGE\-Bench and AgentHallu are the closest prior work because they explicitly target action\-level hallucinations and step attribution, while WebArena, WorkArena, OSWorld, and GAIA expose related failures through realistic task execution\[[77](https://arxiv.org/html/2605.19341#bib.bib77),[42](https://arxiv.org/html/2605.19341#bib.bib44),[78](https://arxiv.org/html/2605.19341#bib.bib99),[10](https://arxiv.org/html/2605.19341#bib.bib100),[74](https://arxiv.org/html/2605.19341#bib.bib81),[47](https://arxiv.org/html/2605.19341#bib.bib62)\]\. However, MIRAGE\-Bench audits decision snapshots from existing environments and AgentHallu relies on human\-annotated trajectories, whereasHalluWorldis designed around synthetic environments with labels available by construction\.

##### Controllable environments, world models, and structured reasoning\.

Synthetic environments such as bAbI, Dyna\-bAbI, TextWorld, BabyAI, and ALFWorld, together with formal planning benchmarks such as PlanBench and ACPBench, demonstrate the value of controllable generation, reproducible dynamics, and exact evaluation\[[72](https://arxiv.org/html/2605.19341#bib.bib92),[63](https://arxiv.org/html/2605.19341#bib.bib57),[9](https://arxiv.org/html/2605.19341#bib.bib89),[6](https://arxiv.org/html/2605.19341#bib.bib51),[55](https://arxiv.org/html/2605.19341#bib.bib46),[65](https://arxiv.org/html/2605.19341#bib.bib82),[30](https://arxiv.org/html/2605.19341#bib.bib42)\]\. But these resources primarily score QA, planning, or task success, not hallucination as an observable false claim\. A parallel literature on world models and state tracking \- from World Models, PlaNet, MuZero, Dreamer, Dynalang, and language\-guided world models to EntNet, CLUTRR, RuleTaker, ProofWriter, BoardgameQA, and recent explicit state\-tracking probes \- studies whether models can represent and update latent state over time\[[19](https://arxiv.org/html/2605.19341#bib.bib101),[21](https://arxiv.org/html/2605.19341#bib.bib74),[54](https://arxiv.org/html/2605.19341#bib.bib76),[20](https://arxiv.org/html/2605.19341#bib.bib56),[76](https://arxiv.org/html/2605.19341#bib.bib73),[38](https://arxiv.org/html/2605.19341#bib.bib75),[23](https://arxiv.org/html/2605.19341#bib.bib94),[58](https://arxiv.org/html/2605.19341#bib.bib54),[8](https://arxiv.org/html/2605.19341#bib.bib96),[62](https://arxiv.org/html/2605.19341#bib.bib84),[29](https://arxiv.org/html/2605.19341#bib.bib53),[33](https://arxiv.org/html/2605.19341#bib.bib39),[22](https://arxiv.org/html/2605.19341#bib.bib95)\]\. This literature is relevant to our motivation, but its standard targets are control, prediction, proof generation, or answer correctness rather than automatic hallucination labels against a fully specified reference world\. Related work on commonsense\-aware generation and reasoning has also explored grounding generation in retrieved visual/contextual evidence, concept\-level constraints, and domain\-specific reasoning dimensions\[[37](https://arxiv.org/html/2605.19341#bib.bib1),[15](https://arxiv.org/html/2605.19341#bib.bib9),[13](https://arxiv.org/html/2605.19341#bib.bib11),[14](https://arxiv.org/html/2605.19341#bib.bib10)\]\. These approaches target more faithful or semantically constrained generation, whereasHalluWorldevaluates whether model claims are true against a specified reference world\.

##### PositioningHalluWorld\.

For the specific goal of measuring hallucination as observable false claims in partially observed, evolving worlds,HalluWorldis not just another source\-grounded or static factuality benchmark: unlike prior text and QA benchmarks, it is not limited to fixed documents, facts, or corpora\. Unlike RAG\-based benchmarks, it does not assume retrieved context is the entire reference world, and unlike multimodal grounding benchmarks, it is not confined to single\-scene evaluation\. Relative to MIRAGE\-Bench and AgentHallu, its main advantage for our purpose is that the world is synthetic and fully specified, so hallucination labels can come from simulator state rather than snapshot audits or post\-hoc human annotation\. In that sense,HalluWorldcombines four ingredients that prior work usually treats separately: an explicit definition of hallucination, a benchmark, a controllable environment, and automatic labels\.

Similar Articles

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

arXiv cs.CL

This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

arXiv cs.CL

This paper introduces a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of small, open-weight models rather than the generator itself. The method achieves superior performance on benchmarks like RAGTruth compared to existing methods like ReDeEP, demonstrating that model size is less critical than the analysis approach.