EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
arXiv:2605.07247v1 Announce Type: new Abstract: Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench
Original Article
View Cached Full Text
Cached at: 05/11/26, 07:15 AM
# EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
Source: [https://arxiv.org/html/2605.07247](https://arxiv.org/html/2605.07247)
Yi Liu TingFeng Hui Wei Zhang Li Sun Beijing University of Posts and Telecommunications louie@bupt\.edu\.cnNingxin Su The Hong Kong University of Science and Technology &Jian Wang Chongqing University Sen Su Beijing University of Posts and Telecommunications

###### Abstract

Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions\. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity\. A promising direction is to replace manually crafted environments with LLM\-simulated counterparts\.However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback\. In practice, LLM\-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift—failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate\. To address this gap, we propose EnvSimBench with four contributions: \(1\) We provide the first formal definition and operationalization ofEnvironment Simulation Ability \(EnvSim Ability\)as a quantifiable research objective\. \(2\) We constructEnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine\-grained difficulty stratification along three axes\. \(3\) Systematic evaluations reveal that all state\-of\-the\-art language models suffer from a universal state change cliff: they achieve near\-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates\. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap\. \(4\) We design a constraint\-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6\.8%, and cuts costs by over 90%\. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM\-based environment simulation, establishing a solid foundation for scalable agent training\. Code and data are available at:https://github\.com/cookieApril/EnvSimBench

## 1Introduction

Large language models \(LLMs\) are increasingly required to serve as agents across diverse real\-world scenariosLuoet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib10)\); Yaoet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib9)\); Qianet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib8)\)\. Recent research effortsPatilet al\.\([2024](https://arxiv.org/html/2605.07247#bib.bib25)\); Yaoet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib9)\); Luet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib3)\)have constructed stateful, tool\-interactive sandbox environments, delivering improved controllability and stability\.

Training autonomous agents at scale requires interactive environments that feature rich diversity, high simulation fidelity, and low deployment costsHuanget al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib37)\); DeepSeek\-AIet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib36)\); Frogeret al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib35)\)\. However, manually designed environments suffer from prohibitive construction costs, poor extensibility, and inherent limitations in scenario coverage, making them ill\-suited for large\-scale agent training\. A compelling recent direction addresses this challenge by replacing manually crafted, executable environmentsJimenezet al\.\([2024a](https://arxiv.org/html/2605.07247#bib.bib4)\); Zhouet al\.\([2024a](https://arxiv.org/html/2605.07247#bib.bib5)\); Shridharet al\.\([2021](https://arxiv.org/html/2605.07247#bib.bib34)\)with LLM\-simulated counterparts, where a language model generates the feedback for each agent actionLiet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib1)\); Songet al\.\([2026](https://arxiv.org/html/2605.07247#bib.bib7)\)\. This paradigm dramatically lowers the barrier to environment creation and promises seamless scaling across domains\.However, it rests on a foundational assumption that has received surprisingly little scrutiny: LLMs can accurately simulate environmental feedback with sufficient fidelity to be trusted as training substrates\.

If this assumption fails, agents trained in hallucinated environments optimize against corrupted reward signals, and synthesis pipelines incur compounding failure costs that erode the very cost advantage the paradigm promises, but simulation fidelity has never been systematically characterized\.

In practice,when models attempt to act as simulators, their performance is undermined by specific vulnerabilities\. Three concrete failure modes underlie this difficulty: hallucination\(Jiet al\.,[2023](https://arxiv.org/html/2605.07247#bib.bib11); Patilet al\.,[2023](https://arxiv.org/html/2605.07247#bib.bib29); Zhanget al\.,[2025](https://arxiv.org/html/2605.07247#bib.bib12)\), whereby a model fabricates plausible but incorrect state transitions; logical inconsistency\(Ruanet al\.,[2024](https://arxiv.org/html/2605.07247#bib.bib26); Elazaret al\.,[2021](https://arxiv.org/html/2605.07247#bib.bib24)\), whereby inter\-field constraints within a single response are violated; and state drift\(Luet al\.,[2025](https://arxiv.org/html/2605.07247#bib.bib3); Prabhakaret al\.,[2025](https://arxiv.org/html/2605.07247#bib.bib23); Yaoet al\.,[2025](https://arxiv.org/html/2605.07247#bib.bib9)\), whereby the absence of persistent memory causes earlier state changes to be silently lost\. These failures are not incidental engineering deficiencies but reflect a fundamental tension between the generative nature of language models and the deterministic, compositional logic of executable environments\. Without a framework to precisely measure these failures, they remain invisible to practitioners and unaddressable by researchers\.

To address this gapand establish a rigorous foundation for studying simulation fidelity, we introduce the concept ofEnvironment Simulation Ability\(EnvSim Ability\)to denote a model’s capacity to accurately predict the state transition and observational feedback induced by an agent action, given the current environment state and the action’s implementation logic\. Without a precise, operationalizable definition, neither systematic evaluation nor principled improvement of this capability is possible\.

To address three failure modes and make EnvSim Ability concretely measurable, we constructEnvSimBench\. EnvSimBench converts the inherently partial\-observable simulation problem into a fully observable, independently verifiable one\. Standard LLM simulators operate under a partially observable Markov decision process \(POMDP\)Oliehoek and Amato \([2016](https://arxiv.org/html/2605.07247#bib.bib31)\): the model must infer the current environment state from conversation history alone, with no explicit access to the ground\-truth state or the environment’s transition logic\. Our central insight is that recasting simulation as a fully observable Markov decision process \(MDP\): providing the model with the explicit before\-state and the action’s implementation logic as input, and requiring it to predict the resulting state and observation as output, resolves all three issues simultaneously\. This reframing is the architectural foundation ofEnvSimBench\.

![Refer to caption](https://arxiv.org/html/2605.07247v1/Images/Fig3.drawio.png)Figure 1:Overview of EnvSimBench\.Module A:EnvScalerSonget al\.\([2026](https://arxiv.org/html/2605.07247#bib.bib7)\)environments serve as seed data; a GPT\-4o\-mini agent collects multi\-turn execution trajectories, preprocessed into self\-contained single\-turn state prediction samples\(st,at,st′,ot\)\(s\_\{t\},a\_\{t\},s^\{\\prime\}\_\{t\},o\_\{t\}\)\. Each step is independently verifiable against a programmatic label, decoupling simulation fidelity from state tracking and making EnvSim Ability objectively measurable\.Module B:Samples undergo three\-axis stratification \(action outcome,state\-change complexity, argument cardinality\) and executor\-based labeling, yielding 400 benchmark samples across 167 environments\. The three axes allow failures to be precisely localized rather than collapsed into a single score\.Module C:*Path 1*evaluates seven frontier LLMs under identical conditions\.*Path 2*trains a specialized 4B simulation model that, when integrated into EnvScaler, improves synthesis yield by 6\.8% while cutting costs by over 90%\.EnvSimBench makes four contributions, linked by a deliberate logical chain from conceptualization to diagnosis to remedy:

\(I\)Formalizing EnvSim Ability\.We provide the first formal definition and operationalization of EnvSim Ability as a quantifiable research objective, establishing it as a distinct capability from related skills and providing a conceptual foundation for rigorous evaluation and improvement\.

\(II\)A rigorous benchmark\.Grounded in the POMDP\-to\-MDP reframing, we construct a benchmark of 400 samples across 167 diverse tool\-interactive environments, with verifiable programmatic labels and fine\-grained difficulty stratification along three orthogonal axes: action outcome, state\-change complexity, and argument cardinality\.

\(III\)A diagnostic finding: the state\-change cliff\.Systematic evaluation reveals a universal failure pattern: all models achieve near\-perfect accuracy on state\-preserving operations yet collapse catastrophically when multiple state variables must update simultaneously\. A threshold we term the*state\-change cliff*\. This gap is orthogonal to model scale and general reasoning ability\. More critically, models that generate superficially correct feedback can simultaneously produce incorrect state transitions, silently corrupting training signals with no observable divergence for the agent\.

\(IV\)A constraint\-driven remedy\.Guided by these findings, we propose a constraint\-driven simulation paradigm that makes environment schemas and transition logic explicit at each step, substantially reducing hallucination\. A simulation model trained under this paradigm surpasses all evaluated frontier LLMs on configuration match, boosts synthesis yield by 6\.8%, and cuts costs by over 90%\. This demonstrates that targeted specialization is a cost\-efficient path to reliable environment simulation\.

## 2Related Work

#### LLM environment simulation and synthesis\.

SimiaLiet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib1)\)demonstrated that reasoning models can generate plausible tool feedback for agent training, establishing LLM\-simulated tool\-interactive environments as a practical substrate for scalable agent training\. EnvScalerSonget al\.\([2026](https://arxiv.org/html/2605.07247#bib.bib7)\)extended this direction by demonstrating a fully automated synthesis of tool\-interactive environments, substantially reducing construction costs\. However, neither work formally defines simulation fidelity as a measurable capability, provides cross\-model comparisons of simulation quality, or offers principled evaluation metrics\. EnvSimBench addresses all three gaps simultaneously\. Relatedly, interactive fiction environmentsHausknechtet al\.\([2020](https://arxiv.org/html/2605.07247#bib.bib16)\)and code\-execution benchmarksJimenezet al\.\([2024a](https://arxiv.org/html/2605.07247#bib.bib4)\)use executable environments to evaluate agent behavior; our work inverts this relationship, treating the simulator itself rather than the agent as the object of evaluation\.

#### Agent benchmarks and tool use\.

Building on a long line of prior work on language agentsYaoet al\.\([2023](https://arxiv.org/html/2605.07247#bib.bib30)\); Zhouet al\.\([2024b](https://arxiv.org/html/2605.07247#bib.bib18)\); Jimenezet al\.\([2024b](https://arxiv.org/html/2605.07247#bib.bib28)\); Liuet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib27)\); Ruanet al\.\([2024](https://arxiv.org/html/2605.07247#bib.bib26)\); Denget al\.\([2023](https://arxiv.org/html/2605.07247#bib.bib33)\); Zenget al\.\([2023](https://arxiv.org/html/2605.07247#bib.bib32)\), LLM tool utilizationPatilet al\.\([2024](https://arxiv.org/html/2605.07247#bib.bib25)\); Qinet al\.\([2024](https://arxiv.org/html/2605.07247#bib.bib19)\); Huanget al\.\([2024](https://arxiv.org/html/2605.07247#bib.bib22)\), and task\-oriented dialogue systemsτ\\tau\-benchYaoet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib9)\)andτ2\\tau^\{2\}\-benchBarreset al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib2)\)established multi\-turn, tool\-using evaluation paradigms and introduced the MDP/POMDP distinction that we adopt as the architectural foundation of our diagnostic framework\. ToolSandboxLuet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib3)\)evaluates stateful tool use with fine\-grained state\-transition feedback, providing a complementary perspective in which the agent is evaluated against a fixed executable environment; our work evaluates the simulator rather than the agent, and asks whether a language model can*replace*such executors reliably\. APIGen\-MTPrabhakaret al\.\([2026](https://arxiv.org/html/2605.07247#bib.bib14)\)generates multi\-turn training data via LLM simulation; our pipeline shares structural similarities but replaces LLM\-generated labels with programmatically verified ground truth, directly addressing the circular validation problem that motivates our work\.

#### Hallucination and simulation fidelity\.

Hallucination in language model generation has been extensively studied in open\-ended settingsJiet al\.\([2023](https://arxiv.org/html/2605.07247#bib.bib11)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib12)\)\. GorillaPatilet al\.\([2023](https://arxiv.org/html/2605.07247#bib.bib29)\)demonstrated that language models frequently produce incorrect tool arguments even when the correct API schema is provided\. This finding also highlights the importance of our research\. Prior work on language model program executionLuet al\.\([2025](https://arxiv.org/html/2605.07247#bib.bib3)\)shows that prediction accuracy degrades as the number of interdependent state updates increases, underscoring the cumulative effect of state updates across multiple execution rounds\. EnvSimBench prevented the cascading dependency identified in this article by using the POMDP\-to\-MDP pattern\.

## 3Problem Formulation

#### Tool interactive environments\.

Following EnvScaler, we model an environment asℰ=\(𝒮,𝒜,𝒯,𝒪\)\\mathcal\{E\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},\\mathcal\{O\}\), where𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,𝒯:𝒮×𝒜→𝒮\\mathcal\{T\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathcal\{S\}is the deterministic transition function, and𝒪:𝒮×𝒜→Σ∗\\mathcal\{O\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Sigma^\{\*\}is the observation function that returns feedback to the agent\. Each environment maintains a persistent configuration that encodes its full state\. For example, a mutual fund management system tracks securities, portfolio holdings, and transaction logs; A mobile authentication system tracks OTP records, session states, and verification timestamps\. The transition function𝒯\\mathcal\{T\}is realized through tool definition, and the label\(o,s′\)\(o,s^\{\\prime\}\)is acquired by executing the real environment\.

#### State prediction task\.

Given a tool callaa, the pre\-execution statess, and the tool’s implementationcode\(a\)\\texttt\{code\}\(a\), a model must predict both the resulting observationo^\\hat\{o\}and the resulting configurations^′\\hat\{s\}^\{\\prime\}\(represented as a structured list of add/modify/delete operations applied toss\)\. Evaluation uses two binary metrics:Feedback Match \(FM\), exact string equality betweeno^\\hat\{o\}andoo, andConfig Match \(CM\), whether the predicted changes, when applied toss, reproduces′s^\{\\prime\}exactly\. Both are verified against ground truth, not another LLM\.

#### POMDP vs\. MDP formulation\.

Standard LLM simulators operate as POMDPs\. Formally, at steptt, the model receives only the observation historyht=\(a1,o1,…,at−1,ot−1,at\)h\_\{t\}=\(a\_\{1\},o\_\{1\},\\ldots,a\_\{t\-1\},o\_\{t\-1\},a\_\{t\}\)and must implicitly maintain a statebt≈P\(st∣ht\)b\_\{t\}\\approx P\(s\_\{t\}\\mid h\_\{t\}\)without access to the true statests\_\{t\}\. The simulation objective is thus:

o^t,s^t′=fθ\(ht\),\\hat\{o\}\_\{t\},\\,\\hat\{s\}^\{\\prime\}\_\{t\}\\;=\\;f\_\{\\theta\}\(h\_\{t\}\),wherefθf\_\{\\theta\}conflates three distinct sub\-problems:*state estimation*\(recoveringsts\_\{t\}fromhth\_\{t\}\),*transition reasoning*\(applying𝒯\(st,at\)\\mathcal\{T\}\(s\_\{t\},a\_\{t\}\)\), and*observation generation*\(computing𝒪\(st,at\)\\mathcal\{O\}\(s\_\{t\},a\_\{t\}\)\)\. This conflation is the root cause of the three failure modes illustrated in Figure[2](https://arxiv.org/html/2605.07247#S3.F2): without explicit access tosts\_\{t\}, the model must hallucinate a plausible belief state, and any error inbtb\_\{t\}compounds silently across turns, sinceht\+1h\_\{t\+1\}is constructed from the model’s own \(potentially incorrect\) outputo^t\\hat\{o\}\_\{t\}\.

#### Constraint\-driven MDP formulation\.

Our key insight is that state estimation is not intrinsic to simulation, but an artifact of information hiding\. We therefore decouple the two by supplyingsts\_\{t\}directly, converting the POMDP into a fully observable MDP\. The simulation objective becomes:

o^t,s^t′=fθ\(st,at,code\(at\)\),\\hat\{o\}\_\{t\},\\,\\hat\{s\}^\{\\prime\}\_\{t\}\\;=\\;f\_\{\\theta\}\\\!\\bigl\(s\_\{t\},\\;a\_\{t\},\\;\\texttt\{code\}\(a\_\{t\}\)\\bigr\),where the prompt supplies \(1\) the full before\-configsts\_\{t\}as an explicit JSON dictionary, \(2\) the tool callata\_\{t\}, and \(3\) the implementationcode\(at\)\\texttt\{code\}\(a\_\{t\}\)that defines𝒯\\mathcal\{T\}and𝒪\\mathcal\{O\}for actionata\_\{t\}\. The model’s task narrows from free\-form state propagation to*structured code comprehension and state transformation*\. Crucially, each step is now*independent*: there is no accumulated history, so estimation errors cannot propagate\. Each prediction\(o^t,s^t′\)\(\{\\hat\{o\}\_\{t\}\},\\hat\{s\}^\{\\prime\}\_\{t\}\)is independently verifiable against the ground truth\(ot,st′\)\(o\_\{t\},s^\{\\prime\}\_\{t\}\), enabling objective, LLM\-free evaluation\. As illustrated in Figure[2](https://arxiv.org/html/2605.07247#S3.F2), this design directly eliminates all three failure modes: the explicitsts\_\{t\}removes the source of hallucinated state transitions, the schema encoded incode\(at\)\\texttt\{code\}\(a\_\{t\}\)enforces logical consistency, and the single\-turn structure prevents state drift by construction\.

![Refer to caption](https://arxiv.org/html/2605.07247v1/Images/Fig1.drawio.png)Figure 2:POMDP vs\. MDP formulation\.*Left*: Standard simulation; the LLM infers state from conversation history, causing state drift and hallucination\.*Right*: Constraint\-driven simulation; the full before\-configsts\_\{t\}, tool callata\_\{t\}, and implementationcode\(a\)are provided explicitly, making each step independently verifiable\.

## 4EnvSimBench and Constraint\-Driven Simulation Paradigm

### 4\.1Benchmark Construction

#### From trajectories to single\-turn samples\.

The constraint\-driven MDP formulation \(Section[3](https://arxiv.org/html/2605.07247#S3)\) motivates a specific data structure: each benchmark sample must be*self\-contained*, supplying the model with the complete triple\(st,at,code\(at\)\)\(s\_\{t\},a\_\{t\},\\texttt\{code\}\(a\_\{t\}\)\)and requiring it to produce\(o^t,s^t′\)\(\\hat\{o\}\_\{t\},\\hat\{s\}^\{\\prime\}\_\{t\}\)without access to any prior turn\. This independence is not merely a design convenience\. It is precisely what makes each prediction independently verifiable and prevents estimation errors from compounding across steps, as they inevitably do in the POMDP regime\.

As illustrated in Figure[4](https://arxiv.org/html/2605.07247#A2.F4), we operationalize this by collecting multi\-turn execution trajectoriesτ=\{\(at,ot,st,st′\)\}t=1T\\tau=\\\{\(a\_\{t\},o\_\{t\},s\_\{t\},s^\{\\prime\}\_\{t\}\)\\\}\_\{t=1\}^\{T\}from 191 tool\-interactive seed environments\. The subscriptt=0t=0is excluded by construction: at initialization, no before\-states0s\_\{0\}exists in the sense required by the MDP formulation because the environment has not yet undergone any transition\. The state that would serve ass0s\_\{0\}is the after\-state produced by the environment’s initialization routine, with no preceding action to condition on\. Ensuring every retained sample\(st,at,st′,ot\)\(s\_\{t\},a\_\{t\},s^\{\\prime\}\_\{t\},o\_\{t\}\)is a genuine state\-transition instance with a well\-defined, executor\-verified before\-state\. This process converts each trajectory into a set of independent state prediction samples, each annotated with the ground\-truth label\(ot,st′\)\(o\_\{t\},s^\{\\prime\}\_\{t\}\)produced by the deterministic external executor, not by any language model\.

#### Three\-axis difficulty stratification\.

To enable fine\-grained diagnosis rather than a single aggregate score, we stratify the sample pool along three orthogonal axes, each targeting a distinct source of simulation difficulty\.Figure[1](https://arxiv.org/html/2605.07247#S1.F1)provides an overview of EnvSimBench, and a detailed construction pipeline is described in Appendix[B](https://arxiv.org/html/2605.07247#A2)\.

Axis 1 \(Action Outcome\)\.Samples are first partitioned by whether the actionata\_\{t\}succeeds or fails under𝒯\\mathcal\{T\}\. The*Failure*group \(\|Δ\(st,st′\)\|=0\|\\Delta\(s\_\{t\},s^\{\\prime\}\_\{t\}\)\|=0with𝒪\\mathcal\{O\}returning an error signal\) tests whether a model correctly withholds state changes when the transition guard is not satisfied\.

Axis 2 \(State\-Change Complexity\)\.Within the success group, samples are stratified by\|Δ\(st,st′\)\|\|\\Delta\(s\_\{t\},s^\{\\prime\}\_\{t\}\)\|, the cardinality of the state\-change set, into four tiers:*No\-Change*\(\|Δ\|=0\|\\Delta\|=0, 80 samples\),*Simple*\(\|Δ\|∈\{1,2\}\|\\Delta\|\\in\\\{1,2\\\}, 50 samples\),*Medium*\(\|Δ\|∈\{3,…,6\}\|\\Delta\|\\in\\\{3,\\ldots,6\\\}, 200 samples\), and*Difficult*\(\|Δ\|∈\{7,…,12\}\|\\Delta\|\\in\\\{7,\\ldots,12\\\}, 50 samples\)\. This distribution is not arbitrary: it mirrors the empirical frequency of state\-change magnitudes observed across the 191 source environments, which follows an approximately bell\-shaped curve concentrated at moderate complexity\. By allocating the largest stratum \(200 samples\) to the Medium tier \(where the state\-change cliff first manifests\) and smaller strata to the extremes, the benchmark both reflects ecological validity and maximizes diagnostic resolution precisely where failures are most consequential\.

Axis 3 \(Input Argument Cardinality\)\.For Failure and No\-Change samples, a third axis further partitions by the number of arguments supplied toata\_\{t\}\(zero vs\. one or more\)\. This controls for a qualitatively simpler sub\-case: when no arguments are provided, the model need not reason about argument\-dependent branching in the transition logic, making correct prediction structurally easier regardless of state\-change complexity\.

#### Diversity rule\.

Within each stratum, samples are selected to maximize the number of distinct environments\|ℰsub\|\|\\mathcal\{E\}\_\{\\text\{sub\}\}\|represented\. This*diversity rule*prevents the benchmark from concentrating in a small number of high\-trajectory environments and ensures that evaluation generalizes across the full breadth of domain types present in the source collection\. The final benchmark comprises 400 samples drawn from 167 distinct environments \(Table[1](https://arxiv.org/html/2605.07247#S4.T1)\)\. All ground\-truth labels are produced by the deterministic external executor, making the benchmark entirely LLM\-free in its construction\.

Table 1:Benchmark sample distribution\.
#### Evaluation metrics\.

We adopt two binary metrics per sample\.Feedback Match \(FM\)measures exact equality between the predicted observationo^t\\hat\{o\}\_\{t\}and the ground\-truthoto\_\{t\}, capturing whether the model’s surface response is correct\.Config Match \(CM\)measures whether the predicted states^t′\\hat\{s\}^\{\\prime\}\_\{t\}, reconstructed by applying the predicted change operations tosts\_\{t\}, coincides exactly with the executor\-producedst′s^\{\\prime\}\_\{t\}\. CM is the stricter and more informative metric: it is invariant to output\-format conventions and directly reflects whether the model has correctly traced the transition𝒯\(st,at\)\\mathcal\{T\}\(s\_\{t\},a\_\{t\}\)\.

### 4\.2Constraint\-Driven Simulation Paradigm

#### MDP prompt structure\.

Each evaluation prompt instantiates the MDP input triple\(st,at,code\(at\)\)\(s\_\{t\},\\,a\_\{t\},\\,\\texttt\{code\}\(a\_\{t\}\)\)in a structured format\. The model is required to produce a structured output encoding\(o^t,s^t′\)\(\\hat\{o\}\_\{t\},\\hat\{s\}^\{\\prime\}\_\{t\}\)as a predicted observation string and a set of typed state\-change operationsΔ^=\{δ1,…,δk\}\\widehat\{\\Delta\}=\\\{\\delta\_\{1\},\\ldots,\\delta\_\{k\}\\\}, where eachδi\\delta\_\{i\}specifies a field path and its new value\. The predicted after\-state is then recovered ass^t′=apply\(Δ^,st\)\\hat\{s\}^\{\\prime\}\_\{t\}=\\text\{apply\}\(\\widehat\{\\Delta\},\\,s\_\{t\}\)and compared againstst′s^\{\\prime\}\_\{t\}to compute CM\.

#### Reasoning enhancement\.

Training data is augmented with explicit intermediate derivations, formatted as structured reasoning traces that walk through the transition logic ofcode\(at\)\\texttt\{code\}\(a\_\{t\}\)step by step before producing\(o^t,Δ^\)\(\\hat\{o\}\_\{t\},\\widehat\{\\Delta\}\)\. These traces teach the supervised model to internalize code\-execution as a sequential inference process rather than a pattern\-matching lookup\.

## 5Experimental Results and Analysis

### 5\.1Evaluation of Frontier LLMs

#### Setup\.

We evaluate seven frontier language models: DeepSeek\-V3\.2, Qwen3\.5\-397B\-A17B, GPT\-5\.4, Gemini\-3\.1\-Pro\-Preview, Claude\-Sonnet\-4\.6, MiniMax\-M2\.7, and GLM\-5, all in non\-thinking mode via their respective inference APIs, with identical prompts instantiating the MDP triple\(st,at,code\(at\)\)\(s\_\{t\},a\_\{t\},\\texttt\{code\}\(a\_\{t\}\)\)\. Ground\-truth labels are produced entirely by the deterministic external executor\. Full evaluation details are provided in Appendix[G](https://arxiv.org/html/2605.07247#A7)\.

Table 2:Results on Failure and No\-Change samples\.Bold: column\-wise maximum;red: column\-wise minimum\.
#### Failure and No\-Change groups\.

Table[2](https://arxiv.org/html/2605.07247#S5.T2)shows that CM is near\-perfect \(97–100%\) across all state\-preserving operations, confirming that every model correctly withholds predicted state changes when none are warranted\. FM exhibits wider variation\. Notably, Claude\-Sonnet\-4\.6 achieves only 15% FM on the Failure group despite 100% CM\. A divergence that reveals a systematicoutput\-format mismatch: the model omits the enclosing structured response schema and returns only the inner message string\. Concretely, whereas the ground\-truth observationoto\_\{t\}takes the form of a key\-value structured response⟨status:false,msg:ε⟩\\langle\\textit\{status\}:\\text\{false\},\\,\\textit\{msg\}:\\varepsilon\\rangle, the model outputs onlyε\\varepsilondirectly\. MiniMax\-M2\.7 exhibits the same unwrapping behavior on success\-type responses\. Because this mismatch is an artifact of a formatting convention rather than a reasoning failure, CM remains unaffected\. We adopt CM as the primary cross\-model reasoning metric throughout\.

Table 3:Averaged results on state\-changing groups\. CM degrades sharply with\|Δ\|\|\\Delta\|\.Bold: column\-wise maximum;red: column\-wise minimum\.
#### State\-changing groups \(Simple / Medium / Difficult\)\.

Table[3](https://arxiv.org/html/2605.07247#S5.T3)summarizes CM and FM across groups requiring actual state updates\. CM degrades sharply with\|Δ\|\|\\Delta\|: most models fall below 20% on Medium and approach zero on Difficult, while FM remains substantially higher throughout\. This persistent FM–CM gap reveals that models can generate surface\-plausible response strings for an action without faithfully tracing the induced state transition𝒯\(st,at\)\\mathcal\{T\}\(s\_\{t\},a\_\{t\}\)\. Detailed per\-\|Δ\|\|\\Delta\|breakdowns appear in Appendix[A](https://arxiv.org/html/2605.07247#A1)\.

Table 4:Overall results\. CM is robust to output\-format variation; FM is depressed for models with format\-unwrapping behavior \(Claude\-Sonnet\-4\.6, MiniMax\-M2\.7\)\.Bold: column\-wise maximum;red: column\-wise minimum\.Finding 1: Substantial capability gaps exist across frontier LLMs\.Overall CM ranges from 32\.5% \(DeepSeek\-V3\.2\) to 42\.3% \(Qwen3\.5\-397B\-A17B\), a nearly 10 pp spread that originates*entirely*from state\-changing samples, since all models achieve≥\\geq99% CM on state\-preserving operations\. On state\-changing samples, DeepSeek\-V3\.2 achieves only 10\.0% CM while the best frontier models reach 22–23%\. This gap is not explained by model scale: DeepSeek\-V3\.2 scores competitively on standard reasoning benchmarks yet exhibits the largest CM deficit here, suggesting that*EnvSim Ability*is a capability orthogonal to general reasoning proficiency \(strong reasoning does not imply strong simulation\)\.

Finding 2 \(State\-change cliff\): All models collapse in CM when\|Δ\|≥3\|\\Delta\|\\geq 3\.Per\-\|Δ\|\|\\Delta\|results in Appendix[A](https://arxiv.org/html/2605.07247#A1)reveal a universal threshold\. At\|Δ\|=1\|\\Delta\|=1, CM spans 36–72%; by\|Δ\|=5\|\\Delta\|=5, every model falls to≤\\leq4%\. Across all samples with\|Δ\|≥5\|\\Delta\|\\geq 5, CM rates converge near zero: DeepSeek\-V3\.2 2\.7%, Claude\-Sonnet\-4\.6 6\.0%, Gemini\-3\.1\-Pro\-Preview 9\.3%, MiniMax\-M2\.7 10\.7%, GPT\-5\.4 12\.0%, GLM\-5 11\.3%\. This threshold marks a*qualitative*transition rather than a gradual decline\. Below it \(\|Δ\|≤4\|\\Delta\|\\leq 4\), models at least partially track state changes: getting some fields right while missing others\. Above it \(\|Δ\|≥5\|\\Delta\|\\geq 5\), predictions collapse entirely: many models outputΔ^=∅\\widehat\{\\Delta\}=\\emptyset, reporting no state change at all, even though the corresponding real action actually needs to modify a large number of state fields\. A partial CM recovery visible at\|Δ\|=7\|\\Delta\|=7for several models is not evidence of improved capability but reflects a structural artifact: high\-\|Δ\|\|\\Delta\|samples that models answer correctly are dominated by*bulk\-uniform*operations whose changes instantiate a single repeating template, a qualitatively easier task than the heterogeneous multi\-field updates that cause the cliff \(see Appendix[A\.4](https://arxiv.org/html/2605.07247#A1.SS4)\)\.

Finding 3: FM and CM expose two partially orthogonal failure modes\.*Format mismatch*is model\-specific and\|Δ\|\|\\Delta\|\-independent: models that omit the enclosing response schema show FM near 10–15% even on state\-preserving operations, while CM remains at 97–100%\. This failure is absent from the CM column, confirming CM as the more reliable reasoning metric\.

*State\-tracking failure*is universal and\|Δ\|\|\\Delta\|\-dependent, manifesting in three sub\-patterns:\(i\)models systematically neglect auxiliary state updates induced as side effects of the primary action;\(ii\)fields whose values are determined only at execution time\(such as clock\-generated timestamps\), cannot be correctly predicted without actually running the transition logic; and\(iii\), most consequentially, a model may produceo^t=ot\\hat\{o\}\_\{t\}=o\_\{t\}while simultaneously computings^t′≠st′\\hat\{s\}^\{\\prime\}\_\{t\}\\neq s^\{\\prime\}\_\{t\}, giving the agent no divergence signal whatsoever\. On\|Δ\|≥3\|\\Delta\|\\geq 3samples where CM fails, 50–64% of failures also have FM=1=1for DeepSeek\-V3\.2, GLM\-5, GPT\-5\.4, and Gemini\-3\.1\-Pro\-Preview\. This means the environment state is silently corrupted while the agent’s reward signal remains intact\. Quantitative breakdowns and cross\-model case studies for all three sub\-patterns are provided in Appendix[D](https://arxiv.org/html/2605.07247#A4)\.

### 5\.2Evaluation of Small Models: Before and After SFT

#### Setup\.

We fine\-tune Qwen3\-4B\-Base via full\-parameter SFT on 2×\\timesA800 \(80 GB\) GPUs, evaluating on anenv\_idholdout set that excludes all 167 benchmark environments from training\. A pre\-SFT scaling baseline confirms that specialization is required beyond scale alone \(Appendix[I](https://arxiv.org/html/2605.07247#A9)\)\. Full hyperparameter details and training method comparisons are provided in Appendix[G](https://arxiv.org/html/2605.07247#A7)\. Full\-parameter SFT substantially improves FM \(27%→72\.8%27\\%\\to 72\.8\\%,2\.7×2\.7\\times\) and corrects output\-schema alignment; the state\-change cliff persists at the 5K\-sample scale, confirming it as a fundamental capability gap\. Detailed per\-group results and a comparison of training strategies are reported in Appendix[J](https://arxiv.org/html/2605.07247#A10)\.

![Refer to caption](https://arxiv.org/html/2605.07247v1/x1.png)Figure 3:Config Match vs\.\|Δ\|\|\\Delta\|: Full\-Balance2 vs\. frontier LLMs\.All frontier models \(thin lines\) drop sharply at\|Δ\|≥3\|\\Delta\|\\geq 3\. Balance2 \(violet, thick\) outperforms all frontier LLMs at\|Δ\|∈\{1,2,3,4\}\|\\Delta\|\\in\\\{1,2,3,4\\\}by up to\+10\+10pp\. Both collapse toward near\-zero at\|Δ\|≥5\|\\Delta\|\\geq 5\.Finding 4: Reasoning\-augmented traces do not consistently improve CM at 5K samples\.As shown in Table[19](https://arxiv.org/html/2605.07247#A10.T19), full SFT with structured reasoning traces achieves 30\.3% CM, compared with 35\.0% without structured reasoning, a 4\.7 pp regression\. Traces hurt on Simple samples \(20% vs\. 34%\), likely adding distributional noise where transition logic is short, while providing a modest gain on Difficult samples \(10% vs\. 6%\)\. We speculate that consistent benefits require either substantially more data \(≥\\geq20K samples\) or difficulty\-conditioned routing that applies traces selectively to complex samples\.

Finding 5 \(Data composition\): Composition governs generalization more strongly than volume\.We trainBalance2by mirroring the empirical difficulty distribution of\|Δ\|\|\\Delta\|across source environments: 1,000 failure\+\+1,000 no\-change\+\+2,000 simple\-change\+\+2,230 complex\-change samples \(6,230 total\)\. This balanced coverage prevents the model from developing pathological biases toward any single difficulty tier\. A full composition ablation comparing alternative data mixtures is provided in Table[17](https://arxiv.org/html/2605.07247#A8.T17)\(Appendix[H](https://arxiv.org/html/2605.07247#A8)\)\.Balance2 achieves 45\.3% overall CM and 79\.5% FM,surpassing all frontier LLMs on CM\(\+3\.0\+3\.0pp over Qwen3\.5\-397B\-A17B\) and nearly matching the best FM \(79\.5% vs\. 80\.5% for GLM\-5\)\. The advantage is concentrated in the deployable regime: Balance2 leads by up to\+10\+10pp at\|Δ\|∈\{1,2,3,4\}\|\\Delta\|\\in\\\{1,2,3,4\\\}, while all approaches converge near zero at\|Δ\|≥5\|\\Delta\|\\geq 5\. Figure[3](https://arxiv.org/html/2605.07247#S5.F3)visualizes Balance2’s per\-\|Δ\|\|\\Delta\|advantage over all frontier models across the Simple and Medium tiers\.

#### Downstream Validation: EnvScaler Pipeline\.

We integrate Full\-Balance2 into the EnvScaler synthesis pipeline in place of its large\-model ensemble\. Synthesized environments are retained only if their pass accuracy meets a quality threshold of 0\.85\. The original pipeline yields 191 passing environments; Full\-Balance2 yields 204, a 6\.8% improvement in synthesis yield, with approximately59×59\\timeslower parameter count\. This demonstrates that targeted specialization at the 4B scale is Pareto\-superior to frontier\-model pipelines on both cost and synthesis quality\.

## 6Discussion and Conclusion

Why is environment simulation hard?Our results expose two orthogonal failure modes\.*Format mismatch*generates structurally different feedback strings that fail exact matching despite correct semantics, motivating semantic metrics as future complements to FM\.*State tracking failure*affects all models: predicting which nested fields change and to what values requires faithful execution of the tool’s logic, a capability closer to program execution than to language modeling\. Our SFT approach partially addresses this, but cases withΔ≥5\\Delta\\geq 5remain challenging\.

Cost\-quality trade\-off\.Full\-Balance2 surpasses all frontier LLMs on CM \(45\.3% vs\. 42\.3%\) and nearly matches the best FM \(79\.5% vs\. 80\.5%\), while running at dramatically lower cost, both on the benchmark and in the downstream synthesis pipeline\. This demonstrates that EnvSimBench enables principled cost\-quality comparison and targeted model improvement\.

Limitations\.\(1\) Benchmark sample size and statistical reliability\.Fine\-grained per\-\|Δ\|\|\\Delta\|analysis in the Difficult group rests on strata as small asn=4n=4–66, where a single sample shifts CM by 17–25 pp\. The Failure group \(n=20n=20\) shows the same fragility\. Deterministic ground\-truth labels do not eliminate selection\-induced sampling variance, and bootstrap error bounds should accompany all quantitative findings in future work\.\(2\) Circular downstream evaluation\.The synthesis quality filter overlaps in distribution with SFT training data, introducing circular validation\. Scarcity of high\-\|Δ\|\|\\Delta\|training samples further compounds this; targeted data augmentation for\|Δ\|≥5\|\\Delta\|\\geq 5remains the primary open challenge\.

Conclusion\.We introduced EnvSimBench, a benchmark that reframes LLM environment simulation as a fully observable state prediction task, enabling objective, LLM\-free evaluation across 400 samples and 167 diverse tool\-interactive environments\. Our central finding, the state\-change cliff, reveals that simulation fidelity is a capability orthogonal to general reasoning ability and degrades catastrophically beyond three simultaneous state updates\. A constraint\-driven MDP paradigm addresses this by making transition logic explicit, and targeted fine\-tuning with balanced data composition produces a 4B model that surpasses all frontier LLMs on Config Match while reducing synthesis costs by over 90×\. We release EnvSimBench as a public diagnostic framework and hope it grounds future research on reliable environment simulation, a prerequisite for trustworthy scalable agent training\.

## Acknowledgments and Disclosure of Funding

## References

- \[1\]\(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.External Links:2506\.07982,[Link](https://arxiv.org/abs/2506.07982)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[2\]DeepSeek\-AI, A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong, C\. Lu, C\. Zhao, C\. Deng, C\. Xu, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, E\. Li, F\. Zhou, F\. Lin, F\. Dai, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Li, H\. Liang, H\. Wei, H\. Zhang, H\. Luo, H\. Ji, H\. Ding, H\. Tang, H\. Cao, H\. Gao, H\. Qu, H\. Zeng, J\. Huang, J\. Li, J\. Xu, J\. Hu, J\. Chen, J\. Xiang, J\. Yuan, J\. Cheng, J\. Zhu, J\. Ran, J\. Jiang, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Gao, K\. Guan, K\. Huang, K\. Zhou, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Wang, L\. Zhao, L\. Yin, L\. Guo, L\. Luo, L\. Ma, L\. Wang, L\. Zhang, M\. S\. Di, M\. Y\. Xu, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, P\. Huang, P\. Cong, P\. Wang, Q\. Wang, Q\. Zhu, Q\. Li, Q\. Chen, Q\. Du, R\. Xu, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Yin, R\. Xu, R\. Shen, R\. Zhang, S\. H\. Liu, S\. Lu, S\. Zhou, S\. Chen, S\. Cai, S\. Chen, S\. Hu, S\. Liu, S\. Hu, S\. Ma, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. Zhou, T\. Ni, T\. Yun, T\. Pei, T\. Ye, T\. Yue, W\. Zeng, W\. Liu, W\. Liang, W\. Pang, W\. Luo, W\. Gao, W\. Zhang, X\. Gao, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Chen, X\. Zhang, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Li, X\. Yang, X\. Li, X\. Chen, X\. Su, X\. Pan, X\. Lin, X\. Fu, Y\. Q\. Wang, Y\. Zhang, Y\. Xu, Y\. Ma, Y\. Li, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Qian, Y\. Yu, Y\. Zhang, Y\. Ding, Y\. Shi, Y\. Xiong, Y\. He, Y\. Zhou, Y\. Zhong, Y\. Piao, Y\. Wang, Y\. Chen, Y\. Tan, Y\. Wei, Y\. Ma, Y\. Liu, Y\. Yang, Y\. Guo, Y\. Wu, Y\. Wu, Y\. Cheng, Y\. Ou, Y\. Xu, Y\. Wang, Y\. Gong, Y\. Wu, Y\. Zou, Y\. Li, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Zhao, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Huang, Z\. Wu, Z\. Li, Z\. Zhang, Z\. Xu, Z\. Wang, Z\. Gu, Z\. Zhu, Z\. Li, Z\. Zhang, Z\. Xie, Z\. Gao, Z\. Pan, Z\. Yao, B\. Feng, H\. Li, J\. L\. Cai, J\. Ni, L\. Xu, M\. Li, N\. Tian, R\. J\. Chen, R\. L\. Jin, S\. S\. Li, S\. Zhou, T\. Sun, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Song, X\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Z\. Huang, Z\. Xu, Z\. Zhang, D\. Ji, J\. Liang, J\. Guo, J\. Chen, L\. Xia, M\. Wang, M\. Li, P\. Zhang, R\. Chen, S\. Sun, S\. Wu, S\. Ye, T\. Wang, W\. L\. Xiao, W\. An, X\. Wang, X\. Sun, X\. Wang, Y\. Tang, Y\. Zha, Z\. Zhang, Z\. Ju, Z\. Zhang, and Z\. Qu\(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.External Links:2512\.02556,[Link](https://arxiv.org/abs/2512.02556)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p2.1)\.
- \[3\]X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su\(2023\)Mind2Web: towards a generalist agent for the web\.External Links:2306\.06070,[Link](https://arxiv.org/abs/2306.06070)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[4\]Y\. Elazar, N\. Kassner, S\. Ravfogel, A\. Ravichander, E\. Hovy, H\. Schütze, and Y\. Goldberg\(2021\)Measuring and improving consistency in pretrained language models\.Transactions of the Association for Computational Linguistics9,pp\. 1012–1031\.External Links:[Link](https://aclanthology.org/2021.tacl-1.60/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00410)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p4.1)\.
- \[5\]R\. Froger, P\. Andrews, M\. Bettini, A\. Budhiraja, R\. S\. Cabral, V\. Do, E\. Garreau, J\. Gaya, H\. Laurençon, M\. Lecanu, K\. Malkan, D\. Mekala, P\. Ménard, G\. M\. Bertran, U\. Piterbarg, M\. Plekhanov, M\. Rita, A\. Rusakov, V\. Vorotilov, M\. Wang, I\. Yu, A\. Benhalloum, G\. Mialon, and T\. Scialom\(2025\)ARE: scaling up agent environments and evaluations\.External Links:2509\.17158,[Link](https://arxiv.org/abs/2509.17158)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p2.1)\.
- \[6\]M\. Hausknecht, P\. Ammanabrolu, M\. Côté, and X\. Yuan\(2020\)Interactive fiction games: a colossal adventure\.External Links:1909\.05398,[Link](https://arxiv.org/abs/1909.05398)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]Y\. Huang, S\. Li, Z\. Fan, M\. LIU, W\. Liu, and Y\. R\. Fung\(2025\)Scaling environments for LLM agents: fundamentals, approaches, and future directions\.InWorkshop on Scaling Environments for Agents,External Links:[Link](https://openreview.net/forum?id=9axZcDTiJm)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p2.1)\.
- \[8\]Y\. Huang, J\. Shi, Y\. Li, C\. Fan, S\. Wu, Q\. Zhang, Y\. Liu, P\. Zhou, Y\. Wan, N\. Z\. Gong, and L\. Sun\(2024\)MetaTool benchmark for large language models: deciding whether to use tools and which to use\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=R0c2qtalgG)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[9\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\-03\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.External Links:ISSN 1557\-7341,[Link](http://dx.doi.org/10.1145/3571730),[Document](https://dx.doi.org/10.1145/3571730)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p4.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px3.p1.1)\.
- \[10\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan\(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VTF8yNQM66)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p2.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2024\)SWE\-bench: can language models resolve real\-world github issues?\.External Links:2310\.06770,[Link](https://arxiv.org/abs/2310.06770)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[12\]Y\. Li, H\. A\. Inan, X\. Yue, W\. Chen, L\. Wutschitz, J\. Kulkarni, R\. Poovendran, R\. Sim, and S\. Rajmohan\(2025\)Simulating environments with reasoning models for agent training\.External Links:2511\.01824,[Link](https://arxiv.org/abs/2511.01824)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p2.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang, Y\. Dong, and J\. Tang\(2025\)AgentBench: evaluating llms as agents\.External Links:2308\.03688,[Link](https://arxiv.org/abs/2308.03688)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[14\]J\. Lu, T\. Holleis, Y\. Zhang, B\. Aumayer, F\. Nan, F\. Bai, S\. Ma, S\. Ma, M\. Li, G\. Yin, Z\. Wang, and R\. Pang\(2025\)ToolSandbox: a stateful, conversational, interactive evaluation benchmark for llm tool use capabilities\.External Links:2408\.04682,[Link](https://arxiv.org/abs/2408.04682)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p1.1),[§1](https://arxiv.org/html/2605.07247#S1.p4.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px3.p1.1)\.
- \[15\]J\. Luo, W\. Zhang, Y\. Yuan, Y\. Zhao, J\. Yang, Y\. Gu, B\. Wu, B\. Chen, Z\. Qiao, Q\. Long, R\. Tu, X\. Luo, W\. Ju, Z\. Xiao, Y\. Wang, M\. Xiao, C\. Liu, J\. Yuan, S\. Zhang, Y\. Jin, F\. Zhang, X\. Wu, H\. Zhao, D\. Tao, P\. S\. Yu, and M\. Zhang\(2025\)Large language model agent: a survey on methodology, applications and challenges\.External Links:2503\.21460,[Link](https://arxiv.org/abs/2503.21460)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p1.1)\.
- \[16\]F\. A\. Oliehoek and C\. Amato\(2016\)A concise introduction to decentralized pomdps\.Springer\.Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p6.1)\.
- \[17\]S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez\(2023\)Gorilla: large language model connected with massive apis\.arXiv preprint arXiv:2305\.15334\.Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p4.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px3.p1.1)\.
- \[18\]S\. G\. Patil, H\. Mao, C\. Cheng\-Jie Ji, F\. Yan, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez\(2024\)The berkeley function calling leaderboard \(bfcl\): from tool use to agentic evaluation of large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p1.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[19\]A\. Prabhakar, Z\. Liu, M\. Zhu, J\. Zhang, T\. M\. Awalgaonkar, S\. Wang, Z\. Liu, H\. Chen, T\. Q\. Hoang, J\. C\. Niebles, S\. Heinecke, W\. Yao, H\. Wang, S\. Savarese, and C\. Xiong\(2026\)APIGen\-MT: agentic pipeline for multi\-turn data generation via simulated agent\-human interplay\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=qk6ORqQ4Cu)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[20\]A\. Prabhakar, Z\. Liu, M\. Zhu, J\. Zhang, T\. Awalgaonkar, S\. Wang, Z\. Liu, H\. Chen, T\. Hoang, J\. C\. Niebles, S\. Heinecke, W\. Yao, H\. Wang, S\. Savarese, and C\. Xiong\(2025\)APIGen\-mt: agentic pipeline for multi\-turn data generation via simulated agent\-human interplay\.External Links:2504\.03601,[Link](https://arxiv.org/abs/2504.03601)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p4.1)\.
- \[21\]C\. Qian, Z\. Liu, A\. Prabhakar, Z\. Liu, J\. Zhang, H\. Chen, H\. Ji, W\. Yao, S\. Heinecke, S\. Savarese, C\. Xiong, and H\. Wang\(2025\)UserBench: an interactive gym environment for user\-centric agents\.External Links:2507\.22034,[Link](https://arxiv.org/abs/2507.22034)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p1.1)\.
- \[22\]Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian, S\. Zhao, L\. Hong, R\. Tian, R\. Xie, J\. Zhou, M\. Gerstein, d\. li, Z\. Liu, and M\. Sun\(2024\)ToolLLM: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 9695–9717\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/28e50ee5b72e90b50e7196fde8ea260e-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[23\]S\. Rajbhandari, J\. Rasley, O\. Ruwase, and Y\. He\(2020\)ZeRO: memory optimizations toward training trillion parameter models\.InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,SC ’20\.External Links:ISBN 9781728199986Cited by:[§G\.3](https://arxiv.org/html/2605.07247#A7.SS3.p1.4)\.
- \[24\]Y\. Ruan, H\. Dong, A\. Wang, S\. Pitis, Y\. Zhou, J\. Ba, Y\. Dubois, C\. J\. Maddison, and T\. Hashimoto\(2024\)Identifying the risks of lm agents with an lm\-emulated sandbox\.External Links:2309\.15817,[Link](https://arxiv.org/abs/2309.15817)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p4.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[25\]M\. Shridhar, X\. Yuan, M\. Cote, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\)\{alfw\}orld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0IOX0YcCdTn)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p2.1)\.
- \[26\]X\. Song, H\. Chang, G\. Dong, Y\. Zhu, J\. Wen, and Z\. Dou\(2026\)EnvScaler: scaling tool\-interactive environments for llm agent via programmatic synthesis\.External Links:2601\.05808,[Link](https://arxiv.org/abs/2601.05808)Cited by:[Figure 1](https://arxiv.org/html/2605.07247#S1.F1),[§1](https://arxiv.org/html/2605.07247#S1.p2.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px1.p1.1)\.
- \[27\]S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan\(2023\)WebShop: towards scalable real\-world web interaction with grounded language agents\.External Links:2207\.01206,[Link](https://arxiv.org/abs/2207.01206)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[28\]S\. Yao, N\. Shinn, P\. Razavi, and K\. R\. Narasimhan\(2025\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=roNSXZpUDN)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p1.1),[§1](https://arxiv.org/html/2605.07247#S1.p4.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[29\]A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang\(2023\)AgentTuning: enabling generalized agent abilities for llms\.External Links:2310\.12823,[Link](https://arxiv.org/abs/2310.12823)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.
- \[30\]Y\. Zhang, Y\. Li, L\. Cui, D\. Cai, L\. Liu, T\. Fu, X\. Huang, E\. Zhao, Y\. Zhang, C\. Xu, Y\. Chen, L\. Wang, A\. T\. Luu, W\. Bi, F\. Shi, and S\. Shi\(2025\)Siren’s song in the ai ocean: a survey on hallucination in large language models\.External Links:2309\.01219,[Link](https://arxiv.org/abs/2309.01219)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p4.1),[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px3.p1.1)\.
- \[31\]Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, and Z\. Luo\(2024\-08\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Y\. Cao, Y\. Feng, and D\. Xiong \(Eds\.\),Bangkok, Thailand,pp\. 400–410\.External Links:[Link](https://aclanthology.org/2024.acl-demos.38/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-demos.38)Cited by:[§G\.3](https://arxiv.org/html/2605.07247#A7.SS3.p1.4)\.
- \[32\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\(2024\)WebArena: a realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=oikOpiigXo)Cited by:[§1](https://arxiv.org/html/2605.07247#S1.p2.1)\.
- \[33\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\(2024\)WebArena: a realistic web environment for building autonomous agents\.External Links:2307\.13854,[Link](https://arxiv.org/abs/2307.13854)Cited by:[§2](https://arxiv.org/html/2605.07247#S2.SS0.SSS0.Px2.p1.2)\.

## Appendix ADetailed Per\-Delta Experimental Results

### A\.1Simple Group \(Δ∈\{1,2\}\\Delta\\in\\\{1,2\\\}\)

Table 5:Results on Simple Group\.Bold: column\-wise maximum;red: column\-wise minimum\.
### A\.2Medium Group \(Δ∈\{3,4,5,6\}\\Delta\\in\\\{3,4,5,6\\\}\)

Table 6:Results on Medium Group\.Bold: column\-wise maximum;red: column\-wise minimum\.
### A\.3Difficult Group \(Δ∈\{7,…,12\}\\Delta\\in\\\{7,\\ldots,12\\\}\)

Table 7:Results on Difficult Group, perΔ\\Deltavalue and averaged\.Bold: column\-wise maximum;red: column\-wise minimum\. The CM rebound atΔ=7\\Delta=7reflects bulk\-uniform operations \(see Appendix[A\.4](https://arxiv.org/html/2605.07247#A1.SS4)\)
### A\.4Non\-Monotonicity at\|Δ\|≥7\|\\Delta\|\\geq 7: Bulk\-Uniform Operations

Per\-\|Δ\|\|\\Delta\|results show a partial CM recovery at\|Δ\|=7\|\\Delta\|=7, where Gemini 3\.1 Pro, GLM\-5, MiniMax\-M2\.7, and GPT\-5\.4 reach 31–50% CM before collapsing again at\|Δ\|≥8\|\\Delta\|\\geq 8\. This rebound is not evidence of improved state\-tracking capability\. Inspection of the 15 Difficult\-group samples on which at least one model achieves CM = True reveals a consistent structural regularity\. Every such sample corresponds to a*bulk\-uniform*operation, in which a single tool call producesNNstructurally identical changes under the same path prefix\. Two dominant patterns account for nearly all recoveries\.

Pattern 1: Consecutive date\-entry additions\.A tool call such asblock\_dates\(property\_id, start, end\)adds one availability record per day across a date range, yielding\|Δ\|=N\|\\Delta\|=Nchanges that all follow an identical template:availability\_calendar\.\{property\_id\}\.\{YYYY\-MM\-DD\}←\\leftarrow\{status: blocked\}, with keys that are consecutive calendar dates\. Sample 376 \(\|Δ\|=10\|\\Delta\|=10\) exemplifies this: all six models answer correctly, because identifying the pattern “add blocked entries for 2025\-05\-01 through 2025\-05\-10” reduces the task to enumerating a date sequence\. No independent value derivation is required per entry\. Samples 397 \(\|Δ\|=12\|\\Delta\|=12\) and 352 \(\|Δ\|=7\|\\Delta\|=7\) follow the same structure and achieve 5/6 and 5/6 correct respectively\.

Pattern 2: Multi\-field update on a single object\.A call that overwrites all fields of one record \(e\.g\.,update\_reservation\) produces\|Δ\|\|\\Delta\|equal to the number of updatable fields\. Still, every change concerns the same object \(e\.g\.,reservations\.RES2\.\*\) and its values are read directly from the call arguments\. Sample 377 \(\|Δ\|=10\|\\Delta\|=10\) achieves 5/6 correct for exactly this reason: no cross\-object inference or runtime\-value derivation is required\.

The contrast with Medium\-group failures is instructive\. A\|Δ\|=5\|\\Delta\|=5sample such asverify\_otprequires five heterogeneous updates across three sub\-objects, including a clock\-generated timestamp whose value is unknowable without executing the transition logic\. Each field demands an independent derivation\. Bulk\-uniform samples collapse this complexity into a single template instantiatedNNtimes, structurally easier despite the larger\|Δ\|\|\\Delta\|\.

This distinction has a direct implication for benchmark design\. The Difficult tier currently conflates two qualitatively different difficulty sources:*heterogeneous multi\-field*updates, which cause a permanent CM collapse, and*homogeneous bulk*updates, which remain tractable at high\|Δ\|\|\\Delta\|\. Future iterations of EnvSimBench should stratify these two sub\-types separately to provide more precise diagnostic resolution at the upper end of the complexity axis\.

## Appendix BBenchmark Construction Pipeline \(Detailed\)

Figure[4](https://arxiv.org/html/2605.07247#A2.F4)illustrates the full sample\-selection pipeline with three\-axis stratification and the diversity rule—the pipeline proceeds in three stages\.

Stage 1: trajectory collection\.GPT\-4o\-mini is deployed as an agent across all 191 EnvScaler environments, executing up to 30 steps per environment\. At each stept≥1t\\geq 1, the Python executor records the complete quadruple\(a,o,s,s′\)\(a,o,s,s^\{\\prime\}\): the tool call, the resulting observation, the before\-config, and the after\-config, along with the derived change setΔ\(s,s′\)\\Delta\(s,s^\{\\prime\}\)\. Step 0 is discarded \(no before\-config exists for the initial state\)\. This yields a large pool of candidate single\-turn state prediction samples\.

Stage 2: three\-axis stratification and filtering\.Candidate samples are classified simultaneously along the three axes described in Section[4](https://arxiv.org/html/2605.07247#S4)\. Axis 1 partitions by action outcome \(success=True/False\); Axis 2 partitions the success group by\|Δ\|\|\\Delta\|into No\-Change, Simple, Medium, and Difficult; Axis 3 further splits Failure and No\-Change subgroups by input argument cardinality \(0 vs\. 1\)\. Each resulting subgroup is then subsampled to its target count \(Table[1](https://arxiv.org/html/2605.07247#S4.T1)\)\.

Stage 3: diversity rule\.Within each subgroup, samples are selected greedily to maximize the number of distinctenv\_ids covered\. If the target count exceeds the number of available environments in the subgroup, multiple samples per environment are permitted, prioritizing those with different\|Δ\|\|\\Delta\|values\. This rule ensures that the 400 benchmark samples span 167 distinct environments rather than being concentrated in a few high\-trajectory environments, providing broad domain coverage for evaluation\.

![Refer to caption](https://arxiv.org/html/2605.07247v1/Images/Fig4.drawio.png)Figure 4:Benchmark construction pipeline\.*Left*: Multi\-turn trajectories from 191 EnvScaler environments are collected by a GPT\-4o\-mini agent and preprocessed into single\-turn\(a,o,s,s′\)\(a,o,s,s^\{\\prime\}\)samples; Step 0 \(no before\-config\) is discarded\.*Center*: Three\-axis stratification partitions the pool by action outcome, state\-change complexity, and input argument cardinality\.*Right*: A diversity rule maximizes distinctenv\_ids per subgroup, yielding 400 samples across 167 environments with Python\-executed ground truth\.
## Appendix CPerformance Visualizations

### C\.1Bar Chart Analysis: Per\-Group FM and CM \(Figure[5](https://arxiv.org/html/2605.07247#A3.F5)\)

Figure[5](https://arxiv.org/html/2605.07247#A3.F5)renders FM \(filled bars\) and CM \(hatched bars\) side by side for all seven frontier models across the six difficulty subgroups\. Reading across panels reveals four visual signatures\.

Panels \(a\) Fail and \(b\) No\-Change: format divergence with perfect state prediction\.CM bars are uniformly at or near 100% across all models, forming a flat ceiling, confirming that none hallucinates a state change when none is expected\. The striking exception is Claude\-Sonnet\-4\.6 in Panel \(a\): its FM bar drops to approximately 15% while its CM bar stays at 100%, creating a visually dramatic gap that is immediately identifiable as a format\-divergence signature rather than a reasoning failure\. MiniMax\-M2\.7 shows a milder version of the same pattern across both panels\. GPT\-5\.4 presents an interesting mirror image in Panel \(b\): its FM drops to 65% on No\-Change samples \(the lowest of all models\), not due to format divergence but because it sometimes generates incorrect feedback*content*on zero\-argument, no\-change calls, a distinct failure mode from Claude’s unwrapping behavior\.

Panel \(c\) Simple \(\|Δ\|∈\{1,2\}\|\\Delta\|\\in\\\{1,2\\\}\): FM–CM gap emerges\.The FM–CM gap now appears for all models: FM bars \(60–96%\) tower above CM bars \(22–50%\), visualizing that models can routinely generate plausible feedback strings while getting the state transition wrong\. Gemini\-3\.1\-Pro\-Preview exhibits the tallest CM bar \(48%\) and the highest FM bar \(96%\) in this panel, making it the strongest overall performer at low complexity\. DeepSeek\-V3\.2 stands out with the shortest CM bar \(22%\), far below its FM bar \(80%\), consistent with its large FM–CM gap identified in Appendix[F](https://arxiv.org/html/2605.07247#A6)\.

Panels \(d\) Medium \(\|Δ\|∈\{3–6\}\|\\Delta\|\\in\\\{3\\text\{\-\-\}6\\\}\) and \(e\) Difficult \(\|Δ\|∈\{7–12\}\|\\Delta\|\\in\\\{7\\text\{\-\-\}12\\\}\): the cliff floor\.Panel \(d\) shows the cliff in transition: CM bars have collapsed to roughly 8–17% across all models, while FM bars remain in the 12–82% range\. The visual contrast between tall FM bars and nearly absent CM bars is most extreme for GLM\-5 \(FM 81\.5%, CM 14%\) and DeepSeek\-V3\.2 \(FM 76\.5%, CM 8\.5%\), both of which generate fluent feedback while almost entirely losing track of state\. Panel \(e\) continues the trend: CM bars for all models are visually indistinguishable from zero \(2–28%\), confirming that the state\-change cliff is a hard performance floor rather than a gradual decline\. Interestingly, in Panel \(e\), GPT\-5\.4 and GLM\-5 maintain relatively taller FM bars \(74% and 70% respectively\), which at this delta level mostly reflects correct high\-level feedback language rather than correct state\. The gap between language modeling and code execution is widest here\.

Panel \(f\) Overall: model ranking summary\.The aggregate view makes the cross\-model contrast most accessible\. GLM\-5 has the tallest FM bar \(80\.5%\) but a moderate CM bar \(41\.0%\)\. Qwen3\.5\-397B\-A17B has the tallest CM bar \(42\.3%\)\. The two format\-divergent models, Claude\-Sonnet\-4\.6 and MiniMax\-M2\.7, have the shortest FM bars despite having CM bars comparable to the other models\. Low FM does not imply low CM\.

![Refer to caption](https://arxiv.org/html/2605.07247v1/x2.png)Figure 5:Frontier LLM performance across difficulty subgroups\.FM \(filled\) and CM \(hatched\) for all seven models\. Panels \(a\)–\(b\): near\-perfect CM on non\-mutating operations; format\-divergence gap visible for Claude\-Sonnet\-4\.6 and MiniMax\-M2\.7\. Panels \(c\)–\(e\): the FM–CM gap widens as\|Δ\|\|\\Delta\|grows, reaching near\-zero CM on Difficult samples for all models\. Panel \(f\): GLM\-5 leads FM \(80\.5%\); Qwen3\.5\-397B\-A17B leads CM \(42\.3%\)\.
### C\.2Heatmap Analysis: FM and CM Across All Models andΔ\\DeltaLevels \(Figure[6](https://arxiv.org/html/2605.07247#A3.F6)\)

Figure[6](https://arxiv.org/html/2605.07247#A3.F6)presents two heatmaps, FM on the left, CM on the right, with models on the x\-axis andΔ\\Deltavalues \(0 through 7–12\) on the y\-axis\. Color encodes match rate from deep blue \(high\) to deep red \(low\)\. The two panels tell structurally different stories\.

Left panel \(FM\): two vertical stripes of red\.The dominant visual feature is a pair of uniformly red columns for Claude\-Sonnet\-4\.6 and MiniMax\-M2\.7, spanning*every*row fromΔ=0\\Delta=0toΔ=7–12\\Delta=7\\text\{\-\-\}12\. This column\-uniform pattern is the visual signature of format divergence: the failure is identical in magnitude regardless of state\-change complexity, confirming it is a formatting convention mismatch rather than a reasoning deficiency\. The remaining five models show a mild top\-to\-bottom gradient\(FM generally decreases slightly asΔ\\Deltaincreases\)\. Still, the dominant pattern is horizontal homogeneity within each column, indicating that FM is primarily a model\-level property rather than a complexity\-level property\. The Qwen3\.5\-397B\-A17B column is notable for being the palest blue in the Simple and Medium rows among the non\-divergent models, consistent with its above\-average state tracking, which shows up as slightly reduced feedback fluency under complex operations\.

Right panel \(CM\): horizontal stripes of blue collapsing to red\.The dominant visual feature is the sharp horizontal transition: the top two rows \(Δ=0\\Delta=0andΔ=1\\Delta=1\) are deep blue across all models, indicating near\-perfect or high CM; byΔ=3\\Delta=3, every column has transitioned toward pink or red; and fromΔ=5\\Delta=5onward the entire panel is deep red, visually confirming the state\-change cliff as a hard threshold rather than a smooth degradation\. The transition is steeper than even the bar charts suggest: atΔ=2\\Delta=2, most columns already show light blue rather than deep blue \(CM at 8–28%\), while atΔ=5\\Delta=5the visual drop to near\-red is universal\. Two columns are worth isolating: the DeepSeek\-V3\.2 column is among the deepest red starting fromΔ=2\\Delta=2\(consistent with its 8% CM atΔ=2\\Delta=2\), while MiniMax\-M2\.7 shows a slightly paler red atΔ=1\\Delta=1andΔ=4\\Delta=4than its neighbors, which is consistent with its MiniMax\-M2\.7 leading CM at Simple \(50%\) and Medium \(16%\)\.

Comparing the two panels jointly\.The most informative comparison is reading the same row across both panels\. AtΔ=5\\Delta=5, the FM panel shows a mixed landscape \(blue for GLM\-5 and DeepSeek\-V3\.2, red for the format\-divergent models\), while the CM panel shows uniform deep red for all models\. This asymmetry is the heatmap manifestation of Finding 3, sub\-pattern \(iii\): high FM combined with low CM at the same delta level means models are generating correct\-sounding feedback while entirely failing at state tracking\.

![Refer to caption](https://arxiv.org/html/2605.07247v1/x3.png)Figure 6:Heatmap view of FM and CM\.*Left \(FM\)*: Uniformly red columns for Claude\-Sonnet\-4\.6 and MiniMax\-M2\.7 across allΔ\\DeltaThe remaining columns show mild top\-to\-bottom gradients, confirming FM is primarily a model property\.*Right \(CM\)*: Sharp horizontal transition from deep blue atΔ≤1\\Delta\\leq 1to deep red atΔ≥5\\Delta\\geq 5, the state\-change cliff as a visual phase boundary\. Reading both panels at the same row \(Δ=5\\Delta=5\) reveals the FM–CM decoupling that makes simulation failures invisible to agents\.

## Appendix DExtended Failure Mode Analysis

This appendix provides expanded qualitative and quantitative analyses of the three state\-tracking failure sub\-patterns identified in Finding 3, together with cross\-model comparisons on shared benchmark samples\.

Table 8:Format divergence examples\. CM = 100% in all listed cases; FM = 0%\.### D\.1Format Divergence: Full Example Set

As described in Finding 3, Claude\-Sonnet\-4\.6 and MiniMax\-M2\.7 systematically unwrap the structured response schema and return only the inner message string\. Table[8](https://arxiv.org/html/2605.07247#A4.T8)enumerates representative cases\. In every instance, CM is 100% \(state prediction is correct\) while FM is 0% \(string mismatch\)\. This pattern holds uniformly across all difficulty levels and both success and failure operations\.

One edge case deserves attention: a second Claude failure\-group example predicts"Workout scheduled successfully"where the ground truth is a failure response\. Here, beyond the format mismatch, the model also misclassifies the error condition\. The rare case where FM = 0 reflects a genuine semantic error rather than a pure formatting convention difference\.

This pattern reveals a counterintuitive deployment tradeoff: Claude\-Sonnet\-4\.6 and MiniMax\-M2\.7, despite their low absolute FM, may be*safer*choices for simulation pipelines that use feedback mismatches as a reliability signal, since their errors are more likely to be*visible*to the monitoring layer\. In contrast, models with near\-perfect FM but poor CM produce silent corruptions that accumulate undetected\.

### D\.2State\-Change Cliff: Cross\-Model Analysis on Shared Samples

#### Example 1:verify\_otp\(\|Δ\|=5\|\\Delta\|=5\)\.

This call must update five fields across three sub\-objects simultaneously\. Table[9](https://arxiv.org/html/2605.07247#A4.T9)shows all seven models’ field\-level predictions on the same sample\. Ground truth requires all five fields to be updated\.

Table 9:Cross\-model field\-level predictions onverify\_otp\(\|Δ\|=5\|\\Delta\|=5\)\. ✓ = correctly predicted;×\\times= missed or wrong value;†\\dagger= path correctly identified but execution\-time value wrong \(stale\-approximation error, sub\-type \(b\) of Appendix[D\.3](https://arxiv.org/html/2605.07247#A4.SS3)\)\. Ground truth: all five fields must be updated\.Three patterns are salient\. First, the runtime\-generated timestamp field is missed by*every*model: set to the current epoch at call time, it cannot be inferred statically\. Second, DeepSeek\-V3\.2, Gemini\-3\.1\-Pro\-Preview, and GLM\-5 independently converge on the same three\-field prediction, suggesting a shared attentional bias toward the most linguistically salient fields\. Third, GPT\-5\.4 collapses to zero predicted changes despite generating a feedback string, the full cliff failure mode\. A fourth pattern distinguishes Qwen3\.5\-397B\-A17B from the other six: it is the only model to identify all five field paths, yet still fails CM because its predictedlast\_verification\_timestamp\(1774167900\.0\) differs from the executor’s runtime value \(1774167865\.92\)\. This is a stale\-approximation error rather than a path\-identification failure, placing it in sub\-type \(b\) of Appendix[D\.3](https://arxiv.org/html/2605.07247#A4.SS3)\.

#### Example 2:return\_equipment\(\|Δ\|=5\|\\Delta\|=5\)\.

The ground\-truth transition requires four primary field updates plus one newly created auxiliary log record\. Every model fails:

- •DeepSeek\-V3\.2: correctly updates three primary fields; omits one primary field and the auxiliary log record\. FM = True, CM = False\.
- •Claude\-Sonnet\-4\.6: assigns a domain\-plausible but code\-absent value to the primary status field; omits the auxiliary record\. FM = False, CM = False\.
- •Gemini\-3\.1\-Pro\-Preview / GLM\-5 / MiniMax\-M2\.7: same incorrect status value as Claude; omit one primary field and the auxiliary record\. FM = False, CM = False\.
- •GPT\-5\.4: correctly derives all four primary field values, the only model to do so, but fabricates a runtime\-generated key for the auxiliary record that mismatches the executor’s output\. FM = True, CM = False\.

Across all\|Δ\|≥3\|\\Delta\|\\geq 3CM failures, models omit newly*added*components in 68–79% of cases where such additions exist, substantially exceeding the omission rate for*modified*components\. Four models independently hallucinate the same domain\-plausible but code\-absent status value, illustrating how parametric world knowledge overrides code\-grounded reasoning when the required value is not directly readable fromcode\(at\)\\texttt\{code\}\(a\_\{t\}\)\.

### D\.3Runtime\-Dependent Values: Failure Taxonomy

Timestamp\-bearing and runtime\-generated fields appear in the ground truth of 23–28% of all\|Δ\|≥3\|\\Delta\|\\geq 3CM failures across models \(Table[10](https://arxiv.org/html/2605.07247#A4.T10)\)\. A model performing static reasoning overcode\(at\)\\texttt\{code\}\(a\_\{t\}\)cannot infer execution\-time values without actually running the transition logic; it must either omit the field entirely or substitute a plausible but incorrect prior value\. Three sub\-patterns are observed\.

Table 10:Presence of runtime\-dependent fields in\|Δ\|≥3\|\\Delta\|\\geq 3CM\-failure samples\.\(a\) Complete omission\.The model identifies the correct field path but omits it fromΔ^\\widehat\{\\Delta\}entirely\. Dominant pattern for DeepSeek\-V3\.2 and GLM\-5\.

\(b\) Stale\-value substitution\.The model includes the field with a plausible but stale value interpolated from the before\-statests\_\{t\}\. In a representative case, a model predicts a stale proxyv^τ\\hat\{v\}\_\{\\tau\}for a clock\-dependent field while the executor writes the true runtime valuevτ∗≠v^τv^\{\*\}\_\{\\tau\}\\neq\\hat\{v\}\_\{\\tau\}; the surface responseo^t=ot\\hat\{o\}\_\{t\}=o\_\{t\}nonetheless matches \(FM = 1, CM = 0\)\.

\(c\) Transaction\-ID key mismatch\.For auto\-generated record keys encoding a runtime epoch, a model must both create the record*and*predict the correct key\. A correct record body with a wrong key still fails CM, a compound challenge of hidden side effect plus runtime dependency, as illustrated by GPT\-5\.4 in Example 2 above\.

### D\.4Plausible Feedback Masking Wrong State: Quantitative Breakdown

Table[11](https://arxiv.org/html/2605.07247#A4.T11)reports the fraction of CM\-failure samples that also pass FM at each\|Δ\|\|\\Delta\|level\. These are the most dangerous cases for simulation\-as\-training: the agent’s reward signal is intact while the environment state is silently wrong\.

Table 11:Percentage of CM\-failure samples that also pass FM, by\|Δ\|\|\\Delta\|level\. High values indicate silent state corruption that is invisible to the agent\.Bold: column\-wise maximum;red: column\-wise minimum\.For DeepSeek\-V3\.2, GLM\-5, GPT\-5\.4, and Gemini\-3\.1\-Pro\-Preview, 60–100% of CM failures at every\|Δ\|\|\\Delta\|level are accompanied by FM = True: the agent sees no divergence signal\. For Claude\-Sonnet\-4\.6 and MiniMax\-M2\.7, the pattern inverts: their format divergence means errors are*more likely*also to fail FM, making divergence visible\. This reveals a counterintuitive deployment tradeoff: models with format divergence, despite low absolute FM, may be safer for simulation pipelines that use feedback mismatches as a reliability signal\.

In one representative case at\|Δ\|=3\|\\Delta\|=3, two models independently produce the correct surface responseoto\_\{t\}but collapse a nested sub\-object to a coarse scalar representation \(losing field\-level granularity\) and substitute a stale value for the clock\-dependent timestamp field:o^t=ot\\hat\{o\}\_\{t\}=o\_\{t\}buts^t′≠st′\\hat\{s\}^\{\\prime\}\_\{t\}\\neq s^\{\\prime\}\_\{t\}\. A third model generates correct feedback and correctly predicts two primary fields but omits the timestamp field entirely: both pass FM and fail CM\. In a training pipeline, such silent divergences accumulate and systematically bias agent behavior\.

## Appendix EHallucination Case Studies

The introduction identifies three concrete failure modes of the LLM\-based environment simulation:hallucination\(fabricating incorrect state transitions\),logical inconsistency\(violating inter\-field constraints within the same response\), andstatelessness\(ignoring the explicitly provided before\-state\)\. This appendix presents one benchmark sample that instantiates each failure mode, verified against Python\-executed ground truth\.

### E\.1Failure Mode 1 Hallucination: Fabricating Non\-Existent Constraint Violations

In this failure mode, the model invents a plausible business\-rule violation to reject a call that the executor accepts\. No such constraint exists in the tool code; the model substitutes reasoning grounded in real\-world domain conventions for the actual implementation logic\.

#### Sample:add\_observations\(\|Δ\|=3\|\\Delta\|=3, all models fail\)\.

The call submits three wildlife observations with IDsobs\_9,obs\_10,obs\_11for distinct observers \(OBSR02,OBSR05,OBSR04\) at timestamps1685000000,1686000000,1687000000, each approximately 11\.6 days apart\. The executor accepts all three and creates the corresponding records\. Every evaluated model rejects the call and predicts no state change\.

Table 12:Hallucination: all six models fabricate different constraint violations to reject three valid observations\. Ground truth: 3 records created,success=True\.The split between models is instructive\. Four models \(DeepSeek, Gemini, GPT\-5\.4, GLM\-5\) generate entirely different constraint violations \(one introduces a temporal deduplication window; three introduce a coordinate\-validity check\), none of which appear anywhere in the tool code\. Each fabricated constraint is domain\-plausible \(geospatial data pipelines often deduplicate, and coordinates have validity ranges\), yet factually incorrect for this specific implementation\. The remaining two models \(Claude, MiniMax\) correctly identify that all observations should succeed, but then hallucinate incorrect field values for the created records, resulting in a different type of state hallucination\.

### E\.2Failure Mode 2 Logical Inconsistency: Self\-Contradictory Response

In this failure mode, the model’s own predicted feedback and predicted state changes are mutually contradictory: the feedback claims one thing happened while the state changes record something incompatible\. An agent receiving this response cannot reconcile the two signals\.

#### Sample:run\_bulk\_analysis\(\|Δ\|=3\|\\Delta\|=3, DeepSeek\-V3\.2\)\.

The call requests a trend analysis for one Japanese stock,7203\.T\(Toyota\)\. The executor processes only7203\.Tand updates three fields accordingly\. DeepSeek\-V3\.2 produces a response that is internally self\-contradictory in two ways simultaneously\.

Table 13:Logical inconsistency: DeepSeek\-V3\.2 feedback claims “2 stocks” were analyzed, but the call only specifies 1, and the predicted state invents a full analysis for a second stock \(SONY\) not present in the input\.The inconsistency operates at two levels\. First, the feedback count \(“2 stocks”\) contradicts the input \(1 stock was requested\)\. Second, the predicted state changes include a complete analysis ofSONY\(with internally coherent SMA values, trend direction, and confidence score\) that has no basis in the call arguments\. The model correctly processes7203\.T\(the core prediction is right\), then spontaneously hallucinates an entire second entity\. An agent receiving this response would record a ghost analysis for SONY in the environment state, causing all subsequent reasoning about the SONY data to be grounded in fabricated values\.

### E\.3Failure Mode 3 Statelessness: Ignoring the Provided Before\-State

In this failure mode, the model contradicts the*explicitly provided*before\-config\. Because our constraint\-driven paradigm supplies the full before\-state in every prompt, this failure cannot be attributed to implicit state tracking; the model overrides observed facts with a hallucinated prior\.

#### Sample:complete\_task\(\|Δ\|=2\|\\Delta\|=2, DeepSeek\-V3\.2\)\.

The tool call marks taskT\-002as completed\. The before\-config provided to the model explicitly states:

> tasks\.T\-002\.status:"in progress" tasks\.T\-002\.completed\_at:null

The executor therefore accepts the call: the task transitions from"in progress"to"completed", andcompleted\_atis set to the current epoch\.

Table 14:Statelessness: the model asserts the task is “already completed” despite the before\-config explicitly showingstatus = "in progress"\.The model asserts"already completed"while simultaneously having access to the before\-config that says the opposite\. This is not a state\-tracking failure arising from implicit multi\-turn inference\. It explicitly contradicts the data present in the prompt for a single\-turn prediction\. The most likely cause is that the model’s parametric knowledge about task\-management workflows \(where double\-completion calls are commonly guarded against\) overrides its attention to the provided state\. This pattern represents the purest form of hallucination in the environment simulation setting: the model’s world model supersedes the ground truth it was given\. The same sample was also misclassified by Gemini\-3\.1\-Pro\-Preview \(which hallucinated a “dependency not met” constraint\) and MiniMax\-M2\.7 \(which hallucinated a “task not found” error\), each substituting a different invented state for the explicitly provided one\.

## Appendix FCross\-Model Capability Profile Analysis

#### GPT\-5\.4 vs\. Gemini\-3\.1\-Pro\-Preview: same aggregate CM, different per\-delta profiles\.

Both models achieve 42\.0% overall CM, but the per\-delta breakdown from Appendix[A](https://arxiv.org/html/2605.07247#A1)reveals a crossover structure: Gemini\-3\.1\-Pro\-Preview leads atΔ∈\{1,2\}\\Delta\\in\\\{1,2\\\}\(68%/28% vs\. 60%/20% for GPT\-5\.4\); both are identical atΔ∈\{3,4\}\\Delta\\in\\\{3,4\\\}\(36%/24%\); GPT\-5\.4 leads substantially atΔ∈\{7,8\}\\Delta\\in\\\{7,8\\\}\(44%/25% vs\. 31%/12%\)\. This suggests that Gemini\-3\.1\-Pro\-Preview has stronger alignment for precise, low\-complexity field prediction, while GPT\-5\.4 has higher resilience against the most complex multi\-branch conditional logic\. For practitioners choosing a simulation backend, Gemini\-3\.1\-Pro\-Preview may be preferable for environments dominated by simple CRUD\-style operations, while GPT\-5\.4 may be preferable for environments with deeply nested state transitions\.

#### DeepSeek\-V3\.2: strong FM, weak CM\.

DeepSeek\-V3\.2 achieves 72\.5% overall FM \(third highest\) but only 32\.5% CM \(lowest\), a 40 pp gap that is the largest FM–CM divergence in the evaluation\. On state\-changing samples, the gap reaches 58\.7 pp \(68\.7% FM vs\. 10\.0% CM\)\. As Table[11](https://arxiv.org/html/2605.07247#A4.T11)shows, 75–87% of its CM failures are accompanied by FM = True across delta levels 1–5: the model generates correctly\-formatted, domain\-appropriate response strings while substantially underperforming on code\-execution tracing\. This makes it a high\-risk choice as a simulation backend for generating training data\.

#### FM–CM gap as a model\-selection signal\.

Table[15](https://arxiv.org/html/2605.07247#A6.T15)ranks all models by the FM–CM gap on state\-changing samples\. A large positive gap indicates the model generates believable but inaccurate simulations; a near\-zero or negative gap indicates visible failures\. For use cases where the reliability of the training signal matters more than surface fluency, CM alone is the appropriate criterion for backend selection\.

Table 15:FM–CM gap on state\-changing samples\. Large gaps indicate models that generate correct\-sounding feedback despite a wrong state\. It’s dangerous for simulation\.Bold: column\-wise maximum;red: column\-wise minimum\.

## Appendix GExperimental Setup Details

### G\.1Frontier LLM Evaluation

All seven frontier models are queried via their respective inference APIs in non\-thinking mode\. Every model receives identical prompts instantiating the MDP triple\(st,at,code\(at\)\)\(s\_\{t\},a\_\{t\},\\texttt\{code\}\(a\_\{t\}\)\); no model has access to prior turns or trajectory history\. Ground\-truth labels\(ot,st′\)\(o\_\{t\},s^\{\\prime\}\_\{t\}\)are produced entirely by the deterministic external executor, making evaluation LLM\-free and immune to circular validation\.

### G\.2Model Identifiers and API Endpoints

Table[16](https://arxiv.org/html/2605.07247#A7.T16)maps each display name used throughout the paper to the exact API model string queried during evaluation, together with the model provider\. All models were called via their respective public inference APIs innon\-thinking\(standard\) mode with no system\-prompt modifications\.

Table 16:Display name, exact API model string, and provider for each evaluated frontier model\.
### G\.3SFT Training Configuration

We fine\-tune Qwen3\-4B\-Base using full\-parameter SFT via LLaMA\-Factory\[[31](https://arxiv.org/html/2605.07247#bib.bib21)\]on 2×\\timesA800 \(80 GB\) GPUs with DeepSpeed ZeRO\-3 parallelism\[[23](https://arxiv.org/html/2605.07247#bib.bib20)\]\. Key hyperparameters: learning rate2×10−52\\times 10^\{\-5\}with a cosine schedule and 5% linear warm\-up; 3 training epochs; per\-device batch size 1 with gradient accumulation over 16 steps \(effective batch size 32\); bf16 mixed precision with FlashAttention\-2\. The maximum sequence length is set to 8,192 tokens, each sample encodes the full transition logiccode\(at\)\\texttt\{code\}\(a\_\{t\}\)together with the structured before\-statests\_\{t\}, with token counts peaking near 5,000 across the corpus; this window ensures no truncation during training or evaluation\. Inference is served via vLLM with a maximum generation length of 16,384 tokens and 90% GPU memory utilization\.

All experiments are evaluated under theenv\_idholdout protocol, which excludes all 167 benchmark environments from training, including Findings 4 and 5 and the composition ablation in Appendix[H](https://arxiv.org/html/2605.07247#A8)\.

## Appendix HData Composition Ablation

Table[17](https://arxiv.org/html/2605.07247#A8.T17)reports the full composition ablation comparing three data mixture strategies, with frontier model performance as reference\.

Change\-only\(trained exclusively on state\-mutating samples\) achieves the highest CM on state\-changing subsets \(60% Simple, 24% Medium, 32% Difficult\) but collapses to 0% FM on Failure and No\-Change groups, despite maintaining 96% CM on these same samples\. The model correctly predicts that no state change should occur in most cases, but has never learned to generate the corresponding feedback strings: outputting state\-change\-style success messages where error messages or minimal acknowledgments are expected\. In a deployed pipeline, this feedback corruption directly poisons the agent’s reward signal: the environment state is preserved, but the agent receives systematically wrong feedback for every failed or no\-op action, causing it to misattribute outcomes and learn incorrect action\-consequence associations\.

Balanceadds failure and no\-change samples but does not mirror the empirical\|Δ\|\|\\Delta\|distribution, resulting in weaker performance on complex\-change tiers\.Balance2additionally upweights complex\-change samples to match the source distribution, yielding the best overall CM \(45\.3%\) and FM \(79\.5%\), surpassing all frontier models on CM\.

Table 17:Data composition ablation underenv\_idholdout\.Bold: column\-wise maximum;red: column\-wise minimum\.
## Appendix IPre\-SFT Scaling Baseline

Table[18](https://arxiv.org/html/2605.07247#A9.T18)shows Qwen3\.5 instruct models \(4B, 9B, 27B\) without fine\-tuning\. Clear FM scaling trends emerge \(2\.3%→\\to62\.8%\), but even the 27B model is substantially below the best\-performing LLMs on the frontier, confirming that specialization is required beyond scale alone\.

Table 18:Qwen3\.5 non\-thinking instruct models before SFT\.Bold: column\-wise maximum;red: column\-wise minimum\.
## Appendix JTraining Method Comparison

Table[19](https://arxiv.org/html/2605.07247#A10.T19)compares LoRA and full\-parameter fine\-tuning on Qwen3\-4B\-Base at a 5K\-sample budget under theenv\_idholdout\. All hyperparameters are held fixed as described in Appendix[G](https://arxiv.org/html/2605.07247#A7)\.

Table 19:Training method comparison on Qwen3\-4B\-Base \(5K samples,env\_idholdout\)\. Full fine\-tuning outperforms LoRA on CM, particularly in the Simple tier\.Bold: column\-wise maximum;red: column\-wise minimum\.Full\-parameter SFT raises FM from 27% to 72\.8% \(2\.7×2\.7\\times\) and aligns the model’s output schema with the expected response convention, correcting near\-zero FM on Failure and No\-Change samples \(CM rises from 51% to 99% on these groups\)\. On CM, the state\-change cliff persists at the 5K\-sample scale, Medium\-group CM reaches only 10\.5%, and Difficult\-group CM falls to 6%, confirming that the cliff reflects a fundamental capability gap rather than a formatting artifact\.

Full\-parameter fine\-tuning surpasses LoRA at equal data budget on CM \(35\.0% vs\. 31\.8%\), with the advantage concentrated in the Simple tier \(34% vs\. 20%\)\. We attribute this to LoRA’s limited representational capacity, which overwrites the dense attention patterns that underlie sequential code\-execution tracing\. The state\-change cliff persists under both methods at 5K samples, confirming it as a genuine capability gap rather than a parameter\-efficiency artifact\.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

Similar Articles

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

Submit Feedback

Similar Articles

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management
EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL