Coding Agent Is Good As World Simulator

arXiv cs.AI 05/15/26, 04:00 AM Papers

Summary

This paper presents an agentic framework that uses coding agents to generate physically plausible world simulations from natural language prompts, outperforming video-based models in physical accuracy and instruction fidelity.

arXiv:2605.14398v1 Announce Type: new Abstract: World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

Original Article

View Cached Full Text

Cached at: 05/15/26, 06:24 AM

# Coding Agent Is Good As World Simulator
Source: [https://arxiv.org/html/2605.14398](https://arxiv.org/html/2605.14398)
Hongyu Wang Department of Mechanical Engineering University of Wisconsin\-Madison Madison, WI 53706 hwang2487@wisc\.edu &Jingquan Wang Department of Mechanical Engineering University of Wisconsin\-Madison Madison, WI 53706 jwang2373@wisc\.edu Bocheng Zou School of Computer, Data, and Information Sciences University of Wisconsin\-Madison Madison, WI 53706 bzou24@wisc\.eduRadu Serban Department of Mechanical & Aerospace Engineering University of Wisconsin\-Madison Madison, WI 53706 serban@wisc\.edu &Dan Negrut Department of Mechanical & Aerospace Engineering University of Wisconsin\-Madison Madison, WI 53706 negrut@wisc\.edu

###### Abstract

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video\-based approaches demonstrating impressive progress in generating visually plausible dynamics\. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints\. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion\. In this paper, we present an agentic framework constructing physics\-based world models through executable simulation code\. The framework coordinates planning, code generation, visual review, and physics analysis agents\. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency\. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints\. Experimental results show that our framework outperforms advanced video\-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks\.

## 1Introduction

World models have shown that learned dynamics can support planning and control from compact latent states\[[9](https://arxiv.org/html/2605.14398#bib.bib9),[11](https://arxiv.org/html/2605.14398#bib.bib10),[10](https://arxiv.org/html/2605.14398#bib.bib31)\]\. More recently, generative video models have pushed this idea toward interactive and visually rich world simulation, including controllable environments, autonomous\-driving scenes, and video\-based world simulators\[[6](https://arxiv.org/html/2605.14398#bib.bib1),[14](https://arxiv.org/html/2605.14398#bib.bib17),[5](https://arxiv.org/html/2605.14398#bib.bib16)\]\. These models can produce plausible future observations, but their dynamics are usually represented implicitly rather than as explicit bodies, joints, contacts, materials, or solver states\. This distinction matters in long\-horizon interaction: a world model must not only render the next plausible frame, but also preserve the physical state that determines what can happen next\. Recent efforts toward physical AI and visual world simulation highlight the importance of this problem, but they still leave open how to construct worlds whose mechanics can be inspected, executed, and repaired\[[1](https://arxiv.org/html/2605.14398#bib.bib19),[2](https://arxiv.org/html/2605.14398#bib.bib20),[44](https://arxiv.org/html/2605.14398#bib.bib43)\]\.

##### Related work

Prior work has taken three different views of the world state: a learned latent state, a generated visual state, or an explicit simulator state\. Latent world models support planning and control through learned rollouts\[[25](https://arxiv.org/html/2605.14398#bib.bib32),[12](https://arxiv.org/html/2605.14398#bib.bib11)\]\. Video\-based approaches extend world modeling toward video prediction, action\-conditioned rollouts, and benchmarks for evaluating whether generated videos behave like world models\[[3](https://arxiv.org/html/2605.14398#bib.bib2),[46](https://arxiv.org/html/2605.14398#bib.bib18)\]\. Both lines are important, but neither directly exposes the simulator\-level state needed to specify contacts, articulated mechanisms, deformable objects, sensors, or numerical validation\. Robot\-learning and deformable\-object studies expose this issue because success depends on physical behavior rather than plausible appearance alone\[[20](https://arxiv.org/html/2605.14398#bib.bib44),[41](https://arxiv.org/html/2605.14398#bib.bib45),[8](https://arxiv.org/html/2605.14398#bib.bib46)\]\. Physics simulators start from explicit state rather than learned or generated state\. Engines and embodied simulation environments such as MuJoCo, Project Chrono, and Isaac Gym expose bodies, joints, contacts, terrain, sensors, and numerical integration as explicit components of the world state\[[29](https://arxiv.org/html/2605.14398#bib.bib21),[24](https://arxiv.org/html/2605.14398#bib.bib8),[19](https://arxiv.org/html/2605.14398#bib.bib23)\]\. They provide physically meaningful state and diagnostics that video\-only world simulators generally lack\[[38](https://arxiv.org/html/2605.14398#bib.bib33),[28](https://arxiv.org/html/2605.14398#bib.bib34),[15](https://arxiv.org/html/2605.14398#bib.bib35)\]\. Their bottleneck is not physical fidelity, but world construction: users must choose assets, instantiate bodies, write simulator code, tune numerical parameters, and inspect failures\. Scene\-generation methods reduce part of this burden by producing embodied environments, indoor layouts, and language\-guided 3D scenes\[[7](https://arxiv.org/html/2605.14398#bib.bib36),[23](https://arxiv.org/html/2605.14398#bib.bib37),[47](https://arxiv.org/html/2605.14398#bib.bib38)\]\. Physically interactable scene synthesis and physics\-augmented LLM agents move closer to physically grounded world construction\[[42](https://arxiv.org/html/2605.14398#bib.bib39),[40](https://arxiv.org/html/2605.14398#bib.bib22),[35](https://arxiv.org/html/2605.14398#bib.bib7)\]\. However, generating a scene is not the same as building a working simulation: the system must also write simulator\-aware code, execute it, review the result, and repair failures\.

This gap suggests a different route for world modeling\. Instead of learning a latent video transition model, a system can construct an executable physics world from user input\. This shifts the problem from frame prediction to simulator\-aware world construction, where the system must specify geometry, bodies, joints, contacts, materials, sensors, and numerical settings in executable code\.

This framework transforms world construction into an agentic code\-generation problem\. LLM agents produce plans, call tools, write code, and modify the code through feedback until it satisfy the prompt\.\[[43](https://arxiv.org/html/2605.14398#bib.bib12),[26](https://arxiv.org/html/2605.14398#bib.bib40),[18](https://arxiv.org/html/2605.14398#bib.bib13)\]\. Prior work has shown that generated code can serve as an executable interface between model reasoning and external systems, making agent behavior more inspectable, editable, and testable\[[34](https://arxiv.org/html/2605.14398#bib.bib30),[39](https://arxiv.org/html/2605.14398#bib.bib25),[45](https://arxiv.org/html/2605.14398#bib.bib47)\]\. For simulation specifically, recent work adopts LLMs to create, evaluate, self\-validate, and specialize physics\-based simulation code\[[32](https://arxiv.org/html/2605.14398#bib.bib5),[33](https://arxiv.org/html/2605.14398#bib.bib6)\]\.

Complex physical worlds, however, require more than a single code\-generation step\. Multi\-agent coordination provides a paradigm to decompose simulation construction into planning, coding, review, and validation roles\[[17](https://arxiv.org/html/2605.14398#bib.bib26),[37](https://arxiv.org/html/2605.14398#bib.bib42),[13](https://arxiv.org/html/2605.14398#bib.bib29)\]\. Self\-correcting multi\-agent systems for physics simulation further show why execution feedback is important for fixing code, especially when the system is not trained on a large corpus of simulator code\[[30](https://arxiv.org/html/2605.14398#bib.bib50),[22](https://arxiv.org/html/2605.14398#bib.bib4)\]\. Recent work such as\[[21](https://arxiv.org/html/2605.14398#bib.bib3)\]demonstrates how a multi\-agent system can generate code for multibody dynamics simulation, but it does not yet integrate rich assets\. These developments point toward a prompt\-to\-simulation loop in which planning, code generation, execution, visual review, and physics validation work together to construct and repair executable worlds\.

These insights lead to the core idea of this paper: a coding agent can act as a world simulator\. Rather than modeling future frames directly, the proposed paradigm constructs executable simulator programs that define the physical world itself\. In this view, generated code serves as the world representation: it specifies bodies, joints, contacts, terrains, sensors, visual assets, materials, and numerical settings within a physics engine\. Program execution then yields both physical trajectories and rendered observations, while runtime diagnostics, physics checks, and visual feedback provide grounded signals for iterative repair\. We test this paradigm across robot interaction in indoor environments, outdoor vehicle simulation, and high\-fidelity fluid–solid interaction\.

Our contributions are summarized as follows:

- •We propose a multi\-agent framework for world simulation, in which an agent constructs executable physics worlds through simulator\-aware planning, skill\-grounded code generation, execution feedback, visual review, and iterative fix\.
- •We incorporate physics simulation into the world\-construction process, representing objects, joints, contacts, terrains, sensors, materials, and numerical settings as executable simulator programs\. This design enables explicit physical state, inspectable dynamics, and physically grounded interaction beyond frame\-level visual prediction\.
- •We demonstrate the effectiveness and generality of the proposed framework across diverse simulation tasks, including robot interaction in indoor environments, outdoor vehicle dynamics, and high\- fidelity fluid–solid interaction\. Through qualitative and quantitative evaluation, we show that coding agent can construct complex world simulation\.

## 2Methodology

### 2\.1Asset Library and Collision Representation

The system constructs simulation scenes from two complementary sources of digital assets: external 3D assets collected from public platform\[[27](https://arxiv.org/html/2605.14398#bib.bib55)\]and simulator\-native assets distributed with Project Chrono\. External assets provide semantic and visual diversity for everyday objects and indoor scenes, while Chrono assets provide components that are already tied to physical simulation, including vehicles, robots, terrain, and geometry shapes\. Both sources are organized into a unified asset library that maps high\-level object descriptions to simulator\-ready geometry, supporting both semantic scene completion and executable physical simulation\.

##### Decomposed Convex Hulls for Collision\.

Due to the high computational cost of using raw 3D meshes for collision, the system separates visual geometry from collision geometry\. High\-resolution meshes are retained for visualization, while physical interactions are computed using simplified collision shapes\. For detailed 3D assets, these shapes are constructed as decomposed convex hulls generated by the Approximate Convex Decomposition \(CoACD\) algorithm\[[36](https://arxiv.org/html/2605.14398#bib.bib51)\]\. CoACD decomposes a 3D mesh into convex components while minimizing collision\-aware concavity, yielding a compact collision representation for contact simulation\.

### 2\.2Multi\-Agent Framework

As illustrated in Fig\.[1](https://arxiv.org/html/2605.14398#S2.F1), the proposed framework decomposes physical world construction into a closed\-loop agent workflow\. Given a user prompt or an optional reference image, the system first produces a structured simulation plan, generates executable PyChrono code, runs the program in the Chrono engine, and reviews the simulation by physical diagnostics and visual evidence\. In this design, the simulator program serves as the world model: the generated code specifies geometry, mass properties, constraints, contacts, controllers, sensors, rendering, and numerical parameters, while the physics engine advances the simulated world through time\.

![Refer to caption](https://arxiv.org/html/2605.14398v1/x1.png)Figure 1:multi\-agent pipeline\.The framework fixes the same program across iterations rather than regenerating a new script from scratch after each failure\. It begins by translating the user request into a structured plan with one or more implementation stages\. Before code generation, the user can approve the plan or specify missing details, since natural\-language requests often leave concrete simulator choices underspecified, including object dimensions, actuation, duration, time step, camera placement, and output modality\. Once the plan is confirmed, the code agent generates an initial program using simulator knowledge from the skill library and assets from the asset library\.

At each implementation stage, the generated program is executed in Chrono, producing trajectory data, simulation video, and execution logs\. The visual review agent describes the video in terms of scene layout, dynamics, object interactions, and possible visual inconsistencies\. The validation stage then combines simulator logs, physical states, and visual evidence to decide whether the simulation matches the confirmed plan\. If it does, the system proceeds to the next stage\. Otherwise, the validator returns a structured error report, and the code agent patches the current program\. The loop ends until the final program satisfies all steps in the plan\.

### 2\.3Plan Agent

The plan agent converts an underspecified user request into a simulator\-oriented plan before code generation, which includes the objects, construction sources, topology relations, physical roles, implementation steps and camera configurations\. This intermediate representation is critical since simulator code requires concrete choices that are often omitted in natural\-language prompts\.

#### 2\.3\.1Optional Image Input

The plan agent takes a text prompt as its default input and can additionally condition on a reference image\. The image is used as auxiliary evidence for planning rather than as a direct simulator state\. From the image, the agent extracts task\-relevant cues such as visible objects, approximate scale, support relations, relative layout, scene type, action intent, and visual constraints\. These hints are then translated into simulator\-oriented plan fields, including object construction choices, topology relations, physical roles, camera settings, and validation targets\. If no image is provided, the same plan schema is completed from the text prompt alone\.

#### 2\.3\.2Asset Extraction

The plan agent first identifies the physical entities required by the request, including rigid bodies, articulated mechanisms, vehicles, robots, terrain, fluids, sensors, and background scene elements\. For each entity, the plan records its semantic role, intended scale, approximate pose, and whether it should be instantiated from the asset library, constructed from a geometric proxy, or requested as an external asset candidate\. This step preserves the physical requirements of the scene even when an exact asset is unavailable, allowing the code agent to generate a simulator\-compatible approximation\.

#### 2\.3\.3Scene Inference

Natural\-language requests often describe a scene through relations rather than exact simulator poses\. For example, a prompt such as “a laptop on a table facing the chair” specifies support and orientation, but not the cup’s coordinates\. The plan agent therefore infers support, containment, layout, orientation, and repeated structure before code generation\.

We use spatial predicates as the algebra underlying these object\-level relations\[[35](https://arxiv.org/html/2605.14398#bib.bib7)\]\. The emitted plan stores this information inobjects\[\*\]\.topology\.relationand the resolvedpose, while the predicate vocabulary defines the semantics used to compute them\. The plan represents these relations at two levels: primitive predicates encode individual algebraic constraints, while composite predicate templates name common combinations of side placement, support, and vertical alignment\. Each relation remains symbolic until the code agent converts it into simulator poses using object sizes, asset metadata, and the Chrono coordinate frame\. The complete predicate definitions and implementation details are provided in Appendix[A\.4](https://arxiv.org/html/2605.14398#Ax1.SS4)\.

##### Position Predicates\.

Position predicates describe where objects are placed in the ground plane and whether they are supported or contained by other objects\.

- •Position in the XY plane:LEFT\-OF/RIGHT\-OF/FRONT\-OF/BACK\-OFplace an object to the left, right, front, or back of a reference object in the ground plane\.PLACE\-ON\-BASEplaces an object on the simulation base plane when no support object is specified\.
- •Alignment:ALIGN\-LEFT/ALIGN\-RIGHT/ALIGN\-FRONT/ALIGN\-BACKalign an object with the corresponding edge of a reference object\.ALIGN\-CENTER\-LR/ALIGN\-CENTER\-FBalign object centers along the left–right or front–back axis\.
- •Containment and support:PLACE\-ONplaces an object on top of a reference object\.PLACE\-INplaces an object inside a reference object registered as a container\.
- •Unconstrained placement:PLACE\-ANYWHEREkeeps an object in the plan when the prompt requires it but does not specify its location\.

##### Height Predicates\.

Height predicates declare vertical extents needed to resolve support and contact relations\.

- •Height declaration:HEIGHTassigns a concrete height to a procedural body whose vertical extent cannot be inferred from mesh metadata or a catalog entry\.

##### Orientation Predicates\.

Orientation predicates describe how an object should face in either the global coordinate frame or the local frame of a reference object\. For simulator assets, we use\+Xas the forward direction in the XYZ coordinate system\.

- •Global orientation:FACING\-RIGHT/FACING\-LEFT/FACING\-FRONT/FACING\-BACKorient an object toward a specified global direction\.
- •Relative orientation:FACING\-TO/FACING\-OPPOSITE\-TO/FACING\-SAME\-ASorient an object with respect to a reference object, such as facing the reference, facing away from it, or matching its heading\.
- •Derived orientation:RANDOM\-ROTsamples a concrete yaw value at plan time, whileORIENT\-BY\-RELATIVE\-SIDEselects the yaw that best matches the object’s side relation to a reference object\.

##### Fluid Predicates\.

Fluid predicates describe object placement relative to a free surface in fluid–solid interaction scenes\.

- •Surface declaration:FREE\-SURFACE\-ATsets the free\-surface height of a fluid body and must appear before other objects refer to that fluid\.
- •Buoyancy:FLOATS\-AT\-SURFACEplaces an object with its bottom face on the fluid surface\.SUBMERGEDplaces an object below the surface at a specified depth, using either its top face or center as the anchor\.
- •Container marking:CONTAINS\-FLUIDmarks an object as a fluid container so that the generated code preserves the geometry and visibility of the contained particles\.

##### Symmetry and Grouping Predicates\.

Symmetry and grouping predicates describe repeated layouts without listing every low\-level placement separately\.

- •Symmetry:SYMMETRY\-ALONGplaces an object as the mirror of a reference object about a designated axis object\.
- •Grouping:GROUPcombines multiple assets into a virtual object whose pose is anchored by one member\.COPY\-GROUPcreates another group with the same internal configuration, so spatial or rotation predicates can be applied to the copied group as a whole\.

##### Composite Placement Predicates\.

Composite predicates are named templates built from the primitive predicate algebra above\. They are used when a common scene relation would otherwise require multiple primitive rows, such as placing an object outside a reference face while also aligning its top surface or positioning a camera relative to a scene boundary\.

- •On\-top templates:spawned\_on\_top,placed\_on\_top, andcentered\_on\_refcombine support or center alignment with a resolved object pose\.
- •Adjacent templates:adjacent\_plus\_x\_top\_flush,adjacent\_minus\_x\_top\_flush,adjacent\_plus\_y\_top\_flush, andadjacent\_minus\_y\_top\_flushcombine a side predicate with top\-face alignment\. The correspondingbottom\_flushandcentersvariants use the same side placement with different vertical alignment\.
- •Water\-surface templates:bottom\_flush\_water\_surface,center\_at\_water\_surface,top\_flush\_water\_surface, andfloats\_at\_surfacecombine a fluid reference with a vertical anchor on the object\.
- •Container and bridge templates:fills\_container\_to\_top,fills\_container\_lower\_half,bridge\_between\_a\_and\_b, andflush\_with\_platform\_topcapture frequent filling, spanning, and platform\-level alignment relations\.
- •Camera templates:side\_minus\_x,side\_plus\_x,side\_minus\_y,side\_plus\_y, andtop\_downvariants place cameras relative to a scene bounding box or enclosed room\.

#### 2\.3\.4User Interaction

Before code generation, the structured plan is exposed to the user for confirmation or correction\. This approval step is necessary because the generated simulator program commits to concrete choices such as time step, duration, output modality, object dimensions, camera placement, and actuation strategy\. If these choices remain underspecified, the system either requests clarification or inserts conservative defaults before generating code to push forward\. However, defaults inferred from incomplete requests may not match the user intent, leading to additional repair iterations or invalid simulation results\.

### 2\.4Code Agent

The code agent translates the approved simulation plan into an executable PyChrono program\. Instead of generating the script from the plan alone, the agent conditions code generation on three sources of simulator\-specific information: a skill library, a tool interface, and a version\-specific API index\. This design reduces the amount of simulator knowledge that must be inferred from the code agent alone\.

#### 2\.4\.1Skill Library

The code agent retrieves task\-relevant skills from a curated skill library before generating code\. Each skill specifies the implementation procedure for a particular pattern in PyChrono, such as rigid\-body creation, joint setup, vehicle initialization, robot loading, sensor configuration, fluid–solid interaction setup, terrain generation, or VSG visualization\. The retrieved skills provide implementation context for the approved plan and help keep the generated script consistent with valid Chrono usage\.

#### 2\.4\.2Tool Interface

The code agent also has access to a set of deterministic tools for querying project resources and performing common simulator operations\. These tools provide information about the asset library and the available simulator API, and they support routine actions such as adding assets, configuring cameras, and recording simulation videos\. Exposing these operations as tools keeps frequently used procedures separate from free\-form code generation and makes the generated program easier to inspect and debug\.

#### 2\.4\.3API Retrieval and Validation

A practical challenge in PyChrono code generation is API drift across differentd versions\. The agent may call outdated classes, use incorrect argument signatures, or propose functions that are not available in the installed environment\. To reduce these errors, the system retrieves relevant entries from a version\-specific API index during code generation and applies a static API validator before execution\. The validator checks imports, classes, functions, method calls, and argument patterns against the available API surface\. When the check fails, the error report is returned to the code agent so that API\-level mistakes can be repaired before entering the more expensive code generation loop\.

### 2\.5Execution and Review Agents

#### 2\.5\.1VLM Inference

During execution, the generated code runs in an isolated process\. The system records camera frames, simulation results, and diagnostic logs\. The visual review agent analyzes the camera frames and produces a textual description of visible objects, scene layout, motion, contacts, and bias between the video and the plan\. This review does not replace physics\-based validation\. Instead, it provides semantic visual evidence that is difficult to infer from logs or trajectory data alone\.

#### 2\.5\.2Simulation Judge

The simulation judge evaluates whether the executed program satisfies the confirmed plan\. It combines evidence from three sources: diagnostic logs, physical trajectory data, and the visual review\. The logs indicate whether the program ran successfully and whether the simulator reported runtime errors or solver warnings\. The trajectory data records physical quantities such as positions, velocities, contacts, and task\-specific measurements\. The visual review describes scene\-level properties, including object presence, layout, visible interaction, and mismatches between the rendered video and the intended behavior\.

Based on the evidence above, the judge determines whether the simulation is valid, stable, visually consistent, and complete\. If the simulation fails, the judge returns a structured error report that identifies the likely repair target, such as physical parameters, object settlement, camera placement, or visual mismatch\.

This report is passed back to the code agent, which helps to fix the current program before the next execution\.

## 3Experiments

We evaluate the framework through three complementary analyses\. First, we ablate the optional reference image used by the Plan Agent to examine how visual grounding affects the generated plan\. Second, we report the time and token usage required for a single generation\. Third, we evaluate rendered simulation rollouts on WorldModelBench\[[16](https://arxiv.org/html/2605.14398#bib.bib28)\], a benchmark for assessing video generation models as world models\. Together, these evaluations examine planning robustness, resource usage, prompt adherence, and physical plausibility\.

The experiments include three scenarios described in Appendix[A\.5](https://arxiv.org/html/2605.14398#Ax1.SS5): a Go2 robot patrolling an office, an HMMWV driving on outdoor terrain, and a specific Fluid\-Solid Interaction \(FSI\) scenario: a Polaris vehicle crossing a floating block on water\.

### 3\.1Ablation Study

We ablate the optional reference\-image input of the Plan Agent\. For each task, plans are generated under two input types: text prompt only and text prompt with a reference image\. Each task\-input pair is run for five independent trials\. A plan is judged successful when it parses cleanly into the structured format consumed by the Code Agent and a human reviewer confirms that the listed entities, object construction choices, topology relations, and physical roles match the requested scene\. Table[1](https://arxiv.org/html/2605.14398#S3.T1)reports the Pass@k scores for the completed trial cells\.

Table 1:Plan\-Agent ablation study\. Pass@k denotes the probability that at least one successful plan is obtained within k sampled attempts\.The high success rates in both input conditions should be interpreted in the context of the planning schema and prompt constraints\. In most trials, both the text\-only and text\-plus\-image settings produced plans with the correct required objects and physical roles\. The main differences between sampled plans were often in how the task was divided into implementation steps, rather than in whether the scene contained the requested objects\. These differences in step decomposition should not be treated as failures since they preserve the information needed by the Code Agent to implement the simulation correctly\.

### 3\.2Time and Token Usage

We report the wall\-clock time and token usage for one successful end\-to\-end agent run of each demo scenario\. The measurement starts from plan generation and ends when the system produces an accepted simulation\. The token counts are summed over all LLM calls in the run\. These results are meant to provide a representative cost profile of the system, rather than an estimate of average runtime\. Since LLM agents are stochastic, the number of agent calls and repair iterations can vary across runs; the values in Table[2](https://arxiv.org/html/2605.14398#S3.T2)should therefore be read as examples of successful runs\.

Table 2:Wall\-clock time and token usage for representative successful end\-to\-end runs\. Token counts are summed over all LLM calls in each run\.
### 3\.3Benchmark Evaluation of Physical World Modeling

We evaluate generated rollouts on WorldModelBench\[[16](https://arxiv.org/html/2605.14398#bib.bib28)\], a benchmark that assesses whether generated videos behave as plausible world models\. WorldModelBench reports scores along instruction, physical\-laws, and common\-sense axes\. In our evaluation, the WorldModelBench judger usesgemini\-3\.1\-pro\-previewas its backend LLM\.

For the proposed framework, each evaluated video is rendered from the executable PyChrono program produced by the agent loop\. We use Wan2\.2\-TI2V\-5B\[[31](https://arxiv.org/html/2605.14398#bib.bib54)\]as the video\-generation baseline; baseline inference is run on an NVIDIA A100 GPU with 80 GB memory\. The baseline video is generated directly from the same prompt and, when applicable, the same reference image\. This setup compares two different representations of a generated world: executable simulator code and frame\-level video generation\. Table[3](https://arxiv.org/html/2605.14398#S3.T3)reports the detailed WorldModelBench scores for each scenario and metric, while Tables[4](https://arxiv.org/html/2605.14398#S3.T4)and[5](https://arxiv.org/html/2605.14398#S3.T5)summarize the same results with paired significance tests\. We use two\-sided pairedtt\-tests over the 10 matched runs for each comparison\.

Table 3:WorldModelBench scores on three scenarios\. Higher is better on each axis\.Instr\.= instruction score,Phys\.= physical\-laws score, andCS= common\-sense score\.Table 4:Scenario\-level WorldModelBench scores\. Each score is the sum of instruction, physical\-laws, and common\-sense scores for a scenario\. Difference is Multi\-Agent Framework minus Wan2\.2\-TI2V\-5B\.Table 5:Metric\-level WorldModelBench scores aggregated across the three scenarios\. Difference is Multi\-Agent Framework minus Wan2\.2\-TI2V\-5B\.The WorldModelBench results provide a preliminary comparison of executable simulation and direct video generation on action\-level physical commonsense\. Looking first at the mean scores, the Multi\-Agent Framework obtains higher scenario\-level totals in all three tasks, with the largest margin on the FSI vehicle task\. The pairedpp\-values indicate how consistently these differences appear across the 10 matched runs: the FSI improvement is statistically significant \(p=0\.0012p=0\.0012\), whereas the outdoor vehicle and robot\-in\-office differences are positive but not statistically significant\. At the metric level, the significant gain is concentrated in instruction adherence \(p=0\.000059p=0\.000059\), while the physical\-laws and common\-sense scores remain comparable between methods\. This pattern suggests that the main advantage of the framework is preserving requested entities, actions, and scene constraints through executable simulation code, rather than uniformly improving every physical\-commonsense metric\. The scores should still be interpreted as preliminary evidence rather than a comprehensive ranking, since the current evaluation covers a small number of scenarios\.

## 4Conclusion and Future Work

This paper studies a code\-centric alternative to video\-only world modeling\. Instead of predicting future frames directly, the proposed framework constructs executable PyChrono programs that specify bodies, contacts, terrains, controllers, sensors, rendering, and numerical settings\. Across the evaluated scenarios, this representation makes the generated world both renderable and inspectable: failures can be traced through execution logs, trajectory data, contacts, and visual review, and the current program can be repaired rather than regenerated from scratch\. The preliminary benchmark results suggest that executable simulation is useful for preserving requested entities, actions, and scene constraints, although the quality of the final simulation still depends on planning, asset retrieval, simulator\-specific code generation, and validation feedback\.

However, this framework also has several limitations\. The repair process is not guaranteed to improve monotonically and can move between different failure modes before reaching an accepted simulation\. The asset library remains finite, so missing objects must be approximated by geometric proxies or replaced with available assets\. Code generation is constrained by the skill library and the version\-specific API index, which makes unsupported sensors, custom solvers, and less common physical regimes difficult to instantiate reliably\. The present evaluation also covers a small number of scenarios and still relies partly on human judgment for plan acceptance and qualitative validation\.

Several directions remain to be explored\. First, the current asset library is limited, so missing objects must be approximated by geometric proxies\. Integrating a 3D asset generation pipeline would improve coverage of long\-tail scene elements and reduce manual asset preparation\. Second, the present infrastructure is still expensive in token usage because planning, code generation, execution analysis, and visual review are mediated by multiple LLM calls\. Therefore, more compact state representations and better caching could reduce this cost\. Third, the experiments are currently run sequentially\. Rendering and simulation place substantial load on both CPU and GPU resources, which prevents multiple agents from running PyChrono programs in parallel under the available hardware budget\. Future frameworks could explore parallel execution and scheduling with more powerful hardware\.

## Acknowledgments and Disclosure of Funding

Use unnumbered first level headings for the acknowledgments\. All acknowledgments go at the end of the paper before the list of references\. Moreover, you are required to declare funding \(financial activities supporting the submitted work\) and competing interests \(related financial activities outside the submitted work\)\. More information about this disclosure can be found at:[https://neurips\.cc/Conferences/2026/PaperInformation/FundingDisclosure](https://neurips.cc/Conferences/2026/PaperInformation/FundingDisclosure)\.

Donotinclude this section in the anonymized submission, only in the final paper\. You can use theackenvironment provided in the style file to automatically hide this section in the anonymized submission\.

## References

- \[1\]N\. Agarwal, A\. Ali, M\. Bala, Y\. Balaji, E\. Barker, T\. Cai, P\. Chattopadhyay, Y\. Chen, Y\. Cui, Y\. Ding,et al\.\(2025\)Cosmos world foundation model platform for physical ai\.arXiv preprint arXiv:2501\.03575\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[2\]\(2025\)World simulation with video foundation models for physical ai\.arXiv preprint arXiv:2511\.00062\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[3\]M\. Assran, A\. Bardes, D\. Fan, Q\. Garrido, R\. Howes, M\. Muckley, A\. Rizvi, C\. Roberts, K\. Sinha, A\. Zholus,et al\.\(2025\)V\-jepa 2: self\-supervised video models enable understanding, prediction and planning\.arXiv preprint arXiv:2506\.09985\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[4\]Blender Foundation\(2024\)Blender\.Note:Version 4\.0External Links:[Link](https://www.blender.org/)Cited by:[§A\.5](https://arxiv.org/html/2605.14398#Ax1.SS5.SSS0.Px3.p1.1)\.
- \[5\]T\. Brooks, B\. Peebles, C\. Holmes, W\. DePue, Y\. Guo, L\. Jing, D\. Schnurr, J\. Taylor, T\. Luhman, E\. Luhman, C\. Ng, R\. Wang, and A\. Ramesh\(2024\)Video generation models as world simulators\.Note:OpenAIExternal Links:[Link](https://openai.com/research/video-generation-models-as-world-simulators)Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[6\]J\. Bruce, M\. D\. Dennis, A\. Edwards, J\. Parker\-Holder, Y\. Shi, E\. Hughes, M\. Lai, A\. Mavalankar, R\. Steigerwald, C\. Apps,et al\.\(2024\)Genie: generative interactive environments\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[7\]M\. Deitke, E\. VanderBilt, A\. Herrasti, L\. Weihs, K\. Ehsani, J\. Salvador, W\. Han, E\. Kolve, A\. Kembhavi, and R\. Mottaghi\(2022\)ProcTHOR: large\-scale embodied ai using procedural generation\.Advances in Neural Information Processing Systems35,pp\. 5982–5994\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[8\]P\. Fung, Y\. Bachrach, A\. Celikyilmaz, K\. Chaudhuri, D\. Chen, W\. Chung, E\. Dupoux, H\. Gong, H\. Jégou, A\. Lazaric,et al\.\(2025\)Embodied ai agents: modeling the world\.arXiv preprint arXiv:2506\.22355\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[9\]D\. Ha and J\. Schmidhuber\(2018\)Recurrent world models facilitate policy evolution\.InAdvances in Neural Information Processing Systems,S\. Bengio, H\. Wallach, H\. Larochelle, K\. Grauman, N\. Cesa\-Bianchi, and R\. Garnett \(Eds\.\),Vol\.31,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2018/file/2de5d16682c3c35007e4e92982f1a2ba-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[10\]D\. Hafner, T\. Lillicrap, J\. Ba, and M\. Norouzi\(2020\)Dream to control: learning behaviors by latent imagination\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[11\]D\. Hafner, T\. Lillicrap, I\. Fischer, R\. Villegas, D\. Ha, H\. Lee, and J\. Davidson\(2019\)Learning latent dynamics for planning from pixels\.InInternational conference on machine learning,pp\. 2555–2565\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[12\]D\. Hafner, J\. Pasukonis, J\. Ba, and T\. Lillicrap\(2023\)Mastering diverse domains through world models\.arXiv preprint arXiv:2301\.04104\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[13\]S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber\(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=VtmBAGCN7o)Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p4.1)\.
- \[14\]A\. Hu, L\. Russell, H\. Yeo, Z\. Murez, G\. Fedoseev, A\. Kendall, J\. Shotton, and G\. Corrado\(2023\)Gaia\-1: a generative world model for autonomous driving\.arXiv preprint arXiv:2309\.17080\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[15\]C\. Li, F\. Xia, R\. Martin\-Martin, M\. Lingelbach, S\. Srivastava, B\. Shen, K\. E\. Vainio, C\. Gokmen, G\. Dharan, T\. Jain,et al\.\(2022\)IGibson 2\.0: object\-centric simulation for robot learning of everyday household tasks\.InConference on Robot Learning,pp\. 455–465\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[16\]D\. Li, Y\. Fang, Y\. Chen, S\. Yang, S\. Cao, J\. Wong, M\. Luo, X\. Wang, H\. Yin, J\. E\. Gonzalez, I\. Stoica, S\. Han, and Y\. Lu\(2026\)WorldModelBench: judging video generation models as world models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=a3hafrDzuA)Cited by:[§3\.3](https://arxiv.org/html/2605.14398#S3.SS3.p1.1),[§3](https://arxiv.org/html/2605.14398#S3.p1.1)\.
- \[17\]G\. Li, H\. A\. A\. K\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem\(2023\)CAMEL: communicative agents for ”mind” exploration of large language model society\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=3IyL2XWDkG)Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p4.1)\.
- \[18\]A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[19\]V\. Makoviychuk, L\. Wawrzyniak, Y\. Guo, M\. Lu, K\. Storey, M\. Macklin, D\. Hoeller, N\. Rudin, A\. Allshire, A\. Handa,et al\.\(2021\)Isaac gym: high performance gpu\-based physics simulation for robot learning\.arXiv preprint arXiv:2108\.10470\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[20\]J\. Mao, S\. He, H\. Wu, Y\. You, S\. Sun, Z\. Wang, Y\. Bao, H\. Chen, L\. Guibas, V\. Guizilini,et al\.\(2025\)Robot learning from a physical world model\.arXiv preprint arXiv:2511\.07416\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[21\]T\. Möltner, P\. Manzl, M\. Pieber, and J\. Gerstmayr\(2025\)Creation, evaluation and self\-validation of simulation models with large language models\.Neurocomputing,pp\. 132030\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p4.1)\.
- \[22\]D\. Park, H\. Moon, and S\. Ryu\(2026\)A self\-correcting multi\-agent llm framework for language\-based physics simulation and explanation\.npj Artificial Intelligence2\(1\),pp\. 10\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p4.1)\.
- \[23\]D\. Paschalidou, A\. Kar, M\. Shugrina, K\. Kreis, A\. Geiger, and S\. Fidler\(2021\)ATISS: autoregressive transformers for indoor scene synthesis\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 12013–12026\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[24\]Project Chrono\(2026\)Project Chrono \- An Open\-Source Physics Engine\.Note:https://projectchrono\.org/Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[25\]J\. Schrittwieser, I\. Antonoglou, T\. Hubert, K\. Simonyan, L\. Sifre, S\. Schmitt, A\. Guez, E\. Lockhart, D\. Hassabis, T\. Graepel,et al\.\(2020\)Mastering atari, go, chess and shogi by planning with a learned model\.Nature588\(7839\),pp\. 604–609\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[26\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[27\]Sketchfab Development Team\(2012\)Sketchfab\.Note:[https://sketchfab\.com](https://sketchfab.com/)Cited by:[§2\.1](https://arxiv.org/html/2605.14398#S2.SS1.p1.1)\.
- \[28\]A\. Szot, A\. Clegg, E\. Undersander, E\. Wijmans, Y\. Zhao, J\. Turner, N\. Maestre, M\. Mukadam, D\. S\. Chaplot, O\. Maksymets,et al\.\(2021\)Habitat 2\.0: training home assistants to rearrange their habitat\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 251–266\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[29\]E\. Todorov, T\. Erez, and Y\. Tassa\(2012\)Mujoco: a physics engine for model\-based control\.In2012 IEEE/RSJ international conference on intelligent robots and systems,pp\. 5026–5033\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[30\]K\. Tran, D\. Dao, M\. Nguyen, Q\. Pham, B\. O’Sullivan, and H\. D\. Nguyen\(2025\)Multi\-agent collaboration mechanisms: a survey of llms\.arXiv preprint arXiv:2501\.06322\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p4.1)\.
- \[31\]T\. Wan, A\. Wang, B\. Ai, B\. Wen, C\. Mao, C\. Xie, D\. Chen, F\. Yu, H\. Zhao, J\. Yang,et al\.\(2025\)Wan: open and advanced large\-scale video generative models\.arXiv preprint arXiv:2503\.20314\.Cited by:[§3\.3](https://arxiv.org/html/2605.14398#S3.SS3.p2.1)\.
- \[32\]J\. Wang, A\. Negrut, H\. Wang, H\. Zhang, and D\. Negrut\(2026\)SimBench: a framework for evaluating and diagnosing llm\-based digital\-twin generation for multi\-physics simulation\.IEEE Access14,pp\. 61784–61808\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2026.3685519)Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[33\]J\. Wang, A\. Negrut, H\. Zhang, K\. Slaton, S\. Wang, R\. Serban, J\. Wu, and D\. Negrut\(2026\)ChronoLLM: customizing language models for physics\-based simulation code generation\.Multibody System Dynamics,pp\. 1–45\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[34\]X\. Wang, Y\. Chen, L\. Yuan, Y\. Zhang, Y\. Li, H\. Peng, and H\. Ji\(2024\)Executable code actions elicit better llm agents\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[35\]Y\. Wang, H\. Yang, M\. Guo, X\. Qiu, T\. Wang, W\. Matusik, J\. B\. Tenenbaum, and C\. Gan\(2026\)PhyScensis: physics\-augmented llm agents for complex physical scene arrangement\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1),[§2\.3\.3](https://arxiv.org/html/2605.14398#S2.SS3.SSS3.p2.1)\.
- \[36\]X\. Wei, M\. Liu, Z\. Ling, and H\. Su\(2022\)Approximate convex decomposition for 3d meshes with collision\-aware concavity and tree search\.ACM Transactions on Graphics \(TOG\)41\(4\),pp\. 1–18\.Cited by:[§2\.1](https://arxiv.org/html/2605.14398#S2.SS1.SSS0.Px1.p1.1)\.
- \[37\]Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.arXiv preprint arXiv:2308\.08155\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p4.1)\.
- \[38\]F\. Xiang, Y\. Qin, K\. Mo, Y\. Xia, H\. Zhu, F\. Liu, M\. Liu, H\. Jiang, Y\. Yuan, H\. Wang,et al\.\(2020\)SAPIEN: a simulated part\-based interactive environment\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11097–11107\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[39\]J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. R\. Narasimhan, and O\. Press\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=mXpq6ut8J3)Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[40\]Y\. Yang, B\. Jia, P\. Zhi, and S\. Huang\(2024\)PhyScene: physically interactable 3d scene synthesis for embodied ai\.InProceedings of Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[41\]Y\. Yang, Z\. Zhang, X\. Zhang, Y\. Zeng, H\. Li, and W\. Zuo\(2025\)PhysWorld: from real videos to world models of deformable objects via physics\-aware demonstration synthesis\.arXiv preprint arXiv:2510\.21447\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[42\]Y\. Yang, F\. Zhao, Y\. Zhu, P\. Zhang, X\. Chen, and S\. Huang\(2024\)Holodeck: language guided generation of 3d embodied ai environments\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[43\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[44\]J\. Yue, Z\. Huang, Z\. Chen, X\. Wang, P\. Wan, and Z\. Liu\(2025\)Simulating the visual world with artificial intelligence: a roadmap\.arXiv preprint arXiv:2511\.08585\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.p1.1)\.
- \[45\]Y\. Zhang, H\. Ruan, Z\. Fan, and A\. Roychoudhury\(2024\)AutoCodeRover: autonomous program improvement\.arXiv preprint arXiv:2404\.05427\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p3.1)\.
- \[46\]F\. Zhu, H\. Wu, S\. Guo, Y\. Liu, C\. Cheang, and T\. Kong\(2024\)IRASim: learning interactive real\-robot action simulators\.arXiv preprint arXiv:2406\.14540\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.
- \[47\]Z\. Zhuang, Y\. Wang, X\. Qiu, W\. Matusik, J\. B\. Tenenbaum, and C\. Gan\(2023\)CommonScenes: generating commonsense 3d indoor scenes with scene graph diffusion\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.14398#S1.SS0.SSS0.Px1.p1.1)\.

## Appendix AAppendix

### A\.1The Use of Large Language Models

In the preparation of this manuscript, the LLM was used for tasks such as grammar correction, sentence restructuring, and improving the overall readability of the manuscript\. The LLM also assisted with code debuging and optimization\. The LLM did not contribute to any scientific ideas, experimental results, or the core structure of the paper\.

### A\.2External 3D Asset Attribute List

## Asset Attribution

Table 6:Third\-party assets used in the experiments\.### A\.3Agent Backend LLMs

Table[7](https://arxiv.org/html/2605.14398#Ax1.T7)reports the LLM backends used by each agent in the experiments\.

Table 7:Backend LLMs used by each agent\.
### A\.4Implementation Details

##### Plan Format\.

The Plan Agent emits a structured plan before code generation\. The compact format below follows the currentobjects\-based schema used by the proposed\-plan view: each concrete body appears inobjects, with its construction source, topology relation, pose, dynamic state, and simulation registration recorded in one entry\.

`Scene Coordinate System Skill\. The Plan Agent uses this skill when resolving objects\[\*\]\.topology and predicate\-based layout constraints into concrete simulator coordinates\. The skill defines the coordinate frame, predicate algebra, relation patterns, and self\-checks used to derive pose\.position and pose\.rotation\_deg from symbolic relations such as spawned\_on\_top, adjacent\_plus\_x\_top\_flush, and floats\_at\_surface\. The Code Agent then treats the resolved plan as the source of truth for placement instead of inferring coordinates again from natural language\. In the original skill text, scene\_predicates denotes the primitive predicate trace; in the compact plan schema used in this paper, those rows are represented by objects\[\*\]\.topology\.relation together with resolved pose fields\. A\.5 Experiment Details Robot in Office Fig\. 2 shows the robot in office task, which requires the system to construct a scene with a robot interacting with an office environment\. Figure 2: Robot in office scene\. Vehicle in Outdoor scene Fig\. 3 shows the vehicle in outdoor scene task, which requires the system to construct a scene with a vehicle driving on an outdoor terrain\. Figure 3: Vehicle in outdoor scene\. Vehicle through FSI ground Fig\. 4 shows the vehicle through FSI ground task, which requires the system to construct a scene with a vehicle driving through a fluid\-structure interaction \(FSI\) ground, demonstrating the system’s ability to handle complex multi\-physics scenarios\. \(For visual clarity of water, the scene for FSI code generated by the Chrono\-Agent is rerendered with watersplash objects in Blender\[4\], which are not included in the original Chrono simulation\.\) Figure 4: Vehicle through FSI ground\.`

Coding Agent Is Good As World Simulator

Similar Articles

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Equipping agents for the real world with Agent Skills

Genie 3: A new frontier for world models

Agentic Coding is a Trap

Submit Feedback

Similar Articles

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Equipping agents for the real world with Agent Skills

Genie 3: A new frontier for world models