SimWorlds: A Multi-Agent System for Dynamic 3D Scene Creation

arXiv cs.AI 07/03/26, 04:00 AM Papers
multi-agent 3d-scene-creation dynamic-4d blender procedural text-to-3d physics-simulation
Summary
SimWorlds is a multi-agent framework that generates dynamic, editable 4D scenes from natural language, using Blender-specific procedural knowledge and a planner-coder-reviewer workflow, outperforming prior baselines.
arXiv:2607.01766v1 Announce Type: new Abstract: LLM agents are increasingly used to translate natural language into 3D scenes in a procedural way, but existing systems focus on static output. Dynamic 4D scenes from text alone, in which liquids flow, particles emit, rigid bodies cascade, and articulated mechanisms move, remain largely unexplored despite their value as editable content and as physics-grounded training data for video generation and embodied AI. Two challenges set the dynamic case apart from static text-to-scene work: an agent must jointly coordinate spatial layout, multiple physics solvers, temporal sequencing, camera, and lighting in a single coherent scene, and verifying motion correctness from rendered video is fundamentally harder than judging a single image. We present SimWorlds: a multi-agent framework that produces dynamic, editable 4D scenes from text, with Blender-specific procedural knowledge, a planner-coder-reviewer workflow driving a fixed ordered sequence of construction stages, a layered scene protocol enforced by a deterministic verifier, and a runtime-state inspection tool suite that catches mechanism failures the rendered image cannot reveal. We also introduce 4DBuildBench, a benchmark for assessing both visual fidelity and physical consistency of the procedural dynamic 3D scenes generated from text prompts. Experiments show that SimWorlds outperforms prior dynamic Blender generation baselines.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:45 AM
# SimWorlds: A Multi-Agent System for Dynamic 3D Scene Creation
Source: [https://arxiv.org/html/2607.01766](https://arxiv.org/html/2607.01766)
Chunjiang Liu1Xiaoyuan Wang1Haoyu Chen2Yizhou Zhao1 Ming\-Hsuan Yang3László A\. Jeni1 1Carnegie Mellon University2Harvard University3University of California, Merced [https://dynsimworlds\.github\.io](https://dynsimworlds.github.io/)

###### Abstract

LLM agents are increasingly used to translate natural language into 3D scenes in a procedural way, but existing systems focus on static output\. Dynamic 4D scenes from text alone, in which liquids flow, particles emit, rigid bodies cascade, and articulated mechanisms move, remain largely unexplored despite their value as editable content and as physics\-grounded training data for video generation and embodied AI\. Two challenges set the dynamic case apart from static text\-to\-scene work: an agent must jointly coordinate spatial layout, multiple physics solvers, temporal sequencing, camera, and lighting in a single coherent scene, and verifying motion correctness from rendered video is fundamentally harder than judging a single image\. We present SimWorlds: a multi\-agent framework that produces dynamic, editable 4D scenes from text, with Blender\-specific procedural knowledge, a planner–coder–reviewer workflow driving a fixed ordered sequence of construction stages, a layered scene protocol enforced by a deterministic verifier, and a runtime\-state inspection tool suite that catches mechanism failures the rendered image cannot reveal\. We also introduce 4DBuildBench, a benchmark for assessing both visual fidelity and physical consistency of the procedural dynamic 3D scenes generated from text prompts\. Experiments show that SimWorlds outperforms prior dynamic Blender generation baselines\.

![Refer to caption](https://arxiv.org/html/2607.01766v1/x1.png)Figure 1:SimWorlds turns text into dynamic, editable 3D Blender scenes\.Given a natural\-language prompt, a planner, coder, and reviewer cooperate to emit a\.blendwhose geometry, materials, lighting, camera, and motion all remain controllable for downstream editing and reuse\.## 1Introduction

A modern 3D generative system is increasingly expected to produce more than a visually plausible render\. For downstream graphics, simulation, and content\-creation workflows, the desired output is an editable scene artifact: a Blender project in which geometry, materials, lighting, cameras, animation, and physics solvers remain explicit and controllable\. Recent text\-to\-3D methods have made rapid progress on object\-level assets\[[48](https://arxiv.org/html/2607.01766#bib.bib6),[27](https://arxiv.org/html/2607.01766#bib.bib10),[61](https://arxiv.org/html/2607.01766#bib.bib12)\], and LLM agents have begun to construct static scenes from natural language\[[29](https://arxiv.org/html/2607.01766#bib.bib36),[59](https://arxiv.org/html/2607.01766#bib.bib37),[42](https://arxiv.org/html/2607.01766#bib.bib66),[74](https://arxiv.org/html/2607.01766#bib.bib22),[77](https://arxiv.org/html/2607.01766#bib.bib38)\]\. Generating a dynamic 3D scene from text alone is substantially harder: a dynamic scene must not only look correct in rendered frames but be produced through the correct underlying mechanisms, drawing on rigid\-body simulation, cloth, fluids, particles, force fields, deformers, and keyframed control, often combined within a single shot\.

This distinction exposes a failure mode largely absent in static generation\. In a static scene, visual inspection is a reasonable proxy for correctness; in a dynamic scene, the same rendered video can correspond to very different underlying states: a tablecloth that drapes via a cloth solver, via hand\-authored shape keys, or via keyframed mesh edits renders identically, yet only the solver version stays editable, composable, and physically meaningful as the scene changes\. Text\-to\-dynamic\-scene generation therefore demands mechanism correctness, not just visual plausibility\.

Existing LLM\-agent pipelines for Blender are not designed around this requirement\. Most render the scene, ask a vision\-language model to critique the image, and revise the code, which catches missing objects and obvious material errors but cannot tell whether a fluid domain, properly configured flow and effector objects, and a baked cache exist, or whether the geometry is merely animated\. As objects, interactions, and temporal phases multiply, these unchecked failures compound, and the final scene approximates the prompt visually while remaining unusable as a 4D asset\.

We formulate dynamic 3D scene generation as plan\-grounded, mechanism\-aware program synthesis\. A text prompt is first converted into an explicit scene plan specifying objects, spatial composition, physical roles and motion phases\. Generation then proceeds through an ordered sequence of typed subtasks, each with its own context, acceptance criteria, and verification\. Crucially, review is not limited to rendered images: the agent inspects the live Blender state, checking whether the expected modifiers are attached, whether physics caches are baked, whether simulated actors move over the intended temporal phase, and whether collision and effector relationships are present\.

We instantiate this formulation as SimWorlds, a multi\-agent framework for text\-to\-4D scene generation in Blender\. A planner compiles the prompt into a structured scene plan; a coder then builds the scene through a fixed ordered sequence of typed stages, each closed by a deterministic verifier that checks the assembled state against a layered scene protocol and a reviewer that judges per\-stage criteria, with failed checks triggering localised retries that keep early errors from contaminating later behaviour\. Engine\-level tools let both the coder and the reviewer read Blender’s runtime state: modifier stacks, physics caches, animation channels, and multi\-angle previews\. A knowledge base auto\-derived from upstream Blender sources supplies procedural detail on demand\. The result is an editable \.blend file whose geometry, materials, lighting, camera, and dynamics remain available for downstream editing, resimulation, and reuse\.

We evaluate SimWorlds on text\-only dynamic scene generation and on multi\-step Blender editing via BlenderBench\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\]\. Compared with visual\-only agent baselines, SimWorlds improves both scene\-level correctness and physical integrity, with the gap widening sharply on complex inputs: generation prompts that require multiple interacting objects, long temporal structure, or nontrivial solver configuration, and edit instructions that span several objects or modify physics simultaneously\.

Our contributions are:

- •We present SimWorlds, an LLM\-agent system that turns text into an editable 4D Blender project: geometry, materials, lighting, cameras, animation, and physics, all controllable\.
- •We build a controllable, physics\-grounded generation pipeline that combines a scene protocol and deterministic verifier, render\-based review, and an engine\-level tool suite\.
- •We introduce 4DBuildBench, 50 scenes across five solver categories \(cloth, fluid, rigid body, particle, soft body\) and three difficulty levels plus a static category, paired with a two\-track evaluation: a deterministic engine\-state audit for mechanism correctness, and an itemized VLM judge for whether the prompt’s content is visually delivered\.

## 2Related Work

#### Code\-Driven and Procedural 3D Scene Generation\.

A growing line of work treats Blender as the runtime for an LLM agent that synthesises scene\-construction code\. SceneCraft\[[29](https://arxiv.org/html/2607.01766#bib.bib36)\]and 3D\-GPT\[[59](https://arxiv.org/html/2607.01766#bib.bib37)\]translate text into Blender scripts coordinated through relational scene graphs\. BlenderAlchemy\[[30](https://arxiv.org/html/2607.01766#bib.bib39)\]iteratively refines materials under VLM feedback\. LL3M\[[42](https://arxiv.org/html/2607.01766#bib.bib66)\]composes planner, retrieval, and coder agents over a BlenderRAG knowledge base, and reports object\-level results at high quality\. A complementary thread bypasses learned generation entirely: procedural pipelines such as Infinigen\[[50](https://arxiv.org/html/2607.01766#bib.bib63)\]hand\-craft generators for Blender\-rendered nature scenes, and ProcTHOR\[[14](https://arxiv.org/html/2607.01766#bib.bib25),[19](https://arxiv.org/html/2607.01766#bib.bib73),[37](https://arxiv.org/html/2607.01766#bib.bib74)\]programmatically synthesises indoor environments for embodied agents\. Layout\-generation methods predict object placements from text, scene graphs, or partial context\[[47](https://arxiv.org/html/2607.01766#bib.bib34),[15](https://arxiv.org/html/2607.01766#bib.bib33),[16](https://arxiv.org/html/2607.01766#bib.bib23),[36](https://arxiv.org/html/2607.01766#bib.bib24),[45](https://arxiv.org/html/2607.01766#bib.bib32),[60](https://arxiv.org/html/2607.01766#bib.bib35)\], typically retrieving furniture from large asset libraries\[[13](https://arxiv.org/html/2607.01766#bib.bib20),[12](https://arxiv.org/html/2607.01766#bib.bib21)\]\. These systems share our artifact target, an editable, code\-defined Blender file, but their headline results target static objects or single\-room layouts; the dynamic regime, in which motion and physics are first\-class outputs, has not been demonstrated end\-to\-end from text\.

#### Render\-Inspect Loops and Editing Benchmarks\.

A parallel thread closes the verification loop by rendering the in\-progress scene and asking a VLM to identify discrepancies\. VIGA\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\]formalises this as a code\-render\-inspect loop and is the closest published system to ours; it conditions on a reference image of the target and reports a 4D mode through qualitative figures only\. Alongside its system, VIGA introduces BlenderBench, an open\-ended editing suite of 27 tasks spanning spatial adjustments, progressive editing, and compositional generation, on which existing one\-shot baselines remain far below human performance\. A separate effort, BlenderGym\[[22](https://arxiv.org/html/2607.01766#bib.bib65)\], contributes 245 handcrafted editing tasks and explicitly identifies a class of failures uncaught by its own photometric and CLIP\-based metrics: scenes that match the goal pixels through the wrong mechanism\. SimWorlds retains the iterative\-verification design but drives it with both rendered previews and mechanism\-level signals \(bake state, modifier stack, fcurves, motion deltas\) read from the engine itself, and extends the loop to text\-only 4D\. Our edit mode runs unchanged on BlenderBench, enabling a direct comparison with VIGA on the benchmark it introduced\.

#### Neural Text\-to\-3D and Text\-to\-4D\.

A separate paradigm produces 3D and 4D content as neural representations rather than as graphics\-engine code\. Object\-level methods distil 2D priors into NeRFs or 3D Gaussians\[[48](https://arxiv.org/html/2607.01766#bib.bib6),[35](https://arxiv.org/html/2607.01766#bib.bib7),[9](https://arxiv.org/html/2607.01766#bib.bib8),[67](https://arxiv.org/html/2607.01766#bib.bib9),[27](https://arxiv.org/html/2607.01766#bib.bib10),[61](https://arxiv.org/html/2607.01766#bib.bib12),[73](https://arxiv.org/html/2607.01766#bib.bib11),[32](https://arxiv.org/html/2607.01766#bib.bib19),[62](https://arxiv.org/html/2607.01766#bib.bib13),[76](https://arxiv.org/html/2607.01766#bib.bib15),[10](https://arxiv.org/html/2607.01766#bib.bib14),[18](https://arxiv.org/html/2607.01766#bib.bib27)\], with multi\-view and image\-conditioned variants resolving Janus\-like inconsistencies\[[39](https://arxiv.org/html/2607.01766#bib.bib17),[55](https://arxiv.org/html/2607.01766#bib.bib16),[41](https://arxiv.org/html/2607.01766#bib.bib18)\]\. Scene\-level optimisation extends this to layouts, rooms, and gallery environments\[[11](https://arxiv.org/html/2607.01766#bib.bib31),[78](https://arxiv.org/html/2607.01766#bib.bib26),[25](https://arxiv.org/html/2607.01766#bib.bib29),[17](https://arxiv.org/html/2607.01766#bib.bib30),[83](https://arxiv.org/html/2607.01766#bib.bib28),[33](https://arxiv.org/html/2607.01766#bib.bib64)\]\. 4D variants animate the representation with video priors\[[58](https://arxiv.org/html/2607.01766#bib.bib52),[3](https://arxiv.org/html/2607.01766#bib.bib53),[51](https://arxiv.org/html/2607.01766#bib.bib54),[72](https://arxiv.org/html/2607.01766#bib.bib56),[52](https://arxiv.org/html/2607.01766#bib.bib57),[4](https://arxiv.org/html/2607.01766#bib.bib55),[34](https://arxiv.org/html/2607.01766#bib.bib59),[71](https://arxiv.org/html/2607.01766#bib.bib58),[65](https://arxiv.org/html/2607.01766#bib.bib79)\], themselves drawing on text\-to\-image and text\-to\-video diffusion\[[53](https://arxiv.org/html/2607.01766#bib.bib1),[24](https://arxiv.org/html/2607.01766#bib.bib2),[6](https://arxiv.org/html/2607.01766#bib.bib5),[57](https://arxiv.org/html/2607.01766#bib.bib3),[7](https://arxiv.org/html/2607.01766#bib.bib4),[40](https://arxiv.org/html/2607.01766#bib.bib76),[63](https://arxiv.org/html/2607.01766#bib.bib75)\]\. Physics\-aware methods attach material parameters to existing fields\[[79](https://arxiv.org/html/2607.01766#bib.bib60),[70](https://arxiv.org/html/2607.01766#bib.bib61)\], while system\-identification methods recover per\-object physical parameters from video through differentiable simulation or learned neural constitutive models\[[38](https://arxiv.org/html/2607.01766#bib.bib78),[80](https://arxiv.org/html/2607.01766#bib.bib77)\]\. This output is visually compelling but is not our target artifact: it cannot be opened in Blender, resimulated, or composed with downstream tools, and its motion is learned rather than solved\. Kubric\[[21](https://arxiv.org/html/2607.01766#bib.bib62)\]renders programmatic Blender physics but is configured by code, not natural language; none of these produce the editable, physics\-driven scenes in Blender that our system targets\.

#### LLM Agents and Long\-Horizon Execution\.

Outside graphics, a parallel line of work studies what makes LLM agents reliable on long\-horizon tasks\. ReAct\[[75](https://arxiv.org/html/2607.01766#bib.bib50)\]and Reflexion\[[56](https://arxiv.org/html/2607.01766#bib.bib45)\]alternate reasoning with environment feedback; Self\-Refine\[[43](https://arxiv.org/html/2607.01766#bib.bib44)\]formalises iterative revision, whileHuang and others \[[31](https://arxiv.org/html/2607.01766#bib.bib47)\]caution that LLMs cannot reliably self\-correct without external grounding; Voyager\[[64](https://arxiv.org/html/2607.01766#bib.bib49)\], Toolformer\[[54](https://arxiv.org/html/2607.01766#bib.bib51)\], CRITIC\[[20](https://arxiv.org/html/2607.01766#bib.bib46)\], and CodeAct\[[66](https://arxiv.org/html/2607.01766#bib.bib48)\]ground critique and action in tool calls and code; AutoGen\[[69](https://arxiv.org/html/2607.01766#bib.bib42)\], MetaGPT\[[26](https://arxiv.org/html/2607.01766#bib.bib40)\], and ChatDev\[[49](https://arxiv.org/html/2607.01766#bib.bib41)\]factorise tasks across role\-specialised agents\. Recent perspectives crystallise the load\-bearing components as context engineering\[[1](https://arxiv.org/html/2607.01766#bib.bib67),[44](https://arxiv.org/html/2607.01766#bib.bib69)\], tool design\[[2](https://arxiv.org/html/2607.01766#bib.bib68)\], and plan\-grounded execution\[[68](https://arxiv.org/html/2607.01766#bib.bib70),[46](https://arxiv.org/html/2607.01766#bib.bib43)\]\. SimWorlds adopts these as its spine; the contribution is not the principles but their instantiation for 4D scene generation in Blender: context scoped to typed stages, tools that read runtime state alongside previews, and a plan whose physics commitments are reconciled against the engine after every step\.

## 3Method

SimWorlds is organised around one idea: a correct render does not guarantee a correctly built scene, so every construction step is validated against Blender’s engine state rather than its rendered image, and the scene is assembled through a fixed, checkable sequence of stages so that structural and mechanism errors are caught deterministically and early\. Concretely, SimWorlds turns a text prompt into a dynamic, editable scene by separating one\-shot planning from a stage\-by\-stage execute–verify–review loop \(Alg\.[1](https://arxiv.org/html/2607.01766#alg1)\)\. The planner first converts the prompt into a global scene plan that specifies objects, their spatial layout as a typed relation graph, physical properties, motion phases, and rendering intent\. Generation then proceeds through a fixed ordered sequence of construction stages \(modeling, UV, texture, deformation setup, motion, camera, light, render\); for each, the planner emits a tactical plan against the scene plan, and may opt a stage out for scenes that do not need it\. At each stage the coder writes a bpy script that extends a running scene, an orchestrator\-side verifier mechanically checks the resulting Blender state against a fixed protocol, and a reviewer agent then judges per\-stage acceptance criteria from rendered previews and runtime readouts before the pipeline advances\. Figure[2](https://arxiv.org/html/2607.01766#S3.F2)sketches the full loop, which rests on two core mechanisms \(each validated by its own ablation, §[4\.3](https://arxiv.org/html/2607.01766#S4.SS3)\) and two enabling components\. The first core mechanism is a staged construction pipeline \(§[3\.1](https://arxiv.org/html/2607.01766#S3.SS1)\) that anchors every coder turn to a single structured specification rather than to an open\-ended task list, so retries and reviews are well\-scoped\. The second is a scene protocol layered on Blender’s natives and the orchestrator\-side verifier that enforces it \(§[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)\); together they turn assembly correctness into a deterministic check, catching structural failures \(e\.g\. missing parent chains, ungrounded objects, undeclared interpenetrations\) before any visual verification runs\. Two enabling components complete the loop: a tool suite \(§[3\.3](https://arxiv.org/html/2607.01766#S3.SS3)\) that gives the verifier and reviewer direct runtime\-state readouts and multi\-angle previews, and an auto\-derived knowledge base \(§[3\.4](https://arxiv.org/html/2607.01766#S3.SS4)\) that supplies Blender\-specific procedural knowledge per stage\.

![Refer to caption](https://arxiv.org/html/2607.01766v1/x2.png)Figure 2:Pipeline overview\.The planner compiles the prompt into a single scene plan; construction then proceeds through a fixed stage sequence, each stage running the coder, a deterministic verifier \(scene protocol, §[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)\), and a reviewer, with failed checks triggering bounded local retries\. Once all stages close, the scene is rendered as a frame\-sequence video\.### 3\.1Staged Construction Pipeline

SimWorlds builds on a fixed pipeline of construction stages𝒮\\mathcal\{S\}: modeling, UV, texture, deformation setup, motion, camera, light, and render, applied in the same order to every scene\. The order follows the dependency structure of a Blender scene: materials depend on UVs, deformers on geometry, motion on rigging, and lighting and camera on the assembled scene\. The planner emits a single global scene plan once, then a per\-stage tactical plan at each stage entry, and may opt a stage out when a scene does not need it \(e\.g\. UV/texture for primitive geometry, deformation setup for fully rigid scenes\)\. Staging this way keeps retries, checkpoints, and reviewer scope well\-defined per stage, while the global scene plan stays immutable across the run \(Alg\.[1](https://arxiv.org/html/2607.01766#alg1)\)\.

SimWorlds staged construction loop\.

1:prompt

qq; fixed stage sequence

𝒮=\(s1,…,sN\)\\mathcal\{S\}=\(s^\{1\},\\ldots,s^\{N\}\); initial scene state

σ0=∅\\sigma\_\{0\}=\\emptyset
2:

𝒫←StrategicPlanner\(q\)\\mathcal\{P\}\\leftarrow\\textsc\{StrategicPlanner\}\(q\)⊳\\trianglerightscene plan: scene specification \+ relation graph \+ motion phases \+ per\-stage opt\-outs

3:for

k=1k=1to

N−1N\{\-\}1do⊳\\trianglerightconstruction stages \(render handled separately\)

4:if

sk∈𝒫\.opt\_outs^\{k\}\\in\\mathcal\{P\}\.\\texttt\{opt\\\_out\}then

σk←σk−1\\sigma\_\{k\}\\leftarrow\\sigma\_\{k\-1\};continue⊳\\trianglerightskipped stage carries state forward

5:

f←∅f\\leftarrow\\emptyset⊳\\trianglerightfeedback: verifier \+ reviewer reports

6:repeat

7:

τk←TacticalPlanner\(sk,𝒫,σk−1\)\\tau\_\{k\}\\leftarrow\\textsc\{TacticalPlanner\}\(s^\{k\},\\mathcal\{P\},\\sigma\_\{k\-1\}\)
8:

σk←Exec\(Coder\(τk,σk−1,f\)\)\\sigma\_\{k\}\\leftarrow\\textsc\{Exec\}\(\\textsc\{Coder\}\(\\tau\_\{k\},\\sigma\_\{k\-1\},f\)\)
9:

vk←Verifier\(σk,𝒫\)v\_\{k\}\\leftarrow\\textsc\{Verifier\}\(\\sigma\_\{k\},\\mathcal\{P\}\)⊳\\trianglerightdeterministic: scene protocol \(§[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)\)

10:

ρk←Reviewer\(τk,𝒫,σk\)\\rho\_\{k\}\\leftarrow\\textsc\{Reviewer\}\(\\tau\_\{k\},\\mathcal\{P\},\\sigma\_\{k\}\)⊳\\trianglerightperceptual: per\-stage criteria

11:

f←\(vk,ρk\)f\\leftarrow\(v\_\{k\},\\rho\_\{k\}\);

d←Planner\(vk,ρk\)d\\leftarrow\\textsc\{Planner\}\(v\_\{k\},\\rho\_\{k\}\)⊳\\trianglerightadvance / retry / replan

12:until

d=advanced=\\texttt\{advance\}orretries/replans exhausted \(

⇒\\Rightarrowabort\)⊳\\trianglerightretries/replans bounded; see App[A](https://arxiv.org/html/2607.01766#A1)

13:

Checkpoint\(σk,sk\)\\textsc\{Checkpoint\}\(\\sigma\_\{k\},s^\{k\}\)
14:

𝒪←Render\(σN−1\)\\mathcal\{O\}\\leftarrow\\textsc\{Render\}\(\\sigma\_\{N\-1\}\)⊳\\trianglerightframe sequence

15:

ρ⋆←FinalReviewer\(𝒫,𝒪\)\\rho^\{\\star\}\\leftarrow\\textsc\{FinalReviewer\}\(\\mathcal\{P\},\\mathcal\{O\}\)
16:if

ρ⋆=needs\_fix\\rho^\{\\star\}=\\texttt\{needs\\\_fix\}then

17:

σN−1←Exec\(Coder\(ρ⋆,σN−1\)\)\\sigma\_\{N\-1\}\\leftarrow\\textsc\{Exec\}\(\\textsc\{Coder\}\(\\rho^\{\\star\},\\sigma\_\{N\-1\}\)\);

𝒪←Render\(σN−1\)\\mathcal\{O\}\\leftarrow\\textsc\{Render\}\(\\sigma\_\{N\-1\}\)
18:return

\(𝒪,σN−1\)\(\\mathcal\{O\},\\sigma\_\{N\-1\}\)

Algorithm 1The strategic planner emits one global scene plan; construction then runs the fixed stage sequence\. Each stage repeats coder, deterministic verifier \(scene protocol, §[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)\), and reviewer until the planner advances it, so a hard verifier or reviewer failure cannot pass; the stage aborts once its retry and replan budgets are exhausted\. After all stages close, the scene is rendered and a final reviewer judges the output\.#### Scene Plan\.

The scene plan consists of three parts: a scene specification \(objects with dimensions, PBR material, position, and physics role; a typed relation graph over objects encoding spatial assembly; coordinated groups; lighting; camera; render\), a motion plan for dynamic scenes, and a list of stages to opt out for the current scene\. Full field schemas are listed in Appendix[A](https://arxiv.org/html/2607.01766#A1)\.

#### Per\-Stage Execution Loop\.

At stagesks^\{k\}, the orchestrator opens a fresh coder and reviewer session, while the planner persists across the whole run, and drives a per\-stage state machine\. The tactical planner first emits a planτk\\tau\_\{k\}specifying the assets to realise, the relations to satisfy, the per\-stage acceptance criteria, and the perceptual judgments the reviewer should answer\. The coder extends the live scene with a bpy script; the verifier of §[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)checks the assembled state; and the reviewer judgesτk\\tau\_\{k\}’s perceptual criteria from rendered previews and the structured scene readout, additionally consuming per\-phase motion evidence on motion stages\. On this evidence the planner chooses one of four transitions: it can advance to the next stage; retry, having the coder reuse its session to fix the current attempt from the verifier and reviewer feedback; replan, revising the tactical plan itself; or abort\. A hard verifier failure or a blocking review can never advance: the orchestrator forces a retry, so the stage loops through coder, verifier, reviewer, and decision until it passes, and retries and replans stay bounded\.

#### Final Render and Review\.

Once every construction stage closes, the runner renders the scene as a single frame for static scenes or a frame\-sequence video for dynamic ones, and a separate final reviewer judges the rendered result against the plan\. A failed verdict rolls back to the last stage checkpoint, re\-invokes the coder once with consolidated fix instructions, and re\-renders\.

#### Edit Mode\.

The same loop accommodates edit\-style prompts at no architectural cost\. Given an existing \.blend file, a natural\-language instruction, and an optional reference image of the target, the runner loads the file, populatess0s\_\{0\}from the file’s scene state, attaches the reference image as additional planner context, and asks the planner to decompose the request into edit\-only tasks \(additions, modifications, removals\) rather than redoing static elements that already exist correctly\.

### 3\.2Scene Protocol and Verifier

Blender’s native scene graph \(parent–child, collections, modifiers, drivers, constraints\) expresses transform inheritance, organisation, and runtime dependencies, but it does not express assembly correctness\. There is no native primitive for “these meshes form one logical object,” “surfaceAAtouches surfaceBB,” or “this object is grounded\.” Without these notions every coder turn must be re\-validated by visual inspection, and the failure modes the rendered image cannot show \(a chair leg floating 5 mm above the floor, a tabletop interpenetrating a wall, an object never parented to anything\) silently accumulate\. SimWorlds layers a small, declarative protocol on top of Blender’s natives and enforces it with an orchestrator\-side verifier that runs automatically after every coder turn, catching structural and mechanism errors before they ever reach a render\.

#### Protocol structure\.

The protocol imposes a strict three\-level containment hierarchy, built entirely from Blender’s native collections, custom properties, and parenting\. L1 is the scene as a whole\. Each L2 inside it is a system grouping: a named set of related objects, such as one room, or a holder of scene\-level state, such as the lights and cameras\. Each L3 inside an L2 is a single logical object, however many meshes it is built from: it gathers all of those meshes under one root Empty, a geometry\-free anchor object, and parents every mesh to it, so the object moves and is checked as a unit\. Since a Blender collection carries no inherent role, a custom property tags each one as an L2 or an L3; collections left untagged are ordinary organisational collections that lie outside the protocol\.

On top of this hierarchy the protocol makes explicit two object\-to\-object relations that Blender’s scene graph leaves implicit\. The first is surface contact, where one object rests on or against another, such as a chair leg on the floor; it is declared on either of the two objects and the verifier later confirms that the surfaces actually meet\. The second is co\-movement, where one object must travel with another, such as a sword carried in a hand; it is expressed through Blender’s native parent–child constraints\.

#### Bipartite plan graph\.

The planner commits the scene’s intended assembly up front as a bipartite plan graph: one set of nodes for the objects, another for the relations among them\. Each relation node carries a typed spatial relation and the objects it relates, spanning support, containment, orientation, and regular arrangement, such as one object resting on another or several arranged in a ring \(the full vocabulary and its rules are in Appendix[B](https://arxiv.org/html/2607.01766#A2)\)\. Because relations are nodes, not edges, one relation can span several objects at once\. The coder realises this graph as the collection tree and the contact declarations above, giving the verifier a stated assembly intent to check\.

#### Rule families\.

The verifier runs four families of deterministic rules against the live Blender state\. Structural rules C1–C6 check the collection layout, and G1–G3 check the anchor\-and\-parent chains within each object\. Geometric rules V1–V2 are BVH distance tests confirming that every declared contact pair actually touches and that no undeclared pair interpenetrates, which catches floating contacts and accidental punch\-through\. Soft rules W1–W5 raise warnings about grounding, within\-object connectivity, and missing cameras or lights\. Plan\-vs\-state rules R1–R18, with seven further motion\-timing rules, re\-check the planner’s relation graph and motion phases against the realised geometry, for example that a declared circular leg arrangement is realised to tolerance and that a settle phase comes to rest in its final frames\. The full catalogue and BVH\-test details are in Appendix[B](https://arxiv.org/html/2607.01766#A2)\.

### 3\.3Tool Suite

Verification in prior Blender\-agent systems\[[29](https://arxiv.org/html/2607.01766#bib.bib36),[59](https://arxiv.org/html/2607.01766#bib.bib37),[42](https://arxiv.org/html/2607.01766#bib.bib66),[77](https://arxiv.org/html/2607.01766#bib.bib38)\]reduces to visual critique: render, feed to a VLM, request issues\. This is often enough for static scenes but cannot tell whether a dynamic effect uses the right mechanism or is merely faked to look identical\. SimWorlds replaces visual self\-critique with direct inspection of Blender’s runtime state, through a tool suite \(full inventory in Appendix Table[4](https://arxiv.org/html/2607.01766#A1.T4)\) grouped into state observation, modification, and knowledge access\. Three tools supply the evidence frames cannot: blender\_scene\_state returns a structured readout of the live scene \(collections, modifier stacks, physics caches, fcurve channels\), showing what was actually built rather than rendered; a multi\-granularity preview family renders a mesh, an L3 object, an L2 system, or the whole scene from fixed multi\-angle presets; and blender\_motion\_sheet\_preview samples a per\-actor frame strip annotated with phase boundaries and bake state\.

### 3\.4Knowledge

4D Blender content draws on too many subsystems \(meshes, PBR shaders, the rigid\-body world, cloth and fluid solvers, particle systems, keyframe animation, the compositor\) for any single system prompt to cover\. SimWorlds therefore exposes knowledge through a single tool, blender\_docs\(query\), backed by a knowledge base auto\-derived per Blender version from upstream sources: the bl\_rna schemas of every bpy\.types class, every bpy\.ops operator’s signature, the official Python API guides, and the full Blender manual\. A query resolves to an exact schema, a manual page, or a ranked candidate list; the base regenerates automatically when the Blender version changes, with no LLM in the build except a content\-addressed page\-summary step\.

## 4Experiments

### 4\.1Setup

#### Benchmark\.

We evaluate SimWorlds on 4DBuildBench, a 50\-scene benchmark of self\-contained prompts for text\-to\-4D Blender scene generation\. 4DBuildBench is organised along two axes: the*mechanism*an artifact must use, and the*difficulty*of authoring it correctly\. The mechanism axis covers the five core Blender physics solvers \(cloth, fluid, rigid body, particle systems, and soft body\), each as a category of 9 prompts, plus a static category of 5 furnished\-scene prompts that exercise object inventory and spatial layout\. The difficulty axis defines three levels per dynamic category, three prompts each, escalating from a single solver actor \(D1\) to within\-category complexity \(D2, e\.g\. self\-collision or force fields\) to cross\-category interaction in one shot \(D3, e\.g\. a rigid block crushing a soft\-body slab\); per\-level definitions and example prompts are in Appendix[D](https://arxiv.org/html/2607.01766#A4)\. Editing is evaluated separately on the BlenderBench\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\]via our edit mode \(§[3\.1](https://arxiv.org/html/2607.01766#S3.SS1)\)\.

#### Evaluation Metrics\.

The central failure mode in dynamic\-scene generation is a scene that looks right but is built through the wrong mechanism\. A purely visual judge cannot see this while an engine\-state check cannot see whether the prompt’s objects and motions are visually delivered\. 4DBuildBench therefore scores each scene on two complementary tracks, detailed in Appendix[C](https://arxiv.org/html/2607.01766#A3)\.

*\(1\) Engine\-state audit\.*A deterministic audit runs in headless Blender and reads the scene’s runtime state directly: solver modifiers, baked caches, collision partners, constraints, force fields, and the keyframe density that betrays faked motion\. It checks this state against a per\-scene ground\-truth mechanism specification drawn from a library of 42 typed predicates \(e\.g\.*cloth modifier present and cache baked*;*static scene carries no baked physics caches*\)\. We report two aggregates:MPR\(Mechanism Pass Rate\), the per\-scene mean over actors of the fraction of each actor’s predicates satisfied; andSPR\(Structural Pass Rate\), the per\-scene fraction of declared spatial relations that hold geometrically under a BVH surface\-distance test\. The audit shares no code with SimWorlds’s in\-loop verifier and reads the \.blend alone; where a baseline emits no logical\-object grouping, SPR infers each object’s assembly geometrically, scoring all systems on identical terms\.

*\(2\) Itemized VLM judge\.*Building on VLM\- and LLM\-as\-judge evaluation for generative vision\[[81](https://arxiv.org/html/2607.01766#bib.bib80),[28](https://arxiv.org/html/2607.01766#bib.bib81),[23](https://arxiv.org/html/2607.01766#bib.bib82),[8](https://arxiv.org/html/2607.01766#bib.bib83)\], a GPT\-5\.5 judge receives the prompt and five frames sampled uniformly across the clip \(t∈\{0,25,50,75,100\}%t\\in\\\{0,25,50,75,100\\\}\\%\) and scores five dimensions: objects present, spatial relations, actions visible, visual quality, and aesthetics\. Rather than emit numeric scores, it enumerates atomic items from the prompt \(one per named object, relation, and action\) and returns a binary verdict per item; the dimension score is the fraction that hold, giving partial credit and a concrete evidence trail\. Mechanism realism is left to the audit track, since vision models judge it unreliably\[[5](https://arxiv.org/html/2607.01766#bib.bib72)\]\.

#### Baseline\.

We compare against VIGA\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\], the only prior published system with a dedicated dynamic\-scene mode for from\-scratch 4D scene generation in Blender\. VIGA exposes a dynamic\-scene mode alongside its static\-scene mode, runs a dual\-agent generator–verifier loop, and emits an editable Blender file, which makes it directly runnable on 4DBuildBench prompts\. It also shares the dual\-agent shape with SimWorlds, so the comparison isolates the contributions of the staged construction pipeline, the scene protocol and verifier, and the engine\-level tool suite\.111Other recent LLM\-driven 3D scene systems target adjacent settings and are not directly runnable on 4DBuildBench\. LL3M\[[42](https://arxiv.org/html/2607.01766#bib.bib66)\]routes generation through a closed cloud server with no code or model release\. SceneCraft\[[29](https://arxiv.org/html/2607.01766#bib.bib36)\]produces static layouts only, with no physics or temporal output\. Holodeck\[[74](https://arxiv.org/html/2607.01766#bib.bib22)\]outputs AI2\-THOR rooms, a different artifact type\. BlenderGym\[[22](https://arxiv.org/html/2607.01766#bib.bib65)\]methods operate in editing mode against reference images\. Concurrent VoxelCodeBench\[[82](https://arxiv.org/html/2607.01766#bib.bib71)\]targets static voxel construction in Unreal\.

### 4\.2Results

#### Quantitative Results\.

We evaluate SimWorlds and VIGA on 4DBuildBench, and score both with the engine\-state audit and the itemized VLM judge\. The results are reported in Table[1](https://arxiv.org/html/2607.01766#S4.T1)\.

Table 1:Per\-category results on 4DBuildBench\.Overallis the macro average over cells: 15 dynamic cells for MPR and all 17 for SPR and VLM score\.The results show that MPR is where SimWorlds and VIGA diverge most decisively \(0\.870\.87vs0\.670\.67\), while VLM stays comparable \(0\.820\.82vs0\.780\.78\)\. The reason is that the VLM judge scores only a few frames sampled at fixed intervals, which cannot reveal whether the motion across them is correct, so a keyframed or shape\-key scene can be wrong yet score well when each sampled frame looks right on its own; the engine\-state audit instead reads the solver state behind those frames and rejects it\. The same blind spot is what lets VIGA’s visual\-only verifier accept such fakes during generation\. SPR moves in the same direction \(\+0\.19\+0\.19\), but because it scores spatial structure against object groupings that SimWorlds emits and baselines do not, its cross\-system fairness is limited; we mitigate the confound geometrically \(Appendix[C](https://arxiv.org/html/2607.01766#A3)\) and read SPR as supporting evidence\.

#### Qualitative Results\.

As shown in Fig[3](https://arxiv.org/html/2607.01766#S4.F3), across all four sequences SimWorlds realises the correct mechanism while VIGA fails it\. The common cause is architectural: VIGA closes its loop with a VLM verifier that inspects sampled frames, which is insufficient to tell whether motion or physics is actually correct, so a scene whose sampled frames look right is accepted however its dynamics were produced\. SimWorlds instead verifies each stage against engine state through its protocol verifier \(§[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)\), so that a faked keyframe is caught at construction\.

![Refer to caption](https://arxiv.org/html/2607.01766v1/figures/qualtative.png)Figure 3:Qualitative comparison across four dynamic sequences: \(a\) a domino cascade in a child’s bedroom; \(b\) a paper airplane landing on a cluttered desk; \(c\) a wind\-blown curtain billowing over a desk; \(d\) a glass hourglass pouring sand from its upper to its lower bulb\.
#### Scene Editing\.

SimWorlds runs in edit mode \(§[3\.1](https://arxiv.org/html/2607.01766#S3.SS1)\) on VIGA’s BlenderBench\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\]without architectural change, at a matched 6\-iteration budget on Opus 4\.7 and scored by VIGA’s own PL, N\-CLIP, and VLM metrics \(Table[2](https://arxiv.org/html/2607.01766#S4.T2); full setup in Appendix[E](https://arxiv.org/html/2607.01766#A5)\)\. Both systems receive the same input, an existing scene with a text instruction and a target image\. VIGA edits through a single generator\-verifier loop that re\-examines the whole scene each iteration; SimWorlds instead decomposes the request and routes it through the staged pipeline\. This precise localisation, together with the planning\-coding\-reviewing loop carried over from the generation pipeline \(§[3\.1](https://arxiv.org/html/2607.01766#S3.SS1), §[3\.3](https://arxiv.org/html/2607.01766#S3.SS3)\), leads to SimWorlds’s improved performance over VIGA across all difficulty levels on BlenderBench\.

Table 2:Scene\-editing results on BlenderBench\.

### 4\.3Ablations

We ablate the two mechanisms SimWorlds adds to a plain planner–coder–reviewer loop, the scene protocol and deterministic verifier \(§[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)\), and the staged construction pipeline \(§[3\.1](https://arxiv.org/html/2607.01766#S3.SS1)\)\. Experiments are conducted on a 15\-scene subset, whose absolute scores sit above the full\-50 numbers of Table[1](https://arxiv.org/html/2607.01766#S4.T1)\.

Table 3:Ablations results on a fixed 15\-scene subset of 4DBuildBench: one scene per mechanism category×\\timesdifficulty level\.On this 15\-scene subset the two ablations show a double dissociation\. Removing the verifier hurts SPR most: without the deterministic gate, structural breaks reach the final scene uncorrected instead of triggering a retry, most visibly on the multi\-object D3 scenes, two of which drop to zero SPR\. Removing staged construction hurts VLM most: a single pass still assembles the right objects but loses the per\-stage review that keeps materials, lighting, and composition on track\. MPR is robust to both, since the solver setup itself is authored reliably regardless\. Each mechanism thus guards a different axis, structural correctness and visual polish, and neither is redundant with the other\.

## 5Limitations

SimWorlds grounds mechanism and geometry against the engine, but its perceptual judgments, such as whether the plan follows the prompt, whether objects are arranged sensibly, and whether the scene composes well, still rest on the LLM and VLM rather than the deterministic checks, and are correspondingly less reliable\. SimWorlds is also text\-only; conditioning on a reference image to supervise layout and appearance is a promising direction for future work\.

## 6Conclusion

We presented SimWorlds, a multi\-agent framework for text\-to\-4D scene generation in Blender that emits editable \.blend files whose dynamics are realised by physics solvers rather than imitated by hand\-authored animation, together with 4DBuildBench, a benchmark that scores generated scenes on both visual quality and mechanism correctness\. SimWorlds structures generation as a fixed staged pipeline that builds and verifies the scene one stage at a time, so errors are caught and repaired locally instead of accumulating across the build\. A lightweight scene protocol keeps the generated assets well\-organised and verifiable, leaving the VLM reviewer the perceptual judgments, such as aesthetic composition and prompt alignment, that deterministic checks cannot make\. We see SimWorlds as a step toward procedural agents whose output is not a piece of media but an editable, physics\-driven asset that others can pick up, perturb, and build on\.

## Acknowledgements

YZ was supported in part by the SoftBank Group–ARM Fellowship\.

## References

- \[1\]\(2025\)Effective context engineering for AI agents\.External Links:[Link](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[2\]Anthropic\(2025\)Writing tools for agents\.External Links:[Link](https://www.anthropic.com/engineering/writing-tools-for-agents)Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[3\]S\. Bahmaniet al\.\(2024\)4D\-fy: text\-to\-4d generation using hybrid score distillation sampling\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[4\]S\. Bahmaniet al\.\(2024\)TC4D: trajectory\-conditioned text\-to\-4d generation\.InECCV,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[5\]H\. Bansal, Z\. Lin, T\. Xie, Z\. Zong, M\. Yarom, Y\. Bitton, C\. Jiang, Y\. Sun, K\. Chang, and A\. Grover\(2024\)VideoPhy: evaluating physical commonsense for video generation\.arXiv:2406\.03520\.Cited by:[§C\.2](https://arxiv.org/html/2607.01766#A3.SS2.p1.2),[§4\.1](https://arxiv.org/html/2607.01766#S4.SS1.SSS0.Px2.p3.1)\.
- \[6\]A\. Blattmann, T\. Dockhorn, S\. Kulal, D\. Mendelevitch, M\. Kilian, D\. Lorenz, Y\. Levi, Z\. English, V\. Voleti, A\. Letts, V\. Jampani, and R\. Rombach\(2023\)Stable video diffusion: scaling latent video diffusion models to large datasets\.arXiv:2311\.15127\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[7\]T\. Brooks, B\. Peebles, C\. Holmes,et al\.\(2024\)Video generation models as world simulators\.Note:OpenAI Technical ReportCited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[8\]H\. Chen, K\. Zhou, H\. Hua, K\. Zhang, J\. Qian, W\. Ma, H\. Chen, C\. Liu, Y\. Zhao, X\. Wang, W\. Li, A\. Yuille, P\. P\. Liang, and Y\. Du\(2026\)MemoBench: benchmarking world modeling in dynamically changing environments\.InEuropean Conference on Computer Vision \(ECCV\),Cited by:[§4\.1](https://arxiv.org/html/2607.01766#S4.SS1.SSS0.Px2.p3.1)\.
- \[9\]R\. Chen, Y\. Chen, N\. Jiao, and K\. Jia\(2023\)Fantasia3D: disentangling geometry and appearance for high\-quality text\-to\-3d content creation\.InICCV,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[10\]Z\. Chen, F\. Wang, Y\. Wang, and H\. Liu\(2024\)Text\-to\-3d using Gaussian Splatting\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[11\]D\. Cohen\-Baret al\.\(2023\)Set\-the\-scene: global\-local training for generating controllable NeRF scenes\.InICCV,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[12\]M\. Deitke, R\. Liu, M\. Wallingford, H\. Ngo,et al\.\(2023\)Objaverse\-XL: a universe of 10M\+ 3d objects\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]M\. Deitke, D\. Schwenk, J\. Salvador, L\. Weihs, O\. Michel, E\. VanderBilt, L\. Schmidt, K\. Ehsani, A\. Kembhavi, and A\. Farhadi\(2023\)Objaverse: a universe of annotated 3d objects\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[14\]M\. Deitke, E\. VanderBilt, A\. Herrasti, L\. Weihs, K\. Ehsani, J\. Salvador, W\. Han, E\. Kolve, A\. Kembhavi, and R\. Mottaghi\(2022\)ProcTHOR: large\-scale embodied AI using procedural generation\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]C\. Fanget al\.\(2023\)Ctrl\-Room: controllable text\-to\-3d room meshes generation with layout constraints\.arXiv:2310\.03602\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]W\. Feng, W\. Zhu, T\. Fu, V\. Jampani, A\. Akula, X\. He, S\. Basu, X\. E\. Wang, and W\. Y\. Wang\(2023\)LayoutGPT: compositional visual planning and generation with large language models\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[17\]R\. Fridmanet al\.\(2023\)SceneScape: text\-driven consistent scene generation\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[18\]G\. Gaoet al\.\(2024\)GraphDreamer: compositional 3d scene synthesis from scene graphs\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[19\]X\. Ge, Y\. Pan, Y\. Zhang, X\. Li, W\. Zhang, D\. Zhang, Z\. Wan, X\. Lin, X\. Zhang, J\. Liang, J\. Li, W\. Jiang, B\. Du, M\. Yang, and L\. Qi\(2026\)AirSim360: a panoramic simulation platform within drone view\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]Z\. Gouet al\.\(2024\)CRITIC: large language models can self\-correct with tool\-interactive critiquing\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[21\]K\. Greff, F\. Belletti, L\. Beyer, C\. Doersch, Y\. Du, D\. Duckworth, D\. J\. Fleet, D\. Gnanapragasam, F\. Golemo, C\. Herrmann,et al\.\(2022\)Kubric: a scalable dataset generator\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[22\]Y\. Gu, I\. Huang, J\. Je, G\. Yang, and L\. Guibas\(2025\)BlenderGym: benchmarking foundational model systems for graphics editing\.InCVPR,Note:HighlightCited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px2.p1.1),[footnote 1](https://arxiv.org/html/2607.01766#footnote1)\.
- \[23\]X\. He, D\. Jiang, G\. Zhang,et al\.\(2024\)VideoScore: building automatic metrics to simulate fine\-grained human feedback for video generation\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§4\.1](https://arxiv.org/html/2607.01766#S4.SS1.SSS0.Px2.p3.1)\.
- \[24\]J\. Ho, T\. Salimans, A\. Gritsenko, W\. Chan, M\. Norouzi, and D\. J\. Fleet\(2022\)Video diffusion models\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[25\]L\. Hölleinet al\.\(2023\)Text2Room: extracting textured 3d meshes from 2d text\-to\-image models\.InICCV,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[26\]S\. Honget al\.\(2024\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[27\]Y\. Hong, K\. Zhang, J\. Gu, S\. Bi, Y\. Zhou, D\. Liu, F\. Liu, K\. Sunkavalli, T\. Bui, and H\. Tan\(2024\)LRM: large reconstruction model for single image to 3d\.InICLR,Cited by:[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[28\]Y\. Hu, B\. Liu, J\. Kasai, Y\. Wang, M\. Ostendorf, R\. Krishna, and N\. A\. Smith\(2023\)TIFA: accurate and interpretable text\-to\-image faithfulness evaluation with question answering\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§4\.1](https://arxiv.org/html/2607.01766#S4.SS1.SSS0.Px2.p3.1)\.
- \[29\]Z\. Hu, A\. Iscen, A\. Jain, T\. Kipf, Y\. Yue, D\. A\. Ross, C\. Schmid, and A\. Fathi\(2024\)SceneCraft: an LLM agent for synthesizing 3d scene as Blender code\.arXiv:2403\.01248\.Cited by:[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2607.01766#S3.SS3.p1.1),[footnote 1](https://arxiv.org/html/2607.01766#footnote1)\.
- \[30\]I\. Huang, G\. Yang, and L\. Guibas\(2024\)BlenderAlchemy: editing 3d graphics with vision\-language models\.arXiv:2404\.17672\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]J\. Huanget al\.\(2024\)Large language models cannot self\-correct reasoning yet\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[32\]B\. Kerbl, G\. Kopanas, T\. Leimkühler, and G\. Drettakis\(2023\)3D Gaussian Splatting for real\-time radiance field rendering\.InSIGGRAPH,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[33\]X\. Liet al\.\(2024\)Director3D: real\-world camera trajectory and 3d scene generation from text\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[34\]H\. Lianget al\.\(2024\)Diffusion4D: fast spatial\-temporal consistent 4d generation via video diffusion models\.InECCV,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]C\. Lin, J\. Gao, L\. Tang, T\. Takikawa, X\. Zeng, X\. Huang, K\. Kreis, S\. Fidler, M\. Liu, and T\. Lin\(2023\)Magic3D: high\-resolution text\-to\-3d content creation\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[36\]C\. Linet al\.\(2024\)InstructScene: instruction\-driven 3d indoor scene synthesis with semantic graph prior\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[37\]X\. Lin, M\. Song, D\. Zhang, W\. Lu, H\. Li, B\. Du, M\. Yang, T\. Nguyen, and L\. Qi\(2026\)Depth any panoramas: a foundation model for panoramic depth estimation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[38\]C\. Liu, X\. Wang, Q\. Lin, A\. Xiao, H\. Chen, S\. Wen, H\. Zhang, L\. Qi, M\. Yang, L\. A\. Jeni, M\. Xu, and Y\. Zhao\(2026\)MOSIV: multi\-object system identification from videos\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[39\]R\. Liu, R\. Wu, B\. Van Hoorick, P\. Tokmakov, S\. Zakharov, and C\. Vondrick\(2023\)Zero\-1\-to\-3: zero\-shot one image to 3d object\.InICCV,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[40\]Y\. Liu, X\. Lin, X\. Li, B\. Yang, C\. Wang, K\. Sunkavalli, Y\. Hold\-Geoffroy, H\. Tan, K\. Zhang, X\. Xie, Z\. Shi, and Y\. Hu\(2026\)OmniRoam: world wandering via long\-horizon panoramic video generation\.InACM SIGGRAPH,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[41\]X\. Long, Y\. Guo, C\. Lin, Y\. Liu, Z\. Dou, L\. Liu, Y\. Ma, S\. Zhang, M\. Habermann, C\. Theobalt,et al\.\(2024\)Wonder3D: single image to 3d using cross\-domain diffusion\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[42\]S\. Lu, G\. Chen, N\. A\. Dinh, I\. Lang, A\. Holtzman, and R\. Hanocka\(2025\)LL3M: large language 3D modelers\.arXiv:2508\.08228\.Cited by:[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2607.01766#S3.SS3.p1.1),[footnote 1](https://arxiv.org/html/2607.01766#footnote1)\.
- \[43\]A\. Madaanet al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[44\]L\. Mei, J\. Yao,et al\.\(2025\)A survey of context engineering for large language models\.arXiv:2507\.13334\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[45\]B\. M\. Öcalet al\.\(2024\)SceneTeller: language\-to\-3d scene generation\.InECCV,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[46\]J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InUIST,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[47\]D\. Paschalidouet al\.\(2021\)ATISS: autoregressive transformers for indoor scene synthesis\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[48\]B\. Poole, A\. Jain, J\. T\. Barron, and B\. Mildenhall\(2023\)DreamFusion: text\-to\-3d using 2d diffusion\.InICLR,Cited by:[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[49\]C\. Qianet al\.\(2024\)ChatDev: communicative agents for software development\.InACL,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[50\]A\. Raistricket al\.\(2023\)Infinite photorealistic worlds using procedural generation\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[51\]J\. Renet al\.\(2024\)DreamGaussian4D: generative 4d Gaussian Splatting\.arXiv:2312\.17142\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[52\]J\. Renet al\.\(2024\)L4GM: large 4d Gaussian reconstruction model\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[53\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[54\]T\. Schicket al\.\(2023\)Toolformer: language models can teach themselves to use tools\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[55\]Y\. Shi, P\. Wang, J\. Ye, M\. Long, K\. Li, and X\. Yang\(2024\)MVDream: multi\-view diffusion for 3d generation\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[56\]N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[57\]U\. Singer, A\. Polyak, T\. Hayes, X\. Yin, J\. An, S\. Zhang, Q\. Hu, H\. Yang, O\. Ashual, O\. Gafni,et al\.\(2023\)Make\-a\-video: text\-to\-video generation without text\-video data\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[58\]U\. Singer, S\. Sheynin, A\. Polyak, O\. Ashual, I\. Makarov, F\. Kokkinos, N\. Goyal, A\. Vedaldi, D\. Parikh, J\. Johnson, and Y\. Taigman\(2023\)Text\-to\-4d dynamic scene generation\.InICML,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[59\]C\. Sunet al\.\(2023\)3D\-GPT: procedural 3d modeling with large language models\.arXiv:2310\.12945\.Cited by:[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2607.01766#S3.SS3.p1.1)\.
- \[60\]J\. Tanget al\.\(2024\)DiffuScene: denoising diffusion models for generative indoor scene synthesis\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px1.p1.1)\.
- \[61\]J\. Tang, Z\. Chen, X\. Chen, T\. Wang, G\. Zeng, and Z\. Liu\(2024\)LGM: large multi\-view Gaussian model for high\-resolution 3d content creation\.InECCV,Cited by:[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[62\]J\. Tang, J\. Ren, H\. Zhou, Z\. Liu, and G\. Zeng\(2024\)DreamGaussian: generative Gaussian Splatting for efficient 3d content creation\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[63\]C\. Wang, X\. Lin, J\. Liu, Y\. Liu, Z\. Wang, D\. Qi, Y\. Yan, and X\. Chen\(2026\)PanoWorld: towards spatial supersensing in 360∘panorama world\.arXiv preprint arXiv:2605\.13169\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[64\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[65\]X\. Wang, Y\. Zhao, B\. Ye, X\. Shan, W\. Lyu, L\. Qi, K\. C\. K\. Chan, Y\. Li, and M\. Yang\(2025\)HoliGS: holistic gaussian splatting for embodied view synthesis\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[66\]X\. Wanget al\.\(2024\)Executable code actions elicit better LLM agents\.InICML,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[67\]Z\. Wang, C\. Lu, Y\. Wang, F\. Bao, C\. Li, H\. Su, and J\. Zhu\(2023\)ProlificDreamer: high\-fidelity and diverse text\-to\-3d generation with variational score distillation\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[68\]L\. Weng\(2023\)LLM powered autonomous agents\.Note:lilianweng\.github\.ioExternal Links:[Link](https://lilianweng.github.io/posts/2023-06-23-agent/)Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[69\]Q\. Wuet al\.\(2023\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.arXiv:2308\.08155\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[70\]T\. Xieet al\.\(2024\)PhysGaussian: physics\-integrated 3d Gaussians for generative dynamics\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[71\]Y\. Xieet al\.\(2024\)SV4D: dynamic 3d content generation with multi\-frame and multi\-view consistency\.arXiv:2407\.17470\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[72\]D\. Xuet al\.\(2024\)Comp4D: LLM\-guided compositional 4d scene generation\.arXiv:2403\.16993\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[73\]J\. Xu, W\. Cheng, Y\. Gao, X\. Wang, S\. Gao, and Y\. Shan\(2024\)InstantMesh: efficient 3d mesh generation from a single image with sparse\-view large reconstruction models\.arXiv:2404\.07191\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[74\]Y\. Yang, F\. Sun, L\. Weihs, E\. VanderBilt, A\. Herrasti, W\. Han, J\. Wu, N\. Haber, R\. Krishna, L\. Liu, C\. Callison\-Burch, M\. Yatskar, A\. Kembhavi, and C\. Clark\(2024\)Holodeck: language guided generation of 3d embodied AI environments\.InCVPR,Cited by:[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[footnote 1](https://arxiv.org/html/2607.01766#footnote1)\.
- \[75\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InICLR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px4.p1.1)\.
- \[76\]T\. Yi, J\. Fang, J\. Wang, G\. Wu, L\. Xie, X\. Zhang, W\. Liu, Q\. Tian, and X\. Wang\(2024\)GaussianDreamer: fast generation from text to 3d Gaussians by bridging 2d and 3d diffusion models\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[77\]S\. Yin, J\. Ge, Z\. Z\. Wang, C\. Wang, X\. Li, M\. J\. Black, T\. Darrell, A\. Kanazawa, and H\. Feng\(2026\)VIGA: vision\-as\-inverse\-graphics agent via interleaved multimodal reasoning\.arXiv:2601\.11109\.Cited by:[Appendix A](https://arxiv.org/html/2607.01766#A1.SS0.SSS0.Px7.p1.1),[Appendix E](https://arxiv.org/html/2607.01766#A5.SS0.SSS0.Px3.p1.1),[Appendix E](https://arxiv.org/html/2607.01766#A5.p1.1),[§1](https://arxiv.org/html/2607.01766#S1.p1.1),[§1](https://arxiv.org/html/2607.01766#S1.p6.1),[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2607.01766#S3.SS3.p1.1),[§4\.1](https://arxiv.org/html/2607.01766#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2607.01766#S4.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2607.01766#S4.SS2.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2607.01766#S4.T2.6.7.1.2.1)\.
- \[78\]Q\. Zhanget al\.\(2024\)SceneWiz3D: towards text\-guided 3d scene composition\.InCVPR,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[79\]T\. Zhanget al\.\(2024\)PhysDreamer: physics\-based interaction with 3d objects via video generation\.arXiv:2404\.13026\.Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[80\]Y\. Zhao, H\. Chen, C\. Liu, Z\. Li, C\. Herrmann, J\. Hur, Y\. Li, M\. Yang, B\. Raj, and M\. Xu\(2025\)MASIV: toward material\-agnostic system identification from videos\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.
- \[81\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\), Datasets and Benchmarks Track,Cited by:[§4\.1](https://arxiv.org/html/2607.01766#S4.SS1.SSS0.Px2.p3.1)\.
- \[82\]Y\. Zheng and F\. Bordes\(2026\)VoxelCodeBench: benchmarking 3d world modeling through code generation\.arXiv:2604\.02580\.Cited by:[footnote 1](https://arxiv.org/html/2607.01766#footnote1)\.
- \[83\]X\. Zhouet al\.\(2024\)GALA3D: towards text\-to\-3d complex scene generation via layout\-guided generative Gaussian Splatting\.InICML,Cited by:[§2](https://arxiv.org/html/2607.01766#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix AImplementation Details

#### Stack\.

SimWorlds usesclaude\-opus\-4\-7as the underlying model for all three agent roles \(planner, coder, and reviewer\)\. Evaluation uses two independent scorers \(Appendix[C](https://arxiv.org/html/2607.01766#A3)\): a deterministic engine\-state audit that runs inside headless Blender with no LLM in the loop, and an itemized VLM judge ongpt\-5\.5run in a separate session; the BlenderBench editing comparison instead uses agpt\-4ojudge \(Appendix[E](https://arxiv.org/html/2607.01766#A5)\)\. Each judge scores every scene once at the provider’s default temperature, with no multi\-seed averaging, and the judge prompts are released verbatim undersrc/bench/\. The audit’s predicate scoring uses no LLM; only an optional actor\-matching fallback usesclaude\-sonnet\-4\-6\. Blender 5\.1 is exposed through an MCP server that provides the tool\-use APIs of §[3\.3](https://arxiv.org/html/2607.01766#S3.SS3)\. End\-to\-end runtime is approximately 70–120 minutes per scene on a single workstation \(Apple M4, 16 GB unified memory\), dominated by Blender bake time on dynamic prompts; the LLM is consumed via API and contributes negligible local compute\.

#### Scene Plan\.

The planner writes a structured plan in JSON and never operates Blender directly: it emits one*strategic plan*at the start of a run and a per\-stage*tactical plan*at each stage entry, then issues an advance/retry/replan/abort decision after each stage\. The plan’s load\-bearing commitments are declared as custom properties on each logical object’s L3 collection: adeformation\_kind∈\\in\{rig,sim\_cloth,sim\_fluid,sim\_rigid,none\}, set at thedeformation\_setupstage, and amotion\_kind∈\\in\{keyframe,bake,none\}, set at themotionstage\. These, together with the spatial\-relation graph of Appendix[B](https://arxiv.org/html/2607.01766#A2), fix what the verifier’s R and T rules check\. A stage left undeclared is a zero\-cost no\-op \(the R/T rules self\-gate on the declaration\), so a static scene passes throughdeformation\_setupandmotioncleanly\. The planner sets no bpy implementation specifics, which API call to use or which value to tune; that is the coder’s job\. The full plan schema and planner prompt are in the released repository \(prompts/\)\.

#### Per\-Stage Loop Implementation\.

Each stage runs the coder, the deterministic verifier, and the per\-stage reviewer until the planner advances it\. The coder is allowed up toMAX\_RETRIES\_PER\_STAGE=10=10coder retries andMAX\_REPLANS\_PER\_STAGE=5=5tactical replans per stage before the stage aborts\. Eachblender\_executecall runs in a fresh Python namespace; persistent state lives in the live Blender scene \(bpy\.data\), which later stages build on rather than rebuild\. Every closed stage writes a checkpointcheckpoints/<stage\>\.blend, so a coder failure rolls the scene back to the previous stage’s checkpoint\.

#### Agents\.

The pipeline runs four roles: a*planner*\(one persistent session across the run, reasoning over plans in JSON with no Blender tools\), a*coder*\(fresh session per stage, persisting across that stage’s retries\), a*per\-stage reviewer*\(fresh session per stage\), and a*final reviewer*\(one\-shot, on the rendered output\)\. Each role prompt is composed at runtime from the role file plus the shared scene\-protocol documentation \(src/agent/prompts/scene\_protocol\.mdanddocs/scene\_protocol\.md\)\. The role files are provided verbatim in the released repository \(prompts/\)\.

#### MCP tool server\.

Blender 5\.1 runs as a long\-lived process; an MCP server exposes the tool\-use APIs of §[3\.3](https://arxiv.org/html/2607.01766#S3.SS3)\(Table[4](https://arxiv.org/html/2607.01766#A1.T4)\) over a local TCP port\.blender\_executeruns arbitrary bpy Python in the live process; the inspect and preview tools wrap deterministic Python helpers that read scene state without re\-rendering\. Checkpointing saves the live \.blend after every closed stage, so a coder failure on stagesi\+1s\_\{i\+1\}rolls back to the checkpoint written at the close ofsis\_\{i\}\.

CategoryToolsPurposeState ObservationScene Snapshotblender\_scene\_stateStructured readout of the live scene: objects, materials, modifier stacks, f\-curves, and physics/cache state\.Visual Previewblender\_single\_mesh\_preview
blender\_object\_preview
blender\_system\_layout\_preview
blender\_scene\_layout\_previewTargeted multi\-angle preview renders at mesh, object, system, and whole\-scene granularity\.Motion Auditblender\_motion\_sheet\_previewSamples a per\-actor frame strip with phase boundaries and bake state to surface motion\-mechanism failures\.State ModificationCode Executionblender\_executeRuns bpy code in the live Blender process; fresh namespace per call\.Protocol Taggingblender\_tag\_objectWrites the scene\-protocol custom properties \(protocol\_role,is\_object\_root,contact\_with\) that make state checkable by the verifier\.Checkpointblender\_saveSaves the live \.blend as the per\-stage checkpoint\.Renderblender\_renderRenders the final frame sequence \(or still\) for the scene\.Knowledge & ReferenceAPI Lookupblender\_docsQueries the auto\-derived Blender API knowledge base\.Table 4:Tool suite exposed to SimWorlds agents, grouped by category\. Bolded tools are the engine\-state inspection tools that supply the mechanism\-level evidence the rendered image cannot\.
#### Knowledge base\.

blender\_docs\(query\)is backed byknowledge/blender\_docs/, an auto\-generated reference compiled per Blender version bysrc/knowledge/build\_knowledge\_base\.py\(it extracts class hierarchies, property signatures, and enum values, and serves curated markdown pages with raw JSON fallback\)\. Consistent with §[3\.4](https://arxiv.org/html/2607.01766#S3.SS4), this is the only knowledge source: no hand\-curated reference material is shipped\. The coder queries it on demand withblender\_docs\(topic\)for the stage at hand\.

#### Asset Licences\.

Third\-party assets used: Blender 5\.1 \(GPLv2\+\); Anthropic Claude Opus 4\.7 \(agent LLM, consumed via the Anthropic API under its Terms of Service\); OpenAI GPT\-5\.5 \(evaluation VLM judge, consumed via the OpenAI API under its Terms of Service\); BlenderBench\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\]\(used for editing evaluation only; license per the BlenderBench repository\)\.

## Appendix BScene Protocol and Verifier Ruleset

This appendix gives the full ruleset for the in\-loop verifier introduced in §[3\.2](https://arxiv.org/html/2607.01766#S3.SS2)\. It is part of the SimWorlds*system*\(the deterministic gate the generation loop runs after every stage\) and is separate from the external benchmark audit of Appendix[C](https://arxiv.org/html/2607.01766#A3), which scores any method’s output and shares no code with it\.

#### Protocol structure\.

Objects are organised into a three\-level collection tree: the scene root \(L1\);*system*collections \(L2,protocol\_role=system\) that group logical objects or hold scene\-level state, including a mandatorygroundand the conventionallighting/cameras/physicscollections; and*object*collections \(L3,protocol\_role=object\), each holding the meshes of one logical object plus exactly oneEmptyroot to which they are parented\. Surface contact, absent from Blender’s native vocabulary, is declared via thecontact\_withcustom property and the protocol\-graph relationsSupportedBy,StableAgainst,FixedAttachment, andAperture; a pair counts as declared if either side names the other\. The verifier accepts the declaration only if the surfaces actually meet \(rule V1\)\.

#### Ruleset\.

The verifier reportsok=false if any*hard*rule fails; warnings are listed separately and do not flipok\. Rules fall in four families: structural \(C, G\), geometric \(V\), soft \(W\), and plan\-vs\-state \(R, T\)\.

*Structural, collection layout \(C, cheap pure\-data\):*

- •C1every L3 sits in≥1\\geq 1system L2;C2every protocol L2 sits directly under the scene root;C3every mesh sits in exactly one object L3;C4every L3 has exactly oneis\_object\_rootEmpty;C5all protocol collections are explicitly named \(noCollection\.001debris\);C6agroundsystem L2 exists with≥1\\geq 1L3 inside\. All hard fails\.

*Structural, Empty and parent–child \(G, cheap pure\-data\):*

- •G1every mesh’s parent chain ends at its L3’s root Empty;G2every root Empty hasparent=None;G3every root Empty is in its own L3\. All hard fails\.

*Geometric \(V, BVH\-based, heavy\):*

- •V1every declared contact relation \(contact\_withand the protocol\-graph relations, cross\-object and within\-object\) must resolve to a nearest\-surface distance≤ϵm\\leq\\epsilon\_\{m\}, measured by BVH on the evaluated meshes\. Declaring an attachment never exempts the parts from touching:*declared contact⇒\\Rightarrowmeasured contact*at every level\. Hard fail\.
- •V2any AABB\-overlapping mesh pair with penetration depth above a numerical\-noise tolerance must be declared on at least one side, else it is flagged as an undeclared penetration\. Hard fail\.

*Soft \(W, warnings\):*

- •W1every L3 reachable from a ground L3 via contact edges \(promoted to hard fail understrict\_grounding\);W2meshes within an L3 form one connected component \(skipped for L3s flaggedallows\_disconnected\);W3no light/camera/force\-field inside an object L3;W4scene has≥1\\geq 1camera;W5scene has a light or an emissive world\.

*Plan\-vs\-state \(R, T\):*

- •R1–R18re\-check the planner’sprotocol\_graphrelations against the realised geometry \(e\.g\. aDistributed\(circle, 4\-fold\)layout is actually a 4\-fold circular arrangement to tolerance\); seven motion\-timing rules \(theTfamily;T1andT5are folded into the motion\-sheet preview and R15\) re\-check the realised motion against the planned phases \(e\.g\. asettlephase has near\-zero velocity in its final frames\)\. The full per\-rule catalogue and tolerances are in the released code\.

Together V1 and V2 catch the geometric failure mode that naive structural checks miss: V1 rejects a declared contact that does not hold \(a chair leg floating above the floor\), and V2 rejects an undeclared interpenetration \(a mesh punching through another\), so a scene cannot pass by being merely well\-organised while being geometrically broken\.

## Appendix CEvaluation Protocol: Engine\-State Audit and VLM Judge

4DBuildBench scores each scene on two independent tracks \(§[4](https://arxiv.org/html/2607.01766#S4)\): a deterministic engine\-state audit that measures*mechanism*correctness from Blender’s runtime state, and an itemized VLM judge that measures whether the prompt’s content is*visually*delivered\. The two are deliberately disjoint \(the audit never looks at a pixel, the judge never inspects a modifier\) so that a scene which looks right but is built the wrong way \(the dominant failure mode of §[4](https://arxiv.org/html/2607.01766#S4)\) scores high on one track and low on the other rather than passing both\.

### C\.1Engine\-state audit

The audit runs inside a headless Blender process and reads the generated scene’s runtime state directly\. It is*system\-agnostic*: it inspects bpy data \(modifier stacks, physics caches, rigid\-body world, collision settings, constraints, force fields, and the per\-channel keyframe density on location/rotation\) and depends on no SimWorlds\-specific protocol, so it scores any method’s \.blend file on identical terms\.

#### Ground truth\.

Each scene carries a hand\-authored specification \(a YAML file\) listing the*expected actors*, each with arole\(cloth,fluid\_domain,rigid\_active, …\), a set ofmust\_havepredicates, and a set ofmust\_not\_havepredicates; the expected*spatial relations*between actors \(SupportedBy,Inside,OnTopOf, …\); and scene\-level anti\-cheat assertions\. Actors are resolved to objects in the generated scene by name and by matching hints \(AABB size band, topology hint, expected collection role\)\.

#### Predicate library\.

Predicates are drawn from a typed library of 42 checks, grouped by mechanism:

- •Universal \(2\):actor is renderable; solver modifiers are enabled in both viewport and render\.
- •Cloth \(5\):CLOTH modifier present; cache baked over the frame range; self\-collision; pin vertex group; collision partners carry COLLISION\.
- •Fluid \(9\):FLUID modifier and type \(DOMAIN/FLOW/EFFECTOR\); a domain exists; required flow count; domain cache baked; dynamic effector on a moving actor; guiding velocity; minimum domain resolution; liquid mesh output\.
- •Rigid body \(9\):rigid\-body settings and type \(ACTIVE/PASSIVE\); collision shape; populated rigid\-body world; world cache baked; constraint type and resolved partner; disabled collisions on constraint; collision modifier when interacting with a deformable; positive mass\.
- •Particle \(7\):particle system present; type \(EMITTER/HAIR\); emission source; cache baked; collision partners; emission from a deformed surface \(modifier\-stack order\); force fields present\.
- •Soft body \(5\):SOFT\_BODY modifier; cache baked; collision partners; goal vertex group; rigid interaction partners carry both rigid\-body and collision\.
- •Anti\-cheat \(5,must\_not\_have\):solver actor does not carry dense location/rotation keyframes \(caps faked animation\); cloth/soft do not use shape keys as a motion source; static scenes carry no baked caches, no populated rigid\-body world, and no spurious solver modifiers\.

#### Aggregates\.

For each actor the audit computes the fraction of itsmust\_have/must\_not\_havepredicates that pass;MPR\(Mechanism Pass Rate\) is the per\-scene mean of these per\-actor fractions\.SPR\(Structural Pass Rate\) is the per\-scene fraction of declared spatial relations that hold, each checked geometrically by nearest\-surface distance via BVH on the evaluated meshes, sampled at start/mid/end frames\. Both are reported in Tables[1](https://arxiv.org/html/2607.01766#S4.T1)–[3](https://arxiv.org/html/2607.01766#S4.T3)\.

#### SPR cross\-system fairness\.

SPR scores spatial relations against the live \.blend state, which a naive implementation would let advantage SimWorlds: its contact checks expand a logical object to its full set of meshes through the scene protocol’s grouping, which baselines do not emit \(VIGA produces hundreds of ungrouped primitive meshes\)\. We remove this confound by inferring an object’s assembly geometrically when no protocol grouping is present: the 3D connected component of meshes in mutual axis\-aligned\-bounding\-box contact, grown from the matched mesh and stopped at the support surface so it cannot trivially absorb its target\. This inference is a strict no\-op on protocol\-compliant scenes \(every SimWorlds mesh is already grouped, so SimWorlds’s SPR is unchanged\) and only relaxes the score for baselines: VIGA’s macro SPR rises from0\.620\.62to the reported0\.700\.70, every cell monotonically non\-decreasing, while MPR is unaffected by the re\-match \(a shift under0\.010\.01\)\. The recovered points concentrate in static scenes where objects were correctly placed but ungrouped \(e\.g\.static\_L1,0\.22→0\.740\.22\\to 0\.74\), whereas scenes whose objects genuinely float or interpenetrate stay low \(e\.g\.static\_L2, unchanged at0\.100\.10, a furniture cluster suspended∼\\sim0\.8 m above the floor\)\. Even so, structural scoring across systems with different grouping conventions is hard to make fully fair, so we treat the\+0\.19\+0\.19SPR gap as supporting evidence and lead with MPR\.

### C\.2Itemized VLM judge

A GPT\-5\.5 judge receives the user prompt and five frames sampled uniformly from the rendered video att∈\{0,25,50,75,100\}%t\\in\\\{0,25,50,75,100\\\}\\%of the clip \(a single still for static scenes\); it sees neither the planner’s intent nor the audit\. Instead of numeric scores, it*enumerates atomic items*from the prompt and returns a binary verdict per item, which yields item\-level partial credit and a concrete evidence trail\. The five dimensions:

- •objects\_present: one item per object the prompt names;present/absent\(a disassembled object that no longer reads as its class countsabsent\)\.
- •spatial\_relations: one item per stated relation;holds/violated\(floating where support is implied, interpenetration, wrong side\)\.
- •actions\_visible\(dynamic only\): one item per described motion;happened/missing, judged by comparing the first and last frame; empty for static scenes \(the dimension is then excluded\)\.
- •visual\_quality: fixed three\-item technical checklist \(materials assigned, exposure/lighting, no render artifacts\);ok/broken\.
- •aesthetics: fixed four\-item artistic checklist \(material fidelity, colour palette, lighting mood, composition\);good/poor\.

Each dimension’s score is the fraction of its items with a positive verdict; the reported VLM score is the mean over the dimensions a scene exercises\. Mechanism realism \(whether motion is a real simulation or keyframed\) is explicitly excluded from the judge’s remit, since vision models are unreliable on it\[[5](https://arxiv.org/html/2607.01766#bib.bib72)\]; the audit covers it instead\. The full judge prompt, including the per\-item output schema and anchor examples, is provided in the released code repository \(src/bench/vlm\_judge\_rubric\.md\)\.

## Appendix DBenchmark Details

#### Composition\.

4DBuildBench contains 50 scenes \(Section[4](https://arxiv.org/html/2607.01766#S4)\): 45 dynamic scenes across five solver categories \(cloth, fluid, rigid body, particle, soft body\), each split into three difficulty levels with three prompts per level, plus 5 static scenes \(three single\-room interiors, two scene\-scale layouts\)\. The difficulty axis is defined by mechanism complexity rather than by clip length or raw object count:

- •D1 \(single actor\):a single solver actor in its default configuration: one cloth draping, one fluid pouring, one stack of rigid bodies settling\.
- •D2 \(within\-category\):multiple instances of the category, or internal solver complexity within it: self\-collision, pinning and goal vertex groups, rigid\-body constraints, force fields, multiple flow sources\.
- •D3 \(cross\-category\):cross\-category interaction realised in a single shot \(a rigid block crushing a soft\-body slab, particles emitted from a deforming cloth surface, a fluid effector riding an animated rigid body\), each requiring two or more solvers to be configured and to interact correctly\.

Static scenes carry no solver at all; they exercise object inventory, material assignment, and spatial layout \(15–20 named objects per single\-room scene\), and their anti\-cheat predicates assert the absence of any physics state\.

#### Example prompts\.

One prompt per difficulty tier \(the canonical ground\-truth IDs are<category\>\_<level\>\_<nn\>\):

- •cloth\_D1\_01\(cloth, single actor\): “A red tablecloth drapes over a small round wooden table in a quiet dining room\.”
- •rigid\_D2\_01\(rigid body, within\-category\): twelve wooden dominoes arranged in a chain on a table; the first is tipped and the cascade runs to the end \(multiple interacting rigid bodies with a baked rigid\-body world\)\.
- •soft\_D3\_01\(soft body×\\timesrigid body, cross\-category\): “A thick green jelly slab on a wooden board is crushed under a falling heavy stone block, the jelly squashing flat under the block as it settles\.”
- •static\_D1\_01\(static, single\-room\): “A furnished living room interior: a three\-seat sofa against the back wall with two cushions, a coffee table on a large rug, a framed picture and a round wall clock, a floor lamp, a tall bookshelf holding rows of books and a small potted plant, a television on a low media console, a side armchair near the window, and a basket of magazines on the floor\.”

#### Authoring protocol\.

Each scene is specified by a hand\-authored ground\-truth YAML \(Appendix[C](https://arxiv.org/html/2607.01766#A3)\) that fixes the expected actors, their required solver predicates, the spatial\-relation graph, and the motion phases, alongside the natural\-language prompt\. Prompts use domain\-specific nouns \(noball/objectplaceholders\) with place, time, and material anchors, and each is validated to exercise its category’s solver\. The difficulty level is fixed by the authored predicate set \(the number of interacting actors and whether the required predicates cross a category boundary\) rather than inferred from prompt text\.

#### VIGA run protocol\.

VIGA is run from its open\-source release in dynamic\-scene mode on the 4DBuildBench prompts\. Each scene receives the prompt text only, with no target or reference image, matching SimWorlds’s text\-only setting\. VIGA uses its own generator–verifier loop and Claude Opus 4\.7 backend, capped at its native budget of 15 rounds; SimWorlds instead runs the per\-stage bounded\-retry loop of §[3\.1](https://arxiv.org/html/2607.01766#S3.SS1), so the two budgets are reported as configured rather than forced equal\. Both systems’ final\.blendfiles are then scored by the identical external audit and VLM judge of Appendix[C](https://arxiv.org/html/2607.01766#A3)\.

## Appendix EBlenderBench Setup

BlenderBench\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\]is reused unchanged from VIGA for the editing comparison in Section[4](https://arxiv.org/html/2607.01766#S4)\. We summarise its structure and metrics here for transparency, and defer to the VIGA paper and the BlenderBench dataset card for the canonical definitions\.

#### Task structure\.

Each of the 27 tasks contains:

- •a Blender scene \.blend file \(the “start” scene\);
- •a512×512512\\times 512reference render of the post\-edit scene \(the “goal” render\);
- •a one\- to two\-sentence natural\-language task description \(task\.txt\);
- •the reference Python edit script used by the dataset authors to produce the goal render \(goal\.py\); not exposed to the agent\.

The agent receives the start \.blend file, the goal render, and the task description, and emits Python that, when executed against the start scene, should produce a render approximating the goal\.

#### Difficulty levels\.

The 27 tasks split evenly across three difficulty levels \(9 tasks each\):

- •Level 1: Camera adjustment\.Scene contents and lighting are held fixed; only the camera pose differs between start and goal\. Example task description:*“Adjust the camera position so that the viewing angle is consistent with the target image\.”*
- •Level 2: Multi\-step attribute editing\.The camera is held fixed; the agent must change two or more lighting, material, or object\-geometry attributes within the same task\. Example task description:*“First adjust the room brightness, then adjust the size of the character’s belly so that it looks like the target image\.”*
- •Level 3: Compositional editing\.The same attribute changes as Level 2 plus a Level\-1 camera change in the same task\. Example task description:*“First adjust the room brightness, then adjust the size of the character’s belly so that it looks like the target image\. You need to adjust the camera angle so that you can see the object you want to modify\.”*

#### Evaluation metrics\.

We report VIGA’s three reference\-comparing metrics, computed by a re\-evaluation pass that mirrors the open\-source VIGA implementation\[[77](https://arxiv.org/html/2607.01766#bib.bib38)\]:

- •PL↓\\downarrow\(photometric loss\)\.Mean squared error between the agent’s final render and the goal render, after both are converted to RGB, normalised to\[0,1\]\[0,1\], and resized to the goal\-render resolution\. Reported on the paper’s×100\\times 100scale\.
- •N\-CLIP↓\\downarrow\(CLIP distance\)\.\(1−cos⁡⟨ϕ\(render\),ϕ\(goal\)⟩\)×100\(1\-\\cos\\langle\\phi\(\\text\{render\}\),\\phi\(\\text\{goal\}\)\\rangle\)\\times 100, whereϕ\\phiis the image embedding from theopenai/clip\-vit\-base\-patch32model, the same CLIP variant VIGA uses\.
- •VLM↑\\uparrow\(judge score\)\.A GPT\-4o judge is shown the goal render, the agent’s final render, and the task description, and assigns four 0–5 integer scores along VIGA’s four criteria \(task completion, visual quality, spatial accuracy, detail accuracy\) using VIGA’s verbatim instruction template\. We report the per\-task mean of the four scores\.

For both methods, the final renders produced by the sweep were re\-scored end\-to\-end through this implementation, so the head\-to\-head means in Table[2](https://arxiv.org/html/2607.01766#S4.T2)are computed against identical metric code rather than against per\-method scoring pipelines\.

The per\-level head\-to\-head numbers are reported in Table[2](https://arxiv.org/html/2607.01766#S4.T2)in the main text\.
SimWorlds: A Multi-Agent System for Dynamic 3D Scene Creation

Similar Articles

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Submit Feedback

Similar Articles

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes
MultiWorld: Scalable Multi-Agent Multi-View Video World Models