BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

arXiv cs.AI Papers

Summary

BIM-Edit is a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) in IFC format. Results show a substantial gap, with the best model achieving only 49.5% average score across geometric, semantic, and topological metrics.

arXiv:2606.20146v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly applied to computer-aided design (CAD) to generate design artifacts from textual instructions. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness. We introduce BIM-Edit, a benchmark for evaluating LLMs on natural-language editing of Building Information Models (BIM) represented in the Industry Foundation Classes (IFC) format. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure. BIM-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes. Tasks are expressed using three instruction categories - direct, spatial, and topological - covering both explicit and scene-grounded edits. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency. Across evaluated LLMs, the best-performing model achieves only 49.5% average score across the three metrics, and no model fully solves more than 3.4% of tasks. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:35 PM

# BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling
Source: [https://arxiv.org/html/2606.20146](https://arxiv.org/html/2606.20146)
Bharathi Kannan Nithyanantham1&Clemens Kujat111footnotemark:1&Tobias Sesterhenn211footnotemark:1&Stefan Telgmann1&Jörn Plönnigs1&Stefan Lüdtke1&Christian Bartelt2

1University of Rostock 2Clausthal University of Technology bharathikannan\.nithyanantham@uni\-rostock\.de

###### Abstract

Large language models \(LLMs\) are increasingly applied to computer\-aided design \(CAD\) to generate design artifacts from textual instructions\. In engineering practice, this requires more than creating new geometry, models must also understand existing scenes, edit them correctly, and preserve semantics and relations\. However, many CAD benchmarks focus on creating new models rather than editing existing ones, and mostly evaluate geometric correctness\. We introduce BIM\-Edit, a benchmark for evaluating LLMs on natural\-language editing of Building Information Models \(BIM\) represented in the Industry Foundation Classes \(IFC\) format\. BIM provides a challenging testbed because building models encode geometry together with semantic and relational structure\. BIM\-Edit contains 324 editing tasks spanning 11 realistic building models and 36 synthetic scenes\. Tasks are expressed using three instruction categories – direct, spatial, and topological – covering both explicit and scene\-grounded edits\. We evaluate outputs along three dimensions: geometric accuracy, semantic validity, and topological consistency\. Across evaluated LLMs, the best\-performing model achieves only49\.5%49\.5\\%average score across the three metrics, and no model fully solves more than3\.4%3\.4\\%of tasks\. These results demonstrate a substantial gap between current LLM capabilities and the requirements of structured engineering design workflows\.111Dataset available at:[https://huggingface\.co/BIM\-Edit](https://huggingface.co/BIM-Edit)

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.20146v1/x1.png)Figure 1:Overview of the tasks of BIM\-Edit\. BIM\-Edit includes create, update, and delete tasks expressed as natural\-language instructions of varying descriptions: direct \(fully specified\), spatial \(context\-based\), and topological \(relation\-based\)\.Large language models \(LLMs\) show strong performance on code generation\[[6](https://arxiv.org/html/2606.20146#bib.bib6),[15](https://arxiv.org/html/2606.20146#bib.bib15),[22](https://arxiv.org/html/2606.20146#bib.bib22),[24](https://arxiv.org/html/2606.20146#bib.bib24)\], which has motivated their use in 3D and computer\-aided design \(CAD\)\. Recent systems can generate CAD sequences from textual descriptions\[[25](https://arxiv.org/html/2606.20146#bib.bib25),[29](https://arxiv.org/html/2606.20146#bib.bib29),[38](https://arxiv.org/html/2606.20146#bib.bib38)\]and from images of target models\[[5](https://arxiv.org/html/2606.20146#bib.bib5),[14](https://arxiv.org/html/2606.20146#bib.bib14)\]\. However, most CAD benchmarks evaluate LLMs on small synthetic examples and require models to generate geometries from scratch\[[46](https://arxiv.org/html/2606.20146#bib.bib46),[10](https://arxiv.org/html/2606.20146#bib.bib10),[30](https://arxiv.org/html/2606.20146#bib.bib30)\]\. This setup diverges from real\-world engineering practice, where CAD models are not purely geometric but also encode relationships between components and semantic properties under domain\-specific constraints\[[36](https://arxiv.org/html/2606.20146#bib.bib36),[7](https://arxiv.org/html/2606.20146#bib.bib7),[40](https://arxiv.org/html/2606.20146#bib.bib40),[2](https://arxiv.org/html/2606.20146#bib.bib2)\]\. Additionally, in practice, CAD workflows rarely start from scratch and instead involve collaboration among experts on shared models\[[34](https://arxiv.org/html/2606.20146#bib.bib34)\]\.

We identify three key discrepancies between current CAD benchmarks and real\-world requirements\.\(G1\) Interaction with existing CAD models\.A realistic benchmark should evaluate whether LLMs can interpret and modify existing CAD models\. Prior work has mostly focused on visual scene understanding in multimodal LLMs\[[1](https://arxiv.org/html/2606.20146#bib.bib1),[32](https://arxiv.org/html/2606.20146#bib.bib32),[33](https://arxiv.org/html/2606.20146#bib.bib33),[47](https://arxiv.org/html/2606.20146#bib.bib47)\]\. In contrast, the ability to retrieve information from and manipulate large CAD models via code remains largely underexplored\[[26](https://arxiv.org/html/2606.20146#bib.bib26)\]\.\(G2\) Consistency of modifications\.Benchmarks should ensure that modifications to CAD models not only produce correct geometry but also preserve consistency in topology and semantics, such as element classes, material properties, containment, and connectivity\. A model can be invalid from engineering perspective, even when it appears geometrically correct\. For example, two structural beams that appear visually in contact, but are disjoint, compromise the structural integrity\[[35](https://arxiv.org/html/2606.20146#bib.bib35)\]\.\(G3\) Handling implicit natural language\.In real\-world scenarios, edit requests are often underspecified and refer to elements through spatial, semantic, or topological relations\. For example:*“Enlarge the central hole of the wooden element”*only provides contextual spatial and semantic cues, but no explicit element IDs\[[45](https://arxiv.org/html/2606.20146#bib.bib45)\]\.

BIM\-Edit addresses those gaps: It operates on existing IFC models \(G1\) and evaluates LLMs in two complementary dimensions that target G2 and G3\. The first dimension is valid BIM manipulation in create, update, and delete tasks\. Each edit is scored on geometry, topology, and semantics metrics evaluating also engineering integrity \(G2\)\. The second dimension is contextual scene understanding\. Each task is written in one of three instruction variants: direct, spatial, or topological, with increasing underspecification of prompts \(G3\)\. Direct instructions explicitly specify the target element and the required change\. Spatial prompts specify the target indirectly through geometric context, such as position, direction, or distance\. Topological prompts specify the target indirectly through BIM relations, such as adjacency, containment, hosting, or connectivity\. These variants are used in both small models and larger realistic house models, allowing controlled evaluation across instruction ambiguity and scene complexity \(see Figure[1](https://arxiv.org/html/2606.20146#S1.F1)\)\. The resulting benchmark is challenging for LLMs: The best of seven proprietary and open\-weight models only reaches 49\.48% performance\. Our contributions are:

- •BIM\-Edit, a benchmark of 324 natural\-language editing tasks spanning create, update, and delete operations under direct, spatial, and topological instructions for small to large models\.
- •A three\-axis evaluation protocol that scores edits on geometric accuracy, semantic validity, and topological consistency rather than geometry alone\.
- •We conduct experiments on seven recent LLMs to assess their performance on BIM\-Edit and analyze their strengths and limitations in BIM editing\.

## 2Related Work

The related work can be grouped into benchmarks: \(i\) that more generally evaluate 3D scenes; \(ii\) that study CAD modeling in different applications; \(iii\) that investigate specifically BIM workflows\.

#### 3D scene understanding benchmarks\.

Existing work evaluates whether LLMs and MLLMs can reason about 3D scenes, for example through visual question answering or spatial reasoning tasks over object locations, relations, and scene composition\[[1](https://arxiv.org/html/2606.20146#bib.bib1),[32](https://arxiv.org/html/2606.20146#bib.bib32),[33](https://arxiv.org/html/2606.20146#bib.bib33),[47](https://arxiv.org/html/2606.20146#bib.bib47)\]\. Mostly relevant in our context is emerging work on programmatic access to CAD artifacts, such as QueryCAD\[[26](https://arxiv.org/html/2606.20146#bib.bib26)\], which evaluates whether LLMs can synthesize code to extract information from CAD models\. These benchmarks capture important parts of the indirect reasoning required in our setting, especially for spatial and relational references\. However, they evaluate perception and reasoning in isolation: models are not required to generate executable CAD or BIM artifacts, nor to preserve consistency across geometry, semantics, and topology\.

#### CAD modeling benchmarks\.

Most CAD modeling benchmarks study*generation*of complete models from modalities such as text\[[25](https://arxiv.org/html/2606.20146#bib.bib25),[29](https://arxiv.org/html/2606.20146#bib.bib29),[38](https://arxiv.org/html/2606.20146#bib.bib38)\], images\[[5](https://arxiv.org/html/2606.20146#bib.bib5),[14](https://arxiv.org/html/2606.20146#bib.bib14)\], or other geometric inputs\[[39](https://arxiv.org/html/2606.20146#bib.bib39),[37](https://arxiv.org/html/2606.20146#bib.bib37),[11](https://arxiv.org/html/2606.20146#bib.bib11)\], often using large repositories such as ShapeNet\[[4](https://arxiv.org/html/2606.20146#bib.bib4)\], DeepCAD\[[43](https://arxiv.org/html/2606.20146#bib.bib43)\], ABC\[[27](https://arxiv.org/html/2606.20146#bib.bib27)\], and Fusion360\[[42](https://arxiv.org/html/2606.20146#bib.bib42)\]\. For example, Text2CAD\[[25](https://arxiv.org/html/2606.20146#bib.bib25)\]benchmarks text\-to\-CAD generation across prompts with different levels of specificity, which is conceptually related to our distinction between direct and indirect task formulations\. Other work extends this setting to multimodal prompting or LLM\-based evaluation of generated CAD models\[[10](https://arxiv.org/html/2606.20146#bib.bib10),[30](https://arxiv.org/html/2606.20146#bib.bib30)\]\. However, these benchmarks primarily assess model synthesis from scratch rather than modification of an existing artifact and, therefore, require no understanding of existing scenes\. More closely related is a parallel line of work on*CAD editing*, which targets natural\-language edits over parametric CAD models: the model receives a source model and a natural\-language edit instruction and must transform it into the desired target\[[45](https://arxiv.org/html/2606.20146#bib.bib45),[18](https://arxiv.org/html/2606.20146#bib.bib18)\]\. CAD\-Editor\[[45](https://arxiv.org/html/2606.20146#bib.bib45)\]introduced this setting with a benchmark of create, update, and delete operations on existing CAD artifacts\. A related setting is studied in BlenderGym\[[16](https://arxiv.org/html/2606.20146#bib.bib16)\], which considers edits from start\-goal pairs, but assumes access to an explicit BlenderPython construction sequence\. However, these benchmarks operate on CAD representations that primarily capture geometry, limiting their ability to assess engineering validity\. In contrast, BIM models encode explicit relationships and domain semantics, enabling evaluation of structural consistency and functional correctness required for real\-world Architecture, Engineering, and Construction \(AEC\) applications\.

#### Building Information Modeling\.

Building Information Modeling is a life\-cycle data\-management paradigm that is established in civil engineering to ensure tool interoperability\. It extends traditional geometry models in CAD by a semantic object\-oriented model, with classes, attributes, and explicit relationships\. The established data exchange format for this are Industry Foundation Classes \(IFC\), an open, vendor\-neutral standard\[[21](https://arxiv.org/html/2606.20146#bib.bib21)\], making it independent of proprietary BIM tools and formats \(e\. g\., Revit, Vectorworks\)\. Within IFC, building components \(e\. g\., walls, slabs, windows\) are defined as typed objects that encapsulate geometry, placement, properties, and inter\-element relationships such as spatial containment\. As a result, BIM models must maintain consistency of these properties, rather than being evaluated solely on geometric representation as in conventional CAD\[[3](https://arxiv.org/html/2606.20146#bib.bib3)\]\. Recent work has begun to explore LLM\-based systems for BIM, targeting tasks such as information retrieval and Text2BIM generation\[[19](https://arxiv.org/html/2606.20146#bib.bib19),[9](https://arxiv.org/html/2606.20146#bib.bib9),[48](https://arxiv.org/html/2606.20146#bib.bib48),[8](https://arxiv.org/html/2606.20146#bib.bib8),[23](https://arxiv.org/html/2606.20146#bib.bib23),[13](https://arxiv.org/html/2606.20146#bib.bib13),[31](https://arxiv.org/html/2606.20146#bib.bib31)\]\. However, these approaches are typically validated on small from\-scratch generation scenarios, and no unified benchmark systematically evaluates LLMs on BIM editing\. BIM\-Edit closes this gap with a broad cover of all create and edit tasks that test contextual understanding \(G1\), implicit prompting \(G3\), and result in valid engineering models \(G2\)\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2606.20146v1/x2.png)Figure 2:Overview of the task structure of the benchmark\. Each task consists of an operator, a target element, and a combination of an instruction type and a scene complexity\. Therefore, each operation can be done in six different ways \(Instruction Type×\\timesScene Complexity\)\.BIM\-Edit evaluates how well LLMs perform in editing existing structured 3D building models from natural\-language instructions\. The model must identify the referenced scene, apply the requested change, and preserve the rest of the model\. This reflects real BIM workflows, where edits occur inside large shared models, and a visually plausible result can still be invalid if it breaks element types, spatial relations, or properties\. Therefore, BIM\-Edit evaluates the full edited model, not only its geometry\. We define each BIM\-Edit task as a triplet\(M0,x,M∗\)\(M^\{0\},x,M^\{\*\}\), whereM0M^\{0\}is the input IFC model,xxis a natural\-language edit instruction, andM∗M^\{\*\}is the manually authored ground\-truth model\. The agent receives\(M0,x\)\(M^\{0\},x\)and must produce an edited modelM′M^\{\\prime\}\. During the evaluation,M′M^\{\\prime\}is then compared toM∗M^\{\*\}with respect to geometric, semantic, and topological correctness\. This makes BIM\-Edit independent of the agent design, since any system that reads and writes IFC files can be evaluated under the same protocol\.

### 3\.1Benchmark construction

#### Modular Task Syntax\.

BIM\-Edit contains 324 tasks across 11 distinct large IFC models and 36 synthetic models\. To enable reliable failure analysis, we structure the benchmark around tasks with controlled variations, as shown in Figure[2](https://arxiv.org/html/2606.20146#S3.F2)\. Each prompt follows a strict pattern: \(i\)Operator\(create, update, delete\), which specifies the underlying editing objective; \(ii\)Target Element\(e\. g\., wall, window, room\); and \(iii\)Instruction, which determines whether implicit information must be inferred from context\. Instructions always have three variants and can refer to elements either directly \(e\. g\., via element IDs\) or indirectly through spatial or topological descriptions\. Additionally, tasks are evaluated in twoscenarios: \(i\) a simple scene with a limited number of elements, and \(ii\) a complex, realistic building, enabling assessment of the impact of increasing scene complexity\.

This design allows to systematically evaluate tasks across different complexity dimensions from operator complexity \(create, update, delete\), to reference complexity \(direct, spatial, topological\), to scene complexity \(simple, complex\)\. In detail:

#### Operation complexity\.

This dimension evaluates which operator types are handled well by the LLMs\.*Create*tasks require adding one or more new BIM elements\.*Update*tasks require modifying existing elements, for example by changing their size, position, or shape\. Lastly,*Delete*tasks require removing target elements and cleaning up dependent structure where necessary\. We focus on common architectural entities as target elements, including walls, slabs, doors, windows, columns, and spaces\. Furthermore, although the tasks always refer to exactly one target element, this does not mean that only one BIM element needs to be edited\. For example, moving a window requires editing both the position of the window and the wall opening that it fills\. Spaces are a special case of building elements, as they represent non\-physical, semantic constructs that are mostly defined by surrounding building elements such as walls, rather than physical objects themselves\. As a result, modifying spaces may require indirect changes to adjacent structural elements\.

#### Reference complexity\.

Varying the instruction types assesses the contextual understanding and reasoning capabilities of the LLMs with respect to implicit user prompts \(G3\)\. First,*direct*instructions explicitly name the target element \(e\. g\., via ID\) and fully specify all required parameters, including geometric properties and any relevant relations, thus requiring minimal scene understanding\. Indirect instructions require the system to understand the scene by querying the IFC model first\. Here,*spatial*instructions use geometric context such as relative position, distance, orientation, or viewpoint, whereas*topological*tasks rely on BIM relations including adjacency, containment, hosting, or connectivity\. This uses the representational structure of IFC models, as they can be understood as semantic property graphs in which building elements form nodes with semantic classes and geometric representations, while relations between elements form edges\. Consequently, the instruction categories induce different reasoning requirements: direct instructions rely on explicit element references, spatial instructions require geometric reasoning over the scene layout, and topological instructions require reasoning over relations such as adjacency, containment, and connectivity\.

#### Scene complexity\.

This dimension evaluates how well the LLMs can scale to large scenes providing potentially ambiguous context\. We distinguish between*simple*and*complex*models based on structural complexity\. Simple models consist of small BIM substructures with a limited number of elements and relations, designed to isolate specific editing behaviors\. On average, they contain 21\.03 elements and 98\.74 relations\. The complex models represent realistic house\-like IFC files with larger layouts and denser relational structure\. On average, they contain 614\.88 elements and 2088\.88 relations\. All IFC models in BIM\-Edit are manually created by human experts \(G1\)\. Examples of both models are shown in Appendix[I\.5](https://arxiv.org/html/2606.20146#A9.SS5)\. This distinction allows us to evaluate performance both in controlled settings and in realistic scenarios where the target element must be identified within a large and complex context, each setting containing162162tasks, respectively\.

### 3\.2Evaluation Metrics

We design a custom evaluation suite for BIM\-Edit because standard CAD metrics mainly focus on geometry\[[46](https://arxiv.org/html/2606.20146#bib.bib46)\]\. A valid IFC edit must also be correct in its element definitions and topological relationships\. For example, a prediction may match the reference shape exactly but still be partially invalid if it uses the wrong element class or adds a door without properly attaching it to its host wall\. We therefore evaluate each prediction along three axes: geometry, semantics, and topology\. The final score is defined as their unweighted mean:

S=13​\(Sgeo\+Ssem\+Stopo\),S=\\tfrac\{1\}\{3\}\\bigl\(S\_\{\\text\{geo\}\}\+S\_\{\\text\{sem\}\}\+S\_\{\\text\{topo\}\}\\bigr\),\(1\)
whereSgeoS\_\{\\text\{geo\}\},SsemS\_\{\\text\{sem\}\}, andStopoS\_\{\\text\{topo\}\}represent the geometry, semantic, and topology scores, respectively\. Each score is bounded between 0 and 1, where higher is better\.

Most edits affect only a small region of a much larger building model\. As stated above, IFC models are semantic property graphsG=\(N,R\)G=\(N,R\), where building elements correspond to nodesn∈Nn\\in Nwith semantic classescnc\_\{n\}, attribute setsana\_\{n\}, and geometric representationsmnm\_\{n\}, while inter\-element relations are edgesr∈Rr\\in Rwith a classcrc\_\{r\}\. Comparing complete IFC files would be dominated by unchanged graph structure\. To correctly reflect edits, we evaluate only the edit graphGΔG\_\{\\Delta\}\. LetG∗G^\{\*\}andG′G^\{\\prime\}denote the semantic property graphs corresponding to the ground\-truth modelM∗M^\{\*\}and the predicted modelM′M^\{\\prime\}, respectively\. Using a lightweight IFC diff built on IfcOpenShell\[[20](https://arxiv.org/html/2606.20146#bib.bib20)\], we compute the edit graphGΔ=\(NΔ,RΔ\)G\_\{\\Delta\}=\(N\_\{\\Delta\},R\_\{\\Delta\}\)that identifies nodes and relations that were added, removed, or modified between the two graphs\. Modifications do include also changes to semantic classescnc\_\{n\}, attributesana\_\{n\}, geometrymnm\_\{n\}, or relational typescrc\_\{r\}\. All metrics are computed on these edit sets, so the score reflects the quality of the edit itself rather than the unchanged parts of the model\. Together, the three metrics evaluate not only geometric correctness but also engineering validity by verifying that the semantic and topological consistency of the IFC model is preserved \(G2\)\.

#### Geometry Score\.

The geometry score measures whether the edit produces the correct shape in the correct location\. We uniformly sample points from the surfaces for alln∈NΔn\\in N\_\{\\Delta\}in their referencemn∗m^\{\*\}\_\{n\}and predicted geometrymn′m^\{\\prime\}\_\{n\}to obtain point cloudsP∗P^\{\*\}andP′P^\{\\prime\}\. We then compare these point clouds using the median bidirectional Chamfer distanceCDmed\\mathrm\{CD\}\{\\mathrm\{med\}\}\[[12](https://arxiv.org/html/2606.20146#bib.bib12)\]\. This set\-level comparison supports edits that affect multiple objects without requiring one\-to\-one object matching\. Since Chamfer distance is unbounded, we normalize it by the diagonalDDof the joint bounding box of the two point clouds and map it to\[0,1\]\[0,1\]using exponential decay:

Sgeo=exp⁡\(−CDmed​\(P∗,P′\)D⋅α\),S\_\{\\text\{geo\}\}=\\exp\\\!\\left\(\-\\,\\frac\{\\mathrm\{CD\}\{\\mathrm\{med\}\}\(P^\{\*\},P^\{\\prime\}\)\}\{D\}\\cdot\\,\\alpha\\right\),\(2\)
whereSgeo=1S\_\{\\text\{geo\}\}=1indicates a perfect geometric match, andSgeo→0S\_\{\\text\{geo\}\}\\to 0as the normalized geometric error increases\. The parameterα=5\\alpha=5controls how quickly the score decreases as the error increases\.

#### Semantic Score\.

The semantic score measures whether the edited objects carry the correct type \(G2\)\. A geometrically valid result can still be incorrect if, for example, the agent represents a door as a window object\. When node identities differ betweenN​’N’andN∗N^\{\*\}, we compute a one\-to\-one matching using the Hungarian algorithm\[[28](https://arxiv.org/html/2606.20146#bib.bib28)\]based on oriented bounding box IoU \(OBB\-IoU\)\. For each matched pair, we compute two semantic terms\. The first term checks whether the predicted object has the same IFC class as the reference object\. The second term measures the fraction of semantic properties relevant to the task that match, such as predefined type, name, material, or other attributes used by the task\. For delete tasks, where no edited object remains to compare, the semantic terms is considered correct if the intended target object is successfully removed\. The semantic score for a matched pair is the average of these two terms\. The task\-level semantic score is the mean over all reference edited objects after matching each reference object to its assigned prediction, with unmatched references assigned zero:

Ssem=1\|NΔ\|​∑n∈NΔ12​\(𝟙​\[cn′=cn∗\]\+ρ​\(an′,an∗\)\)\.S\_\{\\text\{sem\}\}=\\frac\{1\}\{\|N\_\{\\Delta\}\|\}\\sum\_\{n\\in N\_\{\\Delta\}\}\\frac\{1\}\{2\}\\Bigl\(\\mathbbm\{1\}\\\!\\left\[c^\{\\prime\}\_\{n\}=c\_\{n\}^\{\*\}\\right\]\+\\rho\(a^\{\\prime\}\_\{n\},a\_\{n\}^\{\*\}\)\\Bigr\)\.\(3\)
wherecnc\_\{n\}denotes the IFC class of an object,𝟙\\mathbbm\{1\}is the identity operator, andρ​\(an′,an∗\)∈\[0,1\]\\rho\(a^\{\\prime\}\_\{n\},a\_\{n\}^\{\*\}\)\\in\[0,1\]is the property score defined above\.

#### Topology Score\.

The topology score measures changes in the graph edges, which are critical for IFC validity, for example, a missing connection on a load\-bearing column can have serious structural implications\. The metric directly evaluates changes on the difference graphGΔ=\(NΔ,RΔ\)G\_\{\\Delta\}=\(N\_\{\\Delta\},R\_\{\\Delta\}\)considering node and relation edits\. To scale to large IFC graphs, we align node identities using a greedy bipartite matching heuristic and match relations only between aligned node pairs\. We computeF1\\operatorname\{F1\}scores separately for edited nodes and edited relations, and combine them as

Stopo=λ​F1⁡\(N′,N∗\)\+\(1−λ\)​F1⁡\(R′,R∗\)\.S\_\{\\text\{topo\}\}=\\lambda\\,\\operatorname\{F1\}\(N^\{\\prime\},N^\{\*\}\)\+\(1\-\\lambda\)\\,\\operatorname\{F1\}\(R^\{\\prime\},R^\{\*\}\)\.\(4\)
We setλ=0\.3\\lambda=0\.3, so the relationship correctness receives the higher weight\(1−λ=0\.7\)\(1\-\\lambda=0\.7\)because the relations more strongly determine the validity of the model\. If the reference edit contains no topological modifications, we assignStopo=1S\_\{\\mathrm\{topo\}\}=1when the prediction also introduces none, andStopo=0S\_\{\\mathrm\{topo\}\}=0otherwise\.

## 4Evaluation

### 4\.1Experimental Setup

We evaluate LLM agents in a code\-execution environment for IFC editing\. For each of the 324 tasks, an agent receives the natural\-language instruction and the path to an input IFC model\. The agent has access to a single tool,execute\_ifc\_code\(code: str\), which executes generated Python code in a sandbox environment where the IFC model is preloaded using IfcOpenShell \(see Appendix[F\.4](https://arxiv.org/html/2606.20146#A6.SS4)\)\. The agent must generate Python code and pass it to the tool to interact with the IFC model\. It can use multiple tool calls, first to inspect the IFC model and later to apply edits such as creating geometry, changing placements, or editing semantic properties\. For indirect instructions, the target elements and required operations are not explicitly stated\. The LLM therefore has to query the IFC model, identify the relevant context, and ground its edit in spatial or relational reasoning\. We adopt code generation rather than a curated toolset to provide a broad and flexible action space\. This design evaluates whether LLMs can transfer general\-purpose code generation capabilities to structured BIM editing without task\-specific fine\-tuning\. A run ends when the agent declares the task is complete or reaches the budget of 20 tool calls\. The final saved IFC file is the only artifact passed to the evaluator\. We evaluate seven models on BIM\-Edit, including proprietary and open\-weight models: Gemini 3\.0 Flash, Qwen 3\.6 Plus, Claude Sonnet 4\.6, GPT\-5\.4 Pro, GPT\-5\.4 Mini, Gemma 4 31B, and DeepSeek V3\.2\. All models are tested with the same agent harness, system prompt[F\.3](https://arxiv.org/html/2606.20146#A6.SS3), and code\-execution tool[F\.4](https://arxiv.org/html/2606.20146#A6.SS4)\.

### 4\.2Main Results

#### BIM\-Edit exposes substantial gaps in current LLM capabilities\.

Table[1](https://arxiv.org/html/2606.20146#S4.T1)summarizes overall performance on BIM\-Edit\. No evaluated model achieves an average score above50%50\\%, highlighting the difficulty of reliable IFC\-based BIM editing\. Gemini 3\.0 Flash achieves the best overall performance, followed by Qwen 3\.6 Plus and Claude Sonnet 4\.6\. The per\-metric breakdown reveals complementary model strengths: Gemini 3\.0 Flash achieves the highest geometry and semantic scores, whereas Qwen 3\.6 Plus performs best on topology\. Across all models, geometry scores are consistently higher than semantic and topology scores, suggesting that current LLM agents can often approximate the correct shape while failing to preserve IFC semantics and relational consistency\. The large standard deviations across all metrics further indicate substantial task\-level variance across edit operations, instruction types, and scene contexts\. The different rankings across metrics support our choice to evaluate BIM edits separately for geometry, semantics, and topology\.

Table 1:Average BIM\-Edit scores across 324 tasks on a 0 to 100 scale\. Final is the average of geometry, semantics, and topology metrics\. Entries are mean±\\pmstdacross tasks\.ModelFinal↑\\uparrowGeom\.↑\\uparrowSem\.↑\\uparrowTopo\.↑\\uparrowGemini 3\.0 Flash49\.48±\\pm32\.7268\.87±\\pm43\.5541\.81±\\pm46\.6737\.77±\\pm38\.46Qwen 3\.6 Plus47\.63±\\pm35\.3559\.21±\\pm46\.8735\.93±\\pm44\.1447\.77±\\pm42\.65Claude Sonnet 4\.645\.31±\\pm34\.7255\.90±\\pm47\.9533\.85±\\pm44\.1046\.19±\\pm43\.66GPT\-5\.4 Pro43\.94±\\pm34\.3550\.14±\\pm47\.8638\.69±\\pm44\.6442\.97±\\pm40\.89DeepSeek V3\.243\.21±\\pm35\.1950\.97±\\pm47\.3034\.80±\\pm44\.4143\.86±\\pm42\.73GPT\-5\.4 Mini39\.79±\\pm34\.2145\.18±\\pm47\.2938\.80±\\pm45\.8935\.38±\\pm39\.46Gemma 4 31B37\.54±\\pm36\.1744\.94±\\pm48\.0729\.94±\\pm44\.6137\.75±\\pm41\.94Table 2:Strict BIM\-Edit solve rates \(%\) at a 98% correctness threshold\.Overall Solve Ratereports tasks where geometry, semantic, and topology pass simultaneously:Allover the full 324\-task benchmark, andDirect,Spatial,Topologicalrestricted to the corresponding 108\-task instruction subsets\. Evaluation Dimension reports the share of all 324 tasks that pass each metric individually\.Overall Solve Rate \(%\)Evaluation Dimension \(%\)ModelAll\(324\)Direct\(108\)Spatial\(108\)Topological\(108\)Geom\.Sem\.Topo\.Gemini 3\.0 Flash1\.50\.931\.851\.8540\.735\.814\.5Qwen 3\.6 Plus3\.44\.633\.701\.8537\.325\.927\.8Claude Sonnet 4\.61\.92\.780\.931\.8536\.725\.929\.9GPT\-5\.4 Pro1\.23\.700\.000\.0027\.228\.720\.7DeepSeek V3\.22\.25\.560\.930\.0029\.925\.926\.2GPT\-5\.4 Mini0\.61\.850\.000\.0023\.831\.816\.0Gemma 4 31B0\.92\.780\.000\.0026\.926\.221\.3

#### Fully correct BIM edits remain extremely rare\.

While the average scores reflect the current capabilities of LLMs on the three evaluation dimensions, they do not directly reflect how many tasks are actually solved completely, i\.e\. achieving perfect scores on all metrics\. For this, we consider a task as solved, if for each metric the LLM achieves a score of at least98%98\\%\. Table[2](https://arxiv.org/html/2606.20146#S4.T2)shows that the models only solve the tasks for a very small proportion, with Qwen 3\.6 Plus achieving the best solve rate of only3\.4%3\.4\\%\. Models generally perform better on direct tasks than indirect tasks, which is especially true for the smaller models GPT\-5\.4 Mini and Gemma 4, which do not solve any spatial or topological task\. Analyzing the pass rate for each evaluation dimension, the best models solve around40%40\\%of tasks regarding the geometric task dimension but mostly below30%30\\%on the semantic and topological evaluation dimension\.

#### Operation & Reference complexity: Create operations are substantially harder\.

Performance differs strongly across both edit operations and instruction categories, as shown in Figure[3\(a\)](https://arxiv.org/html/2606.20146#S4.F3.sf1)\. Update tasks achieve the highest scores across all models, followed by delete tasks, whereas create tasks are consistently the most challenging setting\. The additional difficulty arises because creation requires the agent not only to generate new geometry, but also to assign correct IFC semantics, place the element coherently within the scene, and establish valid relations to surrounding elements\. Interestingly, instruction type has only a limited effect on aggregated task scores\. Performance remains similar across instruction categories within each edit operation, with indirect tasks occasionally even outperforming direct ones\. Combined with the strict solve rates in Table[2](https://arxiv.org/html/2606.20146#S4.T2), this indicates that models can often satisfy individual evaluation dimensions on indirect tasks, while failing to jointly preserve geometry, semantics, and topology in a fully correct BIM edit\. Furthermore, indirect tasks require substantially more output tokens\. Averaged across all runs, direct instructions require6\.2​k6\.2koutput tokens per task, compared to11\.4​k11\.4kand9\.9​k9\.9kfor spatial and topological instructions, respectively\. This suggests that models compensate for the additional reference ambiguity through longer reasoning and interaction traces, without consistently translating this additional computation into fully correct BIM edits\.

#### Scene complexity: Larger IFC scenes do not substantially increase difficulty\.

To assess whether larger IFC scenes increase the difficulty of BIM editing, we compare performance across the two scene complexity settings in Figure[3\(b\)](https://arxiv.org/html/2606.20146#S4.F3.sf2)\. Although the complex IFC models contain substantially more elements and relations, most models achieve comparable scores across both settings\. This suggests that model size alone is not the dominant source of difficulty in the experimental setup, likely because the IFC models are accessed programmatically rather than being placed directly into the LLM context, though this may become more relevant in future benchmark settings with different interaction paradigms\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/x3.png)\(a\)Performance by edit operation and instruction type\.
![Refer to caption](https://arxiv.org/html/2606.20146v1/x4.png)\(b\)Performance by scene complexity\.
![Refer to caption](https://arxiv.org/html/2606.20146v1/x5.png)\(c\)Performance versus mean tool\-call rounds per task\.

Figure 3:Additional BIM\-Edit analyses\. \(a\) Performance by edit operation and instruction category\. \(b\) Comparison between simple and complex scenes\. \(c\) Relationship between BIM\-Edit score and agent interaction length\.
#### Operational Effort: Longer agent trajectories do not guarantee better edits\.

Figure[3\(c\)](https://arxiv.org/html/2606.20146#S4.F3.sf3)compares final BIM\-Edit scores with the amount of agent interaction used during evaluation\. This analysis is relevant because BIM\-Edit requires iterative inspection and modification of structured IFC models, making longer interaction trajectories a potential indicator of more extensive reasoning or recovery behavior\. However, the results show that longer trajectories do not necessarily translate into better performance\. Qwen 3\.6 Plus generates the most output tokens \(25\.5k\), and Claude Sonnet 4\.6 uses the largest number of tool\-call rounds \(16\.5\), yet Gemini 3\.0 Flash achieves the highest overall score\. In contrast, GPT\-5\.4 Pro and GPT\-5\.4 Mini rely on comparatively short interaction sequences while remaining competitive with several more verbose models\.

### 4\.3Failure Cases

#### Runtime failures are concentrated in distinct model\-specific patterns\.

To better understand the operational robustness of LLM agents on BIM\-Edit, Table[3](https://arxiv.org/html/2606.20146#S4.T3)reports the percentage of tasks ending in different runtime failure modes \(definitions are provided in Appendix[G\.1](https://arxiv.org/html/2606.20146#A7.SS1)\)\. This analysis is important because BIM\-Edit requires models not only to reason about IFC structures, but also to successfully manipulate them programmatically through IfcOpenShell\. Without this distinction, benchmark performance could be dominated by failures in tool usage rather than limitations in BIM understanding itself\. Since agents may already have modified the IFC model before termination, failed runs do not necessarily correspond to zero\-score outputs\. The results reveal substantially different failure profiles across models\. Claude Sonnet 4\.6 exhibits the highest overall runtime failure rate \(46\.3%\), almost entirely caused by budget exhaustion \(46\.0%\), indicating that the agent frequently continues tool interaction until reaching the maximum tool\-call limit\. In contrast, DeepSeek V3\.2 fails primarily through process crashes and streaming timeouts, suggesting lower execution stability despite avoiding budget exhaustion entirely\. Qwen 3\.6 Plus shows a more distributed failure profile across multiple categories, whereas Gemini 3\.0 Flash and Gemma 4 31B remain comparatively stable apart from moderate rates of budget exhaustion\. The GPT models achieve the highest operational stability, with almost no runtime failures across the benchmark\.

Table 3:Runtime failure rates \(%\) across the different models\.ModelBudgetCrashAPITimeoutOtherOverallGemini 3\.0 Flash13\.30\.90\.30\.00\.014\.5Qwen 3\.6 Plus17\.62\.23\.11\.20\.324\.4Claude Sonnet 4\.646\.00\.00\.30\.00\.046\.3GPT\-5\.4 Pro0\.00\.00\.60\.00\.00\.6DeepSeek V3\.20\.05\.92\.512\.30\.020\.7GPT\-5\.4 Mini0\.00\.00\.00\.00\.00\.0Gemma 4 31B8\.34\.61\.90\.00\.014\.8

## 5Conclusion & Future Work

We introduce BIM\-Edit, a benchmark for natural\-language editing of IFC\-based building models that evaluates edits across geometric accuracy, semantic validity, and topological consistency\. The benchmark contains 324 tasks spanning create, update, and delete operations under direct, spatial, and topological instructions on both simple substructures and complex building models\. Across the seven evaluated LLMs, the best\-performing model achieves an average score of only49\.48%49\.48\\%, and no model fully solves more than3\.4%3\.4\\%of tasks\. The experiments show that current LLM agents can often make partial geometric progress, but rarely preserve geometry, semantics, and topology simultaneously\. As a result, visually plausible edits frequently remain invalid as structured engineering artifacts\. Create operations expose the largest capability gap, while, interestingly, the instruction type has only a limited effect on the aggregate task scores\. However, indirect spatial and topological instructions substantially increase generation cost and are far less likely to result in fully correct BIM edits\. Similarly, increasing scene complexity has limited impact on overall performance in our setup, suggesting that the dominant bottleneck is not raw scene scale, but the ability to execute valid structured edits after the relevant IFC context has been identified\. Overall, BIM\-Edit shows that structured BIM editing remains far from solved for current LLM agents\. Future work should therefore research on improving the agents’ precision in BIM editing, improving the coupling of scene understanding and program execution, while also optimizing model cost and response time\. Future benchmark development could focus on multi\-step BIM editing tasks spanning multiple interconnected elements, moving beyond isolated single\-object modifications\.

## References

- Azuma et al\. \[2022\]Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe\.ScanQA: 3d question answering for spatial scene understanding\.In*proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 19129–19139, 2022\.
- Beitz et al\. \[1996\]W Beitz, G Pahl, and K Grote\.Engineering design: a systematic approach\.*Mrs Bulletin*, 71\(30\):3, 1996\.
- Borrmann et al\. \[2018\]André Borrmann, Jakob Beetz, Christian Koch, Thomas Liebich, and Sergej Muhic\.Industry foundation classes: A standardized data model for the vendor\-neutral exchange of digital building models\.In*Building information modeling: Technology foundations and industry practice*, pages 81–126\. Springer, 2018\.
- Chang et al\. \[2015\]Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al\.Shapenet: An information\-rich 3d model repository\.*arXiv preprint arXiv:1512\.03012*, 2015\.
- Chen et al\. \[2025\]Cheng Chen, Jiacheng Wei, Tianrun Chen, Chi Zhang, Xiaofeng Yang, Shangzhan Zhang, Bingchen Yang, Chuan\-Sheng Foo, Guosheng Lin, Qixing Huang, et al\.Cadcrafter: Generating computer\-aided design models from unconstrained images\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 11073–11082, 2025\.
- Chen et al\. \[2024\]Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, et al\.A survey on evaluating large language models in code generation tasks\.*arXiv preprint arXiv:2408\.16498*, 2024\.
- Corrado et al\. \[2022\]G Corrado, G Ntourmas, M Sferza, N Traiforos, A Arteiro, L Brown, D Chronopoulos, F Daoud, F Glock, J Ninic, et al\.Recent progress, challenges and outlook for multidisciplinary structural optimization of aircraft and aerial vehicles\.*Progress in Aerospace Sciences*, 135:100861, 2022\.
- Deng et al\. \[2025\]Zihan Deng, Changyu Du, Stavros Nousias, and André Borrmann\.BIMgent: Towards autonomous building modeling via computer\-use agents\.*arXiv preprint arXiv:2506\.07217*, 2025\.
- Du et al\. \[2026\]Changyu Du, Sebastian Esser, Stavros Nousias, and André Borrmann\.Text2BIM: Generating building models using a large language model\-based multiagent framework\.*Journal of Computing in Civil Engineering*, 40\(2\):04025142, 2026\.
- Du et al\. \[2024\]Yuhao Du, Shunian Chen, Wenbo Zan, Peizhao Li, Mingxuan Wang, Dingjie Song, Bo Li, Yan Hu, and Benyou Wang\.BlenderLLM: Training large language models for computer\-aided design with self\-improvement\.*arXiv preprint arXiv:2412\.14203*, 2024\.
- Dupont et al\. \[2024\]Elona Dupont, Kseniya Cherenkova, Dimitrios Mallis, Gleb Gusev, Anis Kacem, and Djamila Aouada\.Transcad: A hierarchical transformer for cad sequence inference from point clouds\.In*European Conference on Computer Vision*, pages 19–36\. Springer, 2024\.
- Fan et al\. \[2017\]Haoqiang Fan, Hao Su, and Leonidas J Guibas\.A point set generation network for 3d object reconstruction from a single image\.In*Proceedings of the IEEE conference on computer vision and pattern recognition*, pages 605–613, 2017\.
- Fernandes et al\. \[2024\]David Fernandes, Sahej Garg, Matthew Nikkel, and Gursans Guven\.A gpt\-powered assistant for real\-time interaction with building information models\.*Buildings*, 14\(8\):2499, 2024\.
- Giannone et al\. \[2026\]Giorgio Giannone, Anna Clare Doris, Amin Heyrani Nobari, Kai Xu, Akash Srivastava, and Faez Ahmed\.GIFT: Bootstrapping image\-to\-cad program synthesis via geometric feedback\.*arXiv preprint arXiv:2603\.27448*, 2026\.
- Gu et al\. \[2025a\]Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang\.On the effectiveness of large language models in domain\-specific code generation\.*ACM Transactions on Software Engineering and Methodology*, 34\(3\):1–22, 2025a\.
- Gu et al\. \[2025b\]Yunqi Gu, Ian Huang, Jihyeon Je, Guandao Yang, and Leonidas Guibas\.BlenderGym: benchmarking foundational model systems for graphics editing\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 18574–18583, 2025b\.
- Guan et al\. \[2025\]Yandong Guan, Xilin Wang, Ximing Xing, Jing Zhang, Dong Xu, and Qian Yu\.CAD\-Coder: Text\-to\-cad generation with chain\-of\-thought and geometric reward\.*arXiv preprint arXiv:2505\.19713*, 2025\.
- Hasan and Sarkar \[2026\]Md Zahid Hasan and Soumalya Sarkar\.SCOPE: Spatially\-constrained parametric editing for text\-guided cad models\.*Efficient Spatial Reasoning Workshop at ICLR*, 2026\.
- Hellin et al\. \[2025\]Sylvain Hellin, Stavros Nousias, and André Borrmann\.Natural language information retrieval from bim models: An llm\-based multi\-agent system approach\.In*EC3 Conference 2025*, volume 6\. European Council on Computing in Construction, 2025\.
- IfcOpenShell Contributors \[2026\]IfcOpenShell Contributors\.IfcOpenShell: The open source ifc toolkit and geometry engine\.[https://ifcopenshell\.org/](https://ifcopenshell.org/), 2026\.Accessed: 2026\-05\-05\.
- International Organization for Standardization \[2024\]International Organization for Standardization\.ISO 16739\-1:2024 industry foundation classes \(ifc\) for data sharing in the construction and facility management industries – part 1: Data schema\.[https://www\.iso\.org/standard/84123\.html](https://www.iso.org/standard/84123.html), 2024\.Accessed: 2026\-05\-05\.
- Jain et al\. \[2024\]Naman Jain, King Han, Alex Gu, Wen\-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar\-Lezama, Koushik Sen, and Ion Stoica\.LiveCodeBench: Holistic and contamination free evaluation of large language models for code\.*arXiv preprint arXiv:2403\.07974*, 2024\.
- Jang et al\. \[2024\]Suhyung Jang, Ghang Lee, Jiseok Oh, Junghun Lee, and Bonsang Koo\.Automated detailing of exterior walls using nadia: Natural\-language\-based architectural detailing through interaction with ai\.*Advanced Engineering Informatics*, 61:102532, 2024\.
- Jimenez et al\. \[2023\]Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan\.SWE\-bench: Can language models resolve real\-world github issues?*arXiv preprint arXiv:2310\.06770*, 2023\.
- Khan et al\. \[2024\]Mohammad S Khan, Sankalp Sinha, Talha U Sheikh, Didier Stricker, Sk A Ali, and Muhammad Z Afzal\.Text2CAD: Generating sequential cad designs from beginner\-to\-expert level text prompts\.*Advances in Neural Information Processing Systems*, 37:7552–7579, 2024\.
- Kienle et al\. \[2025\]Claudius Kienle, Benjamin Alt, Darko Katic, Rainer Jäkel, and Jan Peters\.QueryCAD: Grounded question answering for cad models\.In*2025 IEEE International Conference on Robotics and Automation \(ICRA\)*, pages 5798–5805\. IEEE, 2025\.
- Koch et al\. \[2019\]Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo\.ABC: A big cad model dataset for geometric deep learning\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 9601–9611, 2019\.
- Kuhn \[1955\]Harold W\. Kuhn\.The Hungarian method for the assignment problem\.*Naval Research Logistics Quarterly*, 2\(1–2\):83–97, 1955\.
- Li et al\. \[2025\]Jiahao Li, Weijian Ma, Xueyang Li, Yunzhong Lou, Guichun Zhou, and Xiangdong Zhou\.CAD\-Llama: leveraging large language models for computer\-aided design parametric 3d model generation\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 18563–18573, 2025\.
- Li et al\. \[2024\]Xingang Li, Yuewan Sun, and Zhenghui Sha\.LLM4CAD: Multi\-modal large language models for 3d computer\-aided design generation\.In*International Design Engineering Technical Conferences and Computers and Information in Engineering Conference*, volume 88407, page V006T06A015\. American Society of Mechanical Engineers, 2024\.
- Liu and Chen \[2025\]Bingru Liu and Hainan Chen\.BIMCoder: A comprehensive large language model fusion framework for natural language\-based bim information retrieval\.*Applied Sciences*, 15\(14\):7647, 2025\.
- Liu et al\. \[2025\]Jingping Liu, Ziyan Liu, Zhedong Cen, Yan Zhou, Yinan Zou, Weiyan Zhang, Haiyun Jiang, and Tong Ruan\.Can multimodal large language models understand spatial relations?In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 620–632, 2025\.
- Ma et al\. \[2025\]Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu\-Cheng Chou, Jieneng Chen, Celso de Melo, and Alan Yuille\.3DSRBench: A comprehensive 3d spatial reasoning benchmark\.In*Proceedings of the IEEE/CVF International Conference on Computer Vision*, pages 6924–6934, 2025\.
- Martins and Lambe \[2013\]Joaquim RRA Martins and Andrew B Lambe\.Multidisciplinary design optimization: a survey of architectures\.*AIAA journal*, 51\(9\):2049–2075, 2013\.
- Ploennigs et al\. \[2025\]Joern Ploennigs, Markus Berger, Thomas Wortmann, Jakob Kirchner, Jakob Beetz, Alina Roitberg, Karsten Menzel, and Björn Ommer\.Building foundation models\-potentials, challenges and research directions for using llm and lvm in aec\.In*EC3 Conference 2025*, volume 6\. European Council on Computing in Construction, 2025\.
- Qureshi et al\. \[2012\]Ahmed Jawad Qureshi, Jean\-Yves Dantan, Vahid Sabri, Paul Beaucaire, and Nicolas Gayton\.A statistical tolerance analysis approach for over\-constrained mechanism based on optimization and monte carlo simulation\.*Computer\-Aided Design*, 44\(2\):132–142, 2012\.
- Seff et al\. \[2021\]Ari Seff, Wenda Zhou, Nick Richardson, and Ryan P Adams\.Vitruvion: A generative model of parametric cad sketches\.*arXiv preprint arXiv:2109\.14124*, 2021\.
- Wang et al\. \[2025a\]Ruiyu Wang, Yu Yuan, Shizhao Sun, and Jiang Bian\.Text\-to\-cad generation through infusing visual feedback in large language models\.*arXiv preprint arXiv:2501\.19054*, 2025a\.
- Wang et al\. \[2025b\]Xilin Wang, Jia Zheng, Yuanchao Hu, Hao Zhu, Qian Yu, and Zihan Zhou\.From 2d cad drawings to 3d parametric models: A vision\-language approach\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 39, pages 7961–7969, 2025b\.
- Wang et al\. \[2024\]Zijian Wang, Rafael Sacks, Boyuan Ouyang, Huaquan Ying, and André Borrmann\.A framework for generic semantic enrichment of bim models\.*Journal of Computing in Civil Engineering*, 38\(1\):04023038, 2024\.
- Wei and Li \[2025\]Yinyi Wei and Xiao Li\.Text\-to\-code generation for modular building layouts in building information modeling\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.
- Willis et al\. \[2021\]Karl DD Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G Lambourne, Armando Solar\-Lezama, and Wojciech Matusik\.Fusion 360 gallery: A dataset and environment for programmatic cad construction from human design sequences\.*ACM Transactions on Graphics \(TOG\)*, 40\(4\):1–24, 2021\.
- Wu et al\. \[2021\]Rundi Wu, Chang Xiao, and Changxi Zheng\.Deepcad: A deep generative network for computer\-aided design models\.In*Proceedings of the IEEE/CVF international conference on computer vision*, pages 6772–6782, 2021\.
- Yang et al\. \[2026\]Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, and Zeyuan Chen\.How far are vision\-language models from constructing the real world? a benchmark for physical generative reasoning\.*arXiv preprint arXiv:2603\.24866*, 2026\.doi:10\.48550/arXiv\.2603\.24866\.URL[https://arxiv\.org/abs/2603\.24866](https://arxiv.org/abs/2603.24866)\.
- Yuan et al\. \[2025\]Yu Yuan, Shizhao Sun, Qi Liu, and Jiang Bian\.CAD\-Editor: A locate\-then\-infill framework with automated training data synthesis for text\-based cad editing\.*arXiv preprint arXiv:2502\.03997*, 2025\.
- Zhang et al\. \[2026\]Licheng Zhang, Bach Le, Naveed Akhtar, Siew\-Kei Lam, and Duc Ngo\.Large language models for computer\-aided design: A survey\.*ACM Computing Surveys*, 58\(9\):1–39, 2026\.
- Zhang et al\. \[2025\]Weichen Zhang, Zile Zhou, Xin Zeng, Xuchen Liu, Jianjie Fang, Chen Gao, Yong Li, Jinqiang Cui, Xinlei Chen, and Xiao\-Ping Zhang\.Open3D\-VQA: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space\.*arXiv preprint arXiv:2503\.11094*, 2025\.
- Zheng and Fischer \[2023\]Junwen Zheng and Martin Fischer\.BIM\-GPT: A prompt\-based virtual assistant framework for bim information retrieval\.*arXiv preprint arXiv:2304\.09333*, 2023\.

## Appendix ALimitations

BIM\-Edit makes several choices that should be considered when interpreting the results\. First of all, each task uses a single human\-authored ground\-truth IFC model\. This makes scoring deterministic, but it may penalize valid alternative edits when an instruction is spatially or topologically ambiguous\. However, we argue that useful systems for real world tasks should also understand the design intent of the user given the model context, and due to the continuous scoring, alternative edits may still reflect a good score\. Furthermore, the benchmark focuses on six common architectural element types: walls, slabs, spaces, doors, windows, and columns\. Therefore, we do not evaluate the LLM capabilities on further relevent areas, such as Mechanical, Electrical, and Plumbing \(MEP\) systems, detailed structural systems, furniture, property\-only edits, or multi\-discipline coordination\. Third, the agent harness in the conducted experiments is intentionally minimal and uses a single code\-execution tool, without retrieval, schema\-aware tools, or multi\-agent orchestration\. Therefore, the reported scores should be interpreted as out\-of\-the\-box LLM performance that can be treated as a baseline instead of an upper bound of a production BIM assistant\. Some runs are also limited by the 20\-round tool\-call budget, which may have an impact on the overall score\. However, in a real\-world use case, the number of tool calls and output tokens represent a significant optimization objective, as acceptance to use such systems is directly influenced by the reliability as well as the response time of those systems\. The results of our work indicate that neither requirement is currently met, with models performing poorly while simultaneously incurring high interaction overhead through long execution traces\. The metric design choices, including the unweighted aggregation of geometry, semantics, and topology, the exponential normalization of geometric error, theλ=0\.3\\lambda=0\.3weighting in the topology score, and the 98% strict solve threshold, were selected to provide a consistent, interpretable, and computationally tractable evaluation across heterogeneous BIM edit types, while placing greater emphasis on relational correctness in IFC models\. Although we do not provide a full sensitivity analysis or calibration against expert judgments, we expect the main conclusions to remain qualitatively stable, since performance gaps are large overall and the same trends appear consistently across the individual metrics\. Finally, the chosen metrics evaluate only the generated IFC artifact, not the generated code or reasoning traces\. We therefore provide an analysis on example runs of GPT 5\.4 Pro and Claude Sonnet 4\.6 in Appendix[H](https://arxiv.org/html/2606.20146#A8)\.

## Appendix BBroader Impact

AI\-assisted BIM authoring has the potential to accelerate design work in architecture, engineering, and construction, but it also introduces risks: an incorrect edit in a safety\-critical model, such as a load\-bearing wall, a fire\-rated partition, or a structural opening, can have consequences beyond those typical for software bugs\. BIM\-Edit is intended to help characterize these risks rather than to produce a fielded authoring assistant\. We believe that output\-level, structure\-sensitive evaluation of the kind proposed here is a prerequisite for responsible deployment because it exposes failure modes that surface\-level metrics miss\. We release the benchmark under an open license and expect it to be used as an evaluation tool and not as a training signal for deployed authoring systems\. BIM\-Edit is intended to expose limitations in critical engineering workflows\. Therefore, high scores should not be interpreted as evidence that agents are ready for unsupervised use in safety\-critical design tasks\.

## Appendix CAdditional Related Work

#### Agentic BIM approaches\.

Recent work has started to connect large language models to BIM workflows\.Zheng and Fischer \[[48](https://arxiv.org/html/2606.20146#bib.bib48)\]study natural\-language information retrieval over BIM data, with the goal of making model information easier to access without extensive manual interface engineering\.Hellin et al\. \[[19](https://arxiv.org/html/2606.20146#bib.bib19)\]propose an LLM\-based workflow for natural\-language querying over IFC\-encoded BIM models\. Another line of work moves from retrieval toward direct model modification\. Text2BIM takes a further step toward authoring by using a multi\-agent LLM framework to generate semantically rich BIM models from textual instructions inside a BIM authoring environment\[[9](https://arxiv.org/html/2606.20146#bib.bib9)\]\. Text2MBL similarly generates executable BIM code from text, but it focuses on modular building layouts rather than general IFC editing\[[41](https://arxiv.org/html/2606.20146#bib.bib41)\]\. These systems show that language\-driven BIM interaction is feasible, but they do not provide a shared benchmark for artifact\-level BIM editing\. Their evaluations are typically tied to a specific authoring tool, task distribution, or workflow objective\. Our benchmark differs in two main ways\. First, it focuses on modifying an existing model rather than only generating a model from scratch\. Second, it evaluates the final IFC artifact across geometry, topology, and semantics, rather than only assessing intermediate tool use or visual plausibility\. Lastly,Yang et al\. \[[44](https://arxiv.org/html/2606.20146#bib.bib44)\]proposed a benchmark for evaluating the physical plausibility of 3D house generation using Vision\-Language Model \(VLM\) agents\. Although their setting focuses on reconstructing houses from images, the benchmark is closely related to our work because it evaluates not only geometric reconstruction quality, but also physical constraints such as the structural and topological validity of generated IFC models\.

## Appendix DBenchmark Details and Dataset Card

### D\.1Purpose and Scope

BIM\-Edit was created to address a gap in the evaluation of LLMs for structured 3D building models\. Existing CAD benchmarks mainly assess whether a system can generate plausible geometry\[[17](https://arxiv.org/html/2606.20146#bib.bib17),[29](https://arxiv.org/html/2606.20146#bib.bib29)\], but mostly they do not test whether object\-level edits also preserve semantic properties and relational structure\. In building design, these properties are tightly connected\. A BIM file is correct only when its geometry is accurate, its elements have the appropriate BIM classes and attributes, and its relational graph encodes the required connections between elements\. BIM\-Edit formalizes this joint requirement as an evaluation target\. The benchmark is intended for researchers working on AI for architecture, engineering, and construction, for developers evaluating IFC\-capable LLMs, and for ablation studies that measure how instruction specificity affects structured editing performance\.

### D\.2Scene Collection and Task Construction

BIM\-Edit contains 47 human\-authored IFC models\. These models are realistic building models and controlled substructures created by the authors to isolate specific element behaviors and reduce noise from unrelated model content\. All models are released under CC\-BY 4\.0\.

Each task in BIM\-Edit is defined as a triplet\(M0,x,M∗\)\(M^\{0\},x,M^\{\*\}\), whereM0M^\{0\}is the input IFC model,xxis a natural\-language edit instruction, andM∗M^\{\*\}is a manually authored ground\-truth model\. The tasks were created using two techniques\. In the first technique, the authors identified editable target elements that already existed in an IFC model and wrote a direct instruction for the corresponding edit\. In the second technique, the authors manually applied a controlled edit to an IFC model, taken the resulting model asM∗M^\{\*\}, and then wrote an instruction describing that edit\. The IFC models were edited using Blender with the Bonsai add\-on and Revit\. For each task, the spatial and topological instruction variants were derived manually from the same underlying edit\. This ensures that all three instruction variants refer to the sameM∗M^\{\*\}\. The design separates the effect of instruction type from task difficulty, because the target edit is identical across the three variants\. Each task was verified by a second author to confirm that the ground truth was unambiguous and that the instruction fully specified the required edit\.

### D\.3Task Statistics

Table[4](https://arxiv.org/html/2606.20146#A4.T4)summarizes the distribution of tasks by element type and edit operation\. BIM\-Edit contains 324 tasks in total, split between human\-authored scenes and controlled artificial scenes\. The three instruction categories are balanced: direct, spatial, and topological instructions each contribute 108 tasks\. It is not fully uniform across element types because some elements support more realistic edit variants than others\.

Table 4:Task distribution in BIM\-Edit by element type and edit operation\. Counts are aggregated over all instruction categories and both scene types\.Element typeCreateUpdateDeleteTotalIfcWall24242472IfcSlab24242472IfcSpace24242472IfcDoor1861236IfcWindow6181236IfcColumn12121236Total108108108324

## Appendix EFull Evaluation Metric Definitions

BIM\-Edit evaluates the changed parts of the 3D model instead of comparing full model files\. This prevents scores from being dominated by unchanged building content\. For each task, letM0M^\{0\}be the input model,M∗M^\{\*\}be the ground\-truth model, andM′M^\{\\prime\}be the model produced by the LLM agent\. We construct a reference edit setGΔ∗G\_\{\\Delta\}^\{\*\}fromM0M^\{0\}andM∗M^\{\*\}, and a predicted edit setGΔ′G^\{\\prime\}\_\{\\Delta\}fromM0M^\{0\}andM′M^\{\\prime\}\. Entities are marked as added, removed, or modified based on their class, identifier, geometry, and selected properties\. These edit sets are then used to compute the geometry, semantic, and topology scores\.

#### Per\-task\-type construction\.

The contents ofGΔ∗G\_\{\\Delta\}^\{\*\}andGΔ′G^\{\\prime\}\_\{\\Delta\}depend on the task type\. For*create*tasks,GΔ∗G\_\{\\Delta\}^\{\*\}contains the ground\-truth object to be created, whileGΔ′G^\{\\prime\}\_\{\\Delta\}contains the entities added by the agent relative toM0M^\{0\}\. For*update*tasks, both sets contain the final state of the target object inM∗M^\{\*\}andM′M^\{\\prime\}, respectively\. Newly added entities are also included inGΔ′G^\{\\prime\}\_\{\\Delta\}, so updates implemented by deleting and recreating the target can still be scored\. For*delete*tasks,GΔ∗G\_\{\\Delta\}^\{\*\}contains the entities that should be removed fromM∗M^\{\*\}, whileGΔ′G^\{\\prime\}\_\{\\Delta\}contains the entities actually removed by the agent\. The diff uses the target ID specified in the task when available and falls back to an added\-entity diff if the agent deletes and recreates the target under a new identifier\.

### E\.1Geometry Score

#### Aggregate point\-cloud comparison\.

We uniformly sample points from the surface of each entity in the reference edit set and from each entity in the predicted edit set\. The sampled points are combined into two edit\-level point clouds:P∗P^\{\*\}for the ground\-truth edit andP′P^\{\\prime\}for the agent\-produced edit\. Sampling is subject to a per\-object budget of 4096 surface points and a total budget of 16384 points across all edited objects\. The per\-object allocation is proportional to surface area and is clamped at a minimum of 256 points per object\. The headline geometry score is based on the median Chamfer distance betweenP∗P^\{\*\}andP′P^\{\\prime\}:

Sgeo=exp⁡\(−CDmed​\(P∗,P′\)D⋅5\.0\),S\_\{\\text\{geo\}\}=\\exp\\\!\\left\(\-\\frac\{\\mathrm\{CD\}\_\{\\mathrm\{med\}\}\(P^\{\*\},P^\{\\prime\}\)\}\{D\}\\cdot 5\.0\\right\),\(5\)whereCDmed​\(P∗,P′\)\\mathrm\{CD\}\_\{\\mathrm\{med\}\}\(P^\{\*\},P^\{\\prime\}\)is the median bidirectional nearest\-neighbor distance between the two point clouds, andDDis the diagonal length of the joint axis\-aligned bounding box enclosing both point clouds\. The distance is normalized byDDto make the score comparable across edits of different scales\. The constant5\.05\.0is a fixed decay factor that controls how quickly the score decreases as the normalized geometric error increases\.

#### Additional geometry diagnostics\.

The evaluator additionally computes oriented bounding\-box IoU \(OBB\-IoU\), axis\-aligned bounding\-box IoU \(AABB\-IoU\), voxel IoU at 0\.05 m resolution on a1283128^\{3\}grid, F\-score atτ=0\.01\\tau=0\.01of the bounding\-box diagonal, Jensen\-Shannon divergence of voxel occupancy, and Hausdorff distance\. These metrics are reported as additional diagnostics and do not contribute to the final score\.

### E\.2Semantic Score

The semantic score measures whether the edited entities carry the correct BIM meaning\. Since semantic checks require object correspondences, we first match entities in the predicted edit set to entities in the reference edit set using oriented bounding\-box IoU \(OBB\-IoU\)\. The resulting OBB\-IoU cost matrix is solved as a one\-to\-one assignment using the Hungarian algorithm\[[28](https://arxiv.org/html/2606.20146#bib.bib28)\]\. Matched predicted entities with an OBB\-IoU below 0\.05 against their assigned reference entity are discarded\. Delete tasks are treated separately, since a successful edit leaves no predicted edited entity to compare against the reference entity\. For these tasks, semantic correctness is defined by target removal: the score is 1\.0 if the intended IFC entity is removed from the model, and 0\.0 otherwise\.

For each remaining matched pair\(n′,n∗\)\(n^\{\\prime\},n^\{\*\}\), the semantic score has two components:

1. 1\.Class score: The score is 1\.0 if the class type ofcn′c^\{\\prime\}\_\{n\}matches the IFC class ofnn∗n^\{\*\}\_\{n\}, and 0\.0 otherwise\. For example, a predicted IfcWall matched to a reference IfcWall receives 1\.0, while a predicted IfcSlab or a proxy element matched to a reference IfcWall receives 0\.0\.
2. 2\.Property score: The score is the fraction of task\-relevant property keys whose values inan′a^\{\\prime\}\_\{n\}match the corresponding values innn∗n^\{\*\}\_\{n\}within a relative tolerance of 5%\. The propertiesTag,Description, andLongNameare excluded from this comparison\.

The per\-pair semantic score is the average of the class score and the property score\. The task\-level semantic score is the mean over all reference entities, with unmatched reference entities assigned a score of zero\.

### E\.3Topology Score

We represent each IFC model as a typed property graph, where nodes correspond to structurally relevant IFC entities and edges correspond to standard IFC relations\. These include spatial containment \(IfcRelContainedInSpatialStructure\), aggregation \(IfcRelAggregates\), space boundaries \(IfcRelSpaceBoundary\), opening and voiding \(IfcRelVoidsElement\), filling \(IfcRelFillsElement\), and element connection \(IfcRelConnectsElements\)\. We construct graphs forM0M^\{0\},M∗M^\{\*\}, andM′M^\{\\prime\}\. The reference topology edit is the symmetric difference between the graphs ofM0M^\{0\}andM∗M^\{\*\}, and the predicted topology edit is the symmetric difference between the graphs ofM0M^\{0\}andM′M^\{\\prime\}\.

Because internal IFC identifiers, such as GUIDs and line numbers, may not be preserved across agent outputs, nodes between the predicted and reference edits must be aligned before comparison\. Unlike the semantic score, where object correspondences directly determine the evaluated class and property terms and are therefore computed using an optimal Hungarian assignment, the topology score uses node matching only as an identity\-canonicalization step before comparing relation deltas\. Since topology graphs can contain hundreds of nodes and thousands of relations, we use a lightweight bipartite matching heuristic based on IFC class agreement and spatial proximity of local placements\. We construct candidate correspondences, sort them by matching score, and greedily select non\-conflicting node pairs\. This avoids constructing dense assignment problems for large graphs while preserving the purpose of the topology metric, which is to evaluate whether the correct relational edits are present after alignment\.

After alignment, we compute precision, recall, and F1 separately for node edits and edge edits\. Node edits include added, removed, or modified entities, while edge edits include added or removed relations\. The two F1 scores are combined withλ=0\.3\\lambda=0\.3as in Eq\.[4](https://arxiv.org/html/2606.20146#S3.E4), with edges weighted more heavily because relations determine whether an edited element is properly integrated into the building model\. As a special case, when the reference edit requires no topology change, the score is11if the prediction also introduces no topology change and0otherwise\. This penalizes prediction\-only spurious topology edits\.

## Appendix FExperimental Details

### F\.1Hardware and Software Environment

All inference is performed through cloud APIs\. After inference, the evaluation pipeline itself runs on CPU only\. On a standard desktop processor, evaluating a single task takes 1\-2 minutes on average\. Our parallelized evaluation script scores the full 324\-task benchmark in under 2 hours once the inference outputs are available\. The LLM pipeline and evaluator are implemented in Python 3\.13 or later, with IfcOpenShell 0\.8\.x used as the IFC backend\. All dependencies are pinned in the repository’s requirements file\.

### F\.2Model Versions and Configurations

Table[5](https://arxiv.org/html/2606.20146#A6.T5)lists the models evaluated in this paper, together with their provider, checkpoint identifier, access route, and inference settings\. All models are evaluated with the same system prompt, tool definition, tool\-call budget, and evaluation pipeline\. For Claude Sonnet 4\.6, temperature and top\-ppare left at the provider defaults because this model does not allow both parameters to be specified together\. For all other models, temperature is set to 0 and top\-ppis set to 1\.0\. The reasoning\_effort parameter is not overridden for any model, so each provider uses its default reasoning behavior\.

Table 5:Model versions and inference configurations used in BIM\-Edit\.ModelProviderCheckpoint IDAccessTemp\.Top\-ppBudgetGemini 3\.0 FlashGooglegemini\-3\-flash\-previewNative API01\.020Qwen 3\.6 PlusAlibaba Cloudqwen/qwen3\.6\-plusOpenRouter01\.020Claude Sonnet 4\.6Anthropicclaude\-sonnet\-4\-6Native APIdefaultdefault20GPT\-5\.4 ProOpenAIgpt\-5\.4\-proNative API01\.020DeepSeek V3\.2DeepSeekdeepseek/deepseek\-v3\.2OpenRouter01\.020GPT\-5\.4 MiniOpenAIgpt\-5\.4\-miniNative API01\.020Gemma 4 31BGooglegoogle/gemma\-4\-31b\-itOpenRouter01\.020
### F\.3System Prompt

All models receive the same system prompt:

System PromptYou are a BIM assistant, with a deep knowledge in Building Information Modeling\. You are working with IFC files and need to create IFCOpenshell calls to fulfill the task\. Make sure to understand the current model before modification and modify the model always correctly on geometry, topology and semantics\. Always understand the unit scale of the model\. Use sensible defaults for unspecified values\. Execute commands directly and never ask for confirmation\.

### F\.4Tool Description

The agent is given a single tool:execute\_ifc\_code\(code: str\)→\\tostr\. The tool is exposed to the model with the following description:

System PromptExecute Python code against the current IFC file\. Input:\{code: str\}\. The IFC model is pre\-loaded as ifc \(ifcopenshell\.file\)\. Also available: ifcopenshell, api \(ifcopenshell\.api\), util \(ifcopenshell\.util\), element\_util \(ifcopenshell\.util\.element\), and guid \(ifcopenshell\.guid\)\. Assign to result to return data\. Call commit\(\) to save modifications\.

The input is a Python code string\. The code runs in a sandboxed subprocess where the IFC file is pre\-loaded into a variable using IfcOpenShell\. The subprocess is restricted to the target IFC file and does not have access to unrelated files\. Each tool call has a timeout of 120 seconds\. If an inference call fails, the harness retries once after a 30\-second delay, giving at most two inference attempts per task\. After the agent completes or exhausts its tool\-call budget of 20, the pipeline saves the current state of the IFC file to disk\. This saved file is the only output passed to the evaluator\.

### F\.5Reproducing Results

Every score reported in the main paper can be verified in the code folder that is included with the submission\. Verification does not require rerunning model inference, because all cached outputs and evaluation results are provided\. The releasedruns/directory contains the configuration files, cached outputs, and per\-task evaluation reports for each of the seven evaluated models\. For more information checkout the code repository\.

## Appendix GAdditional Results and Analysis

This section provides additional per\-model breakdowns that support Section[4\.2](https://arxiv.org/html/2606.20146#S4.SS2)\.

### G\.1Failure\-mode classifier

The counts in Table[6](https://arxiv.org/html/2606.20146#A7.T6)are computed deterministically from the per\-task runtime metadata recorded in the cached run logs\. Each task’s first attempt is assigned to exactly one of six buckets:

- •*Executed*: the agent ran without raising any runtime error\.
- •*Budget exhausted*: the agent reached the 20\-round tool\-call budget, enforced as an internal limit of20×2\+5=4520\\times 2\+5=45graph nodes in langchain\.
- •*Process crash*: the worker subprocess executing the task terminated unexpectedly\.
- •*API error*: the model provider returned a transport or protocol\-level failure \(HTTP error, malformed response, or read error from the client\)\.
- •*Streaming timeout*: a streaming chunk failed to arrive within the configured window\.
- •*Other runtime*: any remaining non\-null error type\.

Table 6:Per\-model runtime outcomes over 324 tasks\.*Executed*is the percentage of tasks the agent completed without a runtime error; the remaining columns count aborted runs by first failure type\.ModelExecuted \(%\)↑\\uparrowBudgetCrashAPITimeoutOtherGPT\-5\.4 Mini100\.0000000GPT\-5\.4 Pro99\.3800200Gemini 3\.0 Flash85\.49433100Gemma 4 31B85\.192715600DeepSeek V3\.279\.320198400Qwen 3\.6 Plus75\.625771041Claude Sonnet 4\.653\.701490100
### G\.2Task\-Family Strengths and Weaknesses

Figure[4](https://arxiv.org/html/2606.20146#A7.F4)shows the fraction of tasks scoring above5050for each operation and instruction category, averaged across the seven models\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/x6.png)Figure 4:Mean fraction of tasks scoring above5050across the seven models\.
### G\.3Metrics by Prompt Instruction Type

Figure[5](https://arxiv.org/html/2606.20146#A7.F5)compares geometry, semantics, and topology scores across direct, spatial, and topological instructions\. Each instruction category contains 108 tasks, so each plotted score is computed over the corresponding 108\-task subset\. Focusing on topology, direct instructions achieve the highest average score \(46\.346\.3\), followed by topological \(41\.241\.2\) and spatial instructions \(37\.537\.5\)\. Grouped together, the two indirect instruction types average39\.439\.4, about77points below direct instructions\. This suggests that topology performance is affected by how explicitly the relevant entities and relationships are given in the prompt\. Topological instructions do not automatically lead to the best topology score because the instruction type and the evaluation metric capture different things: the instruction may use relational references to identify the target, while the metric checks whether the final IFC model satisfies the required relational structure\. Direct instructions perform better because they expose more of the relational structure needed for IFC graph editing\. We interpret this gap as evidence that current LLM agents benefit from explicit entity and relationship information\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/x7.png)Figure 5:Per\-axis performance by instruction category across different metrics\.

## Appendix HEnd\-to\-End Example Runs

This section presents four complete agent runs in detail\. For each run, we show the LLM\-generated tool\-call code, the agent’s natural\-language summary, and the per\-axis BIM\-Edit scores\. The four examples are selected to cover the score range observed in the benchmark: a perfect run with all three axes scoring1\.01\.0, a high\-partial run where two axes score1\.01\.0and one axis drops, a run with incorrect geometric editing, and an all\-zero run where the per\-element matcher fails completely\.

The examples include two models, GPT\-5\.4 Pro and Claude Sonnet 4\.6\. This makes it possible to compare their different agent behaviors\. GPT\-5\.4 Pro often commits to a one\-shot edit, while Claude Sonnet 4\.6 usually performs several read\-only inspection rounds before applying the mutation\. All four tasks come from the synthetic\-scene subset of the benchmark, denoted as*scene: A*in the task tags\. These scenes are small enough to inspect visually, which allows the edited IFC elements in the code to be cross\-referenced with the rendered geometry\. The four runs show a common pattern\. The Python code emitted by the agent often looks good at first attempt, but small errors in the IFC relational structure can strongly reduce the geometry score\. These errors can occur even when the agent’s textual summary reports that the edit succeeded\.

### H\.1Example 1: GPT\-5\.4 Pro on Door\-Opening Translation \(Successful\)

#### Task\.

DOR\-UPD\-DIR\-A\-001\.*operation:*update,*category:*direct,*element:*IfcDoor,*scene:*A\.

> move the door with the ID3weZ9R4j52LPSdY$3bzEMo\. Move the placement of its opening by 0\.5 m in\+x\+x\-direction\.

#### Agent trace\.

The agent solved the task in two tool\-call rounds\. Round 1 was discarded by the harness because it was a malformed call and returnednull\. In round 2, the agent retried with a small set of imports and applied the same edit\. We show the round\-2 code below, since this is the version that produced the saved IFC file\.

importifcopenshell

importifcopenshell\.util\.unit

door=ifc\.by\_guid\('3weZ9R4j52LPSdY$3bzEMo'\)

ifdoorisNone:

raiseValueError\('DoorwithgivenGUIDnotfound'\)

unit\_scale=ifcopenshell\.util\.unit\.calculate\_unit\_scale\(ifc\)

delta\_model=0\.5/unit\_scale

openings=\[\]

forrelingetattr\(door,'FillsVoids',\[\]\)or\[\]:

opening=rel\.RelatingOpeningElement

ifopening:

openings\.append\(opening\)

ifnotopenings:

raiseValueError\('NoopeningassociatedwiththedoorviaFillsVoidsrelationship'\)

moved=\[\]

foropeninginopenings:

placement=opening\.ObjectPlacement

ifplacementisNoneornotplacement\.is\_a\('IfcLocalPlacement'\):

raiseValueError\(f'Opening\{opening\.GlobalId\}hasnoIfcLocalPlacement'\)

rel\_place=placement\.RelativePlacement

ifrel\_placeisNoneornotrel\_place\.is\_a\('IfcAxis2Placement3D'\):

raiseValueError\(f'Opening\{opening\.GlobalId\}placementisnotIfcAxis2Placement3D'\)

loc=rel\_place\.Location

coords=list\(loc\.Coordinates\)

whilelen\(coords\)<3:

coords\.append\(0\.0\)

old\_coords=tuple\(coords\)

coords\[0\]\+=delta\_model

loc\.Coordinates=coords

moved\.append\(\{

'opening\_guid':opening\.GlobalId,

'old\_location':old\_coords,

'new\_location':tuple\(coords\),

'placement\_relto\_id':placement\.PlacementRelTo\.id\(\)ifplacement\.PlacementRelToelseNone

\}\)

commit\(\)

result=\{

'door\_guid':door\.GlobalId,

'unit\_scale\_m\_per\_model\_unit':unit\_scale,

'delta\_applied\_model\_units':delta\_model,

'openings\_moved':moved

\}

Listing 1:GPT\-5\.4 Pro tool call from round 2, which is the executed and committed round\. The agent usesFillsVoidsto find the opening, computes the project unit scale, and shifts the local placement coordinate of the opening’sIfcAxis2Placement3D\. Thecommit\(\)call then writes the edited model back to disk\.
#### Final response from the agent\.

Done\. Moved the opening placement for the door3weZ9R4j52LPSdY$3bzEMoby\+0\.5\+0\.5m in the X direction\.Details:•Model unit scale:11m / unit•Applied offset:\+0\.5\+0\.5model units•Opening moved:1EHZix8UQ7pXxn8WcpugFE•Old opening location:\(2\.5000,−0\.0875,0\.0\)\(2\.5000,\\,\-0\.0875,\\,0\.0\)•New opening location:\(3\.0000,−0\.0875,0\.0\)\(3\.0000,\\,\-0\.0875,\\,0\.0\)The IFC file has been saved\.

#### Score\.

Table[7](https://arxiv.org/html/2606.20146#A8.T7)reports the per\-axis BIM\-Edit result for this run\.

Table 7:Per\-axis evaluation result for Example 1 \(DOR\-UPD\-DIR\-A\-001, GPT\-5\.4 Pro\)\.GeometrySemanticsTopologyFinalTool callsOut tokens1\.00001\.00001\.00001\.00001\.00001\.00001\.0000\\mathbf\{1\.0000\}221,1911\{,\}191
#### Visual Result\.

Figure[6](https://arxiv.org/html/2606.20146#A8.F6)shows the input and edited IFC model\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/x8.png)Figure 6:Input and edited IFC model\.

### H\.2Example 2: Claude Sonnet 4\.6 on Wall Length Resize \(High\-Partial:1\.00/1\.00/0\.471\.00/1\.00/0\.47\)

#### Task\.

WAL\-UPD\-DIR\-A\-004\.*operation:*update,*category:*direct,*element:*IfcWall,*scene:*A\.

> change the length of the bounding box of the wall with the ID0SW1L9fQP9xeGsED2XHlvoto 10\.852 m, keeping its placement fixed\.

#### Agent trace \(7 rounds, condensed\)\.

Sonnet used the first five rounds to inspect the wall, including its representation list, profile parameters \(XDim,YDim, andDepth\), axis polyline, placement, and project unit assignment\. It found that the rectangular profile is centred at\(4\.25625,0\)\(4\.25625,\\,0\)withRefDirection=\(−1,0\)=\(\-1,\\,0\), so the wall spans from localX=0X\{=\}0to localX=8\.5125X\{=\}8\.5125m\. Round 6 is the only updating round\. It updates three elements together: the profileXDim, the profile centre, and the axis polyline end point\. Round 7 then verifies the result\.

new\_length=10\.852

wall=ifc\.by\_guid\("0SW1L9fQP9xeGsED2XHlvo"\)

forrepinwall\.Representation\.Representations:

ifrep\.RepresentationIdentifier=="Body":

foriteminrep\.Items:

ifitem\.is\_a\("IfcExtrudedAreaSolid"\):

profile=item\.SweptArea

profile\.XDim=new\_length

profile\.Position\.Location\.Coordinates=\(new\_length/2,0\.0\)

forrepinwall\.Representation\.Representations:

ifrep\.RepresentationIdentifier=="Axis":

foriteminrep\.Items:

ifitem\.is\_a\("IfcPolyline"\):

item\.Points\[1\]\.Coordinates=\(new\_length,0\.0\)

commit\(\)

Listing 2:Round 6, the editing round\. The agent updatesIfcRectangleProfileDef\.XDimto the new length, repositions the profile centre tonew\_length/2\\texttt\{new\\\_length\}/2so the start end stays fixed, and updates the axis polyline end point\. All three writes happen beforecommit\(\)\.
#### Final response from the agent\.

The wall length has been successfully updated\.Changes made to wall0SW1L9fQP9xeGsED2XHlvo:PropertyBeforeAfterWall length \(XDim\)8\.51258\.5125m10\.852\\mathbf\{10\.852\}mProfile centreXX4\.256254\.25625m5\.426\\mathbf\{5\.426\}mAxis end point\(8\.5125,0\)\(8\.5125,\\,0\)\(10\.852,0\)\\mathbf\{\(10\.852,\\,0\)\}Kept fixed:ObjectPlacementat\(0\.175,4\.0875,0\.0\)\(0\.175,\\,4\.0875,\\,0\.0\), wall thickness \(0\.1750\.175m\), wall height \(2\.82\.8m\), extrusion direction \(Z\-axis\)\.

#### Score\.

Table[8](https://arxiv.org/html/2606.20146#A8.T8)reports the per\-axis BIM\-Edit result; geometry and semantics are perfect, topology is partial\.

Table 8:Per\-axis evaluation result for Example 2 \(WAL\-UPD\-DIR\-A\-004, Claude Sonnet 4\.6\)\. Geometry and semantics are perfect because the updated wall has the correct bounding box,Ifcclass, and property set\. Topology drops to0\.470\.47because some connection relations to neighbouring walls were not re\-checked after the wall length changed\.GeometrySemanticsTopologyFinalTool callsOut tokens1\.00001\.00001\.00001\.00000\.4714\\mathbf\{0\.4714\}0\.82380\.8238773,7423\{,\}742
#### Visual Result\.

Figure[7](https://arxiv.org/html/2606.20146#A8.F7)shows the input and edited IFC model\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/x9.png)Figure 7:Input and edited IFC model\.
#### Take\-away\.

This is a high\-partial outcome\. The agent rebuilt the wall geometry correctly across the three coupled parameters:XDim, the profile centre, and the axis end point\. As a result, the bounding box matches the ground truth\. However, the final score does not reach1\.01\.0because the wall was connected to neighboring walls throughIfcRelConnectsPathElements\. After resizing the wall, these connections were not revisited, leaving the topology axis at a partial value\. This partial score is assigned when relations are present but do not match the ground\-truth relation set\. Overall, this example shows a common high\-scoring update pattern: two axes are correct, but the third remains only partially valid\.

### H\.3Example 3: GPT\-5\.4 Pro on Wall Thickness Resize \(Low Geometry Score:0\.38/0\.72/1\.00\.38/0\.72/1\.0\)

#### Task\.

WAL\-UPD\-TOP\-A\-003\.*operation:*update,*category:*topological,*element:*IfcWall,*scene:*A\.

> change the wall thickness of the wall which is touching the walls with the IDs0okaGdY4n1xulccRQajmzQand0okaGdY4n1xulccRQajmzOand is also bounding the rooms with the ID2UsXAbj6n0fwEWqdkVuHvhand2f8KHsrajBsOOvPUvkuuANto 0\.24m in−x\-x\-direction\.

#### Agent trace\.

The agent solved the task in one tool\-call round\. It reasoned over connections to identify the target wall, adjusted the thickness property, and updated geometry\. It also asjusted the material layer set thickness\.

importifcopenshell

fromifcopenshell\.utilimportelementaselement\_util

fromifcopenshell\.utilimportunitasunit\_util

fromifcopenshell\.utilimportplacementasplacement\_util

fromifcopenshell\.utilimportrepresentationasrepr\_util

defget\_global\(elem\):

returngetattr\(elem,'GlobalId',None\)

defmap\_pts\(op,fn\):

ifop\.is\_a\('IfcPolyline'\):

forpinop\.Points:

fn\(p\)

elifop\.is\_a\('IfcCompositeCurve'\):

forseginop\.Segments:

parent=seg\.ParentCurve

map\_pts\(parent,fn\)

elifop\.is\_a\('IfcTrimmedCurve'\):

basis=op\.BasisCurve

map\_pts\(basis,fn\)

elifop\.is\_a\('IfcIndexedPolyCurve'\):

pts=op\.Points\.CoordList

foriinrange\(len\(pts\)\):

fn\(\(op\.Points,i\)\)

else:

pass

defget\_axis2placement3d\_matrix\(pl\):

returnplacement\_util\.get\_axis2placement\(pl\)

deflocal\_point\_from\_world\(mat,xyz\):

importmath

R=\[\[mat\[r\]\[c\]forcinrange\(3\)\]forrinrange\(3\)\]

t=\[mat\[r\]\[3\]forrinrange\(3\)\]

v=\[xyz\[i\]\-t\[i\]foriinrange\(3\)\]

return\(

R\[0\]\[0\]\*v\[0\]\+R\[1\]\[0\]\*v\[1\]\+R\[2\]\[0\]\*v\[2\],

R\[0\]\[1\]\*v\[0\]\+R\[1\]\[1\]\*v\[1\]\+R\[2\]\[1\]\*v\[2\],

R\[0\]\[2\]\*v\[0\]\+R\[1\]\[2\]\*v\[1\]\+R\[2\]\[2\]\*v\[2\],

\)

defpoint\_in\_bbox\_2d\(pt,bbox,tol=1e\-6\):

x,y=pt

minx,miny,maxx,maxy=bbox

return\(minx\-tol<=x<=maxx\+tol\)and\(miny\-tol<=y<=maxy\+tol\)

defbbox\_intersects\(a,b,tol=1e\-6\):

returnnot\(a\[2\]<b\[0\]\-tolorb\[2\]<a\[0\]\-tolora\[3\]<b\[1\]\-tolorb\[3\]<a\[1\]\-tol\)

defget\_storey\_of\(elem\):

forrelingetattr\(elem,'ContainedInStructure',\[\]\)or\[\]:

ifrel\.RelatingStructureandrel\.RelatingStructure\.is\_a\('IfcBuildingStorey'\):

returnrel\.RelatingStructure

returnNone

unit\_scale=unit\_util\.calculate\_unit\_scale\(ifc\)

target\_thickness\_m=0\.24

target\_thickness=target\_thickness\_m/unit\_scale

wall\_a=ifc\.by\_guid\('0okaGdY4n1xulccRQajmzQ'\)

wall\_b=ifc\.by\_guid\('0okaGdY4n1xulccRQajmzO'\)

space\_a=ifc\.by\_guid\('2UsXAbj6n0fwEWqdkVuHvh'\)

space\_b=ifc\.by\_guid\('2f8KHsrajBsOOvPUvkuuAN'\)

ifnotall\(\[wall\_a,wall\_b,space\_a,space\_b\]\):

result=\{'error':'Oneormorereferencedelementsnotfound'\}

else:

storey=get\_storey\_of\(space\_a\)orget\_storey\_of\(space\_b\)orget\_storey\_of\(wall\_a\)orget\_storey\_of\(wall\_b\)

candidates=\[\]

walls=ifc\.by\_type\('IfcWall'\)\+ifc\.by\_type\('IfcWallStandardCase'\)

seen=set\(\)

forwinwalls:

ifw\.id\(\)inseen:

continue

seen\.add\(w\.id\(\)\)

ifw==wall\_aorw==wall\_b:

continue

ifstoreyisnotNoneandget\_storey\_of\(w\)\!=storey:

continue

bounded\_spaces=set\(\)

forrelinifc\.by\_type\('IfcRelSpaceBoundary'\):

ifrel\.RelatedBuildingElement==wandrel\.RelatingSpace:

bounded\_spaces\.add\(rel\.RelatingSpace\)

ifspace\_anotinbounded\_spacesorspace\_bnotinbounded\_spaces:

continue

connected=set\(\)

forrelingetattr\(w,'ConnectedTo',\[\]\)or\[\]:

ifrel\.RelatedElement:connected\.add\(rel\.RelatedElement\)

forrelingetattr\(w,'ConnectedFrom',\[\]\)or\[\]:

ifrel\.RelatingElement:connected\.add\(rel\.RelatingElement\)

ifwall\_anotinconnectedorwall\_bnotinconnected:

continue

candidates\.append\(w\)

ifnotcandidates:

defelement\_bbox\_world\_2d\(elem\):

rep=getattr\(elem,'Representation',None\)

ifnotrep:

returnNone

mats=\[\]

try:

m=placement\_util\.get\_local\_placement\(elem\.ObjectPlacement\)

except:

returnNone

minx=miny=float\('inf'\)

maxx=maxy=float\('\-inf'\)

found=False

forrinrep\.Representations:

items=list\(r\.Items\)

foritinitems:

ifit\.is\_a\('IfcExtrudedAreaSolid'\):

swept=it\.SweptArea

pos=getattr\(it,'Position',None\)

sm=placement\_util\.get\_axis2placement\(pos\)ifposelse\(\(1,0,0,0\),\(0,1,0,0\),\(0,0,1,0\),\(0,0,0,1\)\)

ifswept\.is\_a\('IfcRectangleProfileDef'\):

xdim=swept\.XDim

ydim=swept\.YDim

pts=\[\(\-xdim/2,\-ydim/2,0\),\(xdim/2,\-ydim/2,0\),\(xdim/2,ydim/2,0\),\(\-xdim/2,ydim/2,0\)\]

pm=placement\_util\.get\_axis2placement\(swept\.Position\)ifgetattr\(swept,'Position',None\)else\(\(1,0,0,0\),\(0,1,0,0\),\(0,0,1,0\),\(0,0,0,1\)\)

forx,y,zinpts:

px=pm\[0\]\[0\]\*x\+pm\[0\]\[1\]\*y\+pm\[0\]\[3\]

py=pm\[1\]\[0\]\*x\+pm\[1\]\[1\]\*y\+pm\[1\]\[3\]

pz=pm\[2\]\[0\]\*x\+pm\[2\]\[1\]\*y\+pm\[2\]\[3\]

ox=sm\[0\]\[0\]\*px\+sm\[0\]\[1\]\*py\+sm\[0\]\[2\]\*pz\+sm\[0\]\[3\]

oy=sm\[1\]\[0\]\*px\+sm\[1\]\[1\]\*py\+sm\[1\]\[2\]\*pz\+sm\[1\]\[3\]

wx=m\[0\]\[0\]\*ox\+m\[0\]\[1\]\*oy\+m\[0\]\[3\]

wy=m\[1\]\[0\]\*ox\+m\[1\]\[1\]\*oy\+m\[1\]\[3\]

minx=min\(minx,wx\);miny=min\(miny,wy\)

maxx=max\(maxx,wx\);maxy=max\(maxy,wy\)

found=True

elifswept\.is\_a\('IfcArbitraryClosedProfileDef'\)andswept\.OuterCurve\.is\_a\('IfcPolyline'\):

forpinswept\.OuterCurve\.Points:

c=list\(p\.Coordinates\)\+\[0\]\*\(3\-len\(p\.Coordinates\)\)

px,py,pz=c\[0\],c\[1\],c\[2\]

ox=sm\[0\]\[0\]\*px\+sm\[0\]\[1\]\*py\+sm\[0\]\[2\]\*pz\+sm\[0\]\[3\]

oy=sm\[1\]\[0\]\*px\+sm\[1\]\[1\]\*py\+sm\[1\]\[2\]\*pz\+sm\[1\]\[3\]

wx=m\[0\]\[0\]\*ox\+m\[0\]\[1\]\*oy\+m\[0\]\[3\]

wy=m\[1\]\[0\]\*ox\+m\[1\]\[1\]\*oy\+m\[1\]\[3\]

minx=min\(minx,wx\);miny=min\(miny,wy\)

maxx=max\(maxx,wx\);maxy=max\(maxy,wy\)

found=True

return\(minx,miny,maxx,maxy\)iffoundelseNone

bbox\_a=element\_bbox\_world\_2d\(wall\_a\)

bbox\_b=element\_bbox\_world\_2d\(wall\_b\)

walls2=\[\]

forwinwalls:

ifwin\(wall\_a,wall\_b\):

continue

bounded\_spaces=set\(\)

forrelinifc\.by\_type\('IfcRelSpaceBoundary'\):

ifrel\.RelatedBuildingElement==wandrel\.RelatingSpace:

bounded\_spaces\.add\(rel\.RelatingSpace\)

ifspace\_anotinbounded\_spacesorspace\_bnotinbounded\_spaces:

continue

bw=element\_bbox\_world\_2d\(w\)

ifbwandbbox\_aandbbox\_bandbbox\_intersects\(bw,bbox\_a,tol=0\.05/unit\_scale\)andbbox\_intersects\(bw,bbox\_b,tol=0\.05/unit\_scale\):

walls2\.append\(w\)

candidates=walls2

iflen\(candidates\)\!=1:

result=\{

'error':'Targetwallnotuniquelyidentified',

'candidate\_ids':\[get\_global\(w\)forwincandidates\],

'count':len\(candidates\)

\}

else:

wall=candidates\[0\]

changed=\{'wall':wall\.GlobalId,'changes':\[\]\}

assoc\_rels=getattr\(wall,'HasAssociations',\[\]\)or\[\]

forrelinassoc\_rels:

mat=getattr\(rel,'RelatingMaterial',None\)

ifnotmat:

continue

ifmat\.is\_a\('IfcMaterialLayerSetUsage'\):

ls=mat\.ForLayerSet

iflsandlen\(ls\.MaterialLayers\)==1:

old=ls\.MaterialLayers\[0\]\.LayerThickness

ls\.MaterialLayers\[0\]\.LayerThickness=target\_thickness

changed\['changes'\]\.append\(\{'type':'material\_layer','old':old,'new':target\_thickness\}\)

elifmat\.is\_a\('IfcMaterialLayerSet'\):

iflen\(mat\.MaterialLayers\)==1:

old=mat\.MaterialLayers\[0\]\.LayerThickness

mat\.MaterialLayers\[0\]\.LayerThickness=target\_thickness

changed\['changes'\]\.append\(\{'type':'material\_layer','old':old,'new':target\_thickness\}\)

rep=wall\.Representation

body\_items=\[\]

ifrep:

forrinrep\.Representations:

ifr\.RepresentationIdentifier=='Body'orr\.RepresentationTypein\('SweptSolid','Clipping','CSG','Brep'\):

body\_items\.extend\(list\(r\.Items\)\)

defadjust\_profile\(profile\):

ifprofile\.is\_a\('IfcRectangleProfileDef'\):

old=profile\.XDim

profile\.XDim=target\_thickness

changed\['changes'\]\.append\(\{'type':'rectangle\_xdim','old':old,'new':target\_thickness\}\)

pos=profile\.Position

delta=\(target\_thickness\-old\)/2\.0

old\_loc=tuple\(pos\.Location\.Coordinates\)

coords=list\(old\_loc\)\+\[0\.0\]\*\(2\-len\(old\_loc\)\)

coords\[0\]\-=delta

pos\.Location\.Coordinates=coords\[:len\(old\_loc\)\]iflen\(old\_loc\)\>=2elsecoords\[:2\]

changed\['changes'\]\.append\(\{'type':'profile\_position\_shift\_x','delta':\-delta\}\)

returnTrue

elifprofile\.is\_a\('IfcArbitraryClosedProfileDef'\):

curve=profile\.OuterCurve

pts=\[\]

ifcurve\.is\_a\('IfcPolyline'\):

forpincurve\.Points:

c=list\(p\.Coordinates\)

iflen\(c\)<2:continue

pts\.append\(\(p,c\)\)

xs=\[c\[0\]for\_,cinpts\]

ifnotxs:

returnFalse

minx,maxx=min\(xs\),max\(xs\)

old=maxx\-minx

dx=target\_thickness\-old

ifabs\(dx\)<1e\-9:

returnTrue

forp,cinpts:

ifabs\(c\[0\]\-minx\)<1e\-9:

c\[0\]\-=dx

p\.Coordinates=c

changed\['changes'\]\.append\(\{'type':'arbitrary\_profile\_minx\_shift','old\_thickness':old,'new':target\_thickness,'delta':\-dx\}\)

returnTrue

elifcurve\.is\_a\('IfcIndexedPolyCurve'\):

coordlist=curve\.Points\.CoordList

xs=\[c\[0\]forcincoordlistiflen\(c\)\>=2\]

ifnotxs:

returnFalse

minx,maxx=min\(xs\),max\(xs\)

old=maxx\-minx

dx=target\_thickness\-old

ifabs\(dx\)<1e\-9:

returnTrue

newlist=\[\]

forcincoordlist:

cc=list\(c\)

iflen\(cc\)\>=2andabs\(cc\[0\]\-minx\)<1e\-9:

cc\[0\]\-=dx

newlist\.append\(tuple\(cc\)\)

curve\.Points\.CoordList=newlist

changed\['changes'\]\.append\(\{'type':'indexed\_profile\_minx\_shift','old\_thickness':old,'new':target\_thickness,'delta':\-dx\}\)

returnTrue

returnFalse

geom\_changed=False

foriteminbody\_items:

ifitem\.is\_a\('IfcExtrudedAreaSolid'\):

geom\_changed=adjust\_profile\(item\.SweptArea\)orgeom\_changed

ifnotgeom\_changedandnotchanged\['changes'\]:

result=\{'error':'Noeditablethicknessgeometry/materialfound','wall':wall\.GlobalId\}

else:

commit\(\)

result=\{

'status':'success',

'unit\_scale':unit\_scale,

'target\_wall':wall\.GlobalId,

'target\_thickness\_model\_units':target\_thickness,

'details':changed

\}

Listing 3:The single tool call to solve the task\.
#### Final response from the agent\.

Done\. The wall was identified as:•000irGlVjApOlI2apOJp9xApplied changes:•Thickness set to 0\.24 m•Modified in \-x direction•Material layer thickness updated from 0\.175 to 0\.24Note: the wall body used an IfcRectangleProfileDef, and its XDim was changed from 4\.45 to 0\.24 with a profile shift applied to keep the change toward \-x\.

#### Score\.

Table[9](https://arxiv.org/html/2606.20146#A8.T9)reports the per\-axis BIM\-Edit result; every axis collapses to0\.

Table 9:Per\-axis evaluation result for Example 3 \(WAL\-UPD\-TOP\-A\-003, GPT\-5\.4 Pro\)\.GeometrySemanticsTopologyFinalTool callsOut tokens0\.3813\\mathbf\{0\.3813\}0\.7222\\mathbf\{0\.7222\}1\.0000\\mathbf\{1\.0000\}0\.7012\\mathbf\{0\.7012\}113,7013\{,\}701
#### Visual Result\.

Figure[8](https://arxiv.org/html/2606.20146#A8.F8)shows the input and edited IFC model\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/x10.png)Figure 8:Input and edited IFC model\.
#### Take\-away\.

In BIM models, the keyword thickness typically refers to the y\-direction by default\. Thickness does not inherently correspond to the x\-direction as initially intended; instead, the LLM relied on this conventional interpretation of the geometric property\. The model therefore adjusted the thickness along the y\-direction\. However, since the wall is oriented orthogonally to the x\-axis, its thickness should in this specific case be interpreted along the x\-direction\. In this sense, the error is semantic rather than purely geometric\. Although the prompt explicitly specified an adjustment in the negative x\-direction, the model still applied the change along the y\-axis\.

### H\.4Example 4: Claude Sonnet 4\.6 on Window Translation \(All\-Zero Catastrophic Failure\)

#### Task\.

WIN\-UPD\-DIR\-A\-002\.*operation:*update,*category:*direct,*element:*IfcWindow,*scene:*A\.

> Move the window with ID0okaGdY4n1xulccRQajpyjby 1 m in positiveyydirection\. Also move the opening3H9tIOVCqvfsgJgyvP95X$the same and preserve all relationships\.

#### Agent trace \(7 rounds, condensed\)\.

Rounds 1 to 5 are read\-only inspections\. The agent prints the placements for the window and the opening, checks the project unit assignment to confirm that the model is in metres, followsIfcRelFillsElementandIfcRelVoidsElementto identify the host wall, and finds the key invariant: the window’sPlacementRelTois the opening’s placement\. Round 6 is the only updating round, and round 7 prints the verification result\.

opening=ifc\.by\_guid\("3H9tIOVCqvfsgJgyvP95X$"\)

op\_location=opening\.ObjectPlacement\.RelativePlacement\.Location

print\(f"Before:\{op\_location\.Coordinates\}"\)

current\_coords=op\_location\.Coordinates

new\_y=current\_coords\[1\]\+1\.0

op\_location\.Coordinates=\(current\_coords\[0\],new\_y,current\_coords\[2\]\)

print\(f"After:\{op\_location\.Coordinates\}"\)

window=ifc\.by\_guid\("0okaGdY4n1xulccRQajpyj"\)

win\_ax=window\.ObjectPlacement\.RelativePlacement

assertwindow\.ObjectPlacement\.PlacementRelTo==opening\.ObjectPlacement

commit\(\)

Listing 4:Round 6, the only updating round\.
#### Final response from the agent\.

The operation is complete\.Movement applied\.Both the*window*\(0okaGdY4n1xulccRQajpyj\) and the*opening*\(3H9tIOVCqvfsgJgyvP95X$\) were moved\+1\.0\+1\.0m in the positiveyydirection\.How it was done\.The IFC placement hierarchy was leveraged: the opening’s placement \(\#3577\) is shared as thePlacementRelTofor the window’s placement, so updating only the opening’sIfcCartesianPointfrom\(11\.3,0\.1,0\.9\)\(11\.3,\\,0\.1,\\,0\.9\)to\(11\.3,1\.1,0\.9\)\(11\.3,\\,1\.1,\\,0\.9\)moves both elements together\.Relationships preserved\.IfcRelFillsElement\(window fills the opening\),IfcRelVoidsElement\(opening voids the wall0okaGdY4n1xulccRQajmzO\), and the placement hierarchy \(window remains at\(0,0,0\)\(0,0,0\)relative to the opening\) all unchanged\.

#### Score\.

Table[10](https://arxiv.org/html/2606.20146#A8.T10)reports the per\-axis BIM\-Edit result; every axis collapses to0\.

Table 10:Per\-axis evaluation result for Example 4 \(WIN\-UPD\-DIR\-A\-002, Claude Sonnet 4\.6\)\. The agent’s edit moves the window to coordinates that match the prompt, but those coordinates differ from the ground\-truth window’s location\. So the per\-element matcher fails and all axes are0\.GeometrySemanticsTopologyFinalTool callsOut tokens0\.0000\\mathbf\{0\.0000\}0\.0000\\mathbf\{0\.0000\}0\.0000\\mathbf\{0\.0000\}0\.0000\\mathbf\{0\.0000\}773,2623\{,\}262
#### Visual Result\.

Figure[9](https://arxiv.org/html/2606.20146#A8.F9)shows the input and edited IFC model\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/x11.png)Figure 9:Input and edited IFC model\.
#### Take\-away

: Example 4 shows that a high iteration count can mainly reflect read\-only work, such as element inspection, querying, and scene understanding\. Even though they work, these steps do not prevent an all\-zero outcome as the final edit is not correct\. The intended edit needs to be correct to get high scores\.

## Appendix IRepresentative Task Examples

This section shows representative BIM\-Edit tasks, grouped by element type and edit operation\. All examples are taken from human\-authored scenes\. The dataset release includes the full set of 324 task prompts\. Task IDs follow the patternELEM\-OP\-INSTR\-SCENE\-NNN, whereSCENEisRfor realistic scenes andAfor artificial \(synthetic\) scenes\.

### I\.1Wall Tasks

#### Create task, direct instruction\.

*Task ID: WAL\-CRE\-DIR\-R\-001\.*The direct create task specifies the wall geometry and the target relationship GUIDs explicitly\. In this prompt, the agent must create the wall, assign it to the correct building storey, create the required wall\-to\-wall connections, and add the corresponding space\-boundary relationships\.

Create PromptCreate a wall with a length of 2\.62 m from\(x1,y1,z1\)=\(7\.180,14\.140,3\.000\)\(x\_\{1\},y\_\{1\},z\_\{1\}\)\{=\}\(7\.180,\\allowbreak 14\.140,\\allowbreak 3\.000\)to\(x2,y2,z2\)=\(9\.800,14\.140,3\.000\)\(x\_\{2\},y\_\{2\},z\_\{2\}\)\{=\}\(9\.800,\\allowbreak 14\.140,\\allowbreak 3\.000\)with a thickness of 0\.240 m in−y\-ydirection and a height of 3 m\. Make sure that the wall is added to the building storey with id 1jPCssxb5C1RSM\_YHIx$zl, connected to walls 30a7nM35T5pgZzaItbPb1u and 153QDldl9AdhGy6O1dePed, and bounds the spaces 19QUlaWcT26g1KZHffEeW9 and 19QUlaWcT26g1KZHffEeWz\.

#### Create task, spatial instruction\.

*Task ID: WAL\-CRE\-SPA\-R\-001\.*The spatial create task specifies the wall position through scene context rather than explicit coordinates\. In this prompt, the stairway is used as the spatial reference, and the agent must compute the wall position from the stairway geometry and the given offset\.

Create PromptCreate a wall 0\.9 m away in−y\-ydirection from the stairway after going it up\. Wall must have thickness of 0\.240 m inyydirection and a height of 3 m\.

#### Create task, topological instruction\.

*Task ID: WAL\-CRE\-TOP\-R\-001\.*The topological create task specifies the wall through the relationship between two spaces rather than explicit coordinates or wall IDs\. In this prompt, the agent must identify the two rooms, infer the boundary between them, and place the new wall along that boundary\.

Create PromptCreate a wall that separates the rooms \(19QUlaWcT26g1KZHffEeW9\) and \(19QUlaWcT26g1KZHffEeWz\) and has a height of 3 m\.

#### Update task, direct instruction\.

*Task ID: WAL\-UPD\-DIR\-R\-001\.*

Update PromptDecrease the length of the wall with ID 30a7nM35T5pgZzaItbPb1c by 2\.1 m fixing one side atx=0\.2x\{=\}0\.2\. Delete the connecting relationship to wall 30a7nM35T5pgZzaItbPb1v\. Make sure that the other relationships stay consistent\.

#### Update task, spatial instruction\.

*Task ID: WAL\-UPD\-SPA\-R\-001\.*

Update PromptDecrease the length of the wall that is on the right side after entering room \(19QUlaWcT26g1KZHffEeWX\) through door \(30a7nM35T5pgZzaItbPb1h\) by 2\.1 m keeping the side with lowerxxvalue fixed\.

#### Update task, topological instruction\.

*Task ID: WAL\-UPD\-TOP\-R\-001\.*

Update PromptDecrease the length of the wall that bounds only the rooms \(19QUlaWcT26g1KZHffEeWj\) and \(19QUlaWcT26g1KZHffEeWX\) by 2\.1 m keeping the side with lowerxxvalue fixed\.

#### Delete task, direct instruction\.

*Task ID: WAL\-DEL\-DIR\-R\-001\.*

Delete PromptDelete the wall with ID 2AOGoBTWz3ieiE459aLhPN\. Make sure to delete the wall from all its relationships: connections to walls 2AOGoBTWz3ieiE459aLeaP, 0FwbbEHzD6bwLn0l0aWD1y, and 0FwbbEHzD6bwLn0l0aWBr\_, bounding of spaces 19QUlaWcT26g1KZHffEedX, 19QUlaWcT26g1KZHffEedY, and 2KN4OKK7D6Uw\_xHkV6DTvG, and voided by openings 3p6vUUnefjnD$Kee\_zq0zd and 0tUp8t\_xqOHBUun1YEiX\_v\.

#### Delete task, spatial instruction\.

*Task ID: WAL\-DEL\-SPA\-R\-001\.*

Delete PromptDelete the wall that is on the left side after walking from room \(2KN4OKK7D6Uw\_xHkV6DTv6\) to wall \(2AOGoBTWz3ieiE459aLeaP\)\.

#### Delete task, topological instruction\.

*Task ID: WAL\-DEL\-TOP\-R\-001\.*

Delete PromptDelete the wall that bounds only the rooms \(19QUlaWcT26g1KZHffEedX\), \(19QUlaWcT26g1KZHffEedY\), and \(2KN4OKK7D6Uw\_xHkV6DTvG\)\.

### I\.2Door and Window Tasks

Door and window tasks follow the same direct, spatial, and topological structure as the wall tasks\. However, doors and windows involve additional IFC relationships, such as openings, voiding relationships, and filling relationships\. The agent must place the element correctly and also satisfy these dependent relationships\.

#### Door create task, topological instruction\.

*Task ID: DOR\-CRE\-TOP\-R\-001\.*

Create PromptAdd a door to the wall that connects room 2gkzSyKgLDVALhNNC1fZ\_k to room 2gkzSyKgLDVALhNNC1fZ\_d\. The door opening must have a distance of−0\.61\-0\.61m to the wall edge with the maximumyyvalue, a width of 0\.89 m, and a height of 2\.045 m\.

#### Window update task, spatial instruction\.

*Task ID: WIN\-UPD\-SPA\-R\-001\.*

Update PromptMove the only window in the second storey that has no window right below down so that it aligns with the other windows of the new wall\.

#### Door delete task, topological instruction\.

*Task ID: DOR\-DEL\-TOP\-R\-001\.*

Delete PromptDelete the door that connects the room 2gkzSyKgLDVALhNNC1fZ\_q to the outside\.

### I\.3Slab Tasks

Slab tasks focus on IFCSlab elements\.

#### Slab create task, direct instruction\.

*Task ID: SLB\-CRE\-DIR\-R\-001\.*

Create PromptCreate a slab starting at\(2\.161,4\.310,−0\.4\)\(2\.161,\\allowbreak 4\.310,\\allowbreak\-0\.4\)with a height of 0\.2 m inzzdirection, a length of 30 m inxxdirection, and a width of 15 m inyydirection\. Assign it to the storey with ID 13LV5dTeP3CAox54x56C1Z\.

#### Slab update task, spatial instruction\.

*Task ID: SLB\-UPD\-SPA\-R\-001\.*

Update PromptDecrease the length of the slab below the ground storey by 7 m keeping it fixed at the smallestxxvalue\.

#### Slab delete task, topological instruction\.

*Task ID: SLB\-DEL\-TOP\-R\-001\.*

Delete PromptDelete the slab of the building storey with ID 13LV5dTeP3CAox54x56C1Z\.

### I\.4Space Tasks

Space tasks focus on IfcSpace elements, which represent rooms or usable areas in a building model\.

#### Space create task, direct instruction\.

*Task ID: ROM\-CRE\-DIR\-R\-001\.*

Create PromptCreate a room at \(0\.438, 5\.542, 0\), with distances in x direction of 3\.823m, in y direction of 4\.399m, and in z direction of 2\.438m\. The area is bounded by walls 09aTDCDGbDbB5YMd6zhdRK, 2o7$Px5Hf4\_OWCiwfVz3qS, 2o7$Px5Hf4\_OWCiwfVz3\_g, and 2o7$Px5Hf4\_OWCiwfVz3$d\. Add the according relationships to the walls, the slab 2o7$Px5Hf4\_OWCiwfVz0eP and the building storey 13LV5dTeP3CAox54x56C1Z\.

#### Space create task, spatial instruction\.

*Task ID: ROM\-CRE\-SPA\-R\-001\.*

Create PromptCreate a room in the area that you can look into through the most central window of the wall with id 09aTDCDGbDbB5YMd6zhdRK\. Make it 2\.438 m high\.

#### Space create task, topological instruction\.

*Task ID: ROM\-CRE\-TOP\-R\-001\.*

Create PromptCreate a room in the area that is enclosed by the walls 09aTDCDGbDbB5YMd6zhdRK, 2o7$Px5Hf4\_OWCiwfVz3qS, 2o7$Px5Hf4\_OWCiwfVz3\_g, and 2o7$Px5Hf4\_OWCiwfVz3$d with a height of 2\.438 m\.

### I\.5IFC Models used in the Benchmark

Figure[10](https://arxiv.org/html/2606.20146#A9.F10)shows example IFC files used in this benchmark\.

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/real7.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/real8.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/real11.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/real4.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/real5.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/real6.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/artificial1.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/artificial2.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/artificial3.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/artificial4.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/artificial6.png)

![Refer to caption](https://arxiv.org/html/2606.20146v1/imgs/artificial5.png)

Figure 10:Example IFC files used in the benchmark\. The top 6 are realistic large\-scale models, and the bottom 6 are smaller synthetic models specifically created for this benchmark\.

## Appendix JLicenses and Asset Terms

The release package includes the license file for the code\. The inference harness and the evaluator is released under the MIT License\. Benchmark prompts, task metadata, author\-created artificial IFC files, and author\-created realistic IFC files are released under CC\-BY 4\.0\.

Similar Articles

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv cs.CL

This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.