Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

arXiv cs.AI Papers

Summary

This paper presents Architect-Ant, an editable automatic furnishing framework for architectural floor plans, together with a curated dataset (AntPlan-270) of 270 floor plans with furniture annotations. The method uses a fine-tuned vision-language model and a domain-specific language to generate geometrically valid and functionally plausible furniture layouts that can be rasterized into blueprint-style images.

arXiv:2606.10953v1 Announce Type: new Abstract: Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor-plan datasets with object-level furniture annotations. To address this gap, we introduce AntPlan-270, a curated dataset of 270 architectural floor plans with per-room furniture bounding box annotations across ten residential room categories. Building on this dataset, we present Architect-Ant, an editable automatic furnishing framework powered by a fine-tuned vision-language model. Furniture layouts are represented using a compact, coordinate-based domain-specific language (DSL) that encodes object categories and placements relative to the room geometry. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room-specific furniture inventories, and use them to supervise fine-tuning of the model. We then apply preference optimization over candidate object placements to further refine layout quality. The generated DSL can be rasterized into semantic masks and used to condition a Flux-based LoRA renderer, producing realistic blueprint-style furnished floor-plan images while preserving the editable symbolic layout. Experiments on layout furnishing show that Architect-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure-only floor-plan datasets.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:18 AM

# Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans
Source: [https://arxiv.org/html/2606.10953](https://arxiv.org/html/2606.10953)
,Aleksandar Cvejić[0009\-0005\-4414\-4457](https://orcid.org/0009-0005-4414-4457)King Abdullah University of Science and Technology \(KAUST\)Saudi Arabia,Michael Birsak[0000\-0001\-6375\-8124](https://orcid.org/0000-0001-6375-8124)King Abdullah University of Science and Technology \(KAUST\)Saudi Arabia,John Femiani[0000\-0002\-0924\-6686](https://orcid.org/0000-0002-0924-6686)Miami UniversityUnited States of AmericaandPeter Wonka[0000\-0003\-0627\-9746](https://orcid.org/0000-0003-0627-9746)King Abdullah University of Science and Technology \(KAUST\)Saudi Arabia

\(2026\)

###### Abstract\.

Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows\. However, progress in automatic furniture arrangement has been limited by the lack of real, professionally designed floor\-plan datasets with object\-level furniture annotations\. To address this gap, we introduce AntPlan\-270, a curated dataset of 270 architectural floor plans with per\-room furniture bounding box annotations across ten residential room categories\. Building on this dataset, we present Architect\-Ant, an editable automatic furnishing framework powered by a fine\-tuned vision\-language model\. Furniture layouts are represented using a compact, coordinate\-based domain\-specific language \(DSL\) that encodes object categories and placements relative to the room geometry\. To improve spatial reasoning, we generate procedural reasoning traces that capture architectural constraints such as wall alignment, door and window clearance, circulation, fixture compatibility, and room\-specific furniture inventories, and use them to supervise fine\-tuning of the model\. We then apply preference optimization over candidate object placements to further refine layout quality\. The generated DSL can be rasterized into semantic masks and used to condition a Flux\-based LoRA renderer, producing realistic blueprint\-style furnished floor\-plan images while preserving the editable symbolic layout\. Experiments on layout furnishing show that Architect\-Ant produces geometrically valid and functionally plausible layouts, and suggest a scalable path for furnishing larger structure\-only floor\-plan datasets\.

LLM spatial reasoning, Furniture placement, Floorplan generation

††copyright:cc††doi:XXXXXXX\.XXXXXXX††isbn:XXXXXXX\.XXXXXXX††copyright:none††ccs:Computing methodologies Spatial and physical reasoning††ccs:Computing methodologies Knowledge representation and reasoning††ccs:Computing methodologies Scene understanding![Refer to caption](https://arxiv.org/html/2606.10953v1/figures/Teaser_compressed.png)Figure 1\.Architect\-Ant turns empty structured floor plans \(left\) into multiple plausible furnished, blueprint\-style renderings \(right,2×22\{\\times\}2grid of layout variants\)\. The intermediate symbolic DSL remains the editable source of truth\.Teaser placeholder: a two\-by\-two grid of furnished floor plans is the intended final figure\.## 1\.Introduction

Furnished floor plans are central to real estate visualization, interior design, and architectural communication\. Furniture makes a plan interpretable: it conveys room scale, likely function, circulation, and whether a space can support the intended use\. Producing such layouts manually is time\-consuming, while automatic furnishing is useful only when the result is geometrically valid, functional, and available as an object\-level representation rather than only as pixels\.

Furniture placement is a constrained layout problem\. A layout must place objects of appropriate type, size, and position inside a room boundary while keeping them accessible, visible, and usable\. These requirements are partly geometric and partly semantic\. A bed is not just a rectangle that should avoid collisions; it is an object with typical relations to walls, doors, circulation paths, and other furniture\. A chair is valid only if it remains reachable and usable after nearby objects are placed\. A plausible layout must satisfy constraints that are easy to express in design language but difficult to learn from clean examples, especially when complete examples of furnished floor plans are scarce\.

This problem is distinct from architectural floor\-plan generation, which typically concerns the organization of rooms, walls, adjacencies, boundaries, and openings\. We focus on the furnishing stage: given a room or floor\-plan geometry, generate the objects that occupy the room and determine where they should go\. This stage has different failure modes\. A furnished room may fail because furniture overlaps, blocks a door, leaves no traversable path, or violates basic use constraints\.

Furniture layout is an object\-level problem\. Designers edit walls, openings, furniture instances, dimensions, positions, and relations, not pixels\. We therefore represent the task in structured text: the input describes the room boundary and relevant architectural elements such as doors, windows, and openings, and the output describes furniture objects with category, position, and axis\-aligned extent\. The same representation exposes the variables needed to define layout validity\. Collision, containment, clearance, door obstruction, reachability, wall affinity, and pairwise object relations can be evaluated directly over structured geometry and labels\. In pixel space, these checks depend on first recovering the underlying objects and geometry\.

The data needed for this formulation is limited\. Public datasets rarely provide many complete, real, furnished floor plans as discrete editable objects\. Architectural datasets may provide images or vector geometry, but they usually describe walls, rooms, doors, windows, and other building elements rather than furniture instances\. Furnished scene datasets exist, including synthetic 3D datasets with object\-level layouts; they can be converted into this form, but they are a poor substitute for real furnished floor plans when the goal is to learn how rooms are typically furnished\. In practice, useful furnishing information is more often found in images, scans, or drawings, where structure must be extracted by detectors or parsers\.

Those extracted layouts are useful but noisy\. They may contain incorrect categories, missing furniture, inaccurate dimensions, or imprecise locations\. We use them as pseudo\-labels for lightweight adaptation: enough to move the model toward the target representation and approximate room statistics, but not as evidence that every extracted object is correct\. We refer to the resulting per\-room corpus, drawn from 270 professionally designed floor plans across ten residential room categories, as AntPlan\-270; the experiments in this paper focus on the four most\-furnished categories \(bedroom, bathroom, kitchen, and living room\)\.

We train a structured layout generator in stages\. Prompting provides an initial prior over furniture categories and coarse spatial relations\. A lightweight fine\-tuning stage on pseudo\-labeled layouts then adapts the model to the target format and approximate room statistics\. A rule\-based evaluator scores sampled layouts using geometric and semantic criteria such as containment within the room, object overlap, door access, traversable paths, wall affinity, and object–object relations\. We then apply preference optimization with preferences derived from this rule\-based evaluator, training the model to assign higher probability to the better\-scoring layouts\. The criteria are weighted by severity, with larger penalties for violations such as obstruction or out\-of\-room placement and smaller penalties for weaker design preferences such as wall affinity or pairwise relationships\.

The training uses three kinds of signal\. The pretrained model supplies semantic priors over object co\-occurrence and common relations\. The pseudo\-labels give the model approximate room\-scale statistics and examples in the target output format\. The rule\-based evaluator supplies explicit design preferences without requiring additional clean demonstrations\. The rules therefore act as supervision for the learned generator\.

The contributions of this paper are as follows:

- •We formulate furnished room layout synthesis as structured sequence generation over editable geometric objects, rather than as image generation\.
- •We adapt a pretrained generator to this representation using pseudo\-labeled layouts, providing a task\-specific starting point for later preference optimization\.
- •We define a rule\-based evaluator that converts geometric and semantic layout criteria into preference signals, and combine those signals with fail\-and\-fix reasoning traces to train the generator toward layouts that satisfy the desired constraints directly\.

For visualization, the resulting DSL layouts are rendered into blueprint\-style architectural images via a domain\-specific diffusion model \(FLUX\.2\-dev\(Black Forest Labs,[2025](https://arxiv.org/html/2606.10953#bib.bib5)\)LoRA\) conditioned on the colored room\-type mask\. The symbolic layout remains the editable source of truth, and the rendered image serves as a downstream view rather than the representation the system operates on\. Figure[1](https://arxiv.org/html/2606.10953#S0.F1)illustrates the overall input\-output behavior: empty structured floor plans are converted into multiple furnished blueprint\-style renderings while retaining an editable DSL layout\.

Although the experiments focus on furniture placement, the setting reflects a broader class of graphics and design problems in which clean demonstrations are limited, but weak observations and explicit rules are available\. The central result is a method for adapting a pretrained structured generator using both noisy examples and symbolic preferences, so that geometric and functional criteria influence the learned distribution rather than appearing only as checks applied after generation\.

## 2\.Related Work

![Refer to caption](https://arxiv.org/html/2606.10953v1/x1.png)

Training and data\-preparation pipeline\. An input floor plan is processed by an RT\-DETR\-X detector to identify structural elements and furniture\. The detected plan is split into room\-level examples, which are converted into structured inputs for a Qwen3\.5\-9B vision\-language model\. The model is adapted with supervised fine\-tuning and direct preference optimization\.

Figure 2\.Build\-time pipeline \(data preparation and training\)\. Raw floor plans are processed by RT\-DETR\-X into per\-room structural primitives and furniture pseudo\-labels, paired with procedural reasoning traces, and used to fine\-tune the Qwen3\.5\-9B VLM via SFT and DPO\. The output is a set of trained per\-room LoRA adapters, which serve as the generator at inference time \(Figure[3](https://arxiv.org/html/2606.10953#S3.F3)\)\.Floor\-plan structure and vectorization\.Architectural floor\-plan work targets the building shell: rooms, walls, doors, windows, and topology\. Boundary\-conditioned generation predicts rooms and walls from a plan outline\(Wu et al\.,[2019](https://arxiv.org/html/2606.10953#bib.bib47)\), while graph\-conditioned methods produce room boxes or rasterized plans from layout graphs\(Hu et al\.,[2020](https://arxiv.org/html/2606.10953#bib.bib18); Nauata et al\.,[2020](https://arxiv.org/html/2606.10953#bib.bib29)\)\. Vector\-graph residential datasets such as ResPlan extend this line at scale\(Abouagour and Garyfallidis,[2025](https://arxiv.org/html/2606.10953#bib.bib2)\)\. A complementary direction parses raster plans into structure: Deep Floor Plan Recognition predicts rooms, openings, and types directly from images\(Zeng et al\.,[2019](https://arxiv.org/html/2606.10953#bib.bib52)\), CubiCasa5K supplies large\-scale vector annotations\(Kalervo et al\.,[2019](https://arxiv.org/html/2606.10953#bib.bib19)\), MSD extends to building complexes\(Van Engelenburg et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib43)\), and FloorplanVLM converts raster plans into topological representations with a vision\-language model\(Liu et al\.,[2026](https://arxiv.org/html/2606.10953#bib.bib25)\)\. HouseDiffusion generates vector plans with a discrete–continuous diffusion model\(Shabani et al\.,[2022](https://arxiv.org/html/2606.10953#bib.bib38)\)\. These methods supply structure rather than furnishing: their outputs describe the architectural shell and do not place furniture instances inside rooms\.

Indoor scene datasets and the 2D–3D mismatch\.Furniture\-rich indoor data are concentrated in 3D scene corpora\. 3D\-FRONT\(Fu et al\.,[2020a](https://arxiv.org/html/2606.10953#bib.bib14)\)and its furniture\-asset companion 3D\-FUTURE\(Fu et al\.,[2020b](https://arxiv.org/html/2606.10953#bib.bib15)\)are the dominant supervision source for object\-level indoor synthesis; Structured3D\(Zheng et al\.,[2020](https://arxiv.org/html/2606.10953#bib.bib54)\), Hypersim\(Roberts et al\.,[2020](https://arxiv.org/html/2606.10953#bib.bib36)\), HSSD\(Khanna et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib20)\), and Aria Digital Twin\(Pan et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib31)\)provide synthetic or scanned scenes at scale\. SceneScript represents scenes as a structured language for reconstruction tasks\(Avetisyan et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib4)\), and ScanNet provides real RGB\-D scans with semantic annotation\(Dai et al\.,[2017](https://arxiv.org/html/2606.10953#bib.bib9)\)\. Procedural and CAD\-style sources complement these: ProcTHOR builds embodied 3D houses procedurally\(Deitke et al\.,[2022](https://arxiv.org/html/2606.10953#bib.bib10)\), FloorPlanCAD\(Fan et al\.,[2021](https://arxiv.org/html/2606.10953#bib.bib12)\)and ArchCAD\-400K\(Luo et al\.,[2026](https://arxiv.org/html/2606.10953#bib.bib26)\)provide panoptic CAD symbols, ZInD pairs floor plans with 360\-degree panoramas\(da Cruz et al\.,[2021](https://arxiv.org/html/2606.10953#bib.bib8)\), and FurniScene contributes densely furnished 3D rooms\(Wang et al\.,[2026](https://arxiv.org/html/2606.10953#bib.bib45)\)\. None of these aligns the three properties our setting requires simultaneously: real 2D architectural geometry, per\-instance editable furniture bounding boxes, and a symbolic representation suited to rule\-based scoring\. Projecting 3D scenes to 2D is possible but changes the annotation problem along five axes: coordinate frame, drawing style, furniture taxonomy, evaluation metrics, and the availability of professional plan\-style supervision\.

Constraint\-based arrangement and LLM agents\.Furniture layout has a constraint\-driven tradition\. Classical systems encode design guidelines or ergonomic objectives and search for arrangements that satisfy accessibility, visibility, and similar criteria\(Merrell et al\.,[2011](https://arxiv.org/html/2606.10953#bib.bib28); Yu et al\.,[2011](https://arxiv.org/html/2606.10953#bib.bib51)\)\. Para et al\. separate transformer\-based layout proposal from a downstream constraint solver\(Para et al\.,[2020](https://arxiv.org/html/2606.10953#bib.bib32)\)\. Learning\-based scene synthesis moved the burden into autoregressive generators \(ATISS\(Paschalidou et al\.,[2021](https://arxiv.org/html/2606.10953#bib.bib33)\)\) and denoising diffusion \(DiffuScene\(Tang et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib41)\), InstructScene\(Lin and Mu,[2024](https://arxiv.org/html/2606.10953#bib.bib24)\)\); LayoutEnhancer instead pushes rules into training as a differentiable expert\-rule loss\(Leimer et al\.,[2022](https://arxiv.org/html/2606.10953#bib.bib22)\)\. LLM\-driven agents continue the line: Holodeck and I\-Design produce 3D scenes from text via constraint solvers and scene graphs\(Yang et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib50); Çelen et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib6)\), Open\-Universe synthesizes scenes via LLM program synthesis with uncurated assets\(Aguina\-Kang et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib3)\), and Procedural Scene Programs places objects through iterative self\-training\(Chang et al\.,[2025](https://arxiv.org/html/2606.10953#bib.bib7)\)\. In most of these methods, constraints are enforced outside the generator, through a solver, a search step, or post\-hoc repair; LayoutEnhancer is the exception that bakes a differentiable surrogate into the loss\.Architect\-Antconverts the same rules into preference signals that adapt the generator’s own distribution\.

LLMs for structured layout generation\.Large language models have been applied as structured layout planners\. LayoutGPT generates layouts via in\-context prompting and extends to 3D scenes\(Feng et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib13)\); Chat2Layout adds multimodal prompting and iterative editing\(Wang et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib44)\); LLplace edits 3D layouts via LLM control\(Yang et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib48)\); LayoutVLM integrates a vision\-language model for spatial planning\(Sun et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib40)\)\. OptiScene fine\-tunes an open LLM for indoor scene layout with multi\-stage preference optimization\(Yang et al\.,[2025](https://arxiv.org/html/2606.10953#bib.bib49)\)\. FloorplanQA shows that a general\-purpose language model is brittle on symbolic indoor\-layout tasks even when the input is explicit\(Rodionov et al\.,[2025](https://arxiv.org/html/2606.10953#bib.bib37)\), motivating task\-specific adaptation\. SceneScript is related in representation but targets structured reconstruction rather than furnishing generation\(Avetisyan et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib4)\)\. The recurring failure mode in these systems is geometric: layouts pass coarse semantic checks yet violate overlap, containment, door\-clearance, and wall\-affinity rules unless an external solver or post\-hoc repair step intervenes\. Our work moves rule enforcement into training so that geometric criteria influence the learned distribution\.

Preference optimization with rule\-derived rewards\.Direct Preference Optimization replaces the explicit reward model of RLHF with a closed\-form pairwise loss over preferred and rejected completions\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib35); Ouyang et al\.,[2022](https://arxiv.org/html/2606.10953#bib.bib30)\)\. Verifier\-based post\-training has used this template in domains with deterministic correctness checks: program execution\(Le et al\.,[2022](https://arxiv.org/html/2606.10953#bib.bib21)\), compiler feedback\(Dou et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib11)\), mathematical answer matching\(Shao et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib39)\), and code preferences derived from execution and judge models\(Weyssow et al\.,[2026](https://arxiv.org/html/2606.10953#bib.bib46)\)\. OptiScene applies multi\-stage preference optimization to indoor scene layout\(Yang et al\.,[2025](https://arxiv.org/html/2606.10953#bib.bib49)\)\.Architect\-Antfollows the same recipe, with a programmatic verifier as the source of preferences, but the verifier is a geometric rule scorer over 2D furniture coordinates\. Section[3](https://arxiv.org/html/2606.10953#S3)describes the rule set, the pair\-construction restriction that isolates placement quality from surface\-form differences, and the failure modes observed under broader pair construction\.

Rendering as visualization\.Image\-conditioned and diffusion\-based renderers translate masks or schematics into architectural images\(Shabani et al\.,[2022](https://arxiv.org/html/2606.10953#bib.bib38); Zhang et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib53); Li et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib23)\)\. Pixel output is a useful end product but not a layout representation: object\-level edits such as moving a bed, resizing a wardrobe, or clearing a doorway require the underlying objects, not their rasterization\.Architect\-Antkeeps this separation: the structured DSL is the representation the pipeline operates on, and a domain\-specific diffusion model rasterizes it into a blueprint\-style view as a downstream visualization step\.

## 3\.Architect\-Ant

![Refer to caption](https://arxiv.org/html/2606.10953v1/x2.png)Generation and rendering pipeline\. A structural room input is split into rooms and combined with a furniture list\. A fine\-tuned Qwen3\.5\-9B VLM generates K structured DSL layout candidates with furniture classes and coordinates\. The rule\-based scorer ranks the candidates and selects the best by clearance, reachability, collision, and aesthetic\-rule signals\. The selected DSL is the editable output; a semantic mask derived from it optionally conditions a FLUX\.2\-dev model with LoRA and a text prompt to render the final blueprint\-style image\.

Figure 3\.Run\-time pipeline \(inference and rendering\)\. Using the per\-room adapter trained in Figure[2](https://arxiv.org/html/2606.10953#S2.F2), the Qwen3\.5\-9B VLM emitsKKDSL candidates per prompt; the rule scorer \(Section[3\.3](https://arxiv.org/html/2606.10953#S3.SS3)\) selects the highest\-scoring one\. The selected DSL is the editable output, with optional FLUX\.2\-dev LoRA rendering as a downstream visualization branch\.Figure[2](https://arxiv.org/html/2606.10953#S2.F2)summarizes the data\-preparation and training pipeline\. Given a room with its geometric primitives \(frame, walls, doors, windows, optional railings\) and a list of furniture, Architect\-Ant produces a furniture layout as a sequence of axis\-aligned bounding boxes, expressed using a structured DSL\. The task is to*place*furniture at plausible positions, following the provided items list\. A valid layout satisfies geometric constraints \(containment, no overlap with walls or door swings, opening clearances\) and functional constraints \(wall affinity for large items, accessibility, room\-specific pairwise relationships\)\.

Our approach is to train a language model to produce editable layouts that match real furnished floor plans while satisfying coded geometric preferences\. Because no suitable real 2D dataset exists for this task, we first construct AntPlan\-270 \(§[3\.1](https://arxiv.org/html/2606.10953#S3.SS1)\)\. We then fine\-tune the model on pseudo\-labeled layout \(§[3\.2](https://arxiv.org/html/2606.10953#S3.SS2)\), score generated layouts with deterministic rules \(§[3\.3](https://arxiv.org/html/2606.10953#S3.SS3)\), and use controlled preference pairs to improve placements and avoid violating those rules \(§[3\.4](https://arxiv.org/html/2606.10953#S3.SS4)\)\. At inference time, Architect\-Ant samples multiple DSL candidates, selects the highest\-scoring layout, and optionally renders it into a blueprint\-style visualization, as summarized in Figure[3](https://arxiv.org/html/2606.10953#S3.F3)\.

### 3\.1\.AntPlan\-270 dataset

AntPlan\-270 contains 270 professionally designed anonymized residential floor plans, collected from publicly accessible online sources\. Figure[6](https://arxiv.org/html/2606.10953#S4.F6)shows representative source drawings from the dataset\. We use these plans only as source material for annotation and experimentation and do not redistribute the original images\. Each plan is converted into room\-level structural primitives and furniture bounding\-box pseudo\-labels\. Annotated data is split into per\-room samples spanning ten residential room categories\. The four most\-furnished categories \(bedroom, bathroom, kitchen, living room\) are the focus of the experiments in this paper; the remaining categories \(for example, balcony, terrace, entry, storage, and other, which includes corridor and garage\) typically carry little or only narrow furnishing such as a single wardrobe class, and are not part of the quantitative evaluation\. Each sample carries the room geometry \(walls, doors, windows, railings, frame\) in metric coordinates and a furniture pseudo\-label list with per\-instance bounding boxes\.

Annotations are produced by a three\-tier pipeline\. Structural primitives \(walls, windows, doors, railings\) are extracted fully automatically with an RT\-DETR\-X\(Lv et al\.,[2024](https://arxiv.org/html/2606.10953#bib.bib27)\)detector trained on CubiCasa5K\. Room labels are produced manually\. Furniture bounding boxes are bootstrapped from a hand\-labelled subsample on which a separate RT\-DETR\-X is trained; the trained detector is then applied to the remaining plans, with a manual review pass that fixes detector errors\. This procedure is the reason the furniture\-side annotations are referred to as pseudo\-labels: they reflect a detector pipeline whose outputs were corrected but not exhaustively re\-drawn\.

Per\-room class whitelists distinguish room\-appropriate furniture \(for example, kitchen appliances are not valid bedroom classes\)\. The dataset is split per room type so that the held\-out validation set contains 10% of the rooms of that type, with the remaining 90% used for training; augmentation \(horizontal flip and 180\-degree rotation\) is applied to the training side only\. Detailed statistics on the total number of rooms, furniture diversity, and object counts per room are provided in Appendix[A](https://arxiv.org/html/2606.10953#A1)\. Section[2](https://arxiv.org/html/2606.10953#S2)discusses how AntPlan\-270 differs from large 3D scene corpora and from real 2D plan datasets that lack per\-room furniture supervision\.

### 3\.2\.Reasoning traces with recovery

Each training example pairs a DSL target layout with a procedural reasoning trace that walks the model through the placement decision step by step\. The DSL is a compact line\-oriented representation of furniture objects and axis\-aligned bounding boxes; its full grammar is provided in Appendix[B](https://arxiv.org/html/2606.10953#A2)\. The trace first identifies anchor objects \(typically large items that should be placed against walls\), then iterates through the remaining inventory and places each item in turn, explicitly checking room containment, wall contact for wall\-touch classes, door\-swing clearance, and pairwise relationships against already\-placed objects\. In half of the training traces, recovery is inserted: a placement that violates exactly one rule is emitted, followed by an in\-trace correction that produces a valid alternative\. Recovery is a training\-time augmentation, not an inference\-time repair loop; the model emits a single trace at inference\. A structure\-only top\-down PNG of the room accompanies the text prompt at both training and inference time, so the model sees the room geometry as both metric prose and a top\-down rendering\.

![Refer to caption](https://arxiv.org/html/2606.10953v1/x3.png)Figure 4\.Rule\-score examples for variants of the same bedroom layout\. Scores start from a base value of\+10\+10, with rule\-specific penalties deducted for blocked openings, wall/window overlaps, and disallowed furniture overlaps\. From left to right: \(a\)−6\-6, with multiple severe overlaps and blocked openings; \(b\)−2\-2, with severe and medium object overlaps; \(c\)\+1\+1, with wall overlap, door blocking, and pairwise overlap penalties; \(d\)\+4\+4, with blocked access and a medium pairwise overlap; \(e\)\+6\+6, with two medium pairwise overlaps; and \(f\) GT\+10\+10, with no fired rules\. The complete per\-rule breakdown is given in Table[10](https://arxiv.org/html/2606.10953#A3.T10)\.
### 3\.3\.Rule\-based scorer

The rule\-based scorer takes a parsed DSL and the room structure and returns a score with a per\-rule breakdown\. The base score is\+10\+10; rule violations deduct severity\-dependent penalties\. Penalties accumulate, giving a fixed upper bound of\+10\+10but no hard lower bound; observed scores were roughly in\[−15,\+10\]\[\-15,\+10\]\. Table[1](https://arxiv.org/html/2606.10953#S3.T1)summarises the rule families and Figure[4](https://arxiv.org/html/2606.10953#S3.F4)illustrates their effect on layout variants of the same bedroom\. The full per\-room specification, including class whitelists and pair tables, is provided in Appendix[C](https://arxiv.org/html/2606.10953#A3)\.

Table 1\.Rule families used by the scorer\. Each family contains multiple deterministic rules whose penalties sum to the deducted total\.The scorer plays two roles\. It supplies the preference signal for direct preference optimization \(Section[3\.4](https://arxiv.org/html/2606.10953#S3.SS4)\), and it acts as the inference\-time selector that ranksKKcandidates and picks the highest\-scoring one\. Because it is deterministic and decomposes by rule family, each preference can be traced back to explicit geometric or semantic violations\. The score does not capture all aspects of layout quality, so we complement it with an independent vision\-language judgment study in Section[4\.3](https://arxiv.org/html/2606.10953#S4.SS3)\. The scorer assumes axis\-aligned boxes, depends on consistent class names, and can be gamed by preference optimization; this failure mode is analyzed in Section[4\.4](https://arxiv.org/html/2606.10953#S4.SS4)\.

### 3\.4\.Preference optimization

On top of the supervised checkpoint, we use direct preference optimization \(DPO\) to align the generator with preferences induced by the rule scorer\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.10953#bib.bib35)\)\. Our main recipe issynthetic\-pair DPO\. For each pseudo\-labeled layout, we construct a chosen–rejected pair by perturbing exactly one bounding box so that the rejected layout violates one scorer rule, while keeping the procedural reasoning trace identical on both sides\. The two sequences are therefore identical except for oneOBJline, which prevents the model from exploiting differences in trace style or surface form; the preference signal is localized to the placement change\.

We also evaluate a broadermodel\-pair DPOvariant\. In this variant, candidate layouts are sampled from the supervised model and paired by score gap: higher\-scoring samples, and in some cases pseudo\-labeled layouts above a threshold, are used as chosen responses, while lower\-scoring samples are used as rejected responses\. Unlike synthetic pairs, model pairs can differ in object placements, reasoning traces, and other surface\-form details\. We therefore report this variant as an ablation: Section[4\.4](https://arxiv.org/html/2606.10953#S4.SS4)shows that it can increase the rule score while degrading visual quality, indicating reward hacking\. Additional pair\-construction details are provided in Appendix[D](https://arxiv.org/html/2606.10953#A4)\.

## 4\.Experiments

### 4\.1\.Setup

#### Training\.

The base model is Qwen3\.5\-9B \(vision\-language\)\(Qwen Team,[2026](https://arxiv.org/html/2606.10953#bib.bib34)\); we attach per\-room LoRA adapters and fine\-tune one adapter per room type\. Each training and inference example combines a text prompt \(system message, one\-shot example, structure primitives, and requested inventory\) with a structure\-only top\-down PNG of the room\. Supervised fine\-tuning uses the augmented reasoning traces \(50% with fail\-and\-fix recovery\); we train for 5 epochs at learning rate2×10−52\{\\times\}10^\{\-5\}with a cosine schedule, LoRA rank 128 and alpha 256 with dropout 0\.05\. Hyperparameters are identical across the four rooms\. Direct preference optimization runs on top of the supervised checkpoint for 2 epochs at learning rate1×10−61\{\\times\}10^\{\-6\}, DPO regularization coefficientβ=0\.1\\beta\{=\}0\.1; the best\-performing checkpoint is room\-dependent and is selected on the in\-distribution validation set\.

#### Inference\.

At inference, the model samplesKKcandidate DSL layouts per prompt with temperature0\.90\.9and top\-pp0\.950\.95\. The rule scorer ranks theKKcandidates and selects the highest\-scoring one as the system output\. We useK=6K\{=\}6for in\-distribution validation on AntPlan\-270 andK=10K\{=\}10for out\-of\-distribution evaluation on CubiCasa5K \(two inventory lists×\\timesfive generations each\)\. As frontier multimodal\-agent baselines, we evaluate Kimi K2\.5\(Team et al\.,[2026](https://arxiv.org/html/2606.10953#bib.bib42)\), an open\-weight 1\.1T\-parameter native multimodal agentic model, and GLM\-5V\-Turbo\(GLM\-V Team et al\.,[2026](https://arxiv.org/html/2606.10953#bib.bib16)\)\. Both baselines are evaluated zero\-shot atK=2K\{=\}2\(one inventory list×\\timestwo generations\) under a fixed evaluation budget\. Because the candidate budget differs across methods, these frontier\-scale models are included as reference zero\-shot comparisons rather than as strictly matched best\-of\-KKbaselines\.

#### Evaluation protocol\.

Out\-of\-distribution evaluation uses 100 deterministically sampled CubiCasa5K rooms per type: bedroom, bathroom, kitchen, and living room\. For each room type, we compare

![Refer to caption](https://arxiv.org/html/2606.10953v1/x4.png)Figure 5\.Representative per\-room qualitative comparison in the schematic DSL view\. Each row corresponds to a room type: bedroom, kitchen, bathroom, and living room\. Columns show, from left to right, the zero\-shot baseline, SFT, GLM\-5V\-Turbo, Kimi K2\.5, and Architect\-Ant \(Ours\)\. The examples illustrate typical differences in wall alignment, functional grouping, object overlap, and circulation clearance across methods\.Placeholder grid: rows of rooms, columns of model outputs\.![Refer to caption](https://arxiv.org/html/2606.10953v1/figures/AntPlanFloorPlans.png)Figure 6\.Examples of six original architectural floor plan drawings from the AntPlan\-270 dataset\.Placeholder grid: rendered variations of layouts within our pipeline\.the zero\-shot base model, the supervised fine\-tuned adapter, the corresponding synthetic\-pair DPO adapter, and the zero\-shot frontier multimodal\-agent baselines Kimi K2\.5 and GLM\-5V\-Turbo\.*Ours*refers to the synthetic\-pair DPO model described in Section[3\.4](https://arxiv.org/html/2606.10953#S3.SS4)

#### Metrics\.

The scorer assigns a numerical score to every generated candidate, and we report two complementary views of the resulting distribution\. The headline view is per\-roommean±\\pmstandard deviationof scores across theKKcandidates per prompt, aggregated across the evaluation set; this reflects typical generation quality and consistency\. The secondary view isbest\-of\-KK, the average score of the best candidate per prompt selected by the scorer, which reflects the inference\-time protocol\. Both views use the same scorer and candidate pool, so they are directly comparable\.

An independent visual judge complements the rule scorer\. Gemini 3 Flash Preview \(thinking\_level = MEDIUM\), a recent frontier VLLM with strong multimodal/agentic capabilities\(Google DeepMind,[2025](https://arxiv.org/html/2606.10953#bib.bib17)\), receives two anonymized renders per pair with the A/B order randomized and returns one of \{A, B, Tie\}\. The judge does not receive the rule score or violation breakdown, so it provides an evaluation signal separate from the scorer used for DPO\.

### 4\.2\.Main results: out\-of\-distribution evaluation on CubiCasa5K

Table[2](https://arxiv.org/html/2606.10953#S4.T2)reports rule\-scorer performance on out\-of\-distribution CubiCasa5K rooms\. The*mean±\\pmstd*columns measure the typical quality of sampled candidates, while*best\-of\-KK*reports the score of the candidate selected by the inference\-time scorer\. Supervised fine\-tuning provides the largest improvement over the zero\-shot base model, increasing the overall mean score from−8\.02\-8\.02to1\.021\.02and the overall best\-of\-KKscore from2\.042\.04to7\.277\.27\. Synthetic\-pair DPO further improves the overall mean score to1\.421\.42and the overall best\-of\-KKscore to7\.347\.34, with best\-of\-KKgains in bedroom, bathroom, and kitchen\.

The frontier multimodal\-agent baselines are more consistent than the zero\-shot base model and achieve higher mean scores in all room types except kitchen\. However, their best\-of\-KKscores remain substantially below SFT and Ours\. This suggests that strong general\-purpose VLMs can often avoid severe failures, but their generated layouts are not always fully correct or functionally plausible in this structured setting\.

The kitchen setting remains the hardest across methods\. Kitchens often contain dense, layered structures such as cabinets, counters, islands, embedded appliances, bar seating, and table\-chair groups\. These create ambiguous 2D overlaps and functional relations that are difficult to evaluate from top\-down boxes alone\. More generally, the rule score is only a partial measure of layout quality: it captures explicit geometric and semantic violations, but it does not fully reflect visual\-functional plausibility\. We therefore complement the scorer with a VLM\-based judgment study and qualitative comparisons below\.

### 4\.3\.Visual quality: VLM\-as\-judge

We complement the rule\-scorer evaluation with an independent visual judgment study\. Gemini 3 Flash receives two anonymized rendered layouts per comparison, with randomized A/B order, and is asked to choose which layout has better*functional layout quality*, or to return a tie\. The prompt instructs the judge to focus on spatial functionality rather than rendering style, using explicit criteria for wall intersections, door and passageway clearance, functional grouping and wall hugging, and furniture\-to\-furniture collisions\. Full prompt details and representative judge failure cases are provided in Appendix[E](https://arxiv.org/html/2606.10953#A5)\.

Table[3](https://arxiv.org/html/2606.10953#S4.T3)compares our synthetic\-pair DPO model against the corresponding SFT model\. The judge prefers DPO in bathroom and kitchen, is nearly tied on living room, and prefers SFT on bedroom\. These results suggest that DPO improves visual\-functional quality in some categories but does not uniformly dominate SFT\. The gains are not limited to hard rule violations: qualitative examples show that DPO often improves layout plausibility through softer spatial preferences, such as placing chairs more symmetrically around tables, tightening kitchen groupings, attaching beds and nightstands to walls, and producing more coherent object arrangements\.

Table[4](https://arxiv.org/html/2606.10953#S5.T4)compares our model with frontier multimodal\-agent baselines evaluated zero\-shot\. Ours substantially outperforms GLM\-5V\-Turbo in bedroom, bathroom, and living room, and is near parity in kitchen\. Against Kimi K2\.5, an open\-weight 1\.1T\-parameter model, our method is close in bedroom, bathroom, and living room, but trails in kitchen\. The kitchen gap reflects the same difficulty observed in the rule\-score analysis: kitchens contain dense fixtures, embedded appliances, and multi\-object functional groups that are difficult to generate and judge reliably from 2D renderings\.

Qualitative comparisons are shown in Figures[5](https://arxiv.org/html/2606.10953#S4.F5)and[7](https://arxiv.org/html/2606.10953#Sx1.F7)\. The per\-room schematic examples in Figure[5](https://arxiv.org/html/2606.10953#S4.F5)highlight common failure modes of the baselines, including implausible object placement, collisions, weak wall attachment, and poor functional grouping\. Figure[7](https://arxiv.org/html/2606.10953#Sx1.F7)extends the comparison to full floor plans, showing both the rendered blueprint\-style output and the underlying schematic DSL view\. These examples also illustrate cases where scorer values alone are insufficient: visually plausible arrangements may depend on softer layout preferences, while the VLM judge can still fail on dense or ambiguous kitchen configurations; representative cases are discussed in Appendix[E](https://arxiv.org/html/2606.10953#A5)

Figure[8](https://arxiv.org/html/2606.10953#Sx1.F8)shows additional outputs from the final Architect\-Ant model across different floor plans, illustrating variation in generated schematic layouts and rendered blueprint\-style results\.

### 4\.4\.Ablations

#### Pipeline stages on bedroom\.

Table[5](https://arxiv.org/html/2606.10953#S5.T5)isolates the contribution of each pipeline component on bedroom in\-distribution validation\. The zero\-shot baseline evaluates the pretrained model without task\-specific adaptation\. The first SFT row uses text\-only prompts and reasoning traces without fail\-and\-fix recovery: the trace describes a direct placement sequence and then emits the final layout\. Adding fail\-and\-fix examples improves the score from4\.654\.65to5\.235\.23, suggesting that recovery traces help the model learn how local placement errors should be corrected\. Adding the structure\-image input further improves the score to5\.985\.98, indicating that the top\-down room rendering provides useful spatial information beyond the textual primitives\. Synthetic\-pair DPO gives the best score among the main pipeline variants, reaching6\.246\.24\.

Table 2\.Out\-of\-distribution evaluation on CubiCasa5K \(n=100n\{=\}100rooms per type,K=10K\{=\}10candidates per prompt\)\. Rule\-scorer values; higher is better\.mean±\\pmstdis the primary view: per\-room mean of theKKcandidate scores, aggregated across rooms with the standard deviation across rooms\.bestis the secondary view: per\-room best\-of\-KKscore, averaged across rooms\.Baselineis zero\-shot Qwen3\.5\-9B;SFTis supervised fine\-tuning on AntPlan\-270;Oursis SFT followed by synthetic\-pair DPO\. Kimi K2\.5 and GLM\-5V\-Turbo are frontier baselines evaluated zero\-shot\.Table 3\.VLM\-as\-judge ablation on CubiCasa5k \(n=100n\{=\}100per type\):Oursvs\. the corresponding SFT model\. Values are pairwise preference rates; higherOurs%indicates stronger preference for synthetic\-pair DPO\.
#### Model\-pair DPO ablation\.

The final row evaluates a broader model\-pair DPO construction\. Unlike synthetic pairs, where chosen and rejected outputs differ only in one perturbed bounding box, model pairs are sampled from the trained model and selected by score gap\. This variant reaches a higher rule score \(6\.816\.81\) than synthetic\-pair DPO, but qualitative inspection shows poorer layouts, including less stable object placement and implausible arrangements; examples are shown in Appendix[D\.3](https://arxiv.org/html/2606.10953#A4.SS3)\. This confirms that a higher rule score does not necessarily imply better layout quality when the preference pairs expose non\-placement shortcuts or reward\-hacking behavior\. We therefore use synthetic\-pair DPO as the main recipe\.

## 5\.Conclusion

Table 4\.VLM\-as\-judge comparison against frontier multimodal\-agent baselines on CubiCasa5K \(n=100n\{=\}100per type\)\. Values are pairwise preference rates between rendered layouts; higherOurs%indicates stronger preference for Architect\-Ant\.We presented Architect\-Ant, a framework for furnishing residential floor plans with object\-level structured layouts\. Given room geometry and a requested furniture inventory, Architect\-Ant generates furniture classes and axis\-aligned bounding boxes in a compact DSL that can be parsed, scored, modified, and rendered\. The system combines pseudo\-labeled furnished layouts from AntPlan\-270, procedural reasoning traces with fail\-and\-fix recovery, and preference pairs derived from a deterministic rule scorer\.

On out\-of\-distribution CubiCasa rooms, Architect\-Ant matches or improves on the supervised baseline by rule score in three of four room types\. The independent vision\-language judge gives a more nuanced result: gains in bathroom and kitchen, a near tie in living room, and a regression in bedroom\. Qualitative comparisons suggest that the gains often involve softer visual\-functional preferences, such as coherent chair\-table arrangements, tighter kitchen groupings, and better wall attachment, which are not fully captured by hard rule scores\.

Table 5\.Pipeline ablation on bedroom in\-distribution validation \(n=42n\{=\}42, best\-of\-6 rule score; higher is better\)\.Configurationbest\-of\-6Baseline \(zero\-shot Qwen3\.5\-9B\)0\.17\+ SFT \(text only, no fail\-and\-fix\)4\.65\+ SFT \(text only\)5\.23\+ SFT \(text \+ structure image\)5\.98\+ DPO \(synthetic\-pair,ours\)6\.24\+ DPO \(model\-pair; ablation\)6\.81Our ablations show that the way preference pairs are constructed affects the learned layout distribution\. Synthetic\-pair DPO localizes each preference to a single perturbed bounding box, while broader model\-pair DPO can achieve higher rule scores but produce worse qualitative layouts\. This suggests that verifier\-derived preferences are useful for spatial layout generation, but they should be constructed carefully so that the preference signal reflects the intended placement property\.

Overall, Architect\-Ant provides a practical route for furnishing residential floor plans when clean object\-level demonstrations are limited but weak annotations and explicit spatial rules are available\. By keeping the layout as a structured object\-level representation, the method supports workflows where furniture placement must be inspected, adjusted, and rendered, including real estate visualization, interior design, and architectural floor\-plan workflows\. Similar ideas may also apply to nearby layout\-design problems that combine object placement, explicit constraints, and visual output\.

## Acknowledgments

The work is supported by funding from King Abdullah University of Science and Technology \(KAUST\)—Center of Excellence for Generative AI, under award number 5940, and a gift from Google\.

![Refer to caption](https://arxiv.org/html/2606.10953v1/x5.png)Figure 7\.Representative full\-floor\-plan qualitative comparison on CubiCasa5K\. Each example shows the input floor plan, the extracted structural input, and generated outputs from the zero\-shot baseline, Architect\-Ant \(Ours\), GLM\-5V\-Turbo, and Kimi K2\.5\. Outputs are shown in both the rendered FLUX LoRA blueprint\-style view and the schematic DSL view, enabling visual inspection of openings, object collisions, wall alignment, and functional groupings\.![Refer to caption](https://arxiv.org/html/2606.10953v1/x6.png)Figure 8\.Variations within Architect\-Ant produced by our final model\. Four examples of different floor plans, each showing the structural input, the generated schematic DSL view, and the rendered result\.Placeholder grid: variations of layouts within our pipeline\.
## References

- \(1\)
- Abouagour and Garyfallidis \(2025\)Mohamed Abouagour and Eleftherios Garyfallidis\. 2025\.ResPlan: A Large\-Scale Vector\-Graph Dataset of 17,000 Residential Floor Plans\.*ArXiv*abs/2508\.14006 \(2025\)\.[https://api\.semanticscholar\.org/CorpusID:280686492](https://api.semanticscholar.org/CorpusID:280686492)
- Aguina\-Kang et al\.\(2024\)Rio Aguina\-Kang, Maxim Gumin, Do Heon Han, Stewart Morris, Seung Jean Yoo, Aditya Ganeshan, R\. K\. Jones, Qiuhong Anna Wei, Kailiang Fu, and Daniel Ritchie\. 2024\.Open\-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases\.*ArXiv*abs/2403\.09675 \(2024\)\.[https://api\.semanticscholar\.org/CorpusID:268509991](https://api.semanticscholar.org/CorpusID:268509991)
- Avetisyan et al\.\(2024\)Armen Avetisyan, Christopher Xie, Henry Howard\-Jenkins, Tsun\-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, et al\.2024\.SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model\. In*European Conference on Computer Vision*\. Springer, 247–263\.[https://api\.semanticscholar\.org/CorpusID:268536695](https://api.semanticscholar.org/CorpusID:268536695)
- Black Forest Labs \(2025\)Black Forest Labs\. 2025\.FLUX\.2: Frontier Visual Intelligence\.[https://bfl\.ai/blog/flux\-2](https://bfl.ai/blog/flux-2)\.
- Çelen et al\.\(2024\)Ata Çelen, Guo Han, Konrad Schindler, Luc Van Gool, Iro Armeni, Anton Obukhov, and Xi Wang\. 2024\.I\-Design: Personalized LLM Interior Designer\. In*European Conference on Computer Vision*\. Springer, 217–234\.[https://api\.semanticscholar\.org/CorpusID:268876421](https://api.semanticscholar.org/CorpusID:268876421)
- Chang et al\.\(2025\)Adrián Chang, Kai Wang, Yuanbo Li, Manolis Savva, Angel X\. Chang, and Daniel Ritchie\. 2025\.Learning to Place Objects with Programs and Iterative Self Training\.*arXiv preprint arXiv:2503\.04496*\(2025\)\.[https://api\.semanticscholar\.org/CorpusID:276812983](https://api.semanticscholar.org/CorpusID:276812983)
- da Cruz et al\.\(2021\)Steve Dias da Cruz, Will Hutchcroft, Yuguang Li, Naji Khosravan, Ivaylo Boyadzhiev, and Sing Bing Kang\. 2021\.Zillow Indoor Dataset: Annotated Floor Plans With 360° Panoramas and 3D Room Layouts\.*2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\(2021\), 2133–2143\.[https://api\.semanticscholar\.org/CorpusID:235694968](https://api.semanticscholar.org/CorpusID:235694968)
- Dai et al\.\(2017\)Angela Dai, Angel X\. Chang, Manolis Savva, Maciej Halber, Thomas A\. Funkhouser, and Matthias Nießner\. 2017\.ScanNet: Richly\-Annotated 3D Reconstructions of Indoor Scenes\.*2017 IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\)*\(2017\), 2432–2443\.[https://api\.semanticscholar\.org/CorpusID:7684883](https://api.semanticscholar.org/CorpusID:7684883)
- Deitke et al\.\(2022\)Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi\. 2022\.ProcTHOR: Large\-Scale Embodied AI Using Procedural Generation\.*Advances in Neural Information Processing Systems*35 \(2022\), 5982–5994\.[https://api\.semanticscholar\.org/CorpusID:249642405](https://api.semanticscholar.org/CorpusID:249642405)
- Dou et al\.\(2024\)Shihan Dou, Yan Liu, Haoxiang Jia, Enyu Zhou, Limao Xiong, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, et al\.2024\.StepCoder: Improving Code Generation with Reinforcement Learning from Compiler Feedback\. In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*\. 4571–4585\.[https://api\.semanticscholar\.org/CorpusID:271915494](https://api.semanticscholar.org/CorpusID:271915494)
- Fan et al\.\(2021\)Zhiwen Fan, Lingjie Zhu, Honghua Li, Xiaohao Chen, Siyu Zhu, and Ping Tan\. 2021\.FloorPlanCAD: A Large\-Scale CAD Drawing Dataset for Panoptic Symbol Spotting\.*2021 IEEE/CVF International Conference on Computer Vision \(ICCV\)*\(2021\), 10108–10117\.[https://api\.semanticscholar\.org/CorpusID:234742455](https://api.semanticscholar.org/CorpusID:234742455)
- Feng et al\.\(2023\)Weixi Feng, Wanrong Zhu, Tsu\-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang\. 2023\.LayoutGPT: Compositional Visual Planning and Generation with Large Language Models\.*Advances in Neural Information Processing Systems*36 \(2023\), 18225–18250\.
- Fu et al\.\(2020a\)Huan Fu, Bowen Cai, Lin Gao, Ling\-Xiao Zhang, Cao Li, Zengqi Xun, Chengyue Sun, Yiyun Fei, Yu qiong Zheng, Ying Li, Yi Liu, Peng Liu, Lin Ma, Le Weng, Xiaohang Hu, Xin Ma, Qian Qian, Rongfei Jia, Binqiang Zhao, and Hao Helen Zhang\. 2020a\.3D\-FRONT: 3D Furnished Rooms with layOuts and semaNTics\.*2021 IEEE/CVF International Conference on Computer Vision \(ICCV\)*\(2020\), 10913–10922\.[https://api\.semanticscholar\.org/CorpusID:227013144](https://api.semanticscholar.org/CorpusID:227013144)
- Fu et al\.\(2020b\)Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Stephen J\. Maybank, and Dacheng Tao\. 2020b\.3D\-FUTURE: 3D Furniture Shape with TextURE\.*International Journal of Computer Vision*129 \(2020\), 3313 – 3337\.[https://api\.semanticscholar\.org/CorpusID:221819358](https://api.semanticscholar.org/CorpusID:221819358)
- GLM\-V Team et al\.\(2026\)GLM\-V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanli Wang, Yan Wang, and … Jie Tang\. 2026\.GLM\-5V\-Turbo: Toward a Native Foundation Model for Multimodal Agents\.[https://api\.semanticscholar\.org/CorpusID:287902038](https://api.semanticscholar.org/CorpusID:287902038)
- Google DeepMind \(2025\)Google DeepMind\. 2025\.*Gemini 3 Flash Model Card*\.Technical Report\. Google DeepMind\.[https://storage\.googleapis\.com/deepmind\-media/Model\-Cards/Gemini\-3\-Flash\-Model\-Card\.pdf](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf)
- Hu et al\.\(2020\)Ruizhen Hu, Zeyu Huang, Yuhan Tang, Oliver Matias van Kaick, Hao Zhang, and Hui Huang\. 2020\.Graph2Plan: Learning Floorplan Generation from Layout Graphs\.*ACM Transactions on Graphics \(TOG\)*39 \(2020\), 118:1 – 118:14\.[https://api\.semanticscholar\.org/CorpusID:216562245](https://api.semanticscholar.org/CorpusID:216562245)
- Kalervo et al\.\(2019\)Ahti Kalervo, Juha Ylioinas, Markus Häikiö, Antti Karhu, and Juho Kannala\. 2019\.CubiCasa5K: A Dataset and an Improved Multi\-Task Model for Floorplan Image Analysis\. In*Scandinavian Conference on Image Analysis*\. Springer, 28–40\.[https://api\.semanticscholar\.org/CorpusID:102487507](https://api.semanticscholar.org/CorpusID:102487507)
- Khanna et al\.\(2023\)Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Schacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X\. Chang, and Manolis Savva\. 2023\.Habitat Synthetic Scenes Dataset \(HSSD\-200\): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation\.*2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\(2023\), 16384–16393\.[https://api\.semanticscholar\.org/CorpusID:259203445](https://api.semanticscholar.org/CorpusID:259203445)
- Le et al\.\(2022\)Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Chu Hong Hoi\. 2022\.CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning\.*Advances in Neural Information Processing Systems*35 \(2022\), 21314–21328\.[https://api\.semanticscholar\.org/CorpusID:250280117](https://api.semanticscholar.org/CorpusID:250280117)
- Leimer et al\.\(2022\)Kurt Leimer, Paul Guerrero, Tomer Weiss, and Przemyslaw Musialski\. 2022\.LayoutEnhancer: Generating Good Indoor Layouts from Imperfect Data\.*SIGGRAPH Asia 2022 Conference Papers*\(2022\)\.[https://api\.semanticscholar\.org/CorpusID:252734701](https://api.semanticscholar.org/CorpusID:252734701)
- Li et al\.\(2024\)Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen\. 2024\.ControlNet\+\+: Improving Conditional Controls with Efficient Consistency Feedback\. In*European Conference on Computer Vision*\. Springer, 129–147\.[https://api\.semanticscholar\.org/CorpusID:269043104](https://api.semanticscholar.org/CorpusID:269043104)
- Lin and Mu \(2024\)Chenguo Lin and Yadong Mu\. 2024\.InstructScene: Instruction\-Driven 3D Indoor Scene Synthesis with Semantic Graph Prior\. In*The Twelfth International Conference on Learning Representations*\.[https://openreview\.net/forum?id=LtuRgL03pI](https://openreview.net/forum?id=LtuRgL03pI)
- Liu et al\.\(2026\)Yuanqing Liu, Ziming Yang, Yulong Li, and Yue Yang\. 2026\.FloorplanVLM: A Vision\-Language Model for Floorplan Vectorization\.*ArXiv*abs/2602\.06507 \(2026\)\.[https://api\.semanticscholar\.org/CorpusID:285401853](https://api.semanticscholar.org/CorpusID:285401853)
- Luo et al\.\(2026\)Ruifeng Luo, Zhengjie Liu, Tianxiao Cheng, Jie Wang, Tongjie Wang, Fei Cheng, Fu Chai, YanPeng Li, Xingguang Wei, Haomin Wang, et al\.2026\.ArchCAD\-400K: A Large\-Scale CAD drawings Dataset and New Baseline for Panoptic Symbol Spotting\.*Advances in Neural Information Processing Systems*38 \(2026\), 127715–127739\.[https://api\.semanticscholar\.org/CorpusID:277434981](https://api.semanticscholar.org/CorpusID:277434981)
- Lv et al\.\(2024\)Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, and Yi Liu\. 2024\.RT\-DETRv2: Improved Baseline with Bag\-of\-Freebies for Real\-Time Detection Transformer\.arXiv:2407\.17140 \[cs\.CV\][https://arxiv\.org/abs/2407\.17140](https://arxiv.org/abs/2407.17140)
- Merrell et al\.\(2011\)Paul C\. Merrell, Eric Schkufza, Zeyang Li, Maneesh Agrawala, and Vladlen Koltun\. 2011\.Interactive furniture layout using interior design guidelines\.*ACM SIGGRAPH 2011 papers*\(2011\)\.[https://api\.semanticscholar\.org/CorpusID:53246134](https://api.semanticscholar.org/CorpusID:53246134)
- Nauata et al\.\(2020\)Nelson Nauata, Kai\-Hung Chang, Chin\-Yi Cheng, Greg Mori, and Yasutaka Furukawa\. 2020\.House\-GAN: Relational Generative Adversarial Networks for Graph\-constrained House Layout Generation\. In*European Conference on Computer Vision*\. Springer, 162–177\.[https://api\.semanticscholar\.org/CorpusID:212725507](https://api.semanticscholar.org/CorpusID:212725507)
- Ouyang et al\.\(2022\)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.2022\.Training language models to follow instructions with human feedback\.*Advances in neural information processing systems*35 \(2022\), 27730–27744\.[https://api\.semanticscholar\.org/CorpusID:246426909](https://api.semanticscholar.org/CorpusID:246426909)
- Pan et al\.\(2023\)Xiaqing Pan, Nicholas Charron, Yongqiang Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar M\. Parkhi, Richard A\. Newcombe, and Carl Yuheng Ren\. 2023\.Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception\.*2023 IEEE/CVF International Conference on Computer Vision \(ICCV\)*\(2023\), 20076–20086\.[https://api\.semanticscholar\.org/CorpusID:259137475](https://api.semanticscholar.org/CorpusID:259137475)
- Para et al\.\(2020\)Wamiq Reyaz Para, Paul Guerrero, Tom Kelly, Leonidas J\. Guibas, and Peter Wonka\. 2020\.Generative Layout Modeling using Constraint Graphs\.*2021 IEEE/CVF International Conference on Computer Vision \(ICCV\)*\(2020\), 6670–6680\.[https://api\.semanticscholar\.org/CorpusID:227209310](https://api.semanticscholar.org/CorpusID:227209310)
- Paschalidou et al\.\(2021\)Despoina Paschalidou, Amlan Kar, Maria Shugrina, Karsten Kreis, Andreas Geiger, and Sanja Fidler\. 2021\.ATISS: Autoregressive Transformers for Indoor Scene Synthesis\. In*Neural Information Processing Systems*\.[https://api\.semanticscholar\.org/CorpusID:238419213](https://api.semanticscholar.org/CorpusID:238419213)
- Qwen Team \(2026\)Qwen Team\. 2026\.Qwen3\.5: Towards Native Multimodal Agents\.[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)
- Rafailov et al\.\(2023\)Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\. 2023\.Direct Preference Optimization: Your Language Model is Secretly a Reward Model\.*Advances in neural information processing systems*36 \(2023\), 53728–53741\.[https://api\.semanticscholar\.org/CorpusID:258959321](https://api.semanticscholar.org/CorpusID:258959321)
- Roberts et al\.\(2020\)Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind\. 2020\.Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding\.*2021 IEEE/CVF International Conference on Computer Vision \(ICCV\)*\(2020\), 10892–10902\.[https://api\.semanticscholar\.org/CorpusID:226254406](https://api.semanticscholar.org/CorpusID:226254406)
- Rodionov et al\.\(2025\)Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John C\. Femiani, Bernard Ghanem, and Peter Wonka\. 2025\.FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations\. In*Proceedings of the 43rd International Conference on Machine Learning \(ICML\)*\.[https://api\.semanticscholar\.org/CorpusID:280219639](https://api.semanticscholar.org/CorpusID:280219639)
- Shabani et al\.\(2022\)Mohammad Amin Shabani, Sepidehsadat Hosseini, and Yasutaka Furukawa\. 2022\.HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising\.*2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\(2022\), 5466–5475\.[https://api\.semanticscholar\.org/CorpusID:254018175](https://api.semanticscholar.org/CorpusID:254018175)
- Shao et al\.\(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al\.2024\.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\.*ArXiv*abs/2402\.03300 \(2024\)\.[https://api\.semanticscholar\.org/CorpusID:267412607](https://api.semanticscholar.org/CorpusID:267412607)
- Sun et al\.\(2024\)Fan\-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, and Jiajun Wu\. 2024\.LayoutVLM: Differentiable Optimization of 3D Layout via Vision\-Language Models\.*2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\(2024\), 29469–29478\.[https://api\.semanticscholar\.org/CorpusID:274446060](https://api.semanticscholar.org/CorpusID:274446060)
- Tang et al\.\(2023\)Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, and Matthias Nießner\. 2023\.DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis\.*2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\(2023\), 20507–20518\.[https://api\.semanticscholar\.org/CorpusID:268363865](https://api.semanticscholar.org/CorpusID:268363865)
- Team et al\.\(2026\)Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S\. H\. Cai, Yuan Cao, Y\. Charles, H\. S\. Che, Cheng Chen, Guanduo Chen, and … Xinxing Zu\. 2026\.Kimi K2\.5: Visual Agentic Intelligence\.arXiv:2602\.02276 \[cs\.CL\][https://api\.semanticscholar\.org/CorpusID:285269548](https://api.semanticscholar.org/CorpusID:285269548)
- Van Engelenburg et al\.\(2024\)Casper Van Engelenburg, Fatemeh Mostafavi, Emanuel Kuhn, Yuntae Jeon, Michael Franzen, Matthias Standfest, Jan van Gemert, and Seyran Khademi\. 2024\.MSD: A Benchmark Dataset for Floor Plan Generation of Building Complexes\. In*European Conference on Computer Vision*\. Springer, 60–75\.[https://api\.semanticscholar\.org/CorpusID:271213468](https://api.semanticscholar.org/CorpusID:271213468)
- Wang et al\.\(2024\)Can Wang, Hongliang Zhong, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao\. 2024\.Chat2Layout: Interactive 3D Furniture Layout With a Multimodal LLM\.*IEEE transactions on visualization and computer graphics*32 \(2024\), 2243–2259\.[https://api\.semanticscholar\.org/CorpusID:271571635](https://api.semanticscholar.org/CorpusID:271571635)
- Wang et al\.\(2026\)Yuxi Wang, Junran Peng, Genghao Zhang, Chuanchen Luo, Shibiao Xu, Man Zhang, and Zhaoxiang Zhang\. 2026\.FurniScene: A Large\-scale 3D Room Dataset with Intricate Furnishing Scenes\.*International Journal of Computer Vision*134, 3 \(2026\), 125\.[https://api\.semanticscholar\.org/CorpusID:266844416](https://api.semanticscholar.org/CorpusID:266844416)
- Weyssow et al\.\(2026\)Martin Weyssow, Aton Kamanda, Xin Zhou, and Houari Sahraoui\. 2026\.CodeUltraFeedback: An LLM\-as\-a\-Judge Dataset for Aligning Large Language Models to Coding Preferences\.*ACM Transactions on Software Engineering and Methodology*35, 3 \(2026\), 1–36\.[https://api\.semanticscholar\.org/CorpusID:268385144](https://api.semanticscholar.org/CorpusID:268385144)
- Wu et al\.\(2019\)Wenming Wu, Xiaoming Fu, Rui Tang, Yuhan Wang, Yuanhang Qi, and Ligang Liu\. 2019\.Data\-driven interior plan generation for residential buildings\.*ACM Transactions on Graphics \(TOG\)*38 \(2019\), 1 – 12\.[https://api\.semanticscholar\.org/CorpusID:207998029](https://api.semanticscholar.org/CorpusID:207998029)
- Yang et al\.\(2024\)Yixuan Yang, Junru Lu, Zixiang Zhao, Zhen Luo, James Jianqiao Yu, Victor Sanchez, and Feng Zheng\. 2024\.LLplace: The 3D Indoor Scene Layout Generation and Editing via Large Language Model\.*ArXiv*abs/2406\.03866 \(2024\)\.[https://api\.semanticscholar\.org/CorpusID:270286118](https://api.semanticscholar.org/CorpusID:270286118)
- Yang et al\.\(2025\)Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, and Feng Zheng\. 2025\.LLM\-driven Indoor Scene Layout Generation via Scaled Human\-aligned Data Synthesis and Multi\-Stage Preference Optimization\.*Advances in Neural Information Processing Systems*38 \(2025\), 42499–42529\.[https://api\.semanticscholar\.org/CorpusID:279251590](https://api.semanticscholar.org/CorpusID:279251590)
- Yang et al\.\(2023\)Yue Yang, Fan\-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison\-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark\. 2023\.Holodeck: Language Guided Generation of 3D Embodied AI Environments\.*2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\(2023\), 16277–16287\.[https://api\.semanticscholar\.org/CorpusID:266210109](https://api.semanticscholar.org/CorpusID:266210109)
- Yu et al\.\(2011\)Lap\-Fai Yu, Sai Kit Yeung, Chi\-Keung Tang, Demetri Terzopoulos, Tony F Chan, and Stanley J Osher\. 2011\.Make it home: Automatic Optimization of Furniture Arrangement\.*ACM SIGGRAPH 2011 papers*30, 4 \(2011\), 86\.[https://api\.semanticscholar\.org/CorpusID:14227](https://api.semanticscholar.org/CorpusID:14227)
- Zeng et al\.\(2019\)Zhiliang Zeng, Xianzhi Li, Ying Kin Yu, and Chi\-Wing Fu\. 2019\.Deep Floor Plan Recognition Using a Multi\-Task Network With Room\-Boundary\-Guided Attention\.*2019 IEEE/CVF International Conference on Computer Vision \(ICCV\)*\(2019\), 9095–9103\.[https://api\.semanticscholar\.org/CorpusID:201670016](https://api.semanticscholar.org/CorpusID:201670016)
- Zhang et al\.\(2023\)Lvmin Zhang, Anyi Rao, and Maneesh Agrawala\. 2023\.Adding Conditional Control to Text\-to\-Image Diffusion Models\.*2023 IEEE/CVF International Conference on Computer Vision \(ICCV\)*\(2023\), 3813–3824\.[https://api\.semanticscholar\.org/CorpusID:256827727](https://api.semanticscholar.org/CorpusID:256827727)
- Zheng et al\.\(2020\)Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou\. 2020\.Structured3D: A Large Photo\-realistic Dataset for Structured 3D Modeling\. In*European Conference on Computer Vision*\. Springer, 519–535\.[https://api\.semanticscholar\.org/CorpusID:199064623](https://api.semanticscholar.org/CorpusID:199064623)

## Appendix AAntPlan\-270 Details

AntPlan\-270 contains 270 professionally drawn residential floor plans collected from publicly accessible online sources\. We process each plan with an RT\-DETR\-X bounding\-box extractor trained on CubiCasa5K to recover per\-room geometry in metric units, including walls, doors, windows, and railings\. Furniture boxes are obtained with a separate detector trained on a hand\-labeled subset and then manually reviewed; we therefore treat them as corrected pseudo\-labels rather than exhaustively redrawn annotations\. Each floor plan is decomposed into room\-level samples, and samples with fewer than two whitelisted furniture objects are discarded\.

#### Per\-room raw statistics\.

Table[6](https://arxiv.org/html/2606.10953#A1.T6)summarizes the room\-level contents of AntPlan\-270 before the train/validation split and before augmentation\. Each room type has a fixed*class whitelist*\. Boxes whose class is outside the corresponding whitelist are removed before sample filtering\. “Samples” denotes the number of room samples that retain at least two whitelisted objects\. “Classes” is the number of distinct whitelisted classes observed in the retained layouts\. “Kept objs” counts the retained whitelisted boxes, while “Min”, “Mean”, and “Max” report the number of retained objects per sample\.

Table 6\.Per\-room statistics for AntPlan\-270 before splitting and augmentation\. The “Raw objs” column counts all ground\-truth boxes in the source annotations, and “Kept objs” counts boxes retained after applying the per\-room class whitelist\. The per\-sample “Min”, “Mean”, and “Max” columns are computed over retained boxes\. In the “Total” row, “Samples”, “Raw objs”, and “Kept objs” are summed across room types; “Classes” reports the union of observed whitelisted classes; and “Mean” is the sample\-weighted mean number of retained objects per room sample\.∗Union across rooms; classes overlap across room types \(e\.g\.,radiator,chair,table,shelf,curtains, andrug\)\. Discarded boxes \(16661666in total,14%14\\%\) correspond to classes outside the room\-specific scorer whitelists, including small decorative or auxiliary objects, unsupported accessories, and rare or inconsistent source labels\.

#### Train/validation splits used for SFT and DPO\.

We split samples 90/10 by source floor\-plan ID so that augmented versions of the same plan cannot appear in both training and validation\. Augmentation is applied only to the training split: each training sample is kept in its original form and augmented by horizontal flipping and180∘180^\{\\circ\}rotation, yielding a three\-fold training set\. Table[7](https://arxiv.org/html/2606.10953#A1.T7)reports the per\-room split sizes used to train the SFT checkpoints in the main paper\.

Table 7\.Per\-room split sizes used for training and evaluation\. “Train \(aug\)” includes the original training samples plus horizontal flip and180∘180^\{\\circ\}rotation; validation contains only original, non\-augmented samples\.
#### Class frequency\.

Table[8](https://arxiv.org/html/2606.10953#A1.T8)lists the ten most frequent classes for each room type, counted over all training and validation layouts\. The remaining long tail consists mostly of decorative objects \(e\.g\.,rug,curtains,plant,mirror,floor\_lamp\) and small auxiliary fixtures \(e\.g\.,towel\_warmer,side\_table,range\_hood\)\. Although rare, these objects account for many difficult placement edge cases\.

Table 8\.Top\-10 class frequencies per room type\. Counts include both training and validation layouts\.
#### Comparison with other furniture\-layout corpora\.

AntPlan\-270 is small but heavily curated\. Unlike abstract scene\-graph datasets, it represents rooms in real\-world metric coordinates\. It differs from large 3D scene corpora such as 3D\-FRONT and ScanNet in two main ways: it is 2D and bounding\-box\-only, with no mesh geometry or explicit orientation angle; and each room includes both parametric architectural geometry \(walls, doors, windows, and railings\) and per\-class furniture supervision\. In contrast, most 2D floor\-plan datasets, including CubiCasa5K and RPLAN, provide structural annotations but no furniture boxes, and therefore cannot directly train or evaluate furnishing models\.

## Appendix BDSL Format

Furniture layouts are represented with a compact line\-oriented grammar:

```
FURNITURE
OBJ class=<snake_case> x=<m> y=<m> w=<m> h=<m>
...
END
```

Each layout starts with the literalFURNITURE, ends withEND, and contains zero or moreOBJlines\. Each object line specifies the class token, the top\-left corner\(x, y\), and the width and height\(w, h\)of an axis\-aligned bounding box\. All coordinates are measured in metres and written with two decimal places\. The format can be parsed in a single linear scan, edited object by object, and rasterized into a colored room\-type mask using a fixed schematic renderer\.

The long\-side orientation of an object is implied bywandh\. The DSL does not encode a separate rotation angle; throughout this paper, references to orientation therefore mean axis\-aligned aspect rather than continuous rotation\. This representation makes geometric and combinatorial constraints—including overlap, containment, clearance, and pairwise relations—directly computable from the parsed objects, without an intermediate reconstruction step\.

#### Example\.

A minimal bedroom layout in the DSL is shown below\.

```
FURNITURE
OBJ class=bed         x=1.60 y=1.38 w=1.60 h=2.00
OBJ class=nightstand  x=1.15 y=2.98 w=0.45 h=0.40
OBJ class=nightstand  x=3.20 y=2.98 w=0.45 h=0.40
OBJ class=wardrobe    x=0.12 y=0.12 w=0.60 h=1.80
OBJ class=tv_stand    x=2.88 y=0.12 w=1.20 h=0.40
OBJ class=tv          x=2.98 y=0.20 w=1.00 h=0.10
END
```

The coordinate origin is the top\-left corner of the room frame;xincreases to the right, andyincreases downward\. The same line\-oriented structure is used for all four room types; only the whitelist of allowedclasstokens changes across rooms\.

#### Strict\-mode parsing\.

Generation is decoded greedily until the firstENDtoken\. Lines that do not match theOBJ class=…regular expression are dropped, and an error is logged\. If noFURNITURE…ENDblock is present, the layout receives a score of−15\-15\. This penalty dominates all other rules, making malformed completions effectively unusable and encouraging the model to learn a valid surface form during SFT\.

## Appendix CScore Function Details

The rule\-based scorer assigns a real\-valued score to each parsed DSL layout\. Each layout starts from a base score of\+10\+10, and every rule violation deducts a penalty proportional to its severity\. Table[9](https://arxiv.org/html/2606.10953#A3.T9)lists all rules, their penalty schedules, and the thresholds used by the scorer\.

Table 9\.Rule\-based scorer penalties\. All thresholds are measured in metres or as fractions of object area\. “Ratio” denotes the intersection\-over\-smaller\-area ratio between two bounding boxes\. Rules marked as per\-room apply only when they are enabled by the corresponding room specification\.RulePenaltyTriggerformat\_errors\_no\_objects−15\-15noFURNITURE…ENDblockout\_of\_bounds−2\-2eachobject exits the room framewall\_overlap\(light\)−1\-1each10%10\\%–40%40\\%of object area inside wallwall\_overlap\(severe\)−3\-3each\>40%\>40\\%of object area inside wall<class\>\_not\_at\_wall\_strict−2\-2eachstrict\-wall classes with\>10%\>10\\%in wall \(per\-room\)internal\_not\_in\_wall−2\-2eachwall\-internal class fails to overlap a wallrail\_overlap−1\-1eachobject intersects a railingwindow\_overlap−1\-1each\>5%\>5\\%of object area inside a windowdoor\_overlap−2\-2each\>5%\>5\\%of object area inside a doordoor\_blocked−2\-2eachobject lies in door swing zone \(0\.600\.60m deep\)fixture\_not\_at\_wall−1\-1eachwall\-touch class has gap\>0\.15\>0\.15m to nearest wallradiator\_misaligned−1\-1eachradiator\-like object has longer side not parallel to wallshort\_side\_not\_to\_wall−1\-1eachper\-room short\-side\-to\-wall class has wrong orientationdisallowed\_overlap\(light\)−1\-1pair overlap ratio10%10\\%–15%15\\%disallowed\_overlap\(medium\)−2\-2pair overlap ratio15%15\\%–50%50\\%disallowed\_overlap\(severe\)−4\-4ratio≥50%\\geq 50\\%or pair inforbidden\_pairsself\_overlap\_excess−2\-2eachtwo same\-class objects exceed per\-room overlap capappliance\_not\_at\_wall−1\-1eachkitchen appliance touches neither wall, counter, nor islandappliance\_partial\_in\_countertop−2\-2eachappliance is5%5\\%–95%95\\%inside a countertopwindow\_blocked\_by\_blocker−2\-2eachfridge/cabinet/wardrobe in window\-front zone \(0\.400\.40m\)chair\_fully\_under−1\-1eachchair bbox fully inside table or islandchair\_far\_from\_seating−1\-1eachchair more than0\.600\.60m from any table/counter/islandchair\_not\_tucked−0\.5\-0\.5eachper\-room facing chair fails minimum tuck ratiochair\_distribution\_imbalanced−1\-1/side≥3\\geq 3chairs on one side of table \(capped at−2\-2\)island\_no\_aisle−1\-1fixedfreestanding island has all four sides blockedinsufficient\_clearanceproportionaltable/island within0\.600\.60m of non\-island countertopfurniture\_not\_in\_line−1\-1eachkitchen anchor class is neither wall\- nor transitively anchoredinventory\_mismatch−2\-2/item \(cap−8\-8\)class counts deviate from REQUEST#### Global thresholds\.

All rules share one set of constants:wall\_touch\_tolerance=0\.15\\textsc\{wall\\\_touch\\\_tolerance\}=0\.15m,wall\_overlap\_ratio=0\.10\\textsc\{wall\\\_overlap\\\_ratio\}=0\.10, wall\_partial\_internal\_ratio=0\.60\\textsc\{wall\\\_partial\\\_internal\\\_ratio\}=0\.60,door\_clearance\_depth=0\.60\\textsc\{door\\\_clearance\\\_depth\}=0\.60m,opening\_overlap\_tolerance=0\.05\\textsc\{opening\\\_overlap\\\_tolerance\}=0\.05, pair\_overlap\_touch\_ratio=0\.10\\textsc\{pair\\\_overlap\\\_touch\\\_ratio\}=0\.10,pair\_overlap\_mod\_ratio=0\.15\\textsc\{pair\\\_overlap\\\_mod\\\_ratio\}=0\.15, andpair\_overlap\_large\_ratio=0\.50\\textsc\{pair\\\_overlap\\\_large\\\_ratio\}=0\.50\. Given the parsed DSL and the room geometry, penalties are deterministic\. The same scorer can therefore be used both as the DPO reward signal and as the best\-of\-NNselector at inference time\.

#### Example score trace\.

Table[10](https://arxiv.org/html/2606.10953#A3.T10)reports the scorer trace for the Figure[4](https://arxiv.org/html/2606.10953#S3.F4)bedroom variants\. It makes the score arithmetic explicit: each row starts from the same\+10\+10base score and subtracts the listed penalties to obtain the final score shown in the figure\.

Table 10\.Per\-rule breakdown for the Figure[4](https://arxiv.org/html/2606.10953#S3.F4)variants\. The base score is\+10\+10; penalties are summed to obtain the listed total\.

## Appendix DPreference\-Pair Construction

DPO requires preference pairs\(chosen,rejected\)\(\\text\{chosen\},\\text\{rejected\}\)of model completions for the same prompt\. We use two complementary pair\-construction recipes\. Both start from the same SFT\-aug checkpoint and use the same training hyperparameters: DPO regularization coefficientβ=0\.1\\beta=0\.1, learning rate10−610^\{\-6\}with cosine decay, two epochs, and checkpointing every 10 or 25 steps\.

### D\.1\.Strict\-pair DPO

For every validation prompt, we sample up to eight candidate completions from the SFT\-aug checkpoint and score them with the rule\-based scorer\. We defineθgood\\theta\_\{\\text\{good\}\}as the minimum score for a sampled model candidate to enter the chosen pool,θg​t\\theta\_\{gt\}as the minimum score for a GT layout to enter the chosen pool,δmin\\delta\_\{\\min\}as the required chosen–rejected score gap, andKpairsK\_\{\\text\{pairs\}\}as the maximum number of preference pairs emitted per prompt\. The chosen pool is

chosen pool=\{GT:score\(GT\)≥θg​t\}∪\{best candidate:score≥θgood\}\.\\text\{chosen pool\}=\\\{\\,\\text\{GT\}:\\text\{score\(GT\)\}\\geq\\theta\_\{gt\}\\,\\\}\\cup\\\{\\,\\text\{best candidate\}:\\text\{score\}\\geq\\theta\_\{\\text\{good\}\}\\,\\\}\.For each chosen completioncc, we sample a rejected completionrrfrom the candidate pool subject to

score​\(c\)−score​\(r\)≥δmin,score​\(r\)<θgood,r≠c\.\\text\{score\}\(c\)\-\\text\{score\}\(r\)\\geq\\delta\_\{\\min\},\\qquad\\text\{score\}\(r\)<\\theta\_\{\\text\{good\}\},\\qquad r\\neq c\.At mostKpairsK\_\{\\text\{pairs\}\}pairs are emitted per prompt\. The procedural reasoning trace is preserved on both sides of the pair, so the gradient signal primarily isolates placement quality rather than surface form\.

Table[11](https://arxiv.org/html/2606.10953#A4.T11)lists the per\-room hyperparameters and resulting pair counts\. Bedroom and living room produce several hundred pairs with a healthy mixture of GT\-as\-chosen and best\-candidate\-as\-chosen examples\. Kitchen is the most data\-starved setting:62%62\\%of prompts have no candidate aboveθgood\\theta\_\{\\text\{good\}\}, so70%70\\%of chosen entries fall back to GT and the median chosen–rejected score gap rises to9\.09\.0\. In this regime, the strict recipe trains on “GT vs\. garbage” comparisons and tends to memorize GT layouts rather than generalize\. This failure mode is analyzed in Appendix[D\.3](https://arxiv.org/html/2606.10953#A4.SS3)\.

Table 11\.Strict\-pair hyperparameters and resulting pair counts\.
### D\.2\.Synthetic\-pair DPO

Strict pairs require the SFT model to be strong enough that its best samples are separated from its worst samples by a meaningful score margin\. When the chosen pool is sparse, as in kitchens, strict\-pair DPO can degenerate into memorization\. Synthetic pairs avoid this failure mode by constructing the rejected side manually\. Starting from a GT layout, we perturb exactly one bounding box in a way that violates exactly one scorer rule, while leaving the rest of the layout and the procedural reasoning trace byte\-identical\. The preference contrast is therefore a single placement edit rather than a broader stylistic difference\. Table[12](https://arxiv.org/html/2606.10953#A4.T12)summarizes the perturbations\.

Table 12\.Synthetic perturbations used to generate the rejected side of each pair\. Each perturbation is designed to violate one scorer rule from Appendix[C](https://arxiv.org/html/2606.10953#A3); one perturbation is sampled per pair\.#### Sampling\.

For every GT prompt, we sample one perturbation uniformly from the corresponding room\-specific list\. We retry up to four times when a perturbation is infeasible \(for example, when there is no valid wall from which to pull an anchor object\)\. Both sides of each pair use the same prompt, inventory, and procedural reasoning trace; they differ only in the perturbed OBJ placement\. This keeps the DPO signal focused on the placement change rather than differences in wording or trace structure\.

### D\.3\.Strict\-pair versus synthetic\-pair DPO\.

We also compare strict\-pair and synthetic\-pair DPO on bedrooms\. The selected strict\-pair checkpoint achieves a higher rule score than the selected synthetic\-pair checkpoint in both in\-distribution and OOD evaluation \(Table[13](https://arxiv.org/html/2606.10953#A4.T13)\)\. However, qualitative inspection reveals cases where this gain reflects reward hacking rather than better layouts \(Figure[9](https://arxiv.org/html/2606.10953#A4.F9)\)\. Some strict\-pair outputs satisfy the written rules while producing looser or less functional arrangements\. Synthetic\-pair DPO is more conservative by construction, because each pair differs by a single bounding\-box perturbation\. We therefore use the selected synthetic\-pair checkpoint in the main pipeline, where visual\-functional quality is more important than maximizing the rule score alone\.

Table 13\.Bedroom rule\-score comparison of strict\-pair and synthetic\-pair DPO\.![Refer to caption](https://arxiv.org/html/2606.10953v1/x7.png)\(a\)Strict\-pair DPO scores higher, but leaves large furniture less tightly arranged\.
![Refer to caption](https://arxiv.org/html/2606.10953v1/x8.png)\(b\)Both outputs receive the same rule score, although the synthetic\-pair layout is visually tighter\.

Figure 9\.Bedroom examples comparing strict\-pair and synthetic\-pair DPO on CubiCasa rooms\. The examples illustrate that higher rule score does not always correspond to better visual\-functional layout quality\.

## Appendix EVLM\-as\-Judge Pipeline

The rule\-based scorer is also the DPO reward signal, so DPO checkpoints are reward\-greedy by construction\. To obtain an independent estimate of layout quality, we use a VLM\-as\-judge pipeline\. We render two layouts on the same empty floor plan, pass the two full\-resolution PNGs to Gemini 3 Flash Preview as separate images rather than as a single composite, setthinking\_level=MEDIUM, and parse a structuredWINNER / REASON / CONFIDENCEresponse\.

#### Judge prompt\.

For the visual judgment study, the VLM judge receives two anonymized rendered layouts for the same room geometry, with randomized A/B order\. The prompt asks the judge to compare functional layout quality only, ignoring rendering style, and to return one of\{A,B,TIE\}\\\{\\text\{A\},\\text\{B\},\\text\{TIE\}\\\}\. The full prompt is included with the released evaluation code; the excerpt below shows the criteria used in all comparisons\.

`Prompt: VLM layout judge`

`Judge calibration on kitchens\. Kitchens are the most difficult setting for the visual judge\. They contain large inventories, averaging 10 objects and reaching up to 20, and their constraints tightly couple appliances to countertops and islands\. Figure 10 shows two representative calibration cases\. In the first case, the judge appears to over\-penalize visually busy but valid kitchen pairings, such as chair\-under\-table and stove\-in\-countertop, and selects the lower\-scoring Kimi layout despite wall intersections and a free\-floating appliances\. In the second case, the judge follows the intended rule hierarchy and selects the DPO layout when the competing Kimi layout places furniture outside the room boundary and near the door\. \(a\) Judge selects Kimi despite lower rule score\. \(b\) Judge selects DPO when Kimi has severe geometric violations\. Figure 10\. Representative kitchen judge\-calibration cases\. The examples illustrate both a likely judge failure on dense but valid kitchen pairings and a correct preference when one layout contains clear geometric violations\.`

Similar Articles

Drafted

Product Hunt

Drafted uses AI to instantly design homes.

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

arXiv cs.LG

This paper introduces A4D, a framework that maps visual observations into a shared latent space structured around affordances (e.g., 'movable') for robot planning. It achieves 94% inference accuracy on existing affordances, outperforming state-of-the-art by 15%, and enables 100x faster inference with superior generalization to unseen object functionalities.