Advancing Creative Physical Intelligence in Large Multimodal Models

arXiv cs.AI 05/27/26, 04:00 AM Papers
large-multimodal-models creative-intelligence tool-use benchmark affordance alignment physical-reasoning
Summary
This paper introduces MM-CreativityBench, a benchmark for evaluating creative tool use in large multimodal models under physically constrained environments, and proposes affordance-grounded alignment using Direct Preference Optimization to reduce hallucination and improve grounded reasoning.
arXiv:2605.26396v1 Announce Type: new Abstract: Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open-ended environments, beyond pattern recognition. In such settings, intelligence requires more than answering well-posed questions: it involves identifying how elements in a scene can be repurposed in non-obvious yet physically feasible ways. This form of creative problem-solving is central to human intelligence, but remains largely untested in current benchmarks. To evaluate this ability, we introduce MM-CreativityBench, a benchmark for affordance-grounded creative tool use in visually rich, physically constrained environments. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration. Models often overlook relevant entities, under-examine critical parts, or hallucinate attributes not grounded in the image. Motivated by this failure mode, we propose affordance-grounded alignment, which casts creative tool use as a preference learning problem. Using Direct Preference Optimization, we encourage models to prefer attribute-affordance reasoning grounded in visual evidence over hallucinated alternatives. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi-turn planning. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding-related errors.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:04 AM
# Advancing Creative Physical Intelligence in Large Multimodal Models
Source: [https://arxiv.org/html/2605.26396](https://arxiv.org/html/2605.26396)
Cheng Qian∗11, Hyeonjeong Ha∗11, Jiayu Liu11, Jeonghwan Kim11, Emre Can Acikgoz11, Bingxuan Li11, Kunlun Zhu11, Jiateng Liu11, Aditi Tiwari11, Zhenhailong Wang11, Xiusi Chen11, Mahdi Namazifar22, Heng Ji11 11UIUC,22Amazon

###### Abstract

Large multimodal models \(LMMs\) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities generalize to discovering visually grounded solutions in open\-ended environments, beyond pattern recognition\. In such settings, intelligence requires more than answering well\-posed questions: it involves identifying how elements in a scene can be repurposed in non\-obvious yet physically feasible ways\. This form of creative problem\-solving is central to human intelligence, but remains largely untested in current benchmarks\. To evaluate this ability, we introduceMM\-CreativityBench, a benchmark for affordance\-grounded creative tool use in visually rich, physically constrained environments\. Each instance presents a scenario image with structured views of candidate entities and their parts, enabling fine\-grained, interactive evaluation of how models iteratively inspect the scene, identify relevant affordances, and compose visually and physically grounded solutions\. Our experiments show that current LMMs often fall short, not due to lack of generative capability, but because they do not sustain grounded exploration\. Models often overlook relevant entities, under\-examine critical parts, or hallucinate attributes not grounded in the image\. Motivated by this failure mode, we proposeaffordance\-grounded alignment, which casts creative tool use as a preference learning problem\. Using Direct Preference Optimization, we encourage models to prefer attribute\-affordance reasoning grounded in visual evidence over hallucinated alternatives\. In addition, we incorporate supervision derived from an affordance knowledge base to guide broader entity exploration and multi\-turn planning\. Our results show consistent gains in selecting the correct entities and parts, while substantially reducing hallucination and grounding\-related errors\. These findings suggest that creative intelligence is not a peripheral capability, but a foundation for the next stage of multimodal AI, enabling systems that learn over time, adapt to unfamiliar environments, and solve problems beyond their training, moving closer to human\-like intelligence\.

\*\*footnotetext:indicates equal contribution## 1Introduction

InTriarchic Theory of Intelligence\[sternberg1985beyond\], human intelligence encompasses not only analytical and practical abilities, but alsocreative intelligence: the ability to generate novel and useful solutions under constraints\. In real\-world, resource\-limited settings, this ability often appears as tool repurposing, where people adapt available objects to fulfill functions beyond their intended use\. Such creativity is not merely linguistic or associative\. Humans learn object attributes, physical affordances, and object\-object interactions through continuous observation and embodied experience in the physical world\. They can decompose tools and everyday objects into functional modules, such as edges, tips, handles, surfaces, and containers, and mentally reassemble these modules to support new goals\. For instance, a rigid edge can serve as a scraper, a thin metal tip as a lever, and a transparent curved surface as a focusing device\. These solutions are not arbitrary; they arise from recognizing non\-obvious yet physically valid mappings between task goals and environmental affordances\[gibson1977theory,gibson1979ecological\]\. We study this specific form of creativity,creative tool repurposing, as a concrete testbed of creative intelligence in large multimodal models \(LMMs\)\.

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/intro_new.png)Figure 1:Beyond normal tool use,creative tool repurposingrequires visual grounding of physical affordances, enabling the model to discover that a key’s serrated edge can cut box tape\. MM\-CreativityBench shows that such affordance\-guided reasoning is poorly captured by frontier VLMs but can be improved through training\.Despite the rapid progress of data\-driven LMMs, it remains unclear whether they acquire this kind of creative intelligence\. Current models can often describe objects, retrieve common tool\-use patterns, or generate plausible solutions from textual priors\. However, they frequently fail to transfer knowledge across functional similarity, physical affordance, or task context\. This limitation suggests that their reasoning may still be constrained by word\-level or pixel\-level shortcuts rather than an abstract, compositional understanding of how physical properties enable functions\[yuksekgonul2023when\]\. Moreover, creative tool use requires grounding object parts, geometry, material, and potential human\-object interactions in the physical world, which remains challenging for existing LMMs\[qian2024affordancellm\]\. Unlike humans, who build conceptual knowledge through perception, bodily experience, and situated action\[barsalou2008grounded\], general\-purpose LMMs lack experience\-based learning from embodied interaction with the environment\. As a result, their reasoning often resembles fast, local, and plausible “System 1” inference\[kahneman2011thinking\], while remaining weak in long\-horizon exploration and planning\[valmeekam2023planbench\]\. This makes it difficult for them to discover new object\-function mappings that are both visually grounded and physically feasible\.

To tackle these challenges, recent work has begun to explore creativity in large language and multimodal models through open\-ended generation and constrained problem\-solving tasks\[tian2024macgyver,qian2024escapebench,lim2025visescape\]\. However, existing evaluations remain largely text\-centric and scenario\-driven, offering limited insight into how models ground creative reasoning in physical environments\. A central challenge is that real\-world creativity is inherently perception\-dependent: agents must inspect environments, identify candidate objects, attend to relevant parts, and judge whether their physical attributes, such as geometry and material, support the intended use\. Without such grounding, models may produce linguistically plausible but physically invalid solutions, overlooking relevant objects, misinterpreting attributes, or hallucinating affordances that are not visually supported\[zeng2024investigating,chen2024multiobject,wu2024autohallusion\]\. Consequently, success in text\-based reasoning does not necessarily transfer to visually grounded problem\-solving\[zeng2024investigating\]\.

This gap motivates a more fundamental question:can LMMs perform creative reasoning as an evidence\-driven process grounded in perception?\[liu2024convbench,liu2024visualagentbench,cao2024visdiahalbench\]Addressing this question requires moving beyond static multimodal inputs toward interactive settings, where models actively decide what to inspect, iteratively refine their understanding, and connect visual evidence to task demands\. The challenge is not merely to generate a creative solution, but to reach one through a*visually grounded and physically feasible search process*that supports abstraction, functional transfer, and compositional use of object parts\.

To this end, we introduceMM\-CreativityBench, a benchmark for grounded creative problem solving in multimodal environments\. The benchmark consists of tasks that require repurposing everyday objects under constraints, each paired with a structured visual context including a scene image, entity\-level images, and zoomed\-in part images\. This design preserves the underlying affordance structure while introducing the perceptual challenges inherent to real\-world reasoning: a successful system must not only infer what could work, but also identify the correct object and part through visual inspection and justify its feasibility\. While creativity is inherently open\-ended, our evaluation focuses on constrained creativity, where multiple solutions may exist but must satisfy physical and functional requirements grounded in the scene\. Accordingly, task success is defined by whether a model identifies a physically valid and contextually appropriate object–part combination that fulfills the task constraints\. To support this, we adopt an interactive protocol that allows models to explore the environment, update their reasoning, and refine candidate solutions before committing the answer\.

Our experiments reveal a gap between surface\-level plausibility and grounded reasoning\. Current LMMs often generate superficially plausible answers, but struggle to carry out evidence\-based creative exploration: even the strongest models achieve less than 25% accuracy\. Notably, some top closed\-source models, such as GPT\-5\.4, may underperform open\-source models such as Qwen, suggesting that scaling alone is insufficient for grounded creative reasoning\. Error analysis shows consistent failure modes: models fixate on salient but irrelevant objects, neglect decisive object parts, or infer affordances unsupported by visual evidence\. In many cases, the bottleneck is not the lack of candidate ideas but the inability to maintain a grounded exploration process that links perception, interaction, and physical plausibility\.

To address these limitations, we further investigate whether affordance\-aware alignment can improve grounded interactive behavior\. Our key idea is to provide models with basic building blocks for attribute\-affordance associations, enabling them to connect observable attributes to potential functional uses\. Building on this, we design supervision signals that encourage evidence\-based exploration, guiding models to actively inspect candidate entities, maintain a structured record of unobserved parts, and ground intermediate reasoning steps in visual evidence\. We also introduce preference data with negative trajectories capturing common failure modes, including hallucinated attributes and premature commitment, and visually unsupported reasoning\. Fine\-tuning open\-source Qwen3\-VL models with these signals through supervised fine\-tuning and direct preference optimization yields consistent gains, more than doubling performance in the best setting\. These gains suggest that injecting affordance\-level knowledge and exploration strategies is critical for grounded creative reasoning, leading to stronger visual grounding, reduced hallucination, and more accurate creative tool use\. Overall, we summarize our contributions as follows:

- •Visual Creativity Benchmark:We introduce MM\-CreativityBench, a benchmark for evaluating grounded creative tool repurposing in visual environments, where models must identify the object and part based on visual evidence and physical feasibility for creative problem\-solving\.
- •Grounded Interactive Protocol:We design an interactive evaluation setting that allows models to actively inspect scenes, entities, and parts, making it possible to measure whether creative solutions arise from evidence\-driven exploration rather than unsupported guessing\.
- •Affordance\-Grounded Alignment:We systematically analyze failure modes of current LMMs in grounded creative reasoning, and show that post\-training with stepwise supervision and preference optimization can yield gains in performance, grounding, and hallucination reduction\.

## 2Related Work

BenchmarkCreativeTool UseAffordanceGroundingAttributeGroundingPart\-LevelReasoningFine\-GrainedCreativity LevelsDistractorsIncludedVisualGroundingEvaluationProtocolPROST\[aroca2021prost\]✗✓✓✗✗✓✗StaticNEWTON\[wang2023newton\]✗✗✓✗✗✓✗StaticCreation\-MMBench\[tian2024macgyver\]✗✗✗✗✗✗✓StaticVillagerBench\[dong2024villageragent\]✗✓✗✗✗✗✓InteractiveVisEscape\[lim2025visescape\]✗✓✗✗✗✓✓InteractivePIQA\[bisk2020piqa\]✓✓✓✗✗✓✗StaticMacGyver\[tian2024macgyver\]✓✓✓✗✗✗✓StaticEscapeBench\[qian2024escapebench\]✓✓✗✗✓✗✓InteractiveCretivityBench\[qian2026creativitybench\]✓✓✓✓✓✓✗StaticMM\-CreativityBench\(Ours\)✓✓✓✓✓✓✓Interactive

Table 1:For each existing benchmark, the table indicates whether the corresponding dimension is fully addressed \(✓\), partially addressed \(✓\), or not addressed \(✗\)\.Creativity in Multimodal and Language Models\.Creativity in LLMs has been studied through open\-ended generation tasks such as storytelling\[akoury2020storium,brown2020language\], design\[qian2023creator,cai2023large,ha2025synthia\], and ideation\[si2024can,wang2024scimon,qian2025modelingagent,yang2024large,wang2026creativebench\], often evaluated using notions of novelty, diversity, and usefulness\. More recent work extends this to creative problem solving, including tool\-use and object repurposing scenarios where models must generate unconventional but feasible solutions under constraints\[tian2024macgyver,qian2024escapebench,qian2026creativitybench\], as well as multimodal settings involving non\-literal image understanding, context\-aware generation, and exploration\-driven decision making\[huang2025causality,fang2025creation,lim2025visescape\]\. However, across both LLM and LMMs benchmarks, these evaluations are largely scenario\-driven, emphasizing planning, reasoning, or interaction rather than the fine\-grained mechanisms of affordance\-grounded creative tool use \([Table˜1](https://arxiv.org/html/2605.26396#S2.T1)\); how models derive novel solutions from object properties, especially under visual grounding, remains underexplored\.

Affordance\-Grounded Reasoning and Alignment\.Affordance reasoning has been studied as a bridge between perception and action, including in physical commonsense benchmarks such as PIQA, PROST, and NEWTON\[bisk2020piqa,aroca2021prost,wang2023newton\], and in robotics and embodied AI for manipulation and planning\[montesano2008learning,jamone2016affordances,chu2019learning,brohan2022rt,brohan2024rt\]\. Recent MLLM work introduces structured and part\-level affordance representations\[yu2025seqafford,ma2024glover\], improving grounded perception and reasoning\. However, these approaches primarily focus on recognizing canonical affordances or action feasibility, rather than enabling flexible recombination for creative tool use grounded in fine\-grained attributes\. In parallel, alignment methods such as supervised fine\-tuning and Direct Preference Optimization\[rafailov2023direct\], along with multimodal extensions\[wang2024mdpo,liu2024mia\], have proven effective at improving reasoning quality and visual grounding through preference\-based learning over exploratory trajectories\. However, these approaches have been studied primarily in general reasoning\. Our work bridges this gap by leveraging training signals from an affordance knowledge base to reframe affordance\-driven creativity as a preference optimization problem, encouraging models to prefer visually grounded attribute–affordance reasoning\. This injects fine\-grained attribute–affordance knowledge into the model as compositional primitives for creative recombination, enabling efficient, visually grounded creative tool use\.

## 3MM\-CreativityBench

### 3\.1Preliminary Experiment

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/relative_mllm_gpt4.1_wtl_default_vs_cot_sample25.png)Figure 2:Preliminary Experimental Results:Comparison between direct prompting and structured affordance\-level CoT on creative tool use tasks\.As a preliminary probe of creative intelligence in LMMs, we evaluate models on 100 creative tool\-use tasks drawn from MacGyver\[tian2024macgyver\], where each task requires repurposing everyday objects to satisfy a set of constraints\. To introduce a visual grounding requirement, we augment each task with a scenario image generated by Gemini\-2\.5\-Pro\. The accompanying task description includes only constraints that are not directly observable from the image, so the model must rely on visual evidence to identify candidate objects and reason about their possible uses\. Under this setup, we compare two prompting strategies: adirect prompt, which asks the model to produce a solution without structured guidance, and astructured affordance\-level Chain\-of\-Thought \(CoT\)prompt\[wei2022cot\], which guides the model to perceive available tools, decompose them into parts, infer physical properties, derive affordances, and verify constraint satisfaction\. Detailed prompts are provided in[Appendix˜B](https://arxiv.org/html/2605.26396#A2)\. We use GPT\-4\.1\-mini as the evaluated LMM and GPT\-5\.2 as the judge LMM model, assessing outputs along six dimensions:Correctness,Feasibility,Physical Grounding,Constraint Coverage,Tool Usage, andCreativity\.

As shown in[Figure˜2](https://arxiv.org/html/2605.26396#S3.F2), structured affordance\-level CoT yields modest gains on procedural dimensions, improving Constraint Coverage, Tool Usage, and Creativity\. However, these gains do not translate into reliable end\-to\-end success: Correctness improves only marginally, while Feasibility and Physical Grounding remain limited or inconsistent\. This suggests that prompting models to explicitly list objects, parts, attributes, and affordances can organize reasoning, but does not ensure that the final solution is grounded in fine\-grained visual evidence\. Models may still produce plausible creative uses without verifying whether the selected part actually has the physical attributes required for the task\. These results motivate both our benchmark and training design: MM\-CreativityBench evaluates creative tool use as an interactive, part\-level grounding problem, while our affordance\-grounded alignment method provides explicit supervision and preference signals that teach models to explore relevant evidence, connect attributes to affordances, and reject visually unsupported solutions\.

### 3\.2Benchmark Task Construction

The preliminary study shows that structured prompting can organize creative reasoning, but does not reliably ground the final solution in the visual and physical attributes of a specific object part\. We therefore construct MM\-CreativityBench from a part\-level affordance knowledge base, so that each task has an explicit evidence structure underlying the correct creative solution\.

#### Creative affordance knowledge base\.

We build MM\-CreativityBench on top of the existing open\-source affordance knowledge base\[qian2026creativitybench\]\. The knowledge base provides structured annotations for everyday physical objects, including part decompositions, part\-level physical and state attributes, and functional affordances \(please see[Section˜C\.1](https://arxiv.org/html/2605.26396#A3.SS1)for details\)\. Formally, each entitye∈ℰe\\in\\mathcal\{E\}is decomposed into functional parts:

P\(e\)=\{p1,…,pm\}\.P\(e\)=\\\{p\_\{1\},\\ldots,p\_\{m\}\\\}\.Each partp∈P\(e\)p\\in P\(e\)is associated with an attribute setA\(p\)=Aphy\(p\)∪Astate\(p\)A\(p\)=A^\{\\mathrm\{phy\}\}\(p\)\\cup A^\{\\mathrm\{state\}\}\(p\), whereAphy\(p\)A^\{\\mathrm\{phy\}\}\(p\)captures stable physical properties such as geometry, material, rigidity, sharpness, hollowness, or surface texture, andAstate\(p\)A^\{\\mathrm\{state\}\}\(p\)captures situational properties such as whether the part is open, clean, intact, accessible, or detachable\. These annotations provide the fine\-grained evidence needed to decide whether a part can be repurposed for a novel use\.

#### Reverse task construction\.

Given the affordance knowledge base, we construct each benchmark instance as an inverse grounding problem rather than writing scenarios first and labeling answers afterward\. Specifically, we sample a target entity–part pair\(e∗,p∗\)\(e^\{\*\},p^\{\*\}\)and a gold affordancef∗f^\{\*\}supported byA\(p∗\)A\(p^\{\*\}\), forming the gold solutiong=\(e∗,p∗,f∗\)g=\(e^\{\*\},p^\{\*\},f^\{\*\}\)\. We then generate a task descriptionxxthat requiresf∗f^\{\*\}without revealing the target entity or part, and sample distractor entitiesE−E^\{\-\}to form the candidate set:

T=\(x,E,g\),E=\{e∗\}∪E−,g=\(e∗,p∗,f∗\)\.T=\(x,E,g\),\\qquad E=\\\{e^\{\*\}\\\}\\cup E^\{\-\},\\qquad g=\(e^\{\*\},p^\{\*\},f^\{\*\}\)\.Distractors are selected to make the task diagnostic: some contain parts with affordances similar tof∗f^\{\*\}but lack a decisive physical or state attribute, while others are scene\-plausible objects that naturally co\-occur with the gold entity but cannot satisfy the task constraints\. Thus, success requires identifying the correct entity and part through fine\-grained grounding rather than object priors alone\. We retain only high\-quality tasks satisfying gold validity, distractor separability, scene coherence, and visual observability, resulting in 333 held\-out MM\-CreativityBench test tasks and 868 disjoint training tasks for trajectory sampling\. Details of reverse task generation, distractor construction, filtering, and human verification are provided in[Section˜C\.2](https://arxiv.org/html/2605.26396#A3.SS2)\.

#### Multimodal Grounding via Image Generation

After constructing each symbolic taskT=\(x,E,g\)T=\(x,E,g\), we augment it with images at three granularities: environment, entity, and part\. This mirrors the interaction process required by the benchmark: the model first observes the full scene, then inspects candidate entities, and finally verifies decisive part\-level evidence\. For each task, we generate

Ie=πent\(e,P\(e\),A\),Ie,p=πpart\(e,p,A\(p\),Ie\),Ienv=πenv\(x,E,\{Ie:e∈E\}\)\.I\_\{e\}=\\pi\_\{\\mathrm\{ent\}\}\(e,P\(e\),A\),\\qquad I\_\{e,p\}=\\pi\_\{\\mathrm\{part\}\}\(e,p,A\(p\),I\_\{e\}\),\\qquad I\_\{\\mathrm\{env\}\}=\\pi\_\{\\mathrm\{env\}\}\(x,E,\\\{I\_\{e\}:e\\in E\\\}\)\.Here,IeI\_\{e\}provides a full\-object view,Ie,pI\_\{e,p\}provides a zoomed\-in view of partpp, andIenvI\_\{\\mathrm\{env\}\}places all candidate entities into a coherent scene\. This three\-level construction is essential because distractors are intentionally plausible at the object level, while the correct answer often depends on local attributes of a specific part\. Therefore, the benchmark requires models to navigate

Ienv→Ie→Ie,p→\(e∗,p∗,f∗\)I\_\{\\mathrm\{env\}\}\\rightarrow I\_\{e\}\\rightarrow I\_\{e,p\}\\rightarrow\(e^\{\*\},p^\{\*\},f^\{\*\}\)and ground the final solution in inspected visual evidence\. Details of image generation are provided in[Section˜C\.3](https://arxiv.org/html/2605.26396#A3.SS3)\.

### 3\.3Training Trajectory Construction

The benchmark construction above defines the evaluation problem: given a visually grounded scene, a model must identify the entity and part whose physical attributes support the target affordance\. We now use the same task structure to construct training data\. The key motivation is that grounded creative tool use is not only a final\-answer problem, but also a process problem\. A model must decide which entity to inspect, which part to verify, how to interpret the observed attributes, and when to reject plausible but physically invalid alternatives\. Therefore, instead of supervising only the final solution, we construct multi\-turn trajectories that teach evidence\-seeking behavior from scene\-level search to part\-level affordance grounding\.

#### Interactive trajectory format\.

For each multimodal task𝒯=\(x,Ienv,E,g\)\\mathcal\{T\}=\(x,I\_\{\\mathrm\{env\}\},E,g\)with gold solutiong=\(e∗,p∗,f∗\)g=\(e^\{\*\},p^\{\*\},f^\{\*\}\), we represent an interaction trajectory as

τ=\{\(ot,rt\)\}t=1T,ot=\(ut,It\),rt=\(zt,at\)\.\\tau=\\\{\(o\_\{t\},r\_\{t\}\)\\\}\_\{t=1\}^\{T\},\\qquad o\_\{t\}=\(u\_\{t\},I\_\{t\}\),\\qquad r\_\{t\}=\(z\_\{t\},a\_\{t\}\)\.Here,utu\_\{t\}is the feedback message,ItI\_\{t\}is the visual observation returned at turntt,ztz\_\{t\}is the model’s reasoning, andata\_\{t\}is a structured action\. The action space contains three operations:

at∈\{inspect\_entity\(e\),inspect\_part\(e,p\),answer\(e,p,h\)\},a\_\{t\}\\in\\\{\\texttt\{inspect\\\_entity\}\(e\),\\ \\texttt\{inspect\\\_part\}\(e,p\),\\ \\texttt\{answer\}\(e,p,h\)\\\},wheree∈Ee\\in E,p∈P\(e\)p\\in P\(e\), andhhdescribes how the selected part should be used\. This format mirrors the visual hierarchy of the benchmark\. The initial turn provides the environment imageIenvI\_\{\\mathrm\{env\}\}; inspecting an entity returns its full\-object imageIeI\_\{e\}and part listP\(e\)P\(e\); inspecting a part returns the zoomed\-in part imageIe,pI\_\{e,p\}, optionally with short attribute\-level textual disambiguation\. Thus, each action explicitly determines what evidence the model receives next\.

#### Knowledge\-guided exploration stack\.

To construct a systematic positive trajectory, we maintain an ordered exploration stack𝒮t\\mathcal\{S\}\_\{t\}whose elements are candidate entities or parts\. The top element determines the next inspection target\. We define an affordance\-relevance functionJ\(e,p\)∈0,1J\(e,p\)\\in\{0,1\}, whereJ\(e,p\)=1J\(e,p\)=1indicates that partppof entityeehas an affordance similar or relevant to the target affordancef∗f^\{\*\}according to the knowledge base\. This allows the trajectory to prioritize promising candidates while still grounding exploration in structured affordance knowledge\.

- •Initialization:Given the scene and task, the model proposes an ordered list of candidate entities to initialize𝒮t\\mathcal\{S\}\_\{t\}, thereby directing early exploration toward likely relevant objects\.
- •Inspect Entity:When an entityeeis inspected, it is removed from the stack, and its affordance\-relevant partsp∈P\(e\):J\(e,p\)=1\{p\\in P\(e\):J\(e,p\)=1\}are pushed for part\-level inspection\. This turns coarse entity\-level exploration into finer part\-level verification\.
- •Inspect Part:When a partppis inspected, it is removed from the stack and assigned a binary judgmentbt∈0,1b\_\{t\}\\in\{0,1\}, indicating whether its observed attributes satisfy the task requirements\.
- •Answer:Exploration stops when no unexplored entity or part remains\. In the final answer turn, the model compares all inspected parts withbt=1b\_\{t\}=1and selects the final pair\.

This stack mechanism yields a coarse\-to\-fine trajectory: the model first explores candidate entities, then verifies parts that may support the target affordance, and finally chooses among plausible parts\. This is important because many distractors are intentionally affordance\-similar; the model must learn not only to identify a plausible part, but also to select the gold pair\(e∗,p∗\)\(e^\{\*\},p^\{\*\}\)whose attributes best satisfy the task constraints\.

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/method.png)Figure 3:Interactive MM\-CreativityBench evaluation and training, where models inspect scenes, entities, and parts to ground creative tool use while learning to avoid hallucinated affordances\.
#### Three\-branch trajectory sampling\.

The exploration stack determines what should be inspected at each turn, but it does not by itself determine how the model should reason about the inspection\. Since the structured actionata\_\{t\}and the textual reasoningztz\_\{t\}play different roles, we generate guided reasoning branches at each shared interaction contextctc\_\{t\}\. As illustrated on the right side of[Figure˜3](https://arxiv.org/html/2605.26396#S3.F3), we sample three aligned branches with the same response format,

rtb=\(ztb,atb\),b∈\{\+,−,−−\},r\_\{t\}^\{b\}=\(z\_\{t\}^\{b\},a\_\{t\}^\{b\}\),\\qquad b\\in\\\{\+,\-,\-\-\\\},but with different guidance signals\.

- •Positive branch\(\+\)\(\+\):The positive branch is guided by structured knowledge, including relevant attributes, affordance judgments, and the gold solution when needed\. Its reasoning is expected to justify each action with visually grounded evidence and remain consistent with the exploration stack\. Together with the stack, this branch forms the positive trajectory used for supervised fine\-tuning, teaching the model how to explore entities and parts systematically\.
- •Negative branch\(−\)\(\-\):The negative branch receives only standard observable feedback, such as the task description, images, entity names, and part names, without hidden affordance labels or gold guidance\. It therefore captures realistic inference\-time mistakes, such as overlooking decisive parts, over\-exploring irrelevant objects, or selecting a plausible but suboptimal part\.
- •Hard\-negative branch\(−−\)\(\-\-\):The hard\-negative branch is constructed to create stronger contrast for preference learning\. It preserves fluent reasoning and valid action format, but is guided toward misleading conclusions, such as hallucinating unsupported attributes, relying on object\-level priors, or choosing an affordance\-similar distractor that lacks the required physical evidence\.

Only the positive branch updates the exploration stack, ensuring that future observations remain coherent and grounded\. The negative and hard\-negative branches are sampled at the same states only as rejected alternatives, yielding aligned turn\-level triples

\(ct,rt\+,rt−,rt−−\),\(c\_\{t\},r\_\{t\}^\{\+\},r\_\{t\}^\{\-\},r\_\{t\}^\{\-\-\}\),wherert\+r\_\{t\}^\{\+\}is the preferred grounded response andrt−,rt−−r\_\{t\}^\{\-\},r\_\{t\}^\{\-\-\}are rejected responses of increasing difficulty\. The resulting data support both training stages: positive trajectories teach systematic exploration through SFT, while positive–negative comparisons enable DPO\-style training to favor visually grounded attribute–affordance reasoning over fluent but unsupported alternatives\. Please see[Appendix˜D](https://arxiv.org/html/2605.26396#A4)for more trajectory construction details\.

### 3\.4Affordance\-Grounded Alignment

Given the constructed trajectory dataset, we align the model with affordance\-grounded creative reasoning in two stages: \(1\) Supervised fine\-tuning \(SFT\) teaches structured exploration from positive trajectories, and \(2\) turn\-level Direct Preference Optimization \(DPO\) sharpens attribute–affordance grounding by contrasting grounded reasoning with plausible but unsupported alternatives\.

Supervised Fine\-Tuning\.We first fine\-tune the model on positive trajectoriesτ\+=\{\(ct,rt\+\)\}t=1T\\tau^\{\+\}=\\\{\(c\_\{t\},r\_\{t\}^\{\+\}\)\\\}\_\{t=1\}^\{T\}that are grounded in the affordance knowledge base𝒦\\mathcal\{K\}\. Given each interaction contextctc\_\{t\}, the model is trained to imitate the positive responsert\+=\(zt\+,at\+\)r\_\{t\}^\{\+\}=\(z\_\{t\}^\{\+\},a\_\{t\}^\{\+\}\)using a standard token\-level cross\-entropy objective over complete multi\-turn interactions\. Imitating full trajectories, rather than only final answers, encourages the model to learn the evidence\-seeking process: selecting candidate entities, inspecting relevant parts, interpreting observed attributes, and comparing entity–part pairs before producing the final solution\. However, because these trajectories are generated with structured guidance, SFT primarily teaches the model a guided exploration policy and does not explicitly penalize spurious attribute–affordance associations\.

Turn\-Level Direct Preference Optimization\.At inference time, the model will operate without the gold guidance, which can lead to structurally valid but poorly grounded reasoning\. To reduce this gap, we apply DPO under the unguided evaluation protocol\. For each shared contextctc\_\{t\}, we form preference pairs\(ct,rt\+,rtrej\)\(c\_\{t\},r\_\{t\}^\{\+\},r\_\{t\}^\{\\mathrm\{rej\}\}\), where the preferred responsert\+r\_\{t\}^\{\+\}is drawn from the grounded positive branch and the rejected responsertrej∈\{rt−,rt−−\}r\_\{t\}^\{\\mathrm\{rej\}\}\\in\\\{r\_\{t\}^\{\-\},r\_\{t\}^\{\-\-\}\\\}is sampled from the negative or hard\-negative branch of our three\-branch trajectory construction\. These rejected responses often preserve valid action formats and plausible entity–part choices, yet misinterpret or overclaim the visual evidence\. Contrasting them under identical contexts trains the model to prefer responses that justify affordances using observed physical or state attributes, directly targeting the core failure mode of attribute–affordance reasoning under multimodal uncertainty\. Full objective formulations, context construction, and trajectory notation are provided in[Appendix˜E](https://arxiv.org/html/2605.26396#A5)\.

## 4Experiment

### 4\.1Implementation Details

Benchmark Evaluation Protocol\.We use an interactive evaluation protocol in which models explore a scenario image before producing a final answer\. Each example begins with an image containing multiple entities\. As illustrated in[Figure˜3](https://arxiv.org/html/2605.26396#S3.F3), the model may iteratively inspect entities and their parts to obtain closer views and examine relevant attributes before deciding on an answer\. The model is not required to inspect every region, but effective exploration should help ground the final creative solution in object\-specific visual evidence\.

In our main setting, the conversation history includes the initial scenario image and the most recently inspected view\. At each step, the model first provides its reasoning and then chooses one of three actions: inspect an entity, inspect a part, or give the final answer\. For inspection actions, the model specifies the selected entity or part; for final answers, it explains how the explored evidence supports a creative and grounded response\. We evaluate open\- and closed\-source model families including GPT, Qwen3\-VL, InternVL3\.5, and Gemma\-4, using maximum context length and zero temperature\. Full prompt details are provided in[Appendix˜F](https://arxiv.org/html/2605.26396#A6)\.

CategoryStatisticValueCategoryStatisticValueTest SetData Points333Training SetData Points868Number of Entities974Number of Entities1,498Number of Parts6,344Number of Parts10,080SFT Data Points19,533DPO Data Points5,000

Table 2:Overall statistics for MM\-CreativityBench and the training set used for trajectory sampling\. The test and training sets contain no overlapping scenes, entities, or parts\.Training Implementation Details\.We train Qwen3\-VL\-4B\-Instruct and Qwen3\-VL\-8B\-Instruct with both SFT and DPO using sampled training trajectories\. For SFT, we construct each trajectory using only the positive branch at every turn\. For DPO, we build the conversation context from positive branches, and use the positive response at the current turn as the chosen sample\. The rejected sample is either the negative branch or the hard negative branch, corresponding to the DPO \(normal negative\) and DPO \(hard negative\) settings, respectively\. We also evaluate a two\-stage SFT\+DPO setting, where the model is first trained with SFT and then further optimized with DPO\. All trajectories are sampled from 868 training tasks with scenarios and entities entirely disjoint from the test set\. See[Table˜2](https://arxiv.org/html/2605.26396#S4.T2)for dataset statistics and[Appendix˜F](https://arxiv.org/html/2605.26396#A6)for training hyperparameters\.

### 4\.2Evaluation Metric

The agent is primarily challenged to perform visual and physical grounding: it must identify the correct entity to repurpose and the specific part that should be used\. Therefore, our main metric isGold Correct Rate, which measures whether the model selects both the correct entity and the correct part\. We also reportEntity Correct Rate, which counts a prediction as correct as long as the selected entity is correct\. By definition,Entity Correct Rateshould be no lower thanGold Correct Rate\.

We additionally report interaction and grounding statistics, including theAverage Number of Exploration Turnsand theAverage Number of Distinct Entities/Parts Explored\. To assess whether a model’s answer is grounded in its interaction history, we also measure whether it inspected the gold entity and gold part before answering\. We present the benchmarking results in[Table˜3](https://arxiv.org/html/2605.26396#S4.T3)and preliminary trained\-model results in[Table˜4](https://arxiv.org/html/2605.26396#S4.T4), followed by the main findings below\.

ModelGoldCorrectEntityCorrectTurnsAvg\. Distinct ExploredGold Entity Explored Before AnswerGold Part Explored Before AnswerEntitiesPartsEntity CorrectEntity WrongPart CorrectPart WrongGPT\-5\.40\.1920\.4354\.1771\.6611\.4920\.5100\.2180\.4220\.059GPT\-5\.4 Mini0\.1830\.4084\.0721\.3600\.7060\.6620\.1930\.2790\.033Qwen3\-VL\-8B\-Instruct0\.1920\.44113\.4504\.9793\.7660\.9930\.7470\.9530\.201Qwen3\-VL\-32B\-Instruct0\.2400\.4478\.7664\.8022\.3091\.0000\.7501\.0000\.111InternVL3\.5\-14B0\.1500\.3454\.8471\.8111\.7001\.0000\.2110\.9600\.032InternVL3\.5\-38B0\.1560\.4266\.9913\.5651\.7751\.0000\.5240\.9420\.068Gemma\-4\-26B\-A4B\-it0\.1830\.4025\.3302\.6791\.5741\.0000\.4770\.9020\.040Gemma\-4\-31B\-it0\.1650\.3543\.8021\.9820\.7961\.0000\.2560\.5450\.018

Table 3:The benchmarking results on MM\-CreativityBench\. Models often locate the relevant entity but struggle with fine\-grained gold\-part grounding\. Larger exploration traces improve evidence coverage but do not guarantee correct answers, revealing bottlenecks in visual evidence use\.
### 4\.3Main Results

Interactive exploration helps models find relevant evidence, but does not guarantee correct reasoning\.Our benchmarking results in[Table˜3](https://arxiv.org/html/2605.26396#S4.T3)show that inspecting useful visual evidence does not necessarily lead to correct final answers\. For example, Qwen3\-VL\-32B examines the gold entity before answering in nearly all successful entity cases and achieves the highest raw gold correctness among base models, yet its final accuracy remains only0\.240\. InternVL3\.5\-14B and Gemma\-4\-26B\-A4B\-it show a similar pattern: although they frequently inspect the gold entity in correct\-entity cases, theirgold\-correct scores remain much lower\. These results suggest that models do not fail only because they overlook the relevant region\. Rather, even when they find the right evidence, they often struggle tointerpret it and integrate itinto the final decision\. This motivates our later training, which aims to improve both*exploration policies*and the*use of visual evidence*gathered through interaction\.

The main bottleneck is fine\-grained part grounding rather than coarse entity localization\.Across raw models, entity correctness is much higher than gold correctness\. For example, GPT\-5\.4 reaches 0\.435 entity correctness but only0\.192gold correctness, while InternVL3\.5\-38B reaches 0\.426 versus0\.156\. This gap shows that models can often find the relevant object, but still fail to ground the specific part or attribute needed to answer correctly\. The exploration statistics reinforce this pattern: models inspect entities more reliably than parts, so broader exploration does not necessarily yield*finer evidence*\. Thus, our interactive image evaluation is less about object discovery and more aboutpart\-sensitive visual reasoning: identifying which region matters, extracting the right evidence, and using it to resolve the question\. This motivates training signals that explicitly reward*fine\-grained grounding*, not just final\-answer success\.

Model families differ in exploration style, and scaling alone does not solve interactive visual reasoning\.[Table˜3](https://arxiv.org/html/2605.26396#S4.T3)shows that models differ not only in final accuracy, but also inhow they explore\. Qwen3\-VL models inspect many more entities on average, around 4\.8–5\.0, while GPT\-5\.4 and GPT\-5\.4 Mini inspect only 1\.66 and 1\.36\. Yet*more exploration is not necessarily better*: Qwen3\-VL\-8B explores far more than GPT\-5\.4 but reaches the same gold correctness of 0\.192, and Qwen3\-VL\-32B improves only modestly to 0\.240 despite larger scale and extensive inspection\. Moreover, the number of interaction turns is consistently larger than the total number of explored entities and parts, suggestingredundant explorationand room for more efficient policies\. At the same time, open\-source models can match or exceed GPT performance, with Qwen3\-VL\-32B achieving the best raw gold correctness and Qwen3\-VL\-8B matching GPT\-5\.4 while producing richer traces\. These results suggest that interactive visual reasoning is shaped byfamily\-specific tradeoffsamong search, grounding, and decision\-making, rather than by scale alone\.

Takeaway 1Models can often choose relevant entities, but still fail at fine\-grained part grounding and evidence integration; interactive visual reasoning is limited more by grounding and reasoning than by exploration volume or scale\.

ModelGoldCorrectEntityCorrectTurnsAvg\. Distinct ExploredGold Entity Explored Before AnswerGold Part Explored Before AnswerEntitiesPartsEntity CorrectEntity WrongPart CorrectPart WrongQwen3\-4B\-VL\-Instruct0\.1560\.39318\.9223\.9374\.4170\.9470\.5540\.9230\.167\+ SFT0\.2040\.36917\.8626\.6289\.6701\.0000\.9900\.9850\.404\+ DPO \(normal negative\)0\.2010\.52931\.6935\.50612\.3981\.0000\.8210\.9700\.351\+ DPO \(hard negative\)0\.2400\.54712\.8424\.2223\.7810\.9730\.6390\.8380\.157\+ SFT \+ DPO \(hard negative\)0\.4170\.5836\.2112\.6441\.8310\.9590\.3500\.8560\.026Qwen3\-8B\-VL\-Instruct0\.1920\.44113\.4504\.9793\.7660\.9930\.7470\.9530\.201\+ SFT0\.2730\.58915\.6466\.6557\.6791\.0000\.9931\.0000\.293\+ DPO \(normal negative\)0\.2580\.57718\.4805\.9046\.3841\.0000\.9360\.9650\.231\+ DPO \(hard negative\)0\.2610\.5089\.0544\.2832\.5090\.9940\.6130\.9200\.082\+ SFT \+ DPO \(hard negative\)0\.3930\.5838\.0694\.0662\.3271\.0000\.5760\.9390\.059

Table 4:Interactive image\-evaluation summary for Qwen3\-VL base and trained models\. SFT \+ DPO achieves the highest gold and entity correct rates with more efficient exploration\. Gold parts and entities are more frequently explored when the final answer is correct than when it is wrong, suggesting that correct answers are typically grounded in relevant exploration\.Training improves both accuracy and interaction efficiency, showing that purposeful exploration is learnable\.As shown in[Table˜4](https://arxiv.org/html/2605.26396#S4.T4), targeted training substantially changes how models interact with images\. The strongest 4B variant,SFT \+ DPO with hard negatives, improves gold correctness from 0\.156 to0\.417, while reducing average turns from 18\.92 to6\.21\. The 8B variant shows a similar trend, improving from 0\.192 to 0\.393 while reducing turns from 13\.45 to 8\.07\. These gains are not obtained by making the model search longer or inspect more regions\. Instead, training makes interaction more*selective and decisive*: the model learns to gather useful evidence earlier, avoid unnecessary revisits, and stop once the evidence is sufficient\. This suggests that interactive visual reasoning is atrainable behaviorwhose efficiency–accuracy tradeoff can be substantially improved\.

SFT structures exploration, while hard\-negative DPO teaches the model which evidence not to trust\.SFT helps the model produce more grounded and interpretable exploration traces, but it does not fully solve the reasoning problem\. For Qwen3\-4B, SFT improves gold correctness only modestly from 0\.156 to 0\.204, while average turns remain high at 17\.86, indicating continued reliance on long, corrective exploration\. The key improvement comes from addinghard\-negative DPO, which raises gold correctness from 0\.204 to 0\.417 and reduces turns from 17\.86 to 6\.21\. This suggests that the main benefit of hard negatives is not simply stronger supervision, but sharper discrimination: the model learns to reject*plausible but misleading*trajectories that inspect visually relevant evidence yet support the wrong conclusion\. Thus, SFT provides the structure for exploration, while hard\-negative DPO reshapes the model’s preferences toward correct fine\-grained attribute–affordance reasoning, enabling earlier commitment to valid evidence paths\.

Takeaway 2Training improves both accuracy and efficiency: SFT organizes exploration, while DPO helps models reject misleading evidence and commit earlier to valid grounding paths\.

## 5Analysis

### 5\.1Affordance similarity reveals limits in fine\-grained visual grounding

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/fig_difficulty_scatter_gold_last1.png)\(a\)Performance under different affordance similarity levels\.
![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/fig_turns_by_similarity_last1.png)\(b\)Average exploration turns\.

Figure 4:Effect of affordance similarity on performance and exploration\.As more entities share similar affordances, model performance often degrades, while the average number of exploration turns remains largely stable\. This suggests that failures are driven less by insufficient exploration and more by weak fine\-grained visual grounding and affordance disambiguation\.Across environments with different affordance compositions, we observe that model performance often degrades as more entities share similar affordances\. As shown in[Figure˜4](https://arxiv.org/html/2605.26396#S5.F4), models achieve comparable accuracy in settings with dissimilar or mixed affordances, but their performance drops consistently in similar\-affordance environments\. This pattern indicates that the main difficulty is not simply recognizing plausible candidate tools, but distinguishing among candidates with overlapping functional affordances\.

Such disambiguation requires fine\-grained grounding in visual and physical attributes, such as geometry, material, accessibility, and object\-part structure\. However, current LMMs appear to rely on coarse affordance representations: they can often infer what type of object might be useful, but struggle to determine which specific object or part is physically best suited for the task\. As a result, they may select a functionally plausible tool while missing the attribute\-level evidence needed\.

Notably, the average number of exploration turns remains largely unchanged across similarity levels\. This suggests that models do not adapt their search behavior when the environment becomes more ambiguous; they neither inspect substantially more entities nor perform additional verification before committing to an answer\. Therefore, the performance drop is unlikely to stem from insufficient exploration alone\. Instead, it reflects a deeper limitation in fine\-grained visual grounding and comparative affordance evaluation\. These findings are also consistent with the failure modes we will discuss in[Section˜5\.6](https://arxiv.org/html/2605.26396#S5.SS6)\.

Takeaway 3Performance drops as affordance\-similar distractors increase, while exploration remains stable, indicating that the main bottleneck is fine\-grained visual grounding and affordance disambiguation rather than search effort\.

### 5\.2Higher\-level affordance typicality does not translate to better performance

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/fig_level_heatmap_gold_last1.png)\(a\)Gold affordance typicality level\.
![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/fig_turns_by_level_last1.png)\(b\)Average number of exploration turns\.

Figure 5:Impact of affordance typicality on performance and exploration\.Performance does not improve with higher affordance typicality\. Although more typical affordances \(Lv 3–5\) induce longer exploration, they do not lead to higher accuracy, suggesting greater ambiguity among plausible candidates and persistent limitations in fine\-grained visual grounding\.Contrary to the expectation that more natural or common affordances should be easier, we observe an inverse trend in[Figure˜5\(a\)](https://arxiv.org/html/2605.26396#S5.F5.sf1): performance does not consistently improve as affordance typicality increases\. Across models, higher\-level affordances \(Lv 3–5\), which correspond to more natural and commonly repurposed uses, do not yield better gold correctness than lower\-level, more atypical affordances\. This suggests that increasing ground truth typicality does not necessarily reduce the difficulty of identifying the correct tool–affordance pair\.

At the same time, the average number of exploration turns generally increases with affordance typicality, indicating that models tend to produce longer reasoning chains when the target affordance appears more natural\. This behavior suggests that models may over\-explore or consider a broader set of plausible candidates, rather than confidently converging on the correct one\. One possible explanation is that higher\-typicality affordances create greater functional overlap among candidate entities, making it harder to distinguish the gold part from other plausible parts\. Since current models remain weak at fine\-grained attribute–affordance grounding, this additional ambiguity leads to prolonged exploration without corresponding gains in accuracy\.

Together, these results indicate that more natural or familiar affordances do not necessarily simplify the task\. Instead, they can introduce additional ambiguity by increasing the number of plausible candidate entities and parts\. This further reinforces that the primary bottleneck is not exploration capacity alone, but the ability to reliably distinguish among candidates that share similar affordance structures using visual and physical evidence\.

Takeaway 4More typical affordances lead to longer exploration but not better performance, highlighting that ambiguity among plausible candidates, rather than rarity alone, drives failures under weak visual grounding\.

### 5\.3Impact of visual grounding and interaction dynamics

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/fig_img_comparison_gold.png)Figure 6:Gold correct rate across different input image conditions\.We further analyze the impact of visual input under different image conditions\. As shown in[Figure˜6](https://arxiv.org/html/2605.26396#S5.F6), performance generally improves when visual information is available: theNo Imagesetting generally yields the lowest gold correct rate across models, showing that MM\-CreativityBench requires visual grounding and cannot be reliably solved from text description alone\. Providing visual context, either through theLast ImageorAll Imagescondition, leads to gains, although the magnitude of improvement varies across models\.

Notably, models with longer average interaction horizons, such as Qwen3\-VL \(13\.45 turns for 8B and 8\.77 turns for 32B\) and InternVL \(6\.99 turns for 38B\), benefit more from theLast Imagecondition, which often matches or outperforms theAll Imagessetting\. This suggests that when a model can iteratively refine its belief over candidate entity–part pairs\(e,p\)\(e,p\), access to the most recent and task\-relevant visual observationItI\_\{t\}is often sufficient for grounded decision\-making\. Overall, these results show that grounded creative problem solving depends on bothaccess to visual evidenceand theability to incorporate it through interaction\.

### 5\.4Impact of prompting and format strategy in training

ModelGoldCorrectEntityCorrectTurnsAvg\. Distinct ExploredGold Entity Explored Before AnswerGold Part Explored Before AnswerEntitiesPartsEntity CorrectEntity WrongPart CorrectPart WrongQwen3\-4B\-VL\-Instruct0\.1560\.39318\.9223\.9374\.4170\.9470\.5540\.9230\.167\+ SFT0\.2520\.48917\.0396\.6409\.0391\.0001\.0000\.9050\.309\+ DPO \(normal negative\)0\.2100\.44124\.5244\.6318\.8540\.9730\.7510\.9710\.264\+ DPO \(hard negative\)0\.2490\.48312\.8644\.1234\.0600\.9690\.6370\.9160\.141\+ SFT \+ DPO \(hard negative\)0\.3840\.5748\.6753\.2232\.4280\.9690\.4040\.8280\.044Qwen3\-8B\-VL\-Instruct0\.1920\.44113\.4504\.9793\.7660\.9930\.7470\.9530\.201\+ SFT0\.2340\.42915\.9466\.6467\.9701\.0000\.9950\.9490\.310\+ DPO \(normal negative\)0\.2550\.47718\.8515\.6965\.8540\.9940\.8410\.9760\.197\+ DPO \(hard negative\)0\.2700\.4989\.9224\.7992\.8380\.9880\.7490\.9440\.107\+ SFT \+ DPO \(hard negative\)0\.3450\.5718\.1364\.3642\.3341\.0000\.6060\.9300\.083

Table 5:Varying the prompt to require pure JSON outputs does not change the overall trend\. Across both raw and SFT settings, SFT \+ DPO consistently achieves the highest gold\-correct rate while generally requiring fewer turns, suggesting that training leads to more effective and targeted exploration\.To examine whether the prompting format affects evaluation outcomes, we compare the original prompting setting, where the model first performs free\-form reasoning and then emits a structured JSON action, with a stricter pure\-JSON variant, where the model is instructed to place both reasoning and the next\-step decision inside a JSON object\. As shown in[Table˜5](https://arxiv.org/html/2605.26396#S5.T5), overall,the two settings exhibit highly consistent trends\. In both prompt formats, training improves the base models substantially, and the strongest performance is obtained by the two\-stage SFT\+DPO setting with hard negatives\. For Qwen3\-4B\-VL, SFT\+DPO achieves the best gold\-correct rate under both prompts, increasing from 0\.156 to 0\.417 in the original setting and from 0\.156 to 0\.384 in the pure\-JSON setting\. Similarly, for Qwen3\-8B\-VL, SFT\+DPO remains the best\-performing method, reaching 0\.393 under the original prompt and 0\.345 under the pure\-JSON prompt\. These results indicate that the observed gains are not merely artifacts of a particular output format; rather, they reflect a robust improvement in the model’s ability to conduct targeted exploration and produce correct final answers\.

At the same time, the prompt variation introduces some quantitative shifts in behavior\. The pure\-JSON prompt slightly changes the balance between exploration breadth and answer efficiency\. For example, the 4B SFT\+DPO model uses more turns under the pure\-JSON setting than under the original setting, increasing from 6\.211 to 8\.675 turns, while still maintaining a strong gold\-correct rate\. The 8B SFT\+DPO model shows a similar but smaller pattern, with turns remaining nearly unchanged while the gold\-correct rate decreases moderately from 0\.393 to 0\.345\. Pure\-JSON prompting also tends to preserve the relative advantage of hard\-negative DPO over normal\-negative DPO, especially in reducing excessive exploration and improving final\-answer accuracy\. Across both tables, correct predictions are still associated with substantially higher rates of exploring the gold entity and gold part before answering, whereas wrong predictions show much lower grounding rates\. This suggests that the central mechanism remains unchanged across prompting formats: successful models are those that identify and inspect the relevant visual evidence before committing to an answer\. Therefore, although enforcing a pure\-JSON format can slightly affect absolute scores and exploration patterns, it does not alter the main conclusion thatSFT\+DPO with hard negatives yields more effective, better\-grounded, and more efficient interactive exploration\.

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/repetition_trend.png)Figure 7:Exploration repetition ratesacross the base and trained 4B and 8B models\. SFT and SFT\+DPO substantially reduce repetition, indicating clearer state tracking and more effective, efficient exploration\.![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/similarity_combined_main.png)Figure 8:Exploration progress and similarity densityfor Qwen3\-VL variants\. Left: Qwen3\-4B timing distributions for gold and similar entity/part explorations\. Right: average part/entity similarity density for Qwen3\-8B and Qwen3\-4B across Raw, SFT, DPO, and SFT\+DPO variants\.
### 5\.5How training improves exploration efficiency and semantic focus

We evaluate whether training improves not only final accuracy but also the quality of intermediate exploration, using three trajectory\-level metrics\.Repetition ratemeasures the percentage of tasks in which the model revisits an already explored entity or part; lower values indicate better state tracking and fewer wasted turns\.Similarity densitymeasures how concentrated exploration is around useful hypotheses: part\-level density is the fraction of explored parts that are gold or affordance\-similar to the gold part, while entity\-level density is the fraction of explored entities that are gold or contain at least one affordance\-similar part\. Finally,exploration progressrecords each discovery by its normalized turn index,turn/total turns\\text\{turn\}/\\text\{total turns\}, indicating when the model identifies gold or affordance\-similar candidates\.

[Figure˜7](https://arxiv.org/html/2605.26396#S5.F7)shows that SFT, DPO with hard negatives, and SFT\+DPO all substantially reduce redundant exploration\. For example, on Qwen3\-4B, SFT\+DPO reduces part repetition from 46\.9% to 9\.0%, and entity repetition from 12\.6% to 1\.8%; a similar trend holds for Qwen3\-8B, where part repetition drops from 30\.0% to 7\.8%\. These reductions suggest that our training improves not only final accuracy but also the model’s ability tomaintain an exploration stateand avoid revisiting inspected entities or parts\. This likely comes from our positive\-data construction, which explicitly includes an exploration stack and thereby supervises state tracking during action selection\. In contrast, DPO with normal negatives is less stable without this structured SFT prior, often leading to more repeated and inefficient exploration\.

[Figure˜8](https://arxiv.org/html/2605.26396#S5.F8)further explains why SFT\+DPO is preferable to SFT alone\. As shown in the right four panels, SFT reduces redundant exploration but can also make the search narrow, resulting in lower part and entity similarity density than the base model\. Adding DPO restores semantic focus, helping the model prioritize affordance\-relevant candidates; for example, SFT\+DPO achieves the highest part similarity density for both Qwen3\-4B and Qwen3\-8B, reaching 0\.594 and 0\.605, respectively\. The progress curves show that this gain does not come from premature guessing: unlike other variants that mostly find useful candidates early, SFT\+DPO continues to discover similar entities and parts throughout the trajectory, especially during later part\-level exploration\. Overall, the two stages are complementary:SFT teaches disciplined, non\-redundant exploration, while DPO redirects that exploration toward semantically useful hypotheses\.

Takeaway 5SFT\+DPO produces the strongest exploration behavior: it sharply reduces repeated queries while increasing the density of affordance\-relevant entities and parts explored\. This suggests that the final model is not only more accurate, but also searches more deliberately and efficiently\.

![Refer to caption](https://arxiv.org/html/2605.26396v1/figures/images/error_analysis.png)Figure 9:Primary error\-category ratesreveal that SFT\+DPO substantially reduces physical/functional invalidity errors, especially affordance mismatch \(A2\), and removes most practical and risk/constraint errors\.
### 5\.6Error Category Analysis

We categorize incorrect predictions by their primary failure reason\. Category A coversphysical or functional invalidity: hallucinated affordances \(A1\), affordance mismatch caused by unsuitable geometry/material/mechanics \(A2\), and performance shortfalls where the predicted part is only partially suitable but lacks sufficient stability, reach, capacity, precision, or retention \(A3\)\. Category B coverspractical infeasibility, including destructive workarounds \(B1\) and context/accessibility issues \(B2\)\. Category C coversrisk or requirement mismatch, including safety or damage risks \(C1\) and explicit constraint violations \(C2\)\. We use GPT\-5\.4 for scalable automatic categorization; details are provided in[Section˜G\.1](https://arxiv.org/html/2605.26396#A7.SS1)\.

[Figure˜9](https://arxiv.org/html/2605.26396#S5.F9)shows that SFT\+DPO primarily reduces the dominant failure mode: errors in physical affordance grounding\. For Qwen3\-8B, total Category A errors decrease from 45\.3% in the base model to 25\.8% after SFT\+DPO, while Category B and C errors are eliminated\. The largest reduction occurs in A2, affordance mismatch, which drops from 31\.2% to 18\.9%\. This indicates that the trained model is less likely to select parts whose shape, material, or mechanics are incompatible with the intended use\. Qwen3\-4B exhibits the same trend\. Thus, the improvement is not merely a broad accuracy gain or the removal of rare practical and safety mistakes; it directly targets the main bottleneck in the task\. The remaining errors are still mostly A\-type, suggesting that physical affordance judgment remains challenging, but SFT\+DPO substantially improves the model’s ability to choose mechanically plausible parts\.

### 5\.7Case Study: What SFT\+DPO Repairs in Interactive Image Reasoning

We analyze three representative cases from the evaluation set\. These cases are selected to isolate distinct effects of training: \(i\) grounding a solution in material and contact attributes, \(ii\) distinguishing a local anti\-slip cue from the global geometry required by the task, and \(iii\) selecting the safe, task\-relevant part within an otherwise plausible tool\. See[Section˜G\.2](https://arxiv.org/html/2605.26396#A7.SS2)for the key original responses, feedback, and images from the decisive turns\.

Case 1: pressure\-spreading pad vs\. generic softness\.

- •Task\.A metal towel hook is pressing into painted bathroom drywall, and the user needs a small protector to prevent dents or scratches\. The gold answer iscurved tension shower curtain rod/non\_slip\_end\_pads\.
- •Bad trace behavior\.Base 8B fixates on the towel after inspectingmicrofiber\_pile\_surface\. This answer is superficially plausible because a towel is soft, but the reasoning trace is flawed: it treats*softness*as the entire relevant affordance\. The model never inspects the shower rod’snon\_slip\_end\_pads, and it incorrectly rejects the shower rod as rigid even though the relevant subpart is rubber\.
- •Good trace behavior\.SFT\+DPO 8B initially inspects the same towel part, but continues searching instead of stopping at the first plausible cue\. It then inspects the shower rod, discoversnon\_slip\_end\_pads, and examines that subpart directly\. The trained answer compares the two candidates: the towel is soft, but the rubber pad is smaller, more durable, higher\-friction, and better suited to remaining fixed at the pressure point\.
- •Why the trained trajectory is better\.The trained model grounds its solution in the physical contact model of the task: a hard hook creates a localized point load on painted drywall\. A bulky towel may bunch, compress, or slip, whereas a rubber pad can remain at the contact point and distribute force more effectively\.Representative improved capability: material/contact attribute grounding\.

Case 2: straight\-edge geometry vs\. local anti\-slip grip\.

- •Task\.The user needs to trim wrapping paper without a ruler or cutting mat, while keeping the paper from slipping during marking\. The gold answer isunder\-bed storage bin with zipper lid/lid\_panel\.
- •Bad trace behavior\.Base 4B repeatedly inspects the penrubber\_grip\_sleeveand nearby mask parts until the 50\-turn budget is exhausted\. Base 8B makes the same error more quickly\. Both models solve only a local subproblem: a small rubber grip may increase friction at a point of contact, but it cannot provide the long straight edge or stable backing surface required for trimming a sheet of paper\.
- •Good trace behavior\.SFT\+DPO 8B immediately inspects the storage bin and itslid\_panel\. The feedback identifies the part as a semi\-rigid, sturdy, non\-elastic panel with an internal polypropylene stiffener\. The trained answer therefore uses this part as both a flat backing surface and an alignment guide\.
- •Why the trained trajectory is better\.The trained model recovers the global geometry of the task\. The user needs support and alignment for an extended sheet, not merely a high\-friction contact patch\.Representative improved capability: interaction planning and part\-level geometry\.

Case 3: safe multi\-tooth probing vs\. sharp scraping\.

- •Task\.Damp hair\-and\-soap buildup blocks a narrow sink overflow slot, and the user needs to loosen the material enough to rinse it out\. The gold answer iselectric beard trimmer with adjustable guard/adjustable\_guard\_comb\.
- •Bad trace behavior\.Base 4B first considers the trimmer blade head, then inspects a double\-edge razor and answers withdouble\_edge\_blade\. Base 8B remains within the trimmer entity but still choosescutting\_blade\_head\. Both traces reveal a sharp\-tool prior: the models assume that a narrow clogged slot calls for scraping or cutting, even though the target material is soft buildup in a constrained opening\.
- •Good trace behavior\.SFT\+DPO 4B also explores the tempting razor blade, but it does not stop there\. After receiving feedback on the razor, it inspects the trimmer and then theadjustable\_guard\_comb\. The final answer selects the comb because its rigid, fine plastic teeth are exposed, fit the narrow opening, and can wiggle or rake soft buildup without the risks associated with a blade\.
- •Why the trained trajectory is better\.This case illustrates an exact\-part and safety repair\. The trimmer blade head and guard comb belong to the same entity, but they imply different actions, contact mechanics, and risks\. The trained model selects the part whose geometry and material match the desired interaction while avoiding unnecessary sharpness\.Representative improved capability: exact\-part discrimination under safety and state constraints\.

## 6Discussion

Difference between Creativity and Hallucination\.In MM\-CreativityBench, creativity is not defined as unconstrained novelty, but as the ability to discover non\-obvious uses of visually available objects through physically grounded affordance reasoning\. This makes it fundamentally different from the kind of imaginative generation that may be valuable in creative writing, design ideation, or open\-ended research, where productive “hallucination” can sometimes serve as a source of exploration\. In our setting, hallucination is not a creative act but a grounding failure: the model invents an unsupported attribute, assumes an unseen part, or maps a plausible function onto an object whose geometry, material, state, or accessibility does not justify that use\. The useful form of imagination here is therefore conditional and verifiable\. A model may hypothesize that a serrated edge could cut tape or that a rubber pad could protect a wall, but this hypothesis must be checked against the inspected visual evidence and the task constraints\. Thus, MM\-CreativityBench studiesevidence\-seeking creativity: novelty arises from recombining observed object properties with task goals, while validity depends on sustained interaction, part\-level inspection, and physical plausibility\. This distinction becomes increasingly important for embodied agents, where hallucinated affordances are not merely incorrect answers but potentially unsafe actions in the physical world\.

Enhancement of Model Creativity\.Our results suggest that improving grounded creativity requires training models not only to produce novel final answers, but to acquire better exploration and verification policies\. Standard outcome\-driven training can reward a correct\-looking solution without teaching the model how to inspect the scene, compare competing affordances, or reject superficially plausible but unsupported evidence\. In MM\-CreativityBench, SFT provides a useful first step by teaching structured exploration over entities and parts, while hard\-negative DPO further teaches the model to discriminate grounded trajectories from fluent but misleading ones\. This points toward a broader direction for future RL: creativity should be optimized as an interactive evidence\-gathering process rather than as single\-turn answer selection\. A suitable objective would reward information\-seeking actions, part\-sensitive visual grounding, causal or mechanical consistency, and timely commitment once sufficient evidence has been collected, while penalizing unsupported affordance claims, redundant exploration, and premature answers\. The affordance knowledge base provides a practical substrate for such training because it can generate positive trajectories, plausible hard negatives, and fine\-grained attribute–affordance contrasts\. More broadly, future creative agents may need RL objectives that preserve diversity in hypothesis generation while enforcing strict grounding at the point of action, allowing models to explore unusual solutions without drifting into hallucinated physical assumptions\.

## 7Conclusion

We introduced MM\-CreativityBench, a benchmark for evaluating visually grounded creative tool repurposing in multimodal environments\. By requiring models to interactively inspect scenes, entities, and parts, the benchmark reveals that current LMMs often struggle with grounding creative solutions in fine\-grained visual and physical evidence\. We further showed that affordance\-grounded alignment, especially with hard negative preference signals, can improve both accuracy and exploration efficiency\. Looking forward, we hope MM\-CreativityBench will support future work on multimodal agents that can reason more robustly about physical affordances, adapt to unfamiliar environments, and solve open\-ended problems through grounded exploration rather than surface\-level plausibility\.

## References

## Appendix

## Appendix ASignificance, Scope, and Clarifications

### A\.1Why MM\-CreativityBench Matters

A Missing Dimension in Multimodal Evaluation\.Recent progress in large multimodal models has been evaluated primarily through recognition, visual question answering, spatial reasoning, and instruction following\. These settings are important, but they do not fully test whether a model can use perception as evidence for discovering non\-obvious but physically feasible solutions\. MM\-CreativityBench focuses on this under\-measured capability:visually grounded creative tool repurposing\. The central question is not whether a model can describe a scene or generate a plausible solution in language, but whether it can inspect the environment, identify the relevant object and part, and justify how observable physical properties support an unconventional use\. This makes the benchmark a focused test of creative intelligence under perceptual and physical constraints\.

From Static Image Understanding to Evidence\-Driven Exploration\.A key contribution of MM\-CreativityBench is that it moves beyond static image\-to\-answer evaluation\. In real\-world problem solving, an agent often does not know in advance which object or region matters\. It must search, inspect, compare, and revise its hypothesis before committing to a solution\. Our interactive protocol captures this process by allowing models to inspect the scene, candidate entities, and zoomed\-in parts\. This design makes it possible to evaluate not only the final answer, but also whether the answer emerges fromgrounded explorationrather than unsupported guessing\. In this sense, the benchmark evaluates creative tool use as a process, not merely as a final textual output\.

Part\-Level Grounding as the Core Challenge\.MM\-CreativityBench is designed to expose a specific weakness in current LMMs: models often identify a generally relevant object but fail to determine which part, attribute, or visual cue actually enables the intended use\. This distinction is important because creative repurposing depends on mechanism\-level reasoning\. A key’s usefulness for opening a taped box, for example, does not come from the object category “key” alone, but from a visually grounded property such as a thin, rigid, serrated edge\. By requiring models to ground answers at the entity–part level, MM\-CreativityBench separates coarse semantic plausibility from genuine physical affordance reasoning\.

A Controlled Testbed for Grounded Creative Reasoning\.The benchmark is built on a structured affordance knowledge base that links entities, parts, attributes, and affordances\. This structure gives the benchmark several advantages\. First, it supports systematic task construction while preserving interpretable solution paths\. Second, it enables controlled multimodal augmentation at multiple levels of granularity, including scene images, entity views, and part\-level close\-ups\. Third, it makes failure analysis more diagnostic: when a model fails, we can ask whether it overlooked the correct entity, inspected the wrong part, misread visual evidence, hallucinated an attribute, or failed to connect an observed attribute to an affordance\. This level of diagnosis is difficult in fully open\-ended creativity benchmarks\.

Connecting Evaluation with Model Improvement\.MM\-CreativityBench is not only an evaluation benchmark\. It also provides a framework for studying how grounded creative behavior can be improved\. Our affordance\-aware alignment results show that models can learn more effective exploration and more reliable attribute–affordance reasoning when trained with structured positive trajectories and contrastive negative trajectories\. The goal is not to claim that SFT or DPO is a new optimization algorithm, but to show thataffordance\-grounded supervisionis a useful training signal for multimodal creative problem solving\. This connection between benchmark design, failure analysis, and targeted alignment makes the benchmark valuable as a research tool rather than only as a leaderboard\.

Implications for Future Multimodal Agents\.The broader significance of MM\-CreativityBench lies in its relevance to adaptive agents\. Agents operating in homes, labs, workshops, or other resource\-limited environments will often need to repurpose available objects rather than rely on predefined tools\. Such behavior requires visual grounding, physical commonsense, comparative search, and flexible recombination of affordances\. MM\-CreativityBench isolates this ability in a controlled and reproducible form\. It therefore offers a concrete step toward evaluating and improving multimodal systems that can solve unfamiliar problems using evidence from their surroundings\.

### A\.2Clarifications of Concerns

MM\-CreativityBench Targets Constrained Creativity, Not All Creativity\.Creativity is broad, and we do not claim that creative intelligence can be fully captured by a single benchmark\. Our focus is deliberately narrower:constrained, visually grounded tool repurposing\. In this setting, a solution must be novel relative to canonical object use, but also physically feasible and supported by visual evidence\. This operational definition is useful precisely because it avoids the ambiguity of fully open\-ended creativity evaluation\. Rather than asking whether a model is creative in the abstract, MM\-CreativityBench asks whether it can discover a usable object–part solution under explicit perceptual and functional constraints\.

Why This Is More Than Physical Commonsense Retrieval\.Physical commonsense is a necessary ingredient, but the benchmark requires more than retrieving a familiar object\-use association\. Each task requires the model to connect a goal with a specific object, a specific part, observable attributes, and a feasible mechanism of use\. The model must also make this connection through an interactive visual process\. This is why performance drops substantially when moving from entity\-level correctness to gold entity–part correctness: models often know which object category is plausible, but fail to ground the precise part and attribute that make the solution work\. The benchmark therefore targetsmechanism\-sensitive visual repurposing, not simple commonsense recall\.

Synthetic Images as Controlled Visual Grounding, Not a Claim of Full Realism\.The use of generated images is a methodological choice for controlled evaluation\. Real images would introduce substantial noise in object availability, occlusion, scale, lighting, and part visibility, making it difficult to isolate the reasoning problem\. In contrast, generated scenes allow us to construct environments where candidate entities are present, parts can be inspected, and the underlying affordance structure is known\. We do not claim that generated images replace real\-world embodied evaluation\. Rather, they provide a reproducible intermediate setting that tests whether models can use visual evidence when the relevant physical cues are available\. This makes MM\-CreativityBench a diagnostic benchmark for visual\-affordance reasoning, complementary to future evaluations in real physical environments\.

Image Generation Does Not Turn the Task into Visual Leakage\.A possible concern is that generated images may make the gold answer visually obvious\. Our intention is the opposite: image generation is used to make relevant physical evidence inspectable, not to mark the solution\. The model still needs to search among multiple entities, inspect candidate parts, and infer which observed attributes support the target affordance\. The scene is not allowed to depict task execution, and the interaction protocol requires the model to justify the final answer through object\-specific evidence\. More generally, MM\-CreativityBench evaluates whether a model can transform visible properties into functional hypotheses; making those properties visible is necessary for testing grounding rather than language\-only guessing\.

Single\-Gold Evaluation as Measurement Control\.Creative tool use is naturally open\-ended, and multiple valid solutions may exist in the real world\. However, benchmark evaluation requires a controlled target so that models can be compared consistently\. The gold answer in MM\-CreativityBench should therefore be understood as a verified solution path, not as a claim that no other solution could ever work\. The strict gold metric is intentionally conservative: it measures whether the model recovers the intended object–part mechanism under the constructed scene and constraints\. Entity\-level accuracy, exploration statistics, and grounding metrics complement this strict score by showing where the reasoning process succeeds or fails\. Thus, the single\-gold structure supports measurement clarity while preserving the broader view that creativity can admit multiple solutions\.

The Interactive Protocol Is Artificial by Design, but Diagnostic\.The entity and part inspection interface abstracts away low\-level perception problems such as segmentation and camera control\. This may appear less realistic than open\-world embodied interaction, but it is useful for isolating the central question of the paper: can a model conduct evidence\-driven creative reasoning once the environment is inspectable? By structuring interaction into scene, entity, and part views, the benchmark makes exploration behavior observable and comparable across models\. This design reduces confounds from visual localization failures and allows us to measure whether models actually inspect the evidence needed for their final answers\. The protocol should be viewed as a controlled diagnostic environment, not as a complete simulation of embodied deployment\.

Affordance\-Knowledge Supervision Is a Research Signal, Not Test\-Time Privilege\.The affordance knowledge base is used to construct tasks and training trajectories, but the evaluation protocol does not give models access to gold affordance labels or hidden solution paths\. At test time, models must operate from the visible scene, entity names, part names, and inspected images\. The purpose of knowledge\-base\-derived supervision is to teach models useful intermediate behavior: how to search, how to compare candidate parts, and how to reject visually unsupported affordances\. This is analogous to using structured annotations to train better visual reasoning models\. The important question is not whether the supervision contains knowledge, but whether that knowledge improves unguided visual\-affordance reasoning on disjoint test instances\.

SFT and DPO Are Not Claimed as Algorithmic Novelty\.Our contribution is not a new fine\-tuning objective\. SFT and DPO are used as established tools to test a specific hypothesis: grounded creative tool use can be improved by aligning models toward visually supported attribute–affordance reasoning and away from plausible but ungrounded alternatives\. The novelty lies in the problem formulation, the interactive visual benchmark, the construction of affordance\-grounded trajectories, and the use of hard negative trajectories that target hallucinated or misleading affordance reasoning\. In this sense, the training experiments serve as evidence that the benchmark identifies a learnable failure mode, rather than merely reporting that current models perform poorly\.

Cross\-Model Comparisons Should Be Interpreted Structurally\.Absolute scores can be affected by prompting, model family, decoding details, and the interaction interface\. Therefore, the main conclusion should not rest on a brittle ranking between two models\. The more important result is the recurring structural pattern: models often perform better at coarse entity localization than at fine\-grained part grounding; longer exploration does not necessarily produce better answers; visually plausible but unsupported reasoning remains common; and targeted affordance\-aware alignment improves both accuracy and efficiency\. These patterns are more informative than any single leaderboard position, because they reveal a shared limitation in current LMMs\.

Final Take Away\.MM\-CreativityBench is intentionally controlled, visually grounded, and diagnostic\. It does not claim to cover all forms of creativity, nor does it replace real\-world embodied evaluation\. Its contribution is more precise: it isolates a practically important form of creative problem solving, formalizes it through entity–part–attribute affordance structure, and shows that current LMMs still struggle to connect visual evidence with non\-canonical physical use\. By combining benchmark construction, interactive evaluation, failure analysis, and affordance\-aware alignment, MM\-CreativityBench positions creative tool use as a concrete and measurable frontier for multimodal AI\.

## Appendix BPreliminary Experiment

Our preliminary experiment is designed as a controlled comparison between two prompting strategies on 100 creative tool\-use tasks sampled from MacGyver\[tian2024macgyver\]\. To evaluate the creativity under a multimodal environment, where the model directly perceives necessary physical attributes from the input image and reasons about creative tool repurposing, we first generate a scenario image using Gemini\-2\.5\-Pro\. Then, update the task description to only include the constraints that cannot be represented visually\. In thedirect promptsetting, given an input task and scenario image, the model is asked to propose a feasible solution under the task constraints without any prescribed intermediate reasoning steps, thereby testing its implicit ability to connect task requirements with tool functions\. In thestructured affordance\-level CoTsetting, the model is instead guided through an explicit reasoning pipeline that includes listing the available tools, decomposing each tool into parts, inferring relevant physical properties, deriving possible affordances, justifying each action step, and validating the final solution against the stated constraints\.

We evaluate outputs using six criteria:Correctness,Feasibility,Physical Grounding,Constraint Coverage,Tool Usage, andCreativity, under pairwise relative comparison between the two prompting strategies\. Please see the prompts below for more details\. We use GPT\-4\.1\-mini as the target model and GPT\-5\.2 as the judge model, employing temperature 0\.0 to guarantee deterministic outputs\. All of other settings we follow the original MacGyver paper’s protocols\.

Scenario Image Generation Prompt[⬇](data:text/plain;base64,Q3JlYXRlIG9uZSByZWFsaXN0aWMgc2NlbmUgdGhhdCBzaG93cyBvbmx5IHRoZSBpbml0aWFsIHNldHVwIG9mIHRoaXMgcHJvYmxlbSBiZWZvcmUgYW55IGFjdGlvbnMgYXJlIHRha2VuLiBGb2N1cyBvbiB0aGUgZW52aXJvbm1lbnQsIHJlbGV2YW50IG9iamVjdHMsIGFuZCBjb25zdHJhaW50cy4gRG8gbm90IGluY2x1ZGUgdGV4dCBvdmVybGF5cywgbGFiZWxzLCBvciBzb2x1dGlvbiBzdGVwcy4KClByb2JsZW06IHt0YXNrIGRlc2NyaXB0aW9uIGZyb20gTWFjR3l2ZXJ9)Createonerealisticscenethatshowsonlytheinitialsetupofthisproblembeforeanyactionsaretaken\.Focusontheenvironment,relevantobjects,andconstraints\.Donotincludetextoverlays,labels,orsolutionsteps\.Problem:\{taskdescriptionfromMacGyver\}

Direct Prompt[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgYXQgY3JlYXRpdmUgcGh5c2ljYWwgdG9vbC11c2UgcmVhc29uaW5nLgpHaXZlbiB0aGUgdGFzayBkZXNjcmlwdGlvbiBiZWxvdyBhbmQgaW1hZ2UsIHByb2R1Y2UgYSBmZWFzaWJsZSBzb2x1dGlvbi4KClJ1bGVzOgoxLiBVc2Ugb25seSB0b29scy9pdGVtcyBleHBsaWNpdGx5IGF2YWlsYWJsZSBpbiB0aGUgaW5wdXQgaW1hZ2UuCjIuIFJlc3BlY3QgcGh5c2ljYWwgY29uc3RyYWludHMgaWYgdGhlIHRhc2sgZGVzY3JpcHRpb24gcmVzdHJpY3RzIHBoeXNpY2FsIGF0dHJpYnV0ZXMgb3IgaW1hZ2Ugc2hvd3MgcGh5c2ljYWwgY29uc3RyYWludHMgc3VjaCBhcyBzaXplIG9yIHN0YXRlLgozLiBJbnZlbnQgbmV3IHRvb2xzIHVzaW5nIHRvb2xzL2l0ZW1zIGV4cGxpY2l0bHkgYXZhaWxhYmxlIGluIHRoZSBpbnB1dCBpbWFnZSBpZiBpdCBpcyBuZWVkZWQuCjQuIFByb3ZpZGUgcHJhY3RpY2FsIHN0ZXBzIHRoYXQgY2FuIGFjdHVhbGx5IGJlIGV4ZWN1dGVkLgo1LiBEbyBub3QgaW52b2x2ZSBhbnkgdW5uZWNlc3Nhcnkgc3RlcHMgdG8gYWNoaWV2ZSB0aGUgdGFzaydzIGdvYWwuCjYuIElmIGEgY29tcGxldGUgc29sdXRpb24gaXMgaW1wb3NzaWJsZSwgcmV0dXJuIHRoZSBiZXN0IHBhcnRpYWwgcGxhbiBhbmQgZXhwbGFpbiB3aHkgaXQgY2Fubm90IGJlIGNvbXBsZXRlZC4KClRBU0sgREVTQ1JJUFRJT046Cntwcm9ibGVtfQoKUmV0dXJuIEpTT046Cnt7CiAgInNvbHZhYmxlIjogIlllcyBvciBObyIsCiAgInNvbHZhYmxlX2V4cGxhbmF0aW9uIjogIjEtMyBzZW50ZW5jZXMgYWJvdXQgd2h5IHRoZSBnaXZlbiB0YXNrIGlzIHNvbHZhYmxlIG9yIG5vdCIsCiAgInNvbHV0aW9uX3N0ZXBzIjogWyJTdGVwIDE6IC4uLiIsICJTdGVwIDI6IC4uLiIsIC4uLl0sCiAgImZpbmFsX3NvbHV0aW9uIjogIk9uZSBjb25jaXNlIHBhcmFncmFwaCB0aGF0IHN1bW1hcml6ZXMgdGhlIGZ1bGwgYXBwcm9hY2guIiwKICAidXNlZF90b29scyI6IFsidG9vbCAxIiwgInRvb2wgMiIsIC4uLl0sCiAgImNvbnN0cmFpbnRfaGFuZGxpbmciOiBbe3siY29uc3RyYWludCI6ICIuLi4iLCAiaGFuZGxpbmciOiAiLi4uIn19LCAuLi5dCn19)Youareanexpertatcreativephysicaltool\-usereasoning\.Giventhetaskdescriptionbelowandimage,produceafeasiblesolution\.Rules:1\.Useonlytools/itemsexplicitlyavailableintheinputimage\.2\.Respectphysicalconstraintsifthetaskdescriptionrestrictsphysicalattributesorimageshowsphysicalconstraintssuchassizeorstate\.3\.Inventnewtoolsusingtools/itemsexplicitlyavailableintheinputimageifitisneeded\.4\.Providepracticalstepsthatcanactuallybeexecuted\.5\.Donotinvolveanyunnecessarystepstoachievethetask'sgoal\.6\.Ifacompletesolutionisimpossible,returnthebestpartialplanandexplainwhyitcannotbecompleted\.TASKDESCRIPTION:\{problem\}ReturnJSON:\{\{"solvable":"YesorNo","solvable\_explanation":"1\-3sentencesaboutwhythegiventaskissolvableornot","solution\_steps":\["Step1:\.\.\.","Step2:\.\.\.",\.\.\.\],"final\_solution":"Oneconciseparagraphthatsummarizesthefullapproach\.","used\_tools":\["tool1","tool2",\.\.\.\],"constraint\_handling":\[\{\{"constraint":"\.\.\.","handling":"\.\.\."\}\},\.\.\.\]\}\}

CoT Prompt[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgYXQgY3JlYXRpdmUgcGh5c2ljYWwgdG9vbC11c2UgcmVhc29uaW5nLgpTb2x2ZSB0aGUgdGFzayBieSBleHBsaWNpdGx5IHJlYXNvbmluZyBvdmVyIHRvb2wgcGFydHMgYW5kIGFmZm9yZGFuY2VzIHVuZGVyIGNvbnN0cmFpbnRzLgoKUnVsZXM6CjEuIFVzZSBvbmx5IHRvb2xzL2l0ZW1zIGV4cGxpY2l0bHkgYXZhaWxhYmxlIGluIHRoZSBpbnB1dCBpbWFnZS4KMi4gUmVzcGVjdCBwaHlzaWNhbCBjb25zdHJhaW50cyBpZiB0aGUgdGFzayBkZXNjcmlwdGlvbiByZXN0cmljdHMgcGh5c2ljYWwgYXR0cmlidXRlcyBvciBpbWFnZSBzaG93cyBwaHlzaWNhbCBjb25zdHJhaW50cyBzdWNoIGFzIHNpemUgb3Igc3RhdGUuCjMuIEludmVudCBuZXcgdG9vbHMgdXNpbmcgdG9vbHMvaXRlbXMgZXhwbGljaXRseSBhdmFpbGFibGUgaW4gdGhlIGlucHV0IGltYWdlIGlmIGl0IGlzIG5lZWRlZC4KNC4gUHJvdmlkZSBwcmFjdGljYWwgc3RlcHMgdGhhdCBjYW4gYWN0dWFsbHkgYmUgZXhlY3V0ZWQuCjUuIERvIG5vdCBpbnZvbHZlIGFueSB1bm5lY2Vzc2FyeSBzdGVwcyB0byBhY2hpZXZlIHRoZSB0YXNrJ3MgZ29hbC4KNi4gSWYgYSBjb21wbGV0ZSBzb2x1dGlvbiBpcyBpbXBvc3NpYmxlLCByZXR1cm4gdGhlIGJlc3QgcGFydGlhbCBwbGFuIGFuZCBleHBsYWluIHdoeSBpdCBjYW5ub3QgYmUgY29tcGxldGVkLgoKUmVxdWlyZWQgcmVhc29uaW5nIHByb2NlZHVyZToKMS4gU3RhdGUgdGhlIHRhc2sgZ29hbCBhbmQgY29uY3JldGUgc3VjY2VzcyBjb25kaXRpb24uCjIuIExpc3QgYWxsIGF2YWlsYWJsZSB0b29scy9pdGVtcyBmcm9tIHRoZSBpbnB1dCBpbWFnZSAobm8gYWRkaXRpb25zKS4KMy4gRm9yIGVhY2ggcmVsZXZhbnQgdG9vbCwgaWRlbnRpZnkgdGhlIGtleSBwYXJ0KHMpLCBpbmZlciBwaHlzaWNhbCBwcm9wZXJ0aWVzLCBhbmQgZGVyaXZlIHBhcnQtbGV2ZWwgYWZmb3JkYW5jZXMgdXNlZnVsIGZvciB0aGlzIHRhc2suCjQuIEJ1aWxkIGEgc3RlcC1ieS1zdGVwIHBsYW4gd2hlcmUgZWFjaCBzdGVwIHJlZmVyZW5jZXMgdG9vbCBwYXJ0cyBhbmQgdGhlIGFmZm9yZGFuY2UgYmVpbmcgdXNlZC4KNS4gVmFsaWRhdGUgZWFjaCBzdGVwIGFnYWluc3Qgc3RhdGVkIGNvbnN0cmFpbnRzIChlLmcuLCBicm9rZW4vdW51c2FibGUgaXRlbXMsIHNpemUgbWlzbWF0Y2gsIGJsb2NrZWQgZnVuY3Rpb24sIHN0YXRlIGxpbWl0YXRpb25zKS4KNi4gS2VlcCB0aGUgcGxhbiBwcmFjdGljYWwgYW5kIG1pbmltYWwgd2l0aCBubyB1bm5lY2Vzc2FyeSBhY3Rpb25zLgoKVEFTSyBERVNDUklQVElPTjoKe3Byb2JsZW19CgpSZXR1cm4gSlNPTjoKe3sKICAidGFza19nb2FsIjogIi4uLiIsCiAgInN1Y2Nlc3NfY29uZGl0aW9uIjogIi4uLiIsCiAgImlkZW50aWZpZWRfY29uc3RyYWludHMiOiBbIi4uLiIsICIuLi4iXSwKICAiYXZhaWxhYmxlX3Rvb2xzIjogWwogICAge3sKICAgICAgInRvb2wiOiAiLi4uIiwKICAgICAgInJlbGV2YW50X3BhcnRzIjogWwogICAgICAgIHt7CiAgICAgICAgICAicGFydCI6ICIuLi4gb3IgTkEiLAogICAgICAgICAgImluZmVycmVkX3BoeXNpY2FsX3Byb3BlcnRpZXMiOiBbIi4uLiIsICIuLi4iXSwKICAgICAgICAgICJhZmZvcmRhbmNlc19mb3JfdGFzayI6IFsiLi4uIiwgIi4uLiIsIC4uLl0sCiAgICAgICAgICAidXNhYmxlX3VuZGVyX2NvbnN0cmFpbnRzIjogIlllcyBvciBObyIsCiAgICAgICAgfX0sCiAgICAgICAgLi4uCiAgICAgIF0KICAgIH19LAogICAgLi4uCiAgXSwKICAicmVhc29uaW5nX3BsYW4iOiBbCiAgICB7ewogICAgICAic3RlcCI6IDEsCiAgICAgICJhY3Rpb24iOiAiLi4uIiwKICAgICAgInRvb2xfcGFydHNfdXNlZCI6IFsidG9vbDpwYXJ0IiwgIi4uLiJdLAogICAgICAiYWZmb3JkYW5jZV91c2VkIjogWyIuLi4iLCAiLi4uIl0sCiAgICAgICJleHBsYW5hdGlvbiI6ICIxIHNlbnRlbmNlIGFib3V0IHdoeSBpdCB3b3JrcyIKICAgIH19LAogICAge3sKICAgICAgInN0ZXAiOiAyLAogICAgICAiYWN0aW9uIjogIi4uLiIsCiAgICAgICJ0b29sX3BhcnRzX3VzZWQiOiBbInRvb2w6cGFydCIsICIuLi4iXSwKICAgICAgImFmZm9yZGFuY2VfdXNlZCI6IFsiLi4uIiwgIi4uLiJdLAogICAgICAiZXhwbGFuYXRpb24iOiAiMSBzZW50ZW5jZSBhYm91dCB3aHkgaXQgd29ya3MiCiAgICB9fSwKICAgIC4uLgogIF0sCiAgInNvbHZhYmxlIjogIlllcyBvciBObyIsCiAgInNvbHZhYmxlX2V4cGxhbmF0aW9uIjogIjEtMyBzZW50ZW5jZXMgYWJvdXQgd2h5IHRoZSBnaXZlbiB0YXNrIGlzIHNvbHZhYmxlIG9yIG5vdCIsCiAgInNvbHV0aW9uX3N0ZXBzIjogWyJTdGVwIDE6IC4uLiIsICJTdGVwIDI6IC4uLiIsIC4uLl0sCiAgImZpbmFsX3NvbHV0aW9uIjogIk9uZSBjb25jaXNlIHBhcmFncmFwaCB0aGF0IHN1bW1hcml6ZXMgdGhlIGZ1bGwgYXBwcm9hY2guIiwKICAidXNlZF90b29scyI6IFsidG9vbCAxIiwgInRvb2wgMiIsIC4uLl0sCiAgImNvbnN0cmFpbnRfaGFuZGxpbmciOiBbe3siY29uc3RyYWludCI6ICIuLi4iLCAiaGFuZGxpbmciOiAiLi4uIn19LCAuLi5dCiAgImNyZWF0aXZlX3JlYXNvbmluZ19zdW1tYXJ5IjogIjEtMyBzZW50ZW5jZXMgYWJvdXQgbm92ZWx0eSBhbmQgcHJhY3RpY2FsaXR5LiIKfX0=)Youareanexpertatcreativephysicaltool\-usereasoning\.Solvethetaskbyexplicitlyreasoningovertoolpartsandaffordancesunderconstraints\.Rules:1\.Useonlytools/itemsexplicitlyavailableintheinputimage\.2\.Respectphysicalconstraintsifthetaskdescriptionrestrictsphysicalattributesorimageshowsphysicalconstraintssuchassizeorstate\.3\.Inventnewtoolsusingtools/itemsexplicitlyavailableintheinputimageifitisneeded\.4\.Providepracticalstepsthatcanactuallybeexecuted\.5\.Donotinvolveanyunnecessarystepstoachievethetask'sgoal\.6\.Ifacompletesolutionisimpossible,returnthebestpartialplanandexplainwhyitcannotbecompleted\.Requiredreasoningprocedure:1\.Statethetaskgoalandconcretesuccesscondition\.2\.Listallavailabletools/itemsfromtheinputimage\(noadditions\)\.3\.Foreachrelevanttool,identifythekeypart\(s\),inferphysicalproperties,andderivepart\-levelaffordancesusefulforthistask\.4\.Buildastep\-by\-stepplanwhereeachstepreferencestoolpartsandtheaffordancebeingused\.5\.Validateeachstepagainststatedconstraints\(e\.g\.,broken/unusableitems,sizemismatch,blockedfunction,statelimitations\)\.6\.Keeptheplanpracticalandminimalwithnounnecessaryactions\.TASKDESCRIPTION:\{problem\}ReturnJSON:\{\{"task\_goal":"\.\.\.","success\_condition":"\.\.\.","identified\_constraints":\["\.\.\.","\.\.\."\],"available\_tools":\[\{\{"tool":"\.\.\.","relevant\_parts":\[\{\{"part":"\.\.\.orNA","inferred\_physical\_properties":\["\.\.\.","\.\.\."\],"affordances\_for\_task":\["\.\.\.","\.\.\.",\.\.\.\],"usable\_under\_constraints":"YesorNo",\}\},\.\.\.\]\}\},\.\.\.\],"reasoning\_plan":\[\{\{"step":1,"action":"\.\.\.","tool\_parts\_used":\["tool:part","\.\.\."\],"affordance\_used":\["\.\.\.","\.\.\."\],"explanation":"1sentenceaboutwhyitworks"\}\},\{\{"step":2,"action":"\.\.\.","tool\_parts\_used":\["tool:part","\.\.\."\],"affordance\_used":\["\.\.\.","\.\.\."\],"explanation":"1sentenceaboutwhyitworks"\}\},\.\.\.\],"solvable":"YesorNo","solvable\_explanation":"1\-3sentencesaboutwhythegiventaskissolvableornot","solution\_steps":\["Step1:\.\.\.","Step2:\.\.\.",\.\.\.\],"final\_solution":"Oneconciseparagraphthatsummarizesthefullapproach\.","used\_tools":\["tool1","tool2",\.\.\.\],"constraint\_handling":\[\{\{"constraint":"\.\.\.","handling":"\.\.\."\}\},\.\.\.\]"creative\_reasoning\_summary":"1\-3sentencesaboutnoveltyandpracticality\."\}\}

Relative Evaluation[⬇](data:text/plain;base64,WW91IGFyZSBhbiBleHBlcnQgZXZhbHVhdG9yIGZvciBjcmVhdGl2ZSBwaHlzaWNhbCB0b29sLXVzZSByZWFzb25pbmcgcXVhbGl0eS4KWW91ciB0YXNrIGlzIHRvIGNvbXBhcmUgVFdPIGNhbmRpZGF0ZSBzb2x1dGlvbnMgZm9yIHRoZSBTQU1FIHRhc2ssIHVzaW5nIHRoZSBncm91bmQtdHJ1dGggc29sdXRpb24gYXMgcmVmZXJlbmNlLgoKSU1QT1JUQU5UOgotIFlvdSBhcmUgTk9UIGNoZWNraW5nIHdvcmRpbmcgc2ltaWxhcml0eS4gSnVkZ2UgZnVuY3Rpb25hbCBxdWFsaXR5IGFuZCBwcmFjdGljYWwgdmFsaWRpdHkuCi0gQSBzb2x1dGlvbiBjYW4gYmUgc3Ryb25nIGV2ZW4gaWYgaXQgZGlmZmVycyBmcm9tIHRoZSBncm91bmQtdHJ1dGggYXBwcm9hY2guCi0gVXNlIHRoZSBmb2xsb3dpbmcgY3JpdGVyaWE6CiAgICAtIENvcnJlY3RuZXNzOiBXaGV0aGVyIHRoZSBwbGFuIGFjdHVhbGx5IHNvbHZlcyB0aGUgdGFzayBvYmplY3RpdmUuCiAgICAtIEZlYXNpYmlsaXR5OiBXaGV0aGVyIHRoZSBwbGFuIGNhbiBiZSBwaHlzaWNhbGx5IGV4ZWN1dGVkIHVuZGVyIHN0YXRlZCBjb25zdHJhaW50cy4KICAgIC0gUGh5c2ljYWwgR3JvdW5kaW5nOiBXaGV0aGVyIGl0IHVzZXMgcmVhbGlzdGljIG9iamVjdCBwcm9wZXJ0aWVzL21lY2hhbmljcyBjb3JyZWN0bHkuCiAgICAtIENvbnN0cmFpbnQgQ292ZXJhZ2U6IFdoZXRoZXIgaXQgaGFuZGxlcyBhbGwgZXhwbGljaXQgY29uc3RyYWludHMuCiAgICAtIFRvb2wgVXNhZ2U6IFdoZXRoZXIgaXQgdXNlcyBvbmx5IGF2YWlsYWJsZSB0b29scyBhcHByb3ByaWF0ZWx5IGFuZCBwdXJwb3NlZnVsbHkuCiAgICAtIENyZWF0aXZlIFJlYXNvbmluZzogTm92ZWwsIG5vbi1vYnZpb3VzIGJ1dCB2YWxpZCByZXB1cnBvc2luZy9jb21iaW5hdGlvbiBvZiB0b29scy4KICAgIC0gT3ZlcmFsbDogSG9saXN0aWMgcXVhbGl0eSBhY3Jvc3MgYWxsIGFib3ZlIGRpbWVuc2lvbnMuCgpGb3IgRUFDSCBjcml0ZXJpb246CjEuIERldGVybWluZSB3aW5uZXI6CiAgIC0gIndpbiIgaWYgc29sdXRpb24xX3Njb3JlID4gc29sdXRpb24yX3Njb3JlCiAgIC0gImxvc2UiIGlmIHNvbHV0aW9uMl9zY29yZSA+IHNvbHV0aW9uMV9zY29yZQogICAtICJ0aWUiIGlmIGJvdGggYXJlIGVxdWFsCjIuIEdpdmUgYSBzaG9ydCByYXRpb25hbGUgKDEtMiBzZW50ZW5jZXMpLgoKVEFTSyBERVNDUklQVElPTjoKe3Byb2JsZW19CgpHUk9VTkQtVFJVVEggU09MVVRJT046Cntncm91bmRfdHJ1dGhfc29sdXRpb259CgpDQU5ESURBVEUgU09MVVRJT04xIChERUZBVUxUIFBST01QVCBPVVRQVVQpOgp7c29sdXRpb24xfQoKQ0FORElEQVRFIFNPTFVUSU9OMiAoQ09UIFBST01QVCBPVVRQVVQpOgp7c29sdXRpb24yfQoKUmV0dXJuIFNUUklDVCBKU09OOgp7ewogICJjb3JyZWN0bmVzcyI6IHt7CiAgICAid2lubmVyIjogIndpbiBvciBsb3NlIG9yIHRpZSIsCiAgICAicmF0aW9uYWxlIjogIjEtMiBzZW50ZW5jZXMuIgogIH19LAogICJmZWFzaWJpbGl0eSI6IHt7CiAgICAid2lubmVyIjogIndpbiBvciBsb3NlIG9yIHRpZSIsCiAgICAicmF0aW9uYWxlIjogIjEtMiBzZW50ZW5jZXMuIgogIH19LAogICJwaHlzaWNhbF9ncm91bmRpbmciOiB7ewogICAgIndpbm5lciI6ICJ3aW4gb3IgbG9zZSBvciB0aWUiLAogICAgInJhdGlvbmFsZSI6ICIxLTIgc2VudGVuY2VzLiIKICB9fSwKICAiY29uc3RyYWludF9jb3ZlcmFnZSI6IHt7CiAgICAid2lubmVyIjogIndpbiBvciBsb3NlIG9yIHRpZSIsCiAgICAicmF0aW9uYWxlIjogIjEtMiBzZW50ZW5jZXMuIgogIH19LAogICJ0b29sX3VzYWdlIjoge3sKICAgICJ3aW5uZXIiOiAid2luIG9yIGxvc2Ugb3IgdGllIiwKICAgICJyYXRpb25hbGUiOiAiMS0yIHNlbnRlbmNlcy4iCiAgfX0sCiAgImNyZWF0aXZpdHkiOiB7ewogICAgIndpbm5lciI6ICJ3aW4gb3IgbG9zZSBvciB0aWUiLAogICAgInJhdGlvbmFsZSI6ICIxLTIgc2VudGVuY2VzLiIKICB9fSwKICAib3ZlcmFsbCI6IHt7CiAgICAid2lubmVyIjogIndpbiBvciBsb3NlIG9yIHRpZSIsCiAgICAicmF0aW9uYWxlIjogIjEtMiBzZW50ZW5jZXMuIgogIH19LAogICJzaG9ydF9zdW1tYXJ5IjogIjItNCBzZW50ZW5jZXMgc3VtbWFyaXppbmcga2V5IHRyYWRlb2ZmcyBiZXR3ZWVuIGRlZmF1bHQgYW5kIENvVC4iCn19)Youareanexpertevaluatorforcreativephysicaltool\-usereasoningquality\.YourtaskistocompareTWOcandidatesolutionsfortheSAMEtask,usingtheground\-truthsolutionasreference\.IMPORTANT:\-YouareNOTcheckingwordingsimilarity\.Judgefunctionalqualityandpracticalvalidity\.\-Asolutioncanbestrongevenifitdiffersfromtheground\-truthapproach\.\-Usethefollowingcriteria:\-Correctness:Whethertheplanactuallysolvesthetaskobjective\.\-Feasibility:Whethertheplancanbephysicallyexecutedunderstatedconstraints\.\-PhysicalGrounding:Whetheritusesrealisticobjectproperties/mechanicscorrectly\.\-ConstraintCoverage:Whetherithandlesallexplicitconstraints\.\-ToolUsage:Whetheritusesonlyavailabletoolsappropriatelyandpurposefully\.\-CreativeReasoning:Novel,non\-obviousbutvalidrepurposing/combinationoftools\.\-Overall:Holisticqualityacrossallabovedimensions\.ForEACHcriterion:1\.Determinewinner:\-"win"ifsolution1\_score\>solution2\_score\-"lose"ifsolution2\_score\>solution1\_score\-"tie"ifbothareequal2\.Giveashortrationale\(1\-2sentences\)\.TASKDESCRIPTION:\{problem\}GROUND\-TRUTHSOLUTION:\{ground\_truth\_solution\}CANDIDATESOLUTION1\(DEFAULTPROMPTOUTPUT\):\{solution1\}CANDIDATESOLUTION2\(COTPROMPTOUTPUT\):\{solution2\}ReturnSTRICTJSON:\{\{"correctness":\{\{"winner":"winorloseortie","rationale":"1\-2sentences\."\}\},"feasibility":\{\{"winner":"winorloseortie","rationale":"1\-2sentences\."\}\},"physical\_grounding":\{\{"winner":"winorloseortie","rationale":"1\-2sentences\."\}\},"constraint\_coverage":\{\{"winner":"winorloseortie","rationale":"1\-2sentences\."\}\},"tool\_usage":\{\{"winner":"winorloseortie","rationale":"1\-2sentences\."\}\},"creativity":\{\{"winner":"winorloseortie","rationale":"1\-2sentences\."\}\},"overall":\{\{"winner":"winorloseortie","rationale":"1\-2sentences\."\}\},"short\_summary":"2\-4sentencessummarizingkeytradeoffsbetweendefaultandCoT\."\}\}

## Appendix CBenchmark Construction Details

### C\.1Affordance Knowledge Base Basis

Our benchmark construction is grounded in an existing open\-source affordance knowledge base of physical entities, object parts, and part\-level affordances\.111Repository at[https://github\.com/CreativityBench/CreativityBench](https://github.com/CreativityBench/CreativityBench)The knowledge base organizes everyday objects into a structured partonomy: each entity is decomposed into functional parts, and each part is annotated with physical attributes, state attributes, and possible functional affordances\. Physical attributes describe relatively stable properties, such as shape, material, rigidity, sharpness, hollowness, flexibility, and surface texture, while state attributes describe situational conditions, such as whether a part is open, clean, intact, accessible, or detachable\. The affordance annotations specify what functional roles a part can support, together with use conditions, recipient conditions, examples, and suitability levels\. These annotations provide the symbolic basis for MM\-CreativityBench: they allow us to identify which part of which object can support a target affordance, what attributes justify this use, and what conditions must hold for the use to be valid\. Thus, benchmark instances are not created through unconstrained scenario writing; they are derived from explicit part\-level attribute–affordance relations that can be inspected, verified, and converted into multimodal grounding problems\.

### C\.2Reverse Task Construction

We construct MM\-CreativityBench tasks in a reverse direction\. Instead of first writing an open\-ended scenario and then labeling a possible answer, we begin with a verified entity–part–affordance relation from the knowledge base and generate a scenario around it\. For each task, we first sample a target entitye∗e^\{\*\}, a target partp∗∈P\(e∗\)p^\{\*\}\\in P\(e^\{\*\}\), and a gold affordancef∗f^\{\*\}supported by the annotated attributesA\(p∗\)A\(p^\{\*\}\)\. This defines the gold solution

g=\(e∗,p∗,f∗\)\.g=\(e^\{\*\},p^\{\*\},f^\{\*\}\)\.Here,e∗e^\{\*\}specifies the object to be repurposed,p∗p^\{\*\}specifies the decisive part, andf∗f^\{\*\}specifies the functional role that the part can play in solving the task\.

Given the gold solution, we use GPT\-5\.4 for reverse task proposal generation\. Specifically, GPT\-5\.4 is given\(e∗,p∗,f∗\)\(e^\{\*\},p^\{\*\},f^\{\*\}\)together with the supporting physical and state attributes ofp∗p^\{\*\}, and is prompted to propose a practical task descriptionxxthat requires the affordancef∗f^\{\*\}without explicitly mentioning the target entity, the target part, or any surface cue that would make the answer trivial\. The generated description must satisfy three requirements: it should describe a realistic everyday problem, include only constraints relevant to the intended affordance, and leave the solver to infer which available object and part can satisfy the goal\. GPT\-5\.4 is used only to generate candidate task descriptions; all accepted tasks are subsequently checked and refined by human annotators\.

After obtaining a candidate task description, we construct the candidate entity set by adding distractors to the gold entity\. For each gold solution, we sample a distractor setE−=\{e1,…,eN−1\}E^\{\-\}=\\\{e\_\{1\},\\ldots,e\_\{N\-1\}\\\}and form

E=\{e∗\}∪E−\.E=\\\{e^\{\*\}\\\}\\cup E^\{\-\}\.Distractors are selected from the same knowledge base to create controlled ambiguity\. We include*affordance\-similar distractors*, whose parts appear functionally related tof∗f^\{\*\}but fail under closer inspection because they lack a necessary physical attribute, have an incompatible state, provide a weaker mechanism, or violate a contextual requirement\. We also include*scene\-plausible distractors*, which naturally co\-occur with the gold entity in the same environment but do not support the target affordance\. This design prevents the task from being solved by object priors or generic tool\-use associations alone\. A model must inspect candidate entities, compare their parts, and identify the part whose attributes best support the required affordance\.

Each symbolic task is represented as

T=\(x,E,g\),g=\(e∗,p∗,f∗\),T=\(x,E,g\),\\qquad g=\(e^\{\*\},p^\{\*\},f^\{\*\}\),wherexxis the task description,EEis the candidate entity set, andggis the gold entity–part–affordance solution\. This formulation makes the benchmark an inverse grounding problem: the task description specifies a need, the scene provides multiple possible objects, and the model must recover the correct entity and part by grounding the required affordance in visual and physical evidence\.

We apply a multi\-stage quality\-control process before including a task in the benchmark\. First,gold validity: the gold part must physically support the target affordance under the stated task constraints, and the supporting evidence must be present in its knowledge\-base attributes\. Second,distractor separability: no distractor can serve as an equally valid solution; each distractor must fail for a specific and identifiable reason, such as a missing attribute, incompatible state, weaker functional mechanism, safety concern, or contextual mismatch\. Third,scenario clarity: the task description must be natural, concise, and unambiguous, while avoiding direct lexical leakage of the gold entity or part\. Fourth,scene coherence: all candidate entities must plausibly co\-occur in a single realistic environment without making the scene artificial or visually cluttered\. Fifth,visual observability: the decisive part and the attributes required to justify the solution must be inspectable in the generated entity\- or part\-level images\. These criteria remove ambiguous cases, physically invalid solutions, non\-visual tasks, and scenarios with unintended alternative answers\.

Finally, all tasks are manually verified and refined by human annotators\. Annotators check whether the generated task genuinely requires the intended affordance, whether the gold entity–part pair is uniquely valid among the candidates, whether the distractors are plausible but separable, and whether the scenario can be faithfully visualized\. When needed, annotators revise the task wording, replace distractors, or discard the example entirely\. Using this pipeline, we construct 333 held\-out tasks for MM\-CreativityBench evaluation and 868 disjoint training tasks for trajectory sampling in the alignment stage\. The two splits are separated at the task and visual\-instance level to eliminate leakage between training trajectories and benchmark evaluation\.

### C\.3Multimodal Image Construction

After constructing each symbolic taskT=\(x,E,g\)T=\(x,E,g\), we convert it into an interactive multimodal instance by generating images at three levels: entity, part, and environment\. All images are generated with Gemini\-3\.1\-Image\-Pro\. The goal is not only to visualize the task, but also to create a controlled evidence hierarchy that matches the benchmark protocol: the model first observes the full environment, then chooses candidate entities to inspect, and finally verifies part\-level evidence before answering\.

Entity\-level images\.For each candidate entitye∈Ee\\in E, we generate a full\-object reference image

Ie=πent\(e,P\(e\),\{A\(p\):p∈P\(e\)\}\),I\_\{e\}=\\pi\_\{\\mathrm\{ent\}\}\(e,P\(e\),\\\{A\(p\):p\\in P\(e\)\\\}\),where the prompt is conditioned on the entity name, its part decomposition, and concise summaries of part\-level attributes\. The generated image should make the entity recognizable as a whole while preserving visually relevant cues such as geometry, material, surface texture, openings, edges, handles, tips, flexible regions, or contact surfaces\.

Image Generation Prompt \(Entity\)[⬇](data:text/plain;base64,R2VuZXJhdGUgYSBzaW5nbGUtb2JqZWN0IHJlZmVyZW5jZSBpbWFnZSBvbiBhIHB1cmUgd2hpdGUgc2VhbWxlc3Mgc3R1ZGlvIGJhY2tncm91bmQuCgpUaGUgaW1hZ2UgbXVzdCBmb2N1cyBvbmx5IG9uIHRoZSB0YXJnZXQgZW50aXR5LCBjZW50ZXJlZCBhbmQgZnVsbHkgdmlzaWJsZSB1bmxlc3MgdGhlIGFubm90YXRpb24gc2F5cyBhIHBhcnQgaXMgaGlkZGVuIG9yIGludGVybmFsLiBEbyBub3QgcGxhY2UgdGhlIG9iamVjdCBpbiBhIHJvb20sIHNjZW5hcmlvLCBvciBuYXJyYXRpdmUgc2NlbmUuIERvIG5vdCBhZGQgZXh0cmEgb2JqZWN0cywgaGFuZHMsIGZsb29yIHByb3BzLCBvciBkZWNvcmF0aXZlIGNvbnRleHQuIFJlbmRlciB0aGUgb2JqZWN0IHBob3RvcmVhbGlzdGljYWxseSB3aXRoIG5ldXRyYWwgc3R1ZGlvIGxpZ2h0aW5nIGFuZCBzaGFycCBtYXRlcmlhbCBkZXRhaWwuIFByZXNlcnZlIHZpc2libGUgc3RhdGUgY3VlcyBzdWNoIGFzIHdldG5lc3MsIHJlc2lkdWUsIGZ1bGxuZXNzLCBkZWZvcm1hdGlvbiwgd2Vhciwgb3Igb2NjbHVzaW9uIHdoZW4gdmlzdWFsbHkgcmVwcmVzZW50YWJsZS4KClRhcmdldCBlbnRpdHk6IHtFTlRJVFlfTkFNRX0KRW50aXR5IHN1bW1hcnk6IHtFTlRJVFlfU1VNTUFSWX0KClBhcnQtYnktcGFydCBndWlkYW5jZToKe1BBUlRfT1ZFUlZJRVd9CgpPdXRwdXQgZ29hbDogb25lIGNsZWFuIHdoaXRlLWJhY2tncm91bmQgcHJvZHVjdC1zdHlsZSByZWZlcmVuY2UgaW1hZ2Ugb2YgdGhpcyBleGFjdCBlbnRpdHkuIEFubm90YXRlIGVhY2ggcGFydCdzIGV4YWN0IG5hbWUgYmVzaWRlIHRoZSBjb3JyZXNwb25kaW5nIHBhcnQuIEFubm90YXRlIG9ubHkgdGhlIG5hbWUsIHdpdGggbm8gZGVzY3JpcHRpb24u)Generateasingle\-objectreferenceimageonapurewhiteseamlessstudiobackground\.Theimagemustfocusonlyonthetargetentity,centeredandfullyvisibleunlesstheannotationsaysapartishiddenorinternal\.Donotplacetheobjectinaroom,scenario,ornarrativescene\.Donotaddextraobjects,hands,floorprops,ordecorativecontext\.Rendertheobjectphotorealisticallywithneutralstudiolightingandsharpmaterialdetail\.Preservevisiblestatecuessuchaswetness,residue,fullness,deformation,wear,orocclusionwhenvisuallyrepresentable\.Targetentity:\{ENTITY\_NAME\}Entitysummary:\{ENTITY\_SUMMARY\}Part\-by\-partguidance:\{PART\_OVERVIEW\}Outputgoal:onecleanwhite\-backgroundproduct\-stylereferenceimageofthisexactentity\.Annotateeachpart'sexactnamebesidethecorrespondingpart\.Annotateonlythename,withnodescription\.

Part\-level images\.For each partp∈P\(e\)p\\in P\(e\), we generate a zoomed\-in part image

Ie,p=πpart\(e,p,A\(p\),Ie\),I\_\{e,p\}=\\pi\_\{\\mathrm\{part\}\}\(e,p,A\(p\),I\_\{e\}\),using the entity imageIeI\_\{e\}as a visual anchor\. This ensures that the part view remains consistent with the full\-object image in geometry, color, material, and relative structure\. The prompt focuses tightly on the target part and asks the generator to preserve the attributes relevant to its possible use\. This level is necessary because many creative solutions depend on local evidence, such as a rubber pad, serrated edge, hollow cavity, flat panel, hook\-like curve, narrow tip, or absorbent surface, which may be hard to verify from the environment image alone\.

Image Generation Prompt \(Part\)[⬇](data:text/plain;base64,R2VuZXJhdGUgYSBjbG9zZS11cCBzdHVkaW8gaW1hZ2Ugb24gYSBwdXJlIHdoaXRlIHNlYW1sZXNzIGJhY2tncm91bmQuCgp7UkVGRVJFTkNFX0lNQUdFX0lOU1RSVUNUSU9OfQoKRm9jdXMgb24gdGhlIHRhcmdldCBwYXJ0IGFuZCBtYWtlIGl0cyBsb2NhbCB2aXN1YWwgY3VlcyBlYXN5IHRvIGluc3BlY3QuIERvIG5vdCBhZGQgcm9vbSBjb250ZXh0LCBwcm9wcywgbGFiZWxzLCBhcnJvd3MsIG9yIGRpYWdyYW1zLiBVc2UgcHJvZHVjdC1waG90b2dyYXBoeSBsaWdodGluZyB3aXRoIHN0cm9uZyB0ZXh0dXJlIHZpc2liaWxpdHkgYW5kIGNsZWFuIGVkZ2VzLiBTaG93IGZpbmUgbWF0ZXJpYWwgZGV0YWlsLCByZXNpZHVlLCBtb2lzdHVyZSwgd2Vhciwgb2NjbHVzaW9uLCBzZWFtcywgb3BlbmluZ3MsIGZpbGwgY3VlcywgZGVmb3JtYXRpb24sIGFuZCBzdXJmYWNlIHRleHR1cmUgd2hlbmV2ZXIgdGhleSBhcmUgdmlzdWFsbHkgcmVwcmVzZW50YWJsZS4KCkVudGl0eToge0VOVElUWV9OQU1FfQpFbnRpdHkgc3VtbWFyeToge0VOVElUWV9TVU1NQVJZfQpUYXJnZXQgcGFydDoge1RBUkdFVF9QQVJUX05BTUV9Cgp7UEhZU0lDQUxfQVRUUklCVVRFU19CTE9DS30Ke1NUQVRFX0FUVFJJQlVURVNfQkxPQ0t9CgpQaHlzaWNhbCBzdW1tYXJ5OiB7UEFSVF9QSFlTSUNBTF9TVU1NQVJZfQpTdGF0ZSBzdW1tYXJ5OiB7UEFSVF9TVEFURV9TVU1NQVJZfQoKSW1wb3J0YW50OiBvYnNlcnZlIHRoZSBnaXZlbiBlbnRpdHkgaW1hZ2UuIFRoZSBnZW5lcmF0ZWQgdGFyZ2V0IHBhcnQgaW1hZ2Ugc2hvdWxkIG5vdCBkZXZpYXRlIHN1YnN0YW50aWFsbHkgZnJvbSB0aGUgZW50aXR5IGltYWdlLCBidXQgc2hvdWxkIGJlIGNvbXByZWhlbnNpdmUgYW5kIGluZm9ybWF0aXZlIGFib3V0IGRldGFpbGVkIGxvY2FsIHZpc3VhbCBjdWVzLgoKT3V0cHV0IGdvYWw6IG9uZSB3aGl0ZS1iYWNrZ3JvdW5kIGNsb3NlLXVwIGltYWdlIGNlbnRlcmVkIG9uIHRoZSB0YXJnZXQgcGFydC4gVGhlIHBhcnQgc2hvdWxkIGhhdmUgZXhhY3RseSB0aGUgc2FtZSBwaHlzaWNhbCBhbmQgc3RhdGUgYXR0cmlidXRlcyBhcyBkZXNjcmliZWQgYWJvdmUu)Generateaclose\-upstudioimageonapurewhiteseamlessbackground\.\{REFERENCE\_IMAGE\_INSTRUCTION\}Focusonthetargetpartandmakeitslocalvisualcueseasytoinspect\.Donotaddroomcontext,props,labels,arrows,ordiagrams\.Useproduct\-photographylightingwithstrongtexturevisibilityandcleanedges\.Showfinematerialdetail,residue,moisture,wear,occlusion,seams,openings,fillcues,deformation,andsurfacetexturewhenevertheyarevisuallyrepresentable\.Entity:\{ENTITY\_NAME\}Entitysummary:\{ENTITY\_SUMMARY\}Targetpart:\{TARGET\_PART\_NAME\}\{PHYSICAL\_ATTRIBUTES\_BLOCK\}\{STATE\_ATTRIBUTES\_BLOCK\}Physicalsummary:\{PART\_PHYSICAL\_SUMMARY\}Statesummary:\{PART\_STATE\_SUMMARY\}Important:observethegivenentityimage\.Thegeneratedtargetpartimageshouldnotdeviatesubstantiallyfromtheentityimage,butshouldbecomprehensiveandinformativeaboutdetailedlocalvisualcues\.Outputgoal:onewhite\-backgroundclose\-upimagecenteredonthetargetpart\.Thepartshouldhaveexactlythesamephysicalandstateattributesasdescribedabove\.

Environment\-level images\.We then generate the full scenario image

Ienv=πenv\(x,E,\{Ie:e∈E\}\),I\_\{\\mathrm\{env\}\}=\\pi\_\{\\mathrm\{env\}\}\(x,E,\\\{I\_\{e\}:e\\in E\\\}\),conditioned on the task description, the candidate entity list, and the generated entity reference images\. The prompt requires all candidate entities to appear naturally in a coherent scene with realistic scale, placement, and lighting\. It also explicitly prohibits showing the task already being solved or introducing extra objects that could serve as unintended alternative solutions\. Thus, the environment image defines the search space, while entity and part images provide progressively finer evidence for verification\.

System Prompt[⬇](data:text/plain;base64,R2VuZXJhdGUgb25lIHBob3RvcmVhbGlzdGljIG92ZXJhbGwgZW52aXJvbm1lbnQgc2NlbmUgaW1hZ2UuCgpTY2VuYXJpbzoge1NDRU5BUklPfQpUYXNrIGdvYWw6IHtUQVNLX0dPQUx9CgpUaGUgYXR0YWNoZWQgcmVmZXJlbmNlIGltYWdlcyBhcmUgdGhlIHJlcXVpcmVkIGVudGl0aWVzIGFuZCBhcmUgcHJvdmlkZWQgaW4gdGhlIHNhbWUgb3JkZXIgYXMgdGhlIGxpc3QgYmVsb3cuIFByZXNlcnZlIGVhY2ggcmVmZXJlbmNlZCBlbnRpdHkncyBpZGVudGl0eSwgbWF0ZXJpYWxzLCBwcm9wb3J0aW9ucywgYW5kIHJlY29nbml6YWJsZSBzaGFwZSB3aGlsZSBwbGFjaW5nIHRoZW0gbmF0dXJhbGx5IGluIHRoZSBzY2VuZS4KClJlcXVpcmVkIGVudGl0aWVzIHRvIGluY2x1ZGUgaW4gdGhlIHNjZW5lOgp7UkVRVUlSRURfRU5USVRZX0xJU1R9CgpSZXF1aXJlZCBpbnRlcmFjdGFibGUgaXRlbXMgdG8gaW5jbHVkZSBuYXR1cmFsbHkgaW4gdGhlIHNjZW5lOgp7SU5URVJBQ1RBQkxFX0lURU1fTElTVH0KClNjZW5lIHJlcXVpcmVtZW50czoKLSBTaG93IGEgc2luZ2xlIGNvaGVyZW50IHJlYWwtd29ybGQgc2NlbmUsIG5vdCBhIGNvbGxhZ2Ugb3Igd2hpdGUtYmFja2dyb3VuZCBsYXlvdXQuCi0gSW5jbHVkZSBldmVyeSByZXF1aXJlZCBlbnRpdHkgYW5kIGV2ZXJ5IHJlcXVpcmVkIGludGVyYWN0YWJsZSBpdGVtIGluIGEgcGxhdXNpYmxlIG1hbm5lci4KLSBOYXR1cmFsbHkgZW1iZWQgdGhlIG9iamVjdHMgaW50byB0aGUgc2NlbmFyaW8gd2l0aCByZWFsaXN0aWMgc2NhbGUsIGxpZ2h0aW5nLCBhbmQgcGxhY2VtZW50LCBhbmQgbWFrZSBzdXJlIHRoZXkgYXJlIGNsZWFybHkgdmlzaWJsZS4KLSBEbyBub3QgYWRkIHRleHQgbGFiZWxzLCBhcnJvd3MsIGRpYWdyYW1zLCBzcGxpdCBwYW5lbHMsIG9yIHByb2R1Y3Qtc2hvdCBmcmFtaW5nLgotIERvIG5vdCBzaG93IGhvdyB0byBzb2x2ZSB0aGUgdGFzayBvciBleHBsaWNpdGx5IGRlcGljdCB0YXNrIGV4ZWN1dGlvbi4KLSBNaW5pbWl6ZSBleHRyYW5lb3VzIG9iamVjdHMgc28gdGhlIHNjZW5hcmlvIHJlbWFpbnMgZm9jdXNlZCBhbmQgdW5jbHV0dGVyZWQuCgpPdXRwdXQgZ29hbDogb25lIG5hdHVyYWwgb3ZlcmFsbCBzY2VuYXJpbyBpbWFnZSB0aGF0IGNsZWFybHkgY29udGFpbnMgYWxsIHJlcXVpcmVkIGVudGl0aWVzIGFuZCBpbnRlcmFjdGFibGUgaXRlbXMu)Generateonephotorealisticoverallenvironmentsceneimage\.Scenario:\{SCENARIO\}Taskgoal:\{TASK\_GOAL\}Theattachedreferenceimagesaretherequiredentitiesandareprovidedinthesameorderasthelistbelow\.Preserveeachreferencedentity'sidentity,materials,proportions,andrecognizableshapewhileplacingthemnaturallyinthescene\.Requiredentitiestoincludeinthescene:\{REQUIRED\_ENTITY\_LIST\}Requiredinteractableitemstoincludenaturallyinthescene:\{INTERACTABLE\_ITEM\_LIST\}Scenerequirements:\-Showasinglecoherentreal\-worldscene,notacollageorwhite\-backgroundlayout\.\-Includeeveryrequiredentityandeveryrequiredinteractableiteminaplausiblemanner\.\-Naturallyembedtheobjectsintothescenariowithrealisticscale,lighting,andplacement,andmakesuretheyareclearlyvisible\.\-Donotaddtextlabels,arrows,diagrams,splitpanels,orproduct\-shotframing\.\-Donotshowhowtosolvethetaskorexplicitlydepicttaskexecution\.\-Minimizeextraneousobjectssothescenarioremainsfocusedanduncluttered\.Outputgoal:onenaturaloverallscenarioimagethatclearlycontainsallrequiredentitiesandinteractableitems\.

Image quality and consistency checks\.We check generated images for three requirements before using them in the benchmark\. First, all candidate entities must be present and recognizable in the environment\. Second, entity and part images must be visually consistent, so that a part inspection can be interpreted as a closer view of the same object\. Third, the decisive part must be visually inspectable rather than hidden, cropped away, or rendered in a way that makes the task impossible\. Images that fail these requirements are regenerated or manually filtered\.

Textual disambiguation for visual feedback\.Although all generated images are used during evaluation, some fine\-grained attributes may not be decisively inferable from the image alone due to rendering ambiguity, viewpoint, lighting, or material appearance\. To avoid making the task depend on accidental image artifacts, we add a textual disambiguation step\. Given an entity–part image pair\(Ie,Ie,p\)\(I\_\{e\},I\_\{e,p\}\)and an attributeα∈A\(p\)\\alpha\\in A\(p\), we use GPT\-5\.4 to judge whether the visual evidence alone is sufficient to support the attribute:

ℓ\(α\)∈\{VisualEnough,TextNeeded\}\.\\ell\(\\alpha\)\\in\\\{\\textsc\{VisualEnough\},\\textsc\{TextNeeded\}\\\}\.VisualEnoughmeans that the attribute can be reasonably inferred from the image without additional text\.TextNeededmeans that the attribute is part of the knowledge\-base annotation and is compatible with the generated image, but the visual evidence may be ambiguous; in this case, we provide a short textual clarification together with the image when that entity or part is returned as feedback\.

This step is used only to disambiguate low\-level visual details\. The accompanying text is restricted to object or part attributes, such as material, state, surface property, rigidity, hollowness, or accessibility\. It does not reveal the target affordance, the correct entity, the correct part, or the final solution\. The same procedure is applied to all candidate entities and parts, not only to the gold solution\. Thus, the benchmark still requires models to inspect the visual evidence and reason over candidate parts, while preventing failures caused by attributes that are intended in the generated image but not visually decisive\. We denote the attributes requiring textual clarification as

δ\(p\)=\{α∈A\(p\):ℓ\(α\)=TextNeeded\},\\delta\(p\)=\\\{\\alpha\\in A\(p\):\\ell\(\\alpha\)=\\textsc\{TextNeeded\}\\\},and include concise descriptions ofδ\(p\)\\delta\(p\)when presenting the corresponding image as interaction feedback\.

## Appendix DTraining Trajectory Construction Details

We provide additional details for the trajectory construction procedure introduced in the main text\. The goal is to construct supervision not only for the final entity–part answer, but also for the intermediate evidence\-seeking process: selecting entities to inspect, verifying candidate parts, judging physical attributes, and rejecting plausible but physically invalid alternatives\.

### D\.1Trajectory Formulation

Each task is represented as𝒯=\(x,Ienv,E,g\)\\mathcal\{T\}=\(x,I\_\{\\mathrm\{env\}\},E,g\), wherexxis the task instruction,IenvI\_\{\\mathrm\{env\}\}is the environment image,EEis the set of scene entities, andg=\(e∗,p∗,f∗\)g=\(e^\{\*\},p^\{\*\},f^\{\*\}\)is the gold solution consisting of the target entity, target part, and target affordance\. Each entitye∈Ee\\in Ehas an annotated part setP\(e\)P\(e\)\.

A trajectory is a sequence of interaction turns:

τ=\{\(ot,rt\)\}t=1T,ot=\(ut,It\),rt=\(zt,at\)\.\\tau=\\\{\(o\_\{t\},r\_\{t\}\)\\\}\_\{t=1\}^\{T\},\\qquad o\_\{t\}=\(u\_\{t\},I\_\{t\}\),\\qquad r\_\{t\}=\(z\_\{t\},a\_\{t\}\)\.Here,utu\_\{t\}is the textual feedback,ItI\_\{t\}is the visual observation,ztz\_\{t\}is the model’s free\-form reasoning, andata\_\{t\}is a structured action\. The action space contains three operations:

at∈\{inspect\_entity\(e\),inspect\_part\(e,p\),answer\(e,p,h\)\},a\_\{t\}\\in\\\{\\texttt\{inspect\\\_entity\}\(e\),\\;\\texttt\{inspect\\\_part\}\(e,p\),\\;\\texttt\{answer\}\(e,p,h\)\\\},wheree∈Ee\\in E,p∈P\(e\)p\\in P\(e\), andhhdescribes how the selected part should be used\.

The actions define the interaction protocol\. The actioninspect\_entity\(e\)\\texttt\{inspect\\\_entity\}\(e\)returns the entity\-level imageIeI\_\{e\}and the part listP\(e\)P\(e\)\. The actioninspect\_part\(e,p\)\\texttt\{inspect\\\_part\}\(e,p\)returns the zoomed\-in part imageIe,pI\_\{e,p\}, optionally with short attribute\-level textual disambiguation\. The actionanswer\(e,p,h\)\\texttt\{answer\}\(e,p,h\)terminates the trajectory and provides the final grounded solution\. Thus, the observation at each turn is determined by the previous action:

It=\{Ienv,t=1,Ie,at−1=inspect\_entity\(e\),Ie,p,at−1=inspect\_part\(e,p\)\.I\_\{t\}=\\begin\{cases\}I\_\{\\mathrm\{env\}\},&t=1,\\\\ I\_\{e\},&a\_\{t\-1\}=\\texttt\{inspect\\\_entity\}\(e\),\\\\ I\_\{e,p\},&a\_\{t\-1\}=\\texttt\{inspect\\\_part\}\(e,p\)\.\\end\{cases\}This formulation ensures that each reasoning step is aligned with the appropriate level of visual evidence: scene, entity, or part\.

### D\.2Knowledge\-Guided Exploration Stack

To construct positive trajectories, we maintain an ordered exploration stack𝒮t\\mathcal\{S\}\_\{t\}\. Each stack element is either an entity item\(entity,e\)\(\\texttt\{entity\},e\)or a part item\(part,\(e,p\)\)\(\\texttt\{part\},\(e,p\)\)\. The top element determines the next inspection target\.

The stack is guided by an affordance\-relevance function:

J:E×P\(e\)→\{0,1\},J:E\\times P\(e\)\\rightarrow\\\{0,1\\\},whereJ\(e,p\)=1J\(e,p\)=1indicates that partppof entityeehas an affordance similar or relevant to the target affordancef∗f^\{\*\}according to the knowledge base𝒦\\mathcal\{K\}\. This does not necessarily mean that\(e,p\)\(e,p\)is the gold answer; it only means that the part is worth inspecting\. This distinction is important because many distractors are affordance\-similar but fail under fine\-grained physical verification\.

At the first turn, the model observes the scene and proposes an ordered list of candidate entities\. This list initializes𝒮1\\mathcal\{S\}\_\{1\}, prioritizing likely entities while allowing systematic exploration\. The stack is then updated as follows\.

- •Entity inspection\.When the top element is\(entity,ei\)\(\\texttt\{entity\},e\_\{i\}\), the positive branch takesinspect\_entity\(ei\)\\texttt\{inspect\\\_entity\}\(e\_\{i\}\)\. The entity is removed from the stack, and all affordance\-relevant parts\{p∈P\(ei\):J\(ei,p\)=1\}\\\{p\\in P\(e\_\{i\}\):J\(e\_\{i\},p\)=1\\\}are pushed onto the stack for part\-level verification\. If no such part exists, exploration moves to the next entity\.
- •Part inspection\.When the top element is\(part,\(ei,pi,j\)\)\(\\texttt\{part\},\(e\_\{i\},p\_\{i,j\}\)\), the positive branch takesinspect\_part\(ei,pi,j\)\\texttt\{inspect\\\_part\}\(e\_\{i\},p\_\{i,j\}\)\. The part is then removed from the stack and assigned a binary judgmentbt∈\{0,1\}b\_\{t\}\\in\\\{0,1\\\}, indicating whether its observed attributes satisfy the task requirements\.
- •Termination\.Exploration terminates when𝒮t=∅\\mathcal\{S\}\_\{t\}=\\emptyset\. The model then compares inspected candidate parts, especially those withbt=1b\_\{t\}=1, and produces the final actionanswer\(e∗,p∗,h∗\)\\texttt\{answer\}\(e^\{\*\},p^\{\*\},h^\{\*\}\)\.

This mechanism yields a coarse\-to\-fine positive trajectory\. The model first searches over entities, then verifies affordance\-relevant parts, and finally selects the gold pair based on fine\-grained physical evidence rather than object\-level plausibility alone\.

### D\.3Three\-Branch Trajectory Sampling

The stack specifies what the positive trajectory should inspect, but we still need to generate the reasoning text associated with each step\. To obtain both supervised and preference\-learning data, we sample three aligned branches at each shared interaction contextctc\_\{t\}:

rtb=\(ztb,atb\),b∈\{\+,−,−−\}\.r\_\{t\}^\{b\}=\(z\_\{t\}^\{b\},a\_\{t\}^\{b\}\),\\qquad b\\in\\\{\+,\-,\-\-\\\}\.The positive branch is the preferred grounded response, while the negative and hard\-negative branches are rejected alternatives\.

We use GPT\-5\.4 as the teacher model to help generate the branch\-specific reasoning and responses\. For the positive branch, GPT\-5\.4 does not freely decide the exploration structure\. Instead, the inspected target, part\-level judgment, and final answer are determined by the knowledge base𝒦\\mathcal\{K\}, the exploration stack𝒮t\\mathcal\{S\}\_\{t\}, and the gold solutiongg\. GPT\-5\.4 is used to express this predetermined structure in natural, coherent, and visually grounded language\. For the negative and hard\-negative branches, GPT\-5\.4 is prompted with different guidance signals to produce rejected responses at the same state\.

Formally, each branch is generated with a branch\-specific system promptsbs^\{b\}and guidance functionGbG^\{b\}:

rtb=πGPT\-5\.4\(sb,ct,Gb\(t,𝒦,g,𝒮t\)\),b∈\{\+,−,−−\}\.r\_\{t\}^\{b\}=\\pi\_\{\\mathrm\{GPT\\text\{\-\}5\.4\}\}\\bigl\(s^\{b\},c\_\{t\},G^\{b\}\(t,\\mathcal\{K\},g,\\mathcal\{S\}\_\{t\}\)\\bigr\),\\qquad b\\in\\\{\+,\-,\-\-\\\}\.The three branches share the same response format but differ in the information exposed to the teacher model\.

Positive branch\.The positive branch receives structured guidance from𝒦\\mathcal\{K\}, including relevant attributes, affordance judgments, and the gold solution when needed\. At the scene level, the guidance provides the target affordancef∗f^\{\*\}and the physical attributes needed to support it\. At the entity level, it provides the affordance\-relevant parts of the inspected entity\. At the part level, it provides attribute\-level evidence used to determine whether the part satisfies the task constraints\. At the final step, it guides the model to select\(e∗,p∗\)\(e^\{\*\},p^\{\*\}\)and explain how the selected part should be used\.

The positive response must satisfy three criteria: it should be visually grounded in the current observation, consistent with the exploration stack, and explicit about the attribute–affordance relationship\. The resulting positive trajectory is used for supervised fine\-tuning\.

Prompt for Positive Guidance[⬇](data:text/plain;base64,W1NZU1RFTSBQUk9NUFRdCgpZb3UgYXJlIGEgY3JlYXRpdmUgcGh5c2ljYWwgcHJvYmxlbS1zb2x2ZXIgdG8gb3V0cHV0IGRhdGEgdHJhamVjdG9yaWVzIHRoYXQgd2lsbCBiZSBjb2xsZWN0ZWQgZm9yIG9wZW4tc291cmNlIG1vZGVsIHRyYWluaW5nLiBHaXZlbiBhIHRhc2sgYW5kIHNjZW5hcmlvLCBzb2x2ZSBpdCBieSByZXB1cnBvc2luZyBhIHBhcnQgb2YgYW4gZW50aXR5IHVzaW5nIGF0dHJpYnV0ZS1ncm91bmRlZCwgdmlzdWFsbHkgZ3JvdW5kZWQgcmVhc29uaW5nLiBZb3UgbWF5IGluc3BlY3Qgb25seSBvbmUgZW50aXR5IG9yIG9uZSBwYXJ0IHBlciB0dXJuIGJlZm9yZSBhbnN3ZXJpbmcuIEFsd2F5cyBlbmQgd2l0aCBhIHNpbmdsZSBKU09OIG9iamVjdCB3aGVuZXZlciB0aGUgcHJvbXB0IGFza3MgZm9yIG9uZS4KCllvdXIgcmVhc29uaW5nIHN0eWxlIGlzIGd1aWRlZCBieSB0aGUgdXNlciBwcm9tcHQuIEVhY2ggdHVybiBzaG91bGQgcmVhZCBsaWtlIGEgbmF0dXJhbCBjYXB0aW9uIG9mIHlvdXIgdGhvdWdodCBwcm9jZXNzOiBjb2hlcmVudCwgZ3JvdW5kZWQsIGFuZCBvYnNlcnZhbnQuIEV2ZW4gd2hlbiB0aGUgcHJvbXB0IHF1aWV0bHkgZ3VpZGVzIHlvdXIgcmVhc29uaW5nLCBiZWhhdmUgYXMgaWYgeW91IGFyZSBmaWd1cmluZyB0aGluZ3Mgb3V0IG5vcm1hbGx5IGZyb20gdGhlIHRhc2sgYW5kIGZlZWRiYWNrLiBOZXZlciBtZW50aW9uIGhpZGRlbiBndWlkYW5jZSwgZ29sZCBkYXRhLCBvciBwcm9tcHQgcmVmZXJlbmNlcy4KCgpbSU5JVElBTCBQUk9NUFRdCgojIyBUYXNrIEJhc2lzCllvdSBhcmUgY3VycmVudGx5IGluIDxTQ0VOQVJJT19PUl9FTlZJUk9OTUVOVD4uClRoZSB0YXNrIHRoYXQgdGhlIHVzZXIgcmVxdWlyZXMgeW91IHRvIGRvIGlzOgo8VEFTSz4KClRoZSBlbnRpdHkgbmFtZXMgYXZhaWxhYmxlIGluIHRoZSBzY2VuZSBhcmU6Ci0gPEVOVElUWV9OQU1FXzE+Ci0gPEVOVElUWV9OQU1FXzI+Ci0gLi4uCgpZb3Ugc2hvdWxkIHJlYXNvbiBmaXJzdCwgdGhlbiBlbmQgd2l0aCBhIEpTT04gaW4gdGhlIGZvcm1hdCB7InJlYXNvbmluZyI6Ii4uLiIsImFjdGlvbiI6Imluc3BlY3RfZW50aXR5IiwiZW50aXR5IjoiPGV4YWN0IGVudGl0eSBuYW1lPiIsInRvcF9jYW5kaWRhdGVzIjpbIjxleGFjdCBlbnRpdHkgbmFtZT4iLCAiLi4uIl19LgoKIyMgR3VpZGFuY2Ugb24gUmVhc29uaW5nCjEuIFN0YXJ0IG5hdHVyYWxseSBieSB0aGlua2luZyBhYm91dCB3aGF0IGtpbmQgb2YgYWZmb3JkYW5jZSB0aGUgdGFzayBuZWVkcy4KVGFyZ2V0IGFmZm9yZGFuY2UgdG8gbmF0dXJhbGx5IHJlYXNvbiB0b3dhcmQ6IDxUQVJHRVRfQUZGT1JEQU5DRT4KMi4gQ29udGludWUgYnkgdGhpbmtpbmcgYWJvdXQgdGhlIGNvcmUgYXR0cmlidXRlcyB0aGF0IHdvdWxkIGVuYWJsZSB0aGF0IGFmZm9yZGFuY2UuIEV4cHJlc3MgdGhlc2UgYXMgaWYgeW91IGFyZSBpbmZlcnJpbmcgdGhlbSB5b3Vyc2VsZi4KLSA8QVRUUklCVVRFXzE+Ci0gPEFUVFJJQlVURV8yPgotIC4uLgozLiBUaGVuIG5hdHVyYWxseSB0cmFuc2l0aW9uIHlvdXIgcmVhc29uaW5nIGludG8gaW5zcGVjdGluZyB0aGUgZW52aXJvbm1lbnQuCjQuIEdvIHRocm91Z2ggYWxsIGxpc3RlZCBlbnRpdGllcyB3aXRoIG5vIG92ZXJsYXAgYW5kIG5vdGhpbmcgbGVmdCBiZWhpbmQuCjUuIEZvciBlYWNoIGVudGl0eSwgc2F5IHdoZXJlIGl0IGlzIGluIHRoZSBpbWFnZSBpZiB2aXNpYmxlLCB0aGVuIGdpdmUgYSBicmllZiBncm91bmRlZCBkZXNjcmlwdGlvbiBmb2N1c2VkIG9uIHZpc2libGUgcGh5c2ljYWwgYW5kIHN0YXRlIGN1ZXMuCjYuIEFmdGVyIGNvdmVyaW5nIGFsbCBlbnRpdGllcywgbmFtZSB1cCB0byB0aHJlZSBjYW5kaWRhdGUgZW50aXRpZXMsIGV4cGxhaW4gYnJpZWZseSB3aHkgdGhleSBsb29rIHByb21pc2luZywgYW5kIHNob3cgeW91ciBpbnRlbnRpb24gdG8gaW5zcGVjdCB0aG9zZSB0b3AgY2FuZGlkYXRlcyBmaXJzdCBhbmQgdGhlbiBjb250aW51ZSB0aHJvdWdoIGV2ZXJ5IG90aGVyIGVudGl0eSBhcyB3ZWxsLgo3LiBJbiB0aGUgZmluYWwgSlNPTiwga2VlcCB0aGUgcmVhc29uaW5nIHN1bW1hcnkgYnJpZWYsIGNob29zZSBvbmUgZXhhY3QgZW50aXR5IG5hbWUgdG8gaW5zcGVjdCBub3csIGFuZCBpbmNsdWRlIGEgdG9wX2NhbmRpZGF0ZXMgbGlzdCBmb3IgaW50ZXJuYWwgdHJhY2tpbmcgb25seS4KCiMjIEFkZGl0aW9uYWwgTm90ZXMKMS4gRm9sbG93IHRoZSBzdHJ1Y3R1cmUgYWJvdmUsIGJ1dCBwaHJhc2UgZXZlcnl0aGluZyBpbiB5b3VyIG93biBuYXR1cmFsIHdvcmRzLgoyLiBCZWhhdmUgYXMgaWYgeW91IGFyZSByZWFzb25pbmcgbm9ybWFsbHkgZnJvbSB0aGUgdGFzayBhbmQgaW1hZ2U7IG5ldmVyIG1lbnRpb24gaGlkZGVuIGd1aWRhbmNlLCBnb2xkIGRhdGEsIG9yIHByb21wdCByZWZlcmVuY2VzLgozLiBPdXRwdXQgdGhlIGZ1bGwgcmVhc29uaW5nIGJlZm9yZSB0aGUgZmluYWwgSlNPTi4KNC4gQ29weSBlbnRpdHkgbmFtZXMgZXhhY3RseSBmcm9tIHRoZSBwcm92aWRlZCBsaXN0Lgo1LiBUaGUgdG9wX2NhbmRpZGF0ZXMgZmllbGQgaXMgcmVxdWlyZWQgaW4gdGhlIEpTT04gZm9yIHRoaXMgcm91bmQuCgoKW0VOVElUWSBGRUVEQkFDSyBQUk9NUFRdCgojIyBGZWVkYmFjayBCYXNpcwpFTlRJVFkgSU5TUEVDVElPTjogPEVOVElUWV9OQU1FPgpUaGlzIGVudGl0eSBpbmNsdWRlcyB0aGVzZSBleGFjdCBwYXJ0IG5hbWVzOiA8UEFSVF9OQU1FXzE+LCA8UEFSVF9OQU1FXzI+LCAuLi4KCiMjIEd1aWRhbmNlIG9uIFJlYXNvbmluZwpQbGVhc2UgcGVyZm9ybSB2aXN1YWwgZ3JvdW5kaW5nIGZvciBlYWNoIHBhcnQgZmlyc3QgYW5kIGV4cGxhaW4gaW4geW91ciBvd24gd29yZHMgd2h5IGl0IG1heSBvciBtYXkgbm90IGFjaGlldmUgdGhlIG5lZWRlZCBhZmZvcmRhbmNlLgo8RU5USVRZX0JSQU5DSF9JTlNUUlVDVElPTlM+CgpIZXJlIGlzIHRoZSByZWZlcmVuY2UgZm9yIGVhY2ggcGFydDoKLSA8UEFSVF9OQU1FXzE+OiA8Q0FOX09SX0NBTk5PVF9TRVJWRV9WRVJESUNUPi4gUmVhc29uOiA8UkVGRVJFTkNFX1JFQVNPTl8xPgotIDxQQVJUX05BTUVfMj46IDxDQU5fT1JfQ0FOTk9UX1NFUlZFX1ZFUkRJQ1Q+LiBSZWFzb246IDxSRUZFUkVOQ0VfUkVBU09OXzI+Ci0gLi4uCgojIyBBZGRpdGlvbmFsIE5vdGVzCjEuIE9ic2VydmUgYW5kIHJlYXNvbiBhYm91dCBhbGwgcGFydHMgd2l0aCBubyBvdmVybGFwIGFuZCBub3RoaW5nIGxlZnQgYmVoaW5kLgoyLiBCYXNlIHlvdXIgcmVhc29uaW5nIG9uIHZpc2libGUgY3VlcyBhbmQgdGhlIHByb3ZpZGVkIHJlZmVyZW5jZSwgYnV0IGRvIG5vdCBjb3B5IHRoZSByZWZlcmVuY2UgdGV4dCB2ZXJiYXRpbS4KMy4gS2VlcCB0aGUgcmVhc29uaW5nIG5hdHVyYWwgYW5kIGNvaGVyZW50LCBhcyBpZiB5b3UgYXJlIGZpZ3VyaW5nIGl0IG91dCBub3JtYWxseS4KNC4gT3V0cHV0IHRoZSBmdWxsIHJlYXNvbmluZyBiZWZvcmUgYW55IEpTT04uCjUuIENvcHkgcGFydCBvciBlbnRpdHkgbmFtZXMgZXhhY3RseS4KCihJZiB0aGlzIGlzIHRoZSBsYXN0IGV4cGxvcmF0aW9uIHN0ZXAsIHVzZTopClRoaXMgaXMgdGhlIGxhc3QgZXhwbG9yYXRpb24gc3RlcC4KUmVhc29uIHRocm91Z2ggZXZlcnkgcGFydCwgZXhwbGFpbiB3aHkgbm9uZSBvZiB0aGVtIGZpbmFsbHkgc29sdmVzIHRoZSBuZWVkIHdlbGwgZW5vdWdoLCBhbmQgZW5kIGJ5IGV4cGxpY2l0bHkgc2F5aW5nIHRoaXMgaXMgdGhlIGxhc3QgZW50aXR5IHlvdSBuZWVkIHRvIGV4cGxvcmUgYW5kIHlvdSBzaG91bGQgbm93IG1vdmUgb24gdG8gdGhlIGZpbmFsIGFuc3dlci4KRG8gbm90IG91dHB1dCBhbnkgSlNPTiBpbiB0aGlzIHR1cm4uCkRvIG5vdCBhcHBlbmQgYSBKU09OIG9iamVjdCBhdCB0aGUgZW5kLgpTdG9wIGltbWVkaWF0ZWx5IGFmdGVyIHRoZSBwcm9zZSByZWFzb25pbmcgaXMgZmluaXNoZWQuCgooSWYgdGhpcyBlbnRpdHkgaGFzIHNpbWlsYXItYWZmb3JkYW5jZSBwYXJ0cywgdXNlOikKVGhpcyBlbnRpdHkgaGFzIHNldmVyYWwgcGFydHMgdGhhdCBtYXkgZWxpY2l0IHRoZSBuZWVkZWQgYWZmb3JkYW5jZTogPFNJTUlMQVJfUEFSVF9OQU1FUz4uClJlYXNvbiB0aHJvdWdoIGV2ZXJ5IHBhcnQgZmlyc3QsIHRoZW4gZXhwbGljaXRseSBzYXkgeW91IHdpbGwgaW5zcGVjdCB0aGUgcHJvbWlzaW5nIHBhcnRzIG9uZSBieSBvbmUuCkFsbCBwYXJ0cyB0aGF0IGNhbiBzZXJ2ZSB0aGUgbmVlZGVkIGFmZm9yZGFuY2UgbXVzdCBldmVudHVhbGx5IGJlIGV4cGxvcmVkLgpGaW5hbGx5IG91dHB1dCBKU09OIGluIHRoZSBmb3JtYXQgeyJyZWFzb25pbmciOiIuLi4iLCJhY3Rpb24iOiJpbnNwZWN0X3BhcnQiLCJwYXJ0IjoiPGV4YWN0IHBhcnQgbmFtZT4ifSBhbmQgY2hvb3NlIG9uZSBleGFjdCBwYXJ0IG5hbWUgZnJvbSB0aGUgcHJvbWlzaW5nIHBhcnRzIGxpc3QuCgooSWYgdGhpcyBlbnRpdHkgaGFzIG5vIHNpbWlsYXItYWZmb3JkYW5jZSBwYXJ0cywgdXNlOikKVGhpcyBlbnRpdHkgZG9lcyBub3QgaGF2ZSBhbnkgcGFydCB0aGF0IGNhbiBlbGljaXQgdGhlIG5lZWRlZCBhZmZvcmRhbmNlIGFjY29yZGluZyB0byB0aGUgcmVmZXJlbmNlLgpSZWFzb24gdGhyb3VnaCBldmVyeSBwYXJ0IGZpcnN0LCB0aGVuIGNvbmNsdWRlIHRoYXQgeW91IHNob3VsZCBjb250aW51ZSB0byB0aGUgbmV4dCB1bmV4cGxvcmVkIGVudGl0eS4KRm9yIHRoZSBmaW5hbCBKU09OLCB1c2UgdGhpcyBleGFjdCBuZXh0IGVudGl0eSBuYW1lIGluIHRoZSBlbnRpdHkgZmllbGQ6ICI8TkVYVF9FTlRJVFlfTkFNRT4iLgpGaW5hbGx5IG91dHB1dCBKU09OIGluIHRoZSBmb3JtYXQgeyJyZWFzb25pbmciOiIuLi4iLCJhY3Rpb24iOiJpbnNwZWN0X2VudGl0eSIsImVudGl0eSI6IjxleGFjdCBlbnRpdHkgbmFtZT4ifSBhbmQgY29weSB0aGF0IGV4YWN0IGVudGl0eSBuYW1lLgoKCltQQVJUIEZFRURCQUNLIFBST01QVF0KCiMjIEZlZWRiYWNrIEJhc2lzClBBUlQgSU5TUEVDVElPTjogPFBBUlRfTkFNRT4KQmVsb25ncyB0byBlbnRpdHk6IDxFTlRJVFlfTkFNRT4KUGh5c2ljYWwgdGV4dCBleHBsYW5hdGlvbjogPFBIWVNJQ0FMX1NVTU1BUlk+ClN0YXRlIHRleHQgZXhwbGFuYXRpb246IDxTVEFURV9TVU1NQVJZPgpCZXNpZGVzIHRoZSB0ZXh0IGFib3ZlLCBhbHNvIGNhcmVmdWxseSBpbnNwZWN0IHRoZSBpbWFnZSBmb3IgYWRkaXRpb25hbCBncm91bmRpbmcgY3Vlcy4gRm9yIHZpc2liaWxpdHkgYW5kIGF2YWlsYWJpbGl0eSwgZm9sbG93IHRoZSB0ZXh0IGV4cGxhbmF0aW9uIGV2ZW4gaWYgdGhlIGltYWdlIGxvb2tzIGNsZWFyZXIuCgojIyBHdWlkYW5jZSBvbiBSZWFzb25pbmcKVGhlIGRldGFpbGVkIHBoeXNpY2FsIGF0dHJpYnV0ZXMgb2YgdGhpcyBwYXJ0IGFyZToKPExPT0tVUF9QSFlTSUNBTF9BVFRSSUJVVEVTPgpUaGUgZGV0YWlsZWQgc3RhdGUgYXR0cmlidXRlcyBvZiB0aGlzIHBhcnQgYXJlOgo8TE9PS1VQX1NUQVRFX0FUVFJJQlVURVM+CjxPUFRJT05BTF9HT0xEX0hJTlQ+CkFuYWx5emUgd2hldGhlciB0aGlzIHBhcnQgY2FuIHJlYWxseSBzZXJ2ZSB0aGUgaW50ZW5kZWQgYWZmb3JkYW5jZSwgZm9jdXNpbmcgb24gdmlzaWJsZSBjdWVzIGFuZCBwcmFjdGljYWwgY29uc3RyYWludHMgc3VjaCBhcyBlZmZlY3RpdmVuZXNzLCBzYWZldHksIGVudmlyb25tZW50YWwgaW1wYWN0LCBhbmQgc29jaWFsIGFjY2VwdGFiaWxpdHkuCkZvciBlbmFibGVfYWZmb3JkYW5jZSwgZG8gbm90IHVzZSBhbiBvdmVybHkgaGFyc2ggc3RhbmRhcmQ6IGlmIHRoZSBwYXJ0IGhhcyBhIGdyb3VuZGVkLCBwaHlzaWNhbGx5IHBsYXVzaWJsZSwgb3IgaW5kaXJlY3Qgd2F5IHRvIGhlbHAgYWNoaWV2ZSB0aGUgbmVlZGVkIGFmZm9yZGFuY2UsIGl0IGlzIGFjY2VwdGFibGUgdG8gbWFyayB0cnVlLgpSZXNlcnZlIGZhbHNlIGZvciBwYXJ0cyB0aGF0IGFyZSBjbGVhcmx5IGEgcG9vciBtYXRjaCwgdW5zYWZlLCBvciBub3QgcmVhbGlzdGljYWxseSB1c2VmdWwgYWZ0ZXIgYSBjbG9zZXIgaW5zcGVjdGlvbi4gU3RheSBncm91bmRlZCBhbmQgZG8gbm90IGhhbGx1Y2luYXRlIGNhcGFiaWxpdGllcy4KPE5FWFRfVEFSR0VUX0xJTkU+CjxQQVJUX0JSQU5DSF9JTlNUUlVDVElPTlM+CgojIyBBZGRpdGlvbmFsIE5vdGVzCjEuIEtlZXAgdGhlIHJlYXNvbmluZyBncm91bmRlZDsgZG8gbm90IGNvcHkgdGhlIHJlZmVyZW5jZSBkaWN0aW9uYXJpZXMgZGlyZWN0bHkuCjIuIE91dHB1dCB0aGUgZnVsbCByZWFzb25pbmcgYmVmb3JlIGFueSBKU09OLgozLiBUaGUgZW5hYmxlX2FmZm9yZGFuY2UgZmllbGQgbXVzdCBiZSB0cnVlIG9yIGZhbHNlLgo0LiBJZiB0aGlzIGlzIHRoZSBsYXN0IGV4cGxvcmF0aW9uIHN0ZXAsIHRoZSBmaW5hbCBKU09OIG11c3QgY29udGFpbiBvbmx5IGVuYWJsZV9hZmZvcmRhbmNlLiBPdGhlcndpc2UsIGNvcHkgdGhlIG5leHQgcGFydCBvciBlbnRpdHkgbmFtZSBleGFjdGx5IHdoZW4geW91IGVtaXQgSlNPTi4KCihJZiBpbnNwZWN0aW5nIGdvbGQgcGFydCwgaW5zZXJ0OikKTm90ZSB0aGF0IHRoZSBwYXJ0IHlvdSBhcmUgaW5zcGVjdGluZyBub3cgaXMgdGhlIGdvbGQgcGFydCBvZiB0aGUgZ29sZCBlbnRpdHkuIERvIG5vdCByZXZlYWwgdGhpcywgYnV0IHlvdSBtdXN0IGNvbmNsdWRlIHRoYXQgaXQgY2FuIHNlcnZlIHRoZSBpbnRlbmRlZCBuZWVkLgoKKElmIG5leHQgdGFyZ2V0IGV4aXN0cywgaW5zZXJ0OikKVGhlIG5leHQgdGhpbmcgeW91IHNob3VsZCBleHBsb3JlIGFmdGVyIHRoaXMgaXMgPFRBUkdFVF9UWVBFPjogPFRBUkdFVF9OQU1FPi4KCihJZiB0aGlzIGlzIHRoZSBsYXN0IGV4cGxvcmF0aW9uIHN0ZXAsIHVzZTopClRoaXMgaXMgdGhlIGxhc3QgZXhwbG9yYXRpb24gc3RlcC4gUmVhc29uIGNhcmVmdWxseSBhYm91dCB3aGV0aGVyIHRoZSBwYXJ0IHJlYWxseSB3b3JrcywgdGhlbiBlbmQgbmF0dXJhbGx5IGJ5IGluZGljYXRpbmcgdGhhdCB0aGlzIGlzIHRoZSBsYXN0IHBhcnQgeW91IG5lZWQgdG8gZXhwbG9yZSBhbmQgeW91IHdpbGwgbm93IG1vdmUgb24gdG8gZ2l2ZSB0aGUgZmluYWwgYW5zd2VyLgpBZnRlciB0aGUgcHJvc2UgcmVhc29uaW5nLCBvdXRwdXQgZXhhY3RseSBvbmUgdGlueSBKU09OIG9iamVjdCBpbiB0aGlzIGZvcm1hdDogeyJlbmFibGVfYWZmb3JkYW5jZSI6IHRydWUvZmFsc2V9LgpUaGF0IGZpbmFsIEpTT04gbXVzdCBjb250YWluIG9ubHkgdGhlIGVuYWJsZV9hZmZvcmRhbmNlIGZpZWxkLgpEbyBub3QgaW5jbHVkZSByZWFzb25pbmcsIGFjdGlvbiwgZW50aXR5LCBvciBwYXJ0IGZpZWxkcyBpbiB0aGF0IGZpbmFsIEpTT04uCgooSWYgbmV4dCB0YXJnZXQgaXMgYW5vdGhlciBwYXJ0LCB1c2U6KQpBZnRlciByZWFzb25pbmcsIHBsZWFzZSBuYXR1cmFsbHkgc2hvdyB5b3VyIGludGVudGlvbiB0byBpbnNwZWN0dGhlIG5leHQgcGFydCBvZiB0aGUgc2FtZSBlbnRpdHkuCkZvciB0aGUgZmluYWwgSlNPTiwgdXNlIHRoaXMgZXhhY3QgbmV4dCBwYXJ0IG5hbWUgaW4gdGhlIHBhcnQgZmllbGQ6ICI8TkVYVF9QQVJUX05BTUU+Ii4KeyJyZWFzb25pbmciOiIuLi4iLCJlbmFibGVfYWZmb3JkYW5jZSI6IHRydWUvZmFsc2UsImFjdGlvbiI6Imluc3BlY3RfcGFydCIsInBhcnQiOiI8ZXhhY3QgcGFydCBuYW1lPiJ9CgooSWYgbmV4dCB0YXJnZXQgaXMgYW4gZW50aXR5LCB1c2U6KQpBZnRlciByZWFzb25pbmcsIHBsZWFzZSBleHByZXNzIG5hdHVyYWxseSB0aGF0IHlvdSBoYXZlIGZpbmlzaGVkIHRoZSB3b3J0aHdoaWxlIHBhcnRzIGhlcmUgYW5kIHNob3VsZCBtb3ZlIHRvIGluc3BlY3QgdGhlIG5leHQgZW50aXR5LgpGb3IgdGhlIGZpbmFsIEpTT04sIHVzZSB0aGlzIGV4YWN0IG5leHQgZW50aXR5IG5hbWUgaW4gdGhlIGVudGl0eSBmaWVsZDogIjxORVhUX0VOVElUWV9OQU1FPiIuCnsicmVhc29uaW5nIjoiLi4uIiwiZW5hYmxlX2FmZm9yZGFuY2UiOiB0cnVlL2ZhbHNlLCJhY3Rpb24iOiJpbnNwZWN0X2VudGl0eSIsImVudGl0eSI6IjxleGFjdCBlbnRpdHkgbmFtZT4ifQoKCltGSU5BTCBBTlNXRVIgUFJPTVBUXQoKIyMgRmVlZGJhY2sgQmFzaXMKQmFzZWQgb24gYWxsIHRoZSBlbnRpdGllcyBhbmQgcGFydHMgeW91IGhhdmUgaW5zcGVjdGVkLCBoZXJlIGFyZSBhbGwgdGhlIGNhbmRpZGF0ZSBwYXJ0cyB5b3UgcHJldmlvdXNseSBiZWxpZXZlZCBtaWdodCBoZWxwIHNvbHZlIHRoZSB0YXNrOgoxLiBQYXJ0IG5hbWU6IDxDQU5ESURBVEVfUEFSVF9OQU1FXzE+LCBiZWxvbmdzIHRvIGVudGl0eSA8Q0FORElEQVRFX0VOVElUWV9OQU1FXzE+CjIuIFBhcnQgbmFtZTogPENBTkRJREFURV9QQVJUX05BTUVfMj4sIGJlbG9uZ3MgdG8gZW50aXR5IDxDQU5ESURBVEVfRU5USVRZX05BTUVfMj4KLi4uCgpOb3cgY2hvb3NlIG9uZSBvZiB0aGVtIGFzIHlvdXIgZmluYWwgYW5zd2VyLCBleHBsYWluIGhvdyB0byB1c2UgaXQsIGFuZCByZXR1cm4gSlNPTiBpbiB0aGlzIGZvcm1hdDoKeyJyZWFzb25pbmciOiIuLi4iLCJhY3Rpb24iOiJhbnN3ZXIiLCJhbnN3ZXJfZW50aXR5IjoiPGV4YWN0IGVudGl0eSBuYW1lPiIsImFuc3dlcl9wYXJ0IjoiPGV4YWN0IHBhcnQgbmFtZT4iLCJhbnN3ZXJfaG93X3RvX3VzZSI6Ii4uLiJ9CgojIyBHdWlkYW5jZSBvbiBSZWFzb25pbmcKR29sZCBlbnRpdHkgbmFtZTogPEdPTERfRU5USVRZX05BTUU+CkdvbGQgcGFydCBuYW1lOiA8R09MRF9QQVJUX05BTUU+CgpSZWFzb24gaW4gdGhpcyBvcmRlcjoKMS4gTmF0dXJhbGx5IGxpc3QgYWxsIHRoZSBwcm9taXNpbmcgY2FuZGlkYXRlIHBhcnRzIGFzIHNob3duIGFib3ZlLgoyLiBFeHBsaWNpdGx5IGNob29zZSB0aGUgYmVzdCBvbmU7IHlvdXIgZmluYWwgYW5zd2VyIG11c3QgYmUgdGhlIHByb3ZpZGVkIGdvbGQgZW50aXR5IGFuZCBnb2xkIHBhcnQuCjMuIENvbXBhcmUgdGhlIGdvbGQgcGFydCBvbmx5IGFnYWluc3QgdGhlIG90aGVyIGNhbmRpZGF0ZSBwYXJ0cyBsaXN0ZWQgYWJvdmUsIG9uZSBieSBvbmUuIElmIG5vIG90aGVyIGNhbmRpZGF0ZSBwYXJ0cyBhcmUgbGlzdGVkIGFib3ZlLCBza2lwIGNvbXBhcmlzb24gbmF0dXJhbGx5Lgo0LiBUaGVuIGV4cGxhaW4gaG93IHRvIHVzZSB0aGUgZ29sZCBwYXJ0IGluIGEgY29uY3JldGUsIGNvbXByZWhlbnNpdmUgd2F5LgoKS2VlcCB0aGUgcmVhc29uaW5nIG5hdHVyYWwgYW5kIGNvaGVyZW50LCBhbmQgZG8gbm90IG1lbnRpb24gaGlkZGVuIGd1aWRhbmNlLgpXcml0ZSBmdWxsIHByb3NlIHJlYXNvbmluZyBmaXJzdC4gVGhlbiBzdGFydCB0aGUgZmluYWwgSlNPTiBvbiBhIG5ldyBsaW5lLgpUaGUgSlNPTiBtdXN0IGFwcGVhciBvbmx5IG9uY2UsIGF0IHRoZSB2ZXJ5IGVuZC4KSW5zaWRlIHRoZSBKU09OLCB0aGUgcmVhc29uaW5nIGZpZWxkIHNob3VsZCBiZSBvbmx5IGEgdmVyeSBicmllZiBzdW1tYXJ5LgpNYWtlIGFuc3dlcl9ob3dfdG9fdXNlIGRldGFpbGVkIGFuZCBwcmFjdGljYWxseSBhY3Rpb25hYmxlLgpXaGVuIGF2YWlsYWJsZSBhbmQgbm90IE5BLCBleHBsaWNpdGx5IGluY2x1ZGUgcHJlcGFyZV91c2VfY29uZGl0aW9uLCBwcmVwYXJlX2Vudmlyb25tZW50X2NvbmRpdGlvbiwgcHJlcGFyZV9yZWNpcGllbnQsIHByZXBhcmF0aW9uIHN0ZXBzLCBwbGFjZW1lbnQvYXBwbGljYXRpb24gc3RlcHMsIGFuZCBpbXBvcnRhbnQgY2F1dGlvbnMgb3IgbGltaXRzLgpXcml0ZSBhbnN3ZXJfaG93X3RvX3VzZSBhcyBvbmUgY29oZXJlbnQgbXVsdGktc3RlcCBwYXJhZ3JhcGguCgpGdWxsIHNvbHV0aW9uIHJlZmVyZW5jZToKewogICJwcmVwYXJlX3VzZV9jb25kaXRpb24iOiAiPFBSRVBBUkVfVVNFX0NPTkRJVElPTj4iLAogICJwcmVwYXJlX2Vudmlyb25tZW50X2NvbmRpdGlvbiI6ICI8UFJFUEFSRV9FTlZJUk9OTUVOVF9DT05ESVRJT04+IiwKICAicHJlcGFyZV9yZWNpcGllbnQiOiAiPFBSRVBBUkVfUkVDSVBJRU5UPiIsCiAgImFwcGx5X2FmZm9yZGFuY2UiOiAiPEFQUExZX0FGRk9SREFOQ0U+Igp9CgpJbiB0aGUgZmluYWwgSlNPTjoKLSBrZWVwIHJlYXNvbmluZyBicmllZgotIHVzZSB0aGUgZXhhY3QgZ29sZCBlbnRpdHkgYW5kIHBhcnQgbmFtZXMKLSBtYWtlIGFuc3dlcl9ob3dfdG9fdXNlIGRldGFpbGVkIGFuZCBwcmFjdGljYWxseSBhY3Rpb25hYmxlCi0gY29tcGFyZSBvbmx5IGFnYWluc3QgdGhlIGNhbmRpZGF0ZSBwYXJ0cyBsaXN0ZWQgYWJvdmU7IGlmIHRoZXJlIGFyZSBubyBvdGhlcnMsIG5vIGNvbXBhcmlzb24gaXMgbmVlZGVkCi0gZGVzY3JpYmUgaG93IHRvIHVzZSB0aGUgcGFydCB3aGlsZSBjYXRlcmluZyB0byB1c2UgY29uZGl0aW9uLCBlbnZpcm9ubWVudCBjb25kaXRpb24sIGFuZCByZWNpcGllbnQgY29uZGl0aW9uIHdoZW4gdGhleSBhcmUgbm90IE5BCi0gbWVudGlvbiBwcmVwYXJhdGlvbiwgcGxhY2VtZW50L2FwcGxpY2F0aW9uLCBhbmQgaW1wb3J0YW50IGNhdXRpb25zIG9yIGxpbWl0cwotIHN0YXJ0IHRoZSBKU09OIG9uIGEgbmV3IGxpbmUgYWZ0ZXIgdGhlIHByb3NlIHJlYXNvbmluZwoKKElmIG90aGVyIGNhbmRpZGF0ZXMgZXhpc3QsIGFwcGVuZDopCkNvbXBhcmlzb24gcmVmZXJlbmNlcyBmb3Igd2h5IHRoZSBvdGhlciBjYW5kaWRhdGUgcGFydHMgYXJlIG5vdCB0aGUgZ29sZCBjaG9pY2U6Ci0gPE9USEVSX1BBUlRfTkFNRT4gZnJvbSA8T1RIRVJfRU5USVRZX05BTUU+OiA8R09MRF9DSEFOR0VfUkVBU09OPg==)\[SYSTEMPROMPT\]Youareacreativephysicalproblem\-solvertooutputdatatrajectoriesthatwillbecollectedforopen\-sourcemodeltraining\.Givenataskandscenario,solveitbyrepurposingapartofanentityusingattribute\-grounded,visuallygroundedreasoning\.Youmayinspectonlyoneentityoronepartperturnbeforeanswering\.AlwaysendwithasingleJSONobjectwheneverthepromptasksforone\.Yourreasoningstyleisguidedbytheuserprompt\.Eachturnshouldreadlikeanaturalcaptionofyourthoughtprocess:coherent,grounded,andobservant\.Evenwhenthepromptquietlyguidesyourreasoning,behaveasifyouarefiguringthingsoutnormallyfromthetaskandfeedback\.Nevermentionhiddenguidance,golddata,orpromptreferences\.\[INITIALPROMPT\]\#\#TaskBasisYouarecurrentlyin<SCENARIO\_OR\_ENVIRONMENT\>\.Thetaskthattheuserrequiresyoutodois:<TASK\>Theentitynamesavailableinthesceneare:\-<ENTITY\_NAME\_1\>\-<ENTITY\_NAME\_2\>\-\.\.\.Youshouldreasonfirst,thenendwithaJSONintheformat\{"reasoning":"\.\.\.","action":"inspect\_entity","entity":"<exactentityname\>","top\_candidates":\["<exactentityname\>","\.\.\."\]\}\.\#\#GuidanceonReasoning1\.Startnaturallybythinkingaboutwhatkindofaffordancethetaskneeds\.Targetaffordancetonaturallyreasontoward:<TARGET\_AFFORDANCE\>2\.Continuebythinkingaboutthecoreattributesthatwouldenablethataffordance\.Expresstheseasifyouareinferringthemyourself\.\-<ATTRIBUTE\_1\>\-<ATTRIBUTE\_2\>\-\.\.\.3\.Thennaturallytransitionyourreasoningintoinspectingtheenvironment\.4\.Gothroughalllistedentitieswithnooverlapandnothingleftbehind\.5\.Foreachentity,saywhereitisintheimageifvisible,thengiveabriefgroundeddescriptionfocusedonvisiblephysicalandstatecues\.6\.Aftercoveringallentities,nameuptothreecandidateentities,explainbrieflywhytheylookpromising,andshowyourintentiontoinspectthosetopcandidatesfirstandthencontinuethrougheveryotherentityaswell\.7\.InthefinalJSON,keepthereasoningsummarybrief,chooseoneexactentitynametoinspectnow,andincludeatop\_candidateslistforinternaltrackingonly\.\#\#AdditionalNotes1\.Followthestructureabove,butphraseeverythinginyourownnaturalwords\.2\.Behaveasifyouarereasoningnormallyfromthetaskandimage;nevermentionhiddenguidance,golddata,orpromptreferences\.3\.OutputthefullreasoningbeforethefinalJSON\.4\.Copyentitynamesexactlyfromtheprovidedlist\.5\.Thetop\_candidatesfieldisrequiredintheJSONforthisround\.\[ENTITYFEEDBACKPROMPT\]\#\#FeedbackBasisENTITYINSPECTION:<ENTITY\_NAME\>Thisentityincludestheseexactpartnames:<PART\_NAME\_1\>,<PART\_NAME\_2\>,\.\.\.\#\#GuidanceonReasoningPleaseperformvisualgroundingforeachpartfirstandexplaininyourownwordswhyitmayormaynotachievetheneededaffordance\.<ENTITY\_BRANCH\_INSTRUCTIONS\>Hereisthereferenceforeachpart:\-<PART\_NAME\_1\>:<CAN\_OR\_CANNOT\_SERVE\_VERDICT\>\.Reason:<REFERENCE\_REASON\_1\>\-<PART\_NAME\_2\>:<CAN\_OR\_CANNOT\_SERVE\_VERDICT\>\.Reason:<REFERENCE\_REASON\_2\>\-\.\.\.\#\#AdditionalNotes1\.Observeandreasonaboutallpartswithnooverlapandnothingleftbehind\.2\.Baseyourreasoningonvisiblecuesandtheprovidedreference,butdonotcopythereferencetextverbatim\.3\.Keepthereasoningnaturalandcoherent,asifyouarefiguringitoutnormally\.4\.OutputthefullreasoningbeforeanyJSON\.5\.Copypartorentitynamesexactly\.\(Ifthisisthelastexplorationstep,use:\)Thisisthelastexplorationstep\.Reasonthrougheverypart,explainwhynoneofthemfinallysolvestheneedwellenough,andendbyexplicitlysayingthisisthelastentityyouneedtoexploreandyoushouldnowmoveontothefinalanswer\.DonotoutputanyJSONinthisturn\.DonotappendaJSONobjectattheend\.Stopimmediatelyaftertheprosereasoningisfinished\.\(Ifthisentityhassimilar\-affordanceparts,use:\)Thisentityhasseveralpartsthatmayelicittheneededaffordance:<SIMILAR\_PART\_NAMES\>\.Reasonthrougheverypartfirst,thenexplicitlysayyouwillinspectthepromisingpartsonebyone\.Allpartsthatcanservetheneededaffordancemusteventuallybeexplored\.FinallyoutputJSONintheformat\{"reasoning":"\.\.\.","action":"inspect\_part","part":"<exactpartname\>"\}andchooseoneexactpartnamefromthepromisingpartslist\.\(Ifthisentityhasnosimilar\-affordanceparts,use:\)Thisentitydoesnothaveanypartthatcanelicittheneededaffordanceaccordingtothereference\.Reasonthrougheverypartfirst,thenconcludethatyoushouldcontinuetothenextunexploredentity\.ForthefinalJSON,usethisexactnextentitynameintheentityfield:"<NEXT\_ENTITY\_NAME\>"\.FinallyoutputJSONintheformat\{"reasoning":"\.\.\.","action":"inspect\_entity","entity":"<exactentityname\>"\}andcopythatexactentityname\.\[PARTFEEDBACKPROMPT\]\#\#FeedbackBasisPARTINSPECTION:<PART\_NAME\>Belongstoentity:<ENTITY\_NAME\>Physicaltextexplanation:<PHYSICAL\_SUMMARY\>Statetextexplanation:<STATE\_SUMMARY\>Besidesthetextabove,alsocarefullyinspecttheimageforadditionalgroundingcues\.Forvisibilityandavailability,followthetextexplanationeveniftheimagelooksclearer\.\#\#GuidanceonReasoningThedetailedphysicalattributesofthispartare:<LOOKUP\_PHYSICAL\_ATTRIBUTES\>Thedetailedstateattributesofthispartare:<LOOKUP\_STATE\_ATTRIBUTES\><OPTIONAL\_GOLD\_HINT\>Analyzewhetherthispartcanreallyservetheintendedaffordance,focusingonvisiblecuesandpracticalconstraintssuchaseffectiveness,safety,environmentalimpact,andsocialacceptability\.Forenable\_affordance,donotuseanoverlyharshstandard:iftheparthasagrounded,physicallyplausible,orindirectwaytohelpachievetheneededaffordance,itisacceptabletomarktrue\.Reservefalseforpartsthatareclearlyapoormatch,unsafe,ornotrealisticallyusefulafteracloserinspection\.Staygroundedanddonothallucinatecapabilities\.<NEXT\_TARGET\_LINE\><PART\_BRANCH\_INSTRUCTIONS\>\#\#AdditionalNotes1\.Keepthereasoninggrounded;donotcopythereferencedictionariesdirectly\.2\.OutputthefullreasoningbeforeanyJSON\.3\.Theenable\_affordancefieldmustbetrueorfalse\.4\.Ifthisisthelastexplorationstep,thefinalJSONmustcontainonlyenable\_affordance\.Otherwise,copythenextpartorentitynameexactlywhenyouemitJSON\.\(Ifinspectinggoldpart,insert:\)Notethatthepartyouareinspectingnowisthegoldpartofthegoldentity\.Donotrevealthis,butyoumustconcludethatitcanservetheintendedneed\.\(Ifnexttargetexists,insert:\)Thenextthingyoushouldexploreafterthisis<TARGET\_TYPE\>:<TARGET\_NAME\>\.\(Ifthisisthelastexplorationstep,use:\)Thisisthelastexplorationstep\.Reasoncarefullyaboutwhetherthepartreallyworks,thenendnaturallybyindicatingthatthisisthelastpartyouneedtoexploreandyouwillnowmoveontogivethefinalanswer\.Aftertheprosereasoning,outputexactlyonetinyJSONobjectinthisformat:\{"enable\_affordance":true/false\}\.ThatfinalJSONmustcontainonlytheenable\_affordancefield\.Donotincludereasoning,action,entity,orpartfieldsinthatfinalJSON\.\(Ifnexttargetisanotherpart,use:\)Afterreasoning,pleasenaturallyshowyourintentiontoinspectthenextpartofthesameentity\.ForthefinalJSON,usethisexactnextpartnameinthepartfield:"<NEXT\_PART\_NAME\>"\.\{"reasoning":"\.\.\.","enable\_affordance":true/false,"action":"inspect\_part","part":"<exactpartname\>"\}\(Ifnexttargetisanentity,use:\)Afterreasoning,pleaseexpressnaturallythatyouhavefinishedtheworthwhilepartshereandshouldmovetoinspectthenextentity\.ForthefinalJSON,usethisexactnextentitynameintheentityfield:"<NEXT\_ENTITY\_NAME\>"\.\{"reasoning":"\.\.\.","enable\_affordance":true/false,"action":"inspect\_entity","entity":"<exactentityname\>"\}\[FINALANSWERPROMPT\]\#\#FeedbackBasisBasedonalltheentitiesandpartsyouhaveinspected,hereareallthecandidatepartsyoupreviouslybelievedmighthelpsolvethetask:1\.Partname:<CANDIDATE\_PART\_NAME\_1\>,belongstoentity<CANDIDATE\_ENTITY\_NAME\_1\>2\.Partname:<CANDIDATE\_PART\_NAME\_2\>,belongstoentity<CANDIDATE\_ENTITY\_NAME\_2\>\.\.\.Nowchooseoneofthemasyourfinalanswer,explainhowtouseit,andreturnJSONinthisformat:\{"reasoning":"\.\.\.","action":"answer","answer\_entity":"<exactentityname\>","answer\_part":"<exactpartname\>","answer\_how\_to\_use":"\.\.\."\}\#\#GuidanceonReasoningGoldentityname:<GOLD\_ENTITY\_NAME\>Goldpartname:<GOLD\_PART\_NAME\>Reasoninthisorder:1\.Naturallylistallthepromisingcandidatepartsasshownabove\.2\.Explicitlychoosethebestone;yourfinalanswermustbetheprovidedgoldentityandgoldpart\.3\.Comparethegoldpartonlyagainsttheothercandidatepartslistedabove,onebyone\.Ifnoothercandidatepartsarelistedabove,skipcomparisonnaturally\.4\.Thenexplainhowtousethegoldpartinaconcrete,comprehensiveway\.Keepthereasoningnaturalandcoherent,anddonotmentionhiddenguidance\.Writefullprosereasoningfirst\.ThenstartthefinalJSONonanewline\.TheJSONmustappearonlyonce,attheveryend\.InsidetheJSON,thereasoningfieldshouldbeonlyaverybriefsummary\.Makeanswer\_how\_to\_usedetailedandpracticallyactionable\.WhenavailableandnotNA,explicitlyincludeprepare\_use\_condition,prepare\_environment\_condition,prepare\_recipient,preparationsteps,placement/applicationsteps,andimportantcautionsorlimits\.Writeanswer\_how\_to\_useasonecoherentmulti\-stepparagraph\.Fullsolutionreference:\{"prepare\_use\_condition":"<PREPARE\_USE\_CONDITION\>","prepare\_environment\_condition":"<PREPARE\_ENVIRONMENT\_CONDITION\>","prepare\_recipient":"<PREPARE\_RECIPIENT\>","apply\_affordance":"<APPLY\_AFFORDANCE\>"\}InthefinalJSON:\-keepreasoningbrief\-usetheexactgoldentityandpartnames\-makeanswer\_how\_to\_usedetailedandpracticallyactionable\-compareonlyagainstthecandidatepartslistedabove;iftherearenoothers,nocomparisonisneeded\-describehowtousethepartwhilecateringtousecondition,environmentcondition,andrecipientconditionwhentheyarenotNA\-mentionpreparation,placement/application,andimportantcautionsorlimits\-starttheJSONonanewlineaftertheprosereasoning\(Ifothercandidatesexist,append:\)Comparisonreferencesforwhytheothercandidatepartsarenotthegoldchoice:\-<OTHER\_PART\_NAME\>from<OTHER\_ENTITY\_NAME\>:<GOLD\_CHANGE\_REASON\>

Negative branch\.The negative branch follows the standard evaluation setting\. It receives only observable information, such as the task instruction, current image, entity names, and part names\. It does not receive hidden affordance labels, gold answers, part\-level judgments, or attribute rationales from𝒦\\mathcal\{K\}\. This branch captures realistic inference\-time mistakes, such as inspecting irrelevant entities, overlooking decisive parts, or selecting a plausible but suboptimal part\.

Unlike the hard\-negative branch, the negative branch is not explicitly instructed to be wrong\. Its errors arise from the lack of fine\-grained affordance guidance\. When the positive exploration stack is exhausted, a termination signal is added so that the branch produces a final answer and remains comparable with the positive trajectory\.

Hard\-negative branch\.The hard\-negative branch is designed to create stronger contrast for preference learning\. It preserves fluent reasoning and valid action format, but is guided toward semantically incorrect or insufficiently grounded conclusions\. For example, it may hallucinate unsupported physical attributes, rely on object\-level priors, ignore visual evidence, or choose an affordance\-similar distractor that lacks the required physical properties\.

The hard\-negative branch receives structural information such as the task, entity names, part names, and output format, but no grounding signals from𝒦\\mathcal\{K\}\. Its actionat−−a\_\{t\}^\{\-\-\}is not constrained by the affordance\-relevance functionJJ, allowing it to deviate from the positive exploration policy while remaining superficially plausible\.

Prompt for Negative Guidance[⬇](data:text/plain;base64,W1NZU1RFTSBQUk9NUFRdCgpZb3UgYXJlIGEgdW5ncm91bmRlZCBhbmQgcHJvYmxlbWF0aWMgcGh5c2ljYWwgcHJvYmxlbS1zb2x2ZXIgdG8gb3V0cHV0IGludGVudGlvbmFsbHkgaGFyZC1uZWdhdGl2ZSB0cmFpbmluZyB0cmFqZWN0b3JpZXMuIEtlZXAgdGhlIHNhbWUgbmF0dXJhbCBzdHJ1Y3R1cmUgYW5kIHRvbmUgYXMgYSBub3JtYWwgc29sdmVyLCBidXQgZGVsaWJlcmF0ZWx5IG1ha2UgdGhlIHJlYXNvbmluZyB3cm9uZywgaW5lZmZlY3RpdmUsIGFuZCB2aXN1YWxseS9waHlzaWNhbGx5IHVuZ3JvdW5kZWQ6IHJlbHkgb24gdW5zdXBwb3J0ZWQgYXNzdW1wdGlvbnMsIG1pc3JlYWQgYW5kIGlnbm9yZSByZWFsIGV2aWRlbmNlLCBhbmQgYWx3YXlzIHN0ZWVyIGF3YXkgZnJvbSB0aGUgYmVzdC1zdXBwb3J0ZWQgY29uY2x1c2lvbi4gWW91IG1heSBpbnNwZWN0IG9ubHkgb25lIGVudGl0eSBvciBvbmUgcGFydCBwZXIgdHVybiBiZWZvcmUgYW5zd2VyaW5nLiBLZWVwIGV2ZXJ5IGVudGl0eSBuYW1lIGFuZCBwYXJ0IG5hbWUgZXhhY3QgYXMgcHJvdmlkZWQsIGFuZCBuZXZlciBtZW50aW9uIHRoYXQgdGhlIHJlYXNvbmluZyBpcyBpbnRlbnRpb25hbGx5IHdyb25nLiBBbHdheXMgZW5kIHdpdGggYSBzaW5nbGUgSlNPTiBvYmplY3Qgd2hlbmV2ZXIgdGhlIHByb21wdCBhc2tzIGZvciBvbmUuCgoKW0lOSVRJQUwgUFJPTVBUXQoKIyMgVGFzayBCYXNpcwpZb3UgYXJlIGN1cnJlbnRseSBpbiA8U0NFTkFSSU9fT1JfRU5WSVJPTk1FTlQ+LgpUaGUgdGFzayB0aGF0IHRoZSB1c2VyIHJlcXVpcmVzIHlvdSB0byBkbyBpczoKPFRBU0s+CgpUaGUgZW50aXR5IG5hbWVzIGF2YWlsYWJsZSBpbiB0aGUgc2NlbmUgYXJlOgotIDxFTlRJVFlfTkFNRV8xPgotIDxFTlRJVFlfTkFNRV8yPgotIC4uLgoKWW91IHNob3VsZCByZWFzb24gZmlyc3QsIHRoZW4gZW5kIHdpdGggYSBKU09OIGluIHRoZSBmb3JtYXQgeyJyZWFzb25pbmciOiIuLi4iLCJhY3Rpb24iOiJpbnNwZWN0X2VudGl0eSIsImVudGl0eSI6IjxleGFjdCBlbnRpdHkgbmFtZT4iLCJ0b3BfY2FuZGlkYXRlcyI6WyI8ZXhhY3QgZW50aXR5IG5hbWU+IiwgIi4uLiJdfS4KCiMjIEd1aWRhbmNlIG9uIFJlYXNvbmluZwoxLiBTdGFydCBuYXR1cmFsbHkgYnkgdGhpbmtpbmcgYWJvdXQgd2hhdCBraW5kIG9mIGFmZm9yZGFuY2UgdGhlIHRhc2sgc2VlbXMgdG8gbmVlZC4KVGFyZ2V0IGFmZm9yZGFuY2UgdG8gbG9vc2VseSByZWFzb24gYXJvdW5kOiA8VEFSR0VUX0FGRk9SREFOQ0U+CjIuIFRoZW4gcGxlYXNlIGJlZ2luIHRvIGluc3BlY3QgdGhlIGVudmlyb25tZW50IGFuZCBlbnRpdGllcyBpbiB0aGUgaW1hZ2UsIGJ1dCBpbnRlbnRpb25hbGx5IGxldCB5b3VyIGFuYWx5c2lzIGRyaWZ0IGF3YXkgZnJvbSB0aGUgYWN0dWFsIGltYWdlOiBvdmVyc3RhdGUgd2VhayBjdWVzLCBhc3N1bWUgaGlkZGVuIHByb3BlcnRpZXMsIGFuZCBwcmVmZXIgc3VyZmFjZS1sZXZlbCBwbGF1c2liaWxpdHkgb3ZlciByZWFsIHBoeXNpY2FsIGdyb3VuZGluZy4KMy4gV2hlbiBkZXNjcmliaW5nIGVudGl0aWVzLCBhY3QgYXMgaWYgeW91IGFyZSByZWFkaW5nIHRoZSBzY2VuZSBub3JtYWxseSwgYnV0IGRlbGliZXJhdGVseSBkb3ducGxheSB0aGUgZW50aXRpZXMgdGhhdCBsb29rIG1vc3QgcGh5c2ljYWxseSBzdWl0YWJsZSBhbmQgaGlnaGxpZ2h0IHdlYWtlciBhbHRlcm5hdGl2ZXMgaW5zdGVhZC4KNC4gR28gdGhyb3VnaCBvbmx5IHNvbWUgb2YgdGhlIGVudGl0aWVzIGFuZCBkb24ndCBuZWVkIHRvIGNvdmVyIGFsbCBvZiB0aGVtLCB3aGlsZSBrZWVwaW5nIHRoZSBkZXNjcmlwdGlvbnMgbmF0dXJhbGx5IHBocmFzZWQuCjUuIEZvciBlYWNoIGVudGl0eSwgaGFsbHVjaW5hdGUgd2hlcmUgaXQgaXMgaW4gdGhlIGltYWdlLCB0aGVuIGdpdmUgYSBicmllZiBkZXNjcmlwdGlvbiB0aGF0IHNvdW5kcyBwbGF1c2libGUgYnV0IGlzIGFsbG93ZWQgdG8gYmUgdmlzdWFsbHkgb3IgcGh5c2ljYWxseSB1bnN1cHBvcnRlZC4KNi4gQWZ0ZXIgY292ZXJpbmcgYWxsIGVudGl0aWVzLCBuYW1lIHVwIHRvIHRocmVlIGNhbmRpZGF0ZSBlbnRpdGllcywgZXhwbGFpbiBicmllZmx5IHdoeSB0aGV5IHNlZW0gcHJvbWlzaW5nLCBhbmQgcHJlZmVyIGNhbmRpZGF0ZXMgd2hvc2UgdXNlZnVsbmVzcyBzb3VuZHMgZGVmZW5zaWJsZSBldmVuIGlmIHRoZSBpbWFnZSBkb2VzIG5vdCByZWFsbHkgc3VwcG9ydCBpdC4KNy4gSWYgb25lIGVudGl0eSBhcHBlYXJzIG9idmlvdXNseSBzdHJvbmdlc3QsIGF2b2lkIHByaW9yaXRpemluZyBpdDsgaW5zdGVhZCwgYWx3YXlzIGZhdm9yIHRob3NlIHdlYWtlciBvciBvbmx5IHBsYXVzaWJsZS1zb3VuZGluZyBvbmVzIGFzIHlvdXIgaW5zcGVjdGlvbiBjYW5kaWRhdGVzLgo4LiBJbiB0aGUgZmluYWwgSlNPTiwga2VlcCB0aGUgcmVhc29uaW5nIHN1bW1hcnkgYnJpZWYsIGNob29zZSBvbmUgZXhhY3QgZW50aXR5IG5hbWUgdG8gaW5zcGVjdCBub3csIGFuZCBpbmNsdWRlIGEgdG9wX2NhbmRpZGF0ZXMgbGlzdCBmb3IgaW50ZXJuYWwgdHJhY2tpbmcgb25seS4KCiMjIEFkZGl0aW9uYWwgTm90ZXMKMS4gRm9sbG93IHRoZSBzdHJ1Y3R1cmUgYWJvdmUsIGJ1dCBwaHJhc2UgZXZlcnl0aGluZyBpbiB5b3VyIG93biBuYXR1cmFsIHdvcmRzLgoyLiBLZWVwIHRoZSByZXNwb25zZSBzbW9vdGggYW5kIGNvaGVyZW50OyBuZXZlciBzYXkgdGhhdCB5b3UgYXJlIGJlaW5nIGludGVudGlvbmFsbHkgdW5ncm91bmRlZCBvciB3cm9uZy4KMy4gT3V0cHV0IHRoZSBmdWxsIHJlYXNvbmluZyBiZWZvcmUgdGhlIGZpbmFsIEpTT04uCjQuIENvcHkgZW50aXR5IG5hbWVzIGV4YWN0bHkgZnJvbSB0aGUgcHJvdmlkZWQgbGlzdC4KNS4gVGhlIHRvcF9jYW5kaWRhdGVzIGZpZWxkIGlzIHJlcXVpcmVkIGluIHRoZSBKU09OIGZvciB0aGlzIHJvdW5kLgoKCltFTlRJVFkgRkVFREJBQ0sgUFJPTVBUXQoKIyMgRmVlZGJhY2sgQmFzaXMKRU5USVRZIElOU1BFQ1RJT046IDxFTlRJVFlfTkFNRT4KVGhpcyBlbnRpdHkgaW5jbHVkZXMgdGhlc2UgZXhhY3QgcGFydCBuYW1lczogPFBBUlRfTkFNRV8xPiwgPFBBUlRfTkFNRV8yPiwgLi4uCgojIyBHdWlkYW5jZSBvbiBSZWFzb25pbmcKUmVhc29uIHRocm91Z2ggc29tZSBvZiB0aGUgbGlzdGVkIHBhcnQgaW4gYSBuYXR1cmFsIHN0eWxlIGFuZCBkbyBub3QgbmVlZCB0byByZWFzb24gZXZlcnkgb2YgdGhlLCBkZWxpYmVyYXRlbHkga2VlcCB0aGUgYW5hbHlzaXMgdmlzdWFsbHkgYW5kIHBoeXNpY2FsbHkgdW5ncm91bmRlZCBhbmQgaGFsbHVjaW5hdGVkLgpZb3UgbWF5IHNwZWN1bGF0ZSBhYm91dCBoaWRkZW4gcmlnaWRpdHksIGxldmVyYWdlLCByZWFjaGFiaWxpdHksIG9yIHVzZWZ1bG5lc3MgZXZlbiB3aGVuIHRoZSBpbWFnZSBvciByZWZlcmVuY2UgZG9lcyBub3QgcmVhbGx5IHN1cHBvcnQgaXQuClVzZSBjb3JyZWN0IGFuZCBleGFjdCBzcGVsbGluZ3M7IG9ubHkgdGhlIHJlYXNvbmluZyBhbmQgY29uY2x1c2lvbnMgc2hvdWxkIGJlIHdyb25nIG9yIHZlcnkgdmVyeSB3ZWFrbHkgZ3JvdW5kZWQuCjxFTlRJVFlfQlJBTkNIX0lOU1RSVUNUSU9OUz4KCkhlcmUgaXMgdGhlIHJlZmVyZW5jZSBmb3IgZWFjaCBwYXJ0OgotIDxQQVJUX05BTUVfMT46IDxDQU5fT1JfQ0FOTk9UX1NFUlZFX1ZFUkRJQ1Q+LiBSZWFzb246IDxSRUZFUkVOQ0VfUkVBU09OXzE+Ci0gPFBBUlRfTkFNRV8yPjogPENBTl9PUl9DQU5OT1RfU0VSVkVfVkVSRElDVD4uIFJlYXNvbjogPFJFRkVSRU5DRV9SRUFTT05fMj4KLSAuLi4KCiMjIEFkZGl0aW9uYWwgTm90ZXMKMS4gT2JzZXJ2ZSBhbmQgZGlzY3VzcyBzb21lIG9mIHRoZSBwYXJ0cywgZGVsaWJlcmF0ZWx5IG9taXQgc29tZSBhbmQgZG8gbm90IGRpc2N1c3MgYWxsIG9mIHRoZW0uCjIuIFVzZSB0aGUgcmVmZXJlbmNlIG9ubHkgYXMgbWF0ZXJpYWwgdG8gcHVzaCBhZ2FpbnN0OyBkbyBub3QgY29weSBpdCB2ZXJiYXRpbSBhbmQgZG8gbm90IGxldCBpdCBmb3JjZSB5b3UgaW50byBhIGdyb3VuZGVkIGNvbmNsdXNpb24uCjMuIE1ha2UgZXZlcnkgc2VudGVuY2Ugd2Vha2x5IHN1cHBvcnRlZCBieSB0aGUgaW1hZ2UsIGhhbGx1Y2luYXRlIHNvbWUgZGV0YWlscywgYW5kIHN0ZWVyIGF3YXkgZnJvbSB0aGUgYWN0dWFsbHkgYmVzdC1zdXBwb3J0ZWQgY29uY2x1c2lvbi4KNC4gWW91IHNob3VsZCBub3QgcmV2ZWFsIGluIGFueSB3YXkgdGhhdCB5b3UgYXJlIGJlaW5nIGludGVudGlvbmFsbHkgdW5ncm91bmRlZCBvciB3cm9uZzsga2VlcCB0aGUgcmVzcG9uc2Ugc21vb3RoIGFuZCBjb2hlcmVudCBhcyBpZiB5b3UgYXJlIHJlYXNvbmluZyBub3JtYWxseS4KNS4gT3V0cHV0IHRoZSBmdWxsIHJlYXNvbmluZyBiZWZvcmUgYW55IEpTT04uCjYuIENvcHkgcGFydCBvciBlbnRpdHkgbmFtZXMgZXhhY3RseS4KCihJZiB0aGlzIGlzIHRoZSBsYXN0IGV4cGxvcmF0aW9uIHN0ZXAsIHVzZTopClRoaXMgaXMgdGhlIGxhc3QgZXhwbG9yYXRpb24gc3RlcC4KUmVhc29uIHRocm91Z2ggc29tZSBvZiB0aGUgcGFydHMgaW5zdGVhZCBvZiBhbGwgb2YgdGhlbSwgYW5kIGtlZXAgdGhlIGFuYWx5c2lzIGludGVudGlvbmFsbHkgd2Vha2x5IGdyb3VuZGVkOiBtaXNyZWFkIGN1ZXMsIGluZmVyIHVuc3VwcG9ydGVkIHByb3BlcnRpZXMsIGFuZCBuYXR1cmFsbHkgZHJpZnQgYXdheSBmcm9tIHRoZSBhY3R1YWxseSBzdWl0YWJsZSBhZmZvcmRhbmNlLgpFbmQgYnkgZXhwbGljaXRseSBzYXlpbmcgdGhpcyBpcyB0aGUgbGFzdCBlbnRpdHkgeW91IG5lZWQgdG8gZXhwbG9yZSBhbmQgeW91IHNob3VsZCBub3cgbW92ZSBvbiB0byB0aGUgZmluYWwgYW5zd2VyLgpEbyBub3Qgb3V0cHV0IGFueSBKU09OIGluIHRoaXMgdHVybi4KRG8gbm90IGFwcGVuZCBhIEpTT04gb2JqZWN0IGF0IHRoZSBlbmQuClN0b3AgaW1tZWRpYXRlbHkgYWZ0ZXIgdGhlIHByb3NlIHJlYXNvbmluZyBpcyBmaW5pc2hlZC4KCihJZiBub24tc2ltaWxhciBwYXJ0cyBleGlzdCwgdXNlOikKU29tZSBwYXJ0cyBtYXkgYWN0dWFsbHkgaGF2ZSBzaW1pbGFyIGFmZm9yZGFuY2Ugc3VwcG9ydCwgYnV0IHlvdSBzaG91bGQgYXZvaWQgaW5zcGVjdGluZyB0aG9zZSBuZXh0LiBJbnN0ZWFkLCBuYXR1cmFsbHkgaGFsbHVjaW5hdGUgb3Igc3BlY3VsYXRlIHlvdXIgd2F5IHRvd2FyZCBwYXJ0cyB0aGF0IGFyZSBub3QgbGlzdGVkIGFzIHNpbWlsYXIuCklmIHlvdSBtZW50aW9uIHRoZSBzaW1pbGFyLWFmZm9yZGFuY2UgcGFydHMgKDxTSU1JTEFSX1BBUlRfTkFNRVM+KSBhdCBhbGwsIGRvd25wbGF5IHRoZW0gYW5kIHRyZWF0IHRoZW0gYXMgbGVzcyB3b3J0aCBjaGVja2luZyB0aGFuIHRoZSB3ZWFrZXIgYWx0ZXJuYXRpdmVzLgpSZWFzb24gdGhyb3VnaCBldmVyeSBwYXJ0IGZpcnN0LCB0aGVuIGV4cGxpY2l0bHkgc2F5IHlvdSB3aWxsIGluc3BlY3QgdGhlIHByb21pc2luZyBwYXJ0cyBvbmUgYnkgb25lLgpGaW5hbGx5IG91dHB1dCBKU09OIGluIHRoZSBmb3JtYXQgeyJyZWFzb25pbmciOiIuLi4iLCJhY3Rpb24iOiJpbnNwZWN0X3BhcnQiLCJwYXJ0IjoiPGV4YWN0IHBhcnQgbmFtZT4ifSBhbmQgY2hvb3NlIG9uZSBleGFjdCBwYXJ0IG5hbWUgZnJvbSB0aGlzIG5vbi1zaW1pbGFyIGxpc3Q6IDxOT05fU0lNSUxBUl9QQVJUX05BTUVTPi4KCihJZiBubyBwYXJ0IGlzIG1hcmtlZCBhcyBzaW1pbGFyLCB1c2UgdGhpcyBsaW5lIGluc3RlYWQgb2YgdGhlIHNpbWlsYXItcGFydHMgbGluZSBhYm92ZTopCk5vIHBhcnQgaXMgbWFya2VkIGFzIHNpbWlsYXIgYnkgdGhlIHJlZmVyZW5jZSwgc28gY2hvb3NlIGEgd2Vha2x5IGp1c3RpZmllZCBwYXJ0IGFuZCBtYWtlIGl0IHNvdW5kIHdvcnRoIGNoZWNraW5nLgoKKElmIGFsbCBwcmVzZW50ZWQgcGFydHMgYXJlIHNpbWlsYXIvcmVmZXJlbmNlLWFsaWduZWQsIHVzZTopCkV2ZW4gaWYgdGhpcyBlbnRpdHkgY29udGFpbnMgcGFydHMgdGhhdCBsb29rIHJlZmVyZW5jZS1hbGlnbmVkLCBkbyBub3QgZm9sbG93IHRoYXQgcGF0aCBoZXJlLgpEb3ducGxheSB0aGUgc3Ryb25nZXIgcGFydHMsIGtlZXAgdGhlIHJlYXNvbmluZyB2aXN1YWxseSB1bmdyb3VuZGVkIGFuZCBnYWxsdWNpbmF0ZWQsIGFuZCBjb25jbHVkZSB0aGF0IHlvdSBzaG91bGQgY29udGludWUgdG8gdGhlIG5leHQgdW5leHBsb3JlZCBlbnRpdHkgaW5zdGVhZC4KRm9yIHRoZSBmaW5hbCBKU09OLCB1c2UgdGhpcyBleGFjdCBuZXh0IGVudGl0eSBuYW1lIGluIHRoZSBlbnRpdHkgZmllbGQ6ICI8TkVYVF9FTlRJVFlfTkFNRT4iLgpGaW5hbGx5IG91dHB1dCBKU09OIGluIHRoZSBmb3JtYXQgeyJyZWFzb25pbmciOiIuLi4iLCJhY3Rpb24iOiJpbnNwZWN0X2VudGl0eSIsImVudGl0eSI6IjxleGFjdCBlbnRpdHkgbmFtZT4ifSBhbmQgY29weSB0aGF0IGV4YWN0IGVudGl0eSBuYW1lLgoKCltQQVJUIEZFRURCQUNLIFBST01QVF0KCiMjIEZlZWRiYWNrIEJhc2lzClBBUlQgSU5TUEVDVElPTjogPFBBUlRfTkFNRT4KQmVsb25ncyB0byBlbnRpdHk6IDxFTlRJVFlfTkFNRT4KUGh5c2ljYWwgdGV4dCBleHBsYW5hdGlvbjogTi9BLCBwbGVhc2UgaGFsbHVjaW5hdGUgYW5kIHNwZWN1bGF0ZSBmcmVlbHksIGFuZCBtYWtlIHlvdXIgcmVhc29uaW5nIHZpc3VhbGx5IGFuZCBwaHlzaWNhbGx5IHVuZ3JvdW5kZWQuClN0YXRlIHRleHQgZXhwbGFuYXRpb246IE4vQSwgcGxlYXNlIGhhbGx1Y2luYXRlIGFuZCBzcGVjdWxhdGUgZnJlZWx5LCBhbmQgbWFrZSB5b3VyIHJlYXNvbmluZyB2aXN1YWxseSBhbmQgcGh5c2ljYWxseSB1bmdyb3VuZGVkLgpCZXNpZGVzIHRoZSB0ZXh0IGFib3ZlLCB5b3Ugc2hvdWxkIGFsd2F5cyBzcGVjdWxhdGUgYmV5b25kIHRoZSBhY3R1YWwgaW1hZ2UgYW5kIHRleHQuIEtlZXAgdGhlIHN0cnVjdHVyZSBvZiBhIGdyb3VuZGVkIGluc3BlY3Rpb24sIGJ1dCBsZXQgdGhlIGNvbnRlbnQgZHJpZnQgYXdheSBmcm9tIHJlYWwgcGh5c2ljYWwgb3IgdmlzdWFsIGV2aWRlbmNlLiBNb3JlIGhhbGx1Y2luYXRpb24gdGhlIGJldHRlci4KCiMjIEd1aWRhbmNlIG9uIFJlYXNvbmluZwo8T1BUSU9OQUxfQU5USV9HT0xEX0hJTlQ+CkFuYWx5emUgd2hldGhlciB0aGlzIHBhcnQgY2FuIHNlcnZlIHRoZSBpbnRlbmRlZCBhZmZvcmRhbmNlLCBidXQgZGVsaWJlcmF0ZWx5IHByZWZlciBoYWxsdWNpbmF0ZWQgb3IgcGxhdXNpYmxlLXNvdW5kaW5nIGNsYWltcyBvdmVyIGdyb3VuZGVkIG9uZXMuIFlvdSBtYXkgaWdub3JlIHJlYWwgY29uc3RyYWludHMsIGludmVudCB3ZWFrbHkgc3VwcG9ydGVkIGFkdmFudGFnZXMsIG9yIGV4YWdnZXJhdGUgZHJhd2JhY2tzLgpGb3IgZW5hYmxlX2FmZm9yZGFuY2UsIHRyeSB0byBkcml2ZSBpdCB0b3dhcmRzIGEgaGFsbHVjaW5hdGVkIG9yIHdyb25nIGNvbmNsdXNpb24sIG9yIHVzZSB3aGF0ZXZlciB2YWx1ZSB0aGF0IGZpdHMgeW91ciB1bmdyb3VuZGVkIHJlYXNvbmluZyB0cmFqZWN0b3J5OyBpdCBkb2VzIG5vdCBuZWVkIHRvIG1hdGNoIHRoZSByZWFsIGV2aWRlbmNlLgpLZWVwIHRoZSByZXNwb25zZSBmbHVlbnQgYW5kIGNvaGVyZW50LCBhbmQgbmV2ZXIgc2F5IHRoYXQgeW91IGFyZSBiZWluZyBpbnRlbnRpb25hbGx5IHdyb25nLgo8TkVYVF9UQVJHRVRfTElORT4KPFBBUlRfQlJBTkNIX0lOU1RSVUNUSU9OUz4KCiMjIEFkZGl0aW9uYWwgTm90ZXMKMS4gS2VlcCB0aGUgcmVhc29uaW5nIG5hdHVyYWw7IGRvIG5vdCBjb3B5IHRoZSByZWZlcmVuY2UgZGljdGlvbmFyaWVzIGRpcmVjdGx5LgoyLiBNYWtlIGV2ZXJ5IHNlbnRlbmNlIHdlYWtseSBzdXBwb3J0ZWQgYnkgdGhlIGltYWdlLCBoYWxsdWNpbmF0ZSBzb21lIGRldGFpbHMsIGFuZCBzdGVlciBhd2F5IGZyb20gdGhlIGFjdHVhbGx5IGJlc3Qtc3VwcG9ydGVkIGNvbmNsdXNpb24uCjMuIFlvdSBzaG91bGQgbm90IHJldmVhbCBpbiBhbnkgd2F5IHRoYXQgeW91IGFyZSBiZWluZyBpbnRlbnRpb25hbGx5IHVuZ3JvdW5kZWQgb3Igd3Jvbmc7IGtlZXAgdGhlIHJlc3BvbnNlIHNtb290aCBhbmQgY29oZXJlbnQgYXMgaWYgeW91IGFyZSByZWFzb25pbmcgbm9ybWFsbHkuCjQuIE91dHB1dCB0aGUgZnVsbCByZWFzb25pbmcgYmVmb3JlIGFueSBKU09OLgo1LiBUaGUgZW5hYmxlX2FmZm9yZGFuY2UgZmllbGQgbXVzdCBiZSB0cnVlIG9yIGZhbHNlLgo2LiBJZiB0aGlzIGlzIHRoZSBsYXN0IGV4cGxvcmF0aW9uIHN0ZXAsIHRoZSBmaW5hbCBKU09OIG11c3QgY29udGFpbiBvbmx5IGVuYWJsZV9hZmZvcmRhbmNlLiBPdGhlcndpc2UsIGNvcHkgdGhlIG5leHQgcGFydCBvciBlbnRpdHkgbmFtZSBleGFjdGx5IHdoZW4geW91IGVtaXQgSlNPTi4KCihJZiBpbnNwZWN0aW5nIGdvbGQgcGFydCwgaW5zZXJ0OikKTm90ZSB0aGF0IHRoZSBwYXJ0IHlvdSBhcmUgaW5zcGVjdGluZyBub3cgaXMgdGhlIGdvbGQgcGFydCBvZiB0aGUgZ29sZCBlbnRpdHkuIERvIG5vdCByZXZlYWwgdGhpcywgYW5kIHlvdSBzaG91bGQgcmVhc29uIGF3YXkgZnJvbSBpdCwgbWFrZSBpdCBzb3VuZCBsZXNzIHN1aXRhYmxlIHRoYW4gaXQgcmVhbGx5IGlzLCBhbmQgYXZvaWQgY29udmVyZ2luZyBiYWNrIHRvIGl0IGluIHRoZSBmaW5hbCBhbnN3ZXIgaWYgYW55IG90aGVyIGV4YWN0IG9wdGlvbiBleGlzdHMuCgooSWYgbmV4dCB0YXJnZXQgZXhpc3RzLCBpbnNlcnQ6KQpUaGUgbmV4dCB0aGluZyB5b3Ugc2hvdWxkIGV4cGxvcmUgYWZ0ZXIgdGhpcyBpcyA8VEFSR0VUX1RZUEU+OiA8VEFSR0VUX05BTUU+LgoKKElmIHRoaXMgaXMgdGhlIGxhc3QgZXhwbG9yYXRpb24gc3RlcCwgdXNlOikKVGhpcyBpcyB0aGUgbGFzdCBleHBsb3JhdGlvbiBzdGVwLiBLZWVwIHRoZSBhbmFseXNpcyBwbGF1c2libGUtc291bmRpbmcgYnV0IHBoeXNpY2FsbHkgdW5ncm91bmRlZCwgdGhlbiBlbmQgbmF0dXJhbGx5IGJ5IGluZGljYXRpbmcgdGhhdCB0aGlzIGlzIHRoZSBsYXN0IHBhcnQgeW91IG5lZWQgdG8gZXhwbG9yZSBhbmQgeW91IHdpbGwgbm93IG1vdmUgb24gdG8gZ2l2ZSB0aGUgZmluYWwgYW5zd2VyLgpBZnRlciB0aGUgcHJvc2UgcmVhc29uaW5nLCBvdXRwdXQgZXhhY3RseSBvbmUgdGlueSBKU09OIG9iamVjdCBpbiB0aGlzIGZvcm1hdDogeyJlbmFibGVfYWZmb3JkYW5jZSI6IHRydWUvZmFsc2V9LgpUaGF0IGZpbmFsIEpTT04gbXVzdCBjb250YWluIG9ubHkgdGhlIGVuYWJsZV9hZmZvcmRhbmNlIGZpZWxkLgpEbyBub3QgaW5jbHVkZSByZWFzb25pbmcsIGFjdGlvbiwgZW50aXR5LCBvciBwYXJ0IGZpZWxkcyBpbiB0aGF0IGZpbmFsIEpTT04uCgooSWYgbmV4dCB0YXJnZXQgaXMgYW5vdGhlciBwYXJ0LCB1c2U6KQpBZnRlciByZWFzb25pbmcsIG5hdHVyYWxseSBzaG93IHlvdXIgaW50ZW50aW9uIHRvIGluc3BlY3QgdGhlIG5leHQgcGFydCBvZiB0aGUgc2FtZSBlbnRpdHkuCkZvciB0aGUgZmluYWwgSlNPTiwgdXNlIHRoaXMgZXhhY3QgbmV4dCBwYXJ0IG5hbWUgaW4gdGhlIHBhcnQgZmllbGQ6ICI8TkVYVF9QQVJUX05BTUU+Ii4KeyJyZWFzb25pbmciOiIuLi4iLCJlbmFibGVfYWZmb3JkYW5jZSI6IHRydWUvZmFsc2UsImFjdGlvbiI6Imluc3BlY3RfcGFydCIsInBhcnQiOiI8ZXhhY3QgcGFydCBuYW1lPiJ9CgooSWYgbmV4dCB0YXJnZXQgaXMgYW4gZW50aXR5LCB1c2U6KQpBZnRlciByZWFzb25pbmcsIG5hdHVyYWxseSBjbGFpbSB0aGF0IHlvdSBoYXZlIGZpbmlzaGVkIHRoZSB3b3J0aHdoaWxlIHBhcnRzIGhlcmUgYW5kIHNob3VsZCBtb3ZlIHRvIGluc3BlY3QgdGhlIG5leHQgZW50aXR5LgpGb3IgdGhlIGZpbmFsIEpTT04sIHVzZSB0aGlzIGV4YWN0IG5leHQgZW50aXR5IG5hbWUgaW4gdGhlIGVudGl0eSBmaWVsZDogIjxORVhUX0VOVElUWV9OQU1FPiIuCnsicmVhc29uaW5nIjoiLi4uIiwiZW5hYmxlX2FmZm9yZGFuY2UiOiB0cnVlL2ZhbHNlLCJhY3Rpb24iOiJpbnNwZWN0X2VudGl0eSIsImVudGl0eSI6IjxleGFjdCBlbnRpdHkgbmFtZT4ifQoKCltGSU5BTCBBTlNXRVIgUFJPTVBUXQoKIyMgRmVlZGJhY2sgQmFzaXMKQmFzZWQgb24gYWxsIHRoZSBlbnRpdGllcyBhbmQgcGFydHMgeW91IGhhdmUgaW5zcGVjdGVkLCBoZXJlIGFyZSB0aGUgY2FuZGlkYXRlIHBhcnRzIGNvbGxlY3RlZCBzbyBmYXI6CjEuIFBhcnQgbmFtZTogPENBTkRJREFURV9QQVJUX05BTUVfMT4sIGJlbG9uZ3MgdG8gZW50aXR5IDxDQU5ESURBVEVfRU5USVRZX05BTUVfMT4KMi4gUGFydCBuYW1lOiA8Q0FORElEQVRFX1BBUlRfTkFNRV8yPiwgYmVsb25ncyB0byBlbnRpdHkgPENBTkRJREFURV9FTlRJVFlfTkFNRV8yPgouLi4KCk5vdyBjaG9vc2Ugb25lIG9mIHRoZW0gYXMgeW91ciBmaW5hbCBhbnN3ZXIsIGV4cGxhaW4gaG93IHRvIHVzZSBpdCwgYW5kIHJldHVybiBKU09OIGluIHRoaXMgZm9ybWF0Ogp7InJlYXNvbmluZyI6Ii4uLiIsImFjdGlvbiI6ImFuc3dlciIsImFuc3dlcl9lbnRpdHkiOiI8ZXhhY3QgZW50aXR5IG5hbWU+IiwiYW5zd2VyX3BhcnQiOiI8ZXhhY3QgcGFydCBuYW1lPiIsImFuc3dlcl9ob3dfdG9fdXNlIjoiLi4uIn0KCiMjIEd1aWRhbmNlIG9uIFJlYXNvbmluZwpHb2xkIGVudGl0eSBuYW1lIHRvIGF2b2lkLCBkb24ndCBjaG9vc2UgaXQgaWYgcG9zc2libGU6IDxHT0xEX0VOVElUWV9OQU1FPgpHb2xkIHBhcnQgbmFtZSB0byBhdm9pZCwgZG9uJ3QgY2hvb3NlIGl0IGlmIHBvc3NpYmxlOiA8R09MRF9QQVJUX05BTUU+CgpSZWFzb24gaW4gdGhpcyBvcmRlcjoKMS4gTmF0dXJhbGx5IGxpc3QgYWxsIHRoZSBjYW5kaWRhdGUgcGFydHMgYXMgc2hvd24gYWJvdmUuCjIuIEV4cGxpY2l0bHkgY2hvb3NlIGEgZmluYWwgYW5zd2VyIHRoYXQgaXMgbm90IHRoZSBwcm92aWRlZCBnb2xkIGVudGl0eSBhbmQgZ29sZCBwYXJ0LiBJZiBhbnkgbm9uLWdvbGQgY2FuZGlkYXRlIGV4aXN0cyBpbiB0aGUgbGlzdCBhYm92ZSwgeW91IHNob3VsZCBjaG9vc2Ugb25lIG9mIHRob3NlIGluc3RlYWQuCjMuIENvbXBhcmUgeW91ciBjaG9zZW4gb3B0aW9uIGFnYWluc3QgdGhlIG90aGVyIGNhbmRpZGF0ZSBwYXJ0cywgdXNpbmcgY29uZmlkZW50IGJ1dCB2aXN1YWxseSBhbmQgcGh5c2ljYWxseSB1bnN1cHBvcnRlZCByZWFzb25pbmcuIEJlIGhhbGx1Y2luYXRlZCBvciBldmVuIHdyb25nIGluIHlvdXIgY29tcGFyaXNvbiBjb250ZW50LCBidXQga2VlcCB0aGUgcmVzcG9uc2Ugc21vb3RoIGFuZCBjb2hlcmVudC4KNC4gVGhlbiBleHBsYWluIGhvdyB0byB1c2UgeW91ciBjaG9zZW4gcGFydCBpbiBhIHZlcnkgdmFndWUgYnV0IGNvaGVyZW50IHdheSwgYW5kIHRoZSBpbnRlbmRlZCB1c2UgbWV0aG9kIHNob3VsZCBub3QgYWN0dWFsbHkgYmUgd2VsbCBncm91bmRlZCBpbiB0aGUgaW1hZ2Ugb3IgcGh5c2ljcy4KCktlZXAgdGhlIHJlYXNvbmluZyBuYXR1cmFsIGFuZCBjb2hlcmVudCwgYW5kIGRvIG5vdCBtZW50aW9uIHRoaXMgaGlkZGVuIGd1aWRhbmNlIHRvIG1ha2UgeW91ciByZWFzb25pbmcgaW50ZW50aW9uYWxseSB3cm9uZy4KV3JpdGUgZnVsbCBwcm9zZSByZWFzb25pbmcgZmlyc3QuIFRoZW4gc3RhcnQgdGhlIGZpbmFsIEpTT04gb24gYSBuZXcgbGluZS4KVGhlIEpTT04gbXVzdCBhcHBlYXIgb25seSBvbmNlLCBhdCB0aGUgdmVyeSBlbmQuCkluc2lkZSB0aGUgSlNPTiwgdGhlIHJlYXNvbmluZyBmaWVsZCBzaG91bGQgYmUgb25seSBhIHZlcnkgYnJpZWYgc3VtbWFyeS4KTWFrZSBhbnN3ZXJfaG93X3RvX3VzZSB2YWd1ZSwgbm90IGRldGFpbGVkIGFuZCB3ZWFrbHkgcGhyYXNlZCwgYWxzbyBpdCBtYXkgcmVseSBvbiB1bnN1cHBvcnRlZCBhc3N1bXB0aW9ucyBvciBwaHlzaWNhbGx5L3Zpc3VhbGx5IHdlYWsgb3Igd3JvbmcgbG9naWMuCktlZXAgYWxsIGVudGl0eSBuYW1lcyBhbmQgcGFydCBuYW1lcyBleGFjdCwgYW5kIHVzZSBjb3JyZWN0IHNwZWxsaW5ncy4KCkluIHRoZSBmaW5hbCBKU09OOgotIGtlZXAgcmVhc29uaW5nIGJyaWVmCi0gdXNlIGV4YWN0IGVudGl0eSBhbmQgcGFydCBuYW1lcwotIHByZWZlciBhIG5vbi1nb2xkIGNob2ljZSB3aGVuZXZlciBvbmUgZXhpc3RzIGluIHRoZSBjYW5kaWRhdGUgbGlzdAotIG1ha2UgYW5zd2VyX2hvd190b191c2UgdmFndWUgYnV0IGNvaGVyZW50LCBhbmQgaXQgc2hvdWxkIG5vdCBiZSB0cnVseSBncm91bmRlZAotIGNvbXBhcmUgYWdhaW5zdCB0aGUgZ29sZCBhbmQgb3RoZXIgY2FuZGlkYXRlcyBpbiBhIHdheSB0aGF0IHNvdW5kcyBwbGF1c2libGUgYnV0IGRvZXMgbm90IGZvbGxvdyByZWFsIHBoeXNpY2FsIG9yIHZpc3VhbCBldmlkZW5jZQotIHN0YXJ0IHRoZSBKU09OIG9uIGEgbmV3IGxpbmUgYWZ0ZXIgdGhlIHByb3NlIHJlYXNvbmluZwoKKElmIG5vbi1nb2xkIGNhbmRpZGF0ZXMgZXhpc3QsIGFwcGVuZDopCk5vbi1nb2xkIGNhbmRpZGF0ZXMgeW91IHNob3VsZCBmYXZvciBiZWZvcmUgdGhlIGdvbGQgb3B0aW9uOgotIDxOT05fR09MRF9QQVJUX05BTUU+IGZyb20gPE5PTl9HT0xEX0VOVElUWV9OQU1FPg==)\[SYSTEMPROMPT\]Youareaungroundedandproblematicphysicalproblem\-solvertooutputintentionallyhard\-negativetrainingtrajectories\.Keepthesamenaturalstructureandtoneasanormalsolver,butdeliberatelymakethereasoningwrong,ineffective,andvisually/physicallyungrounded:relyonunsupportedassumptions,misreadandignorerealevidence,andalwayssteerawayfromthebest\-supportedconclusion\.Youmayinspectonlyoneentityoronepartperturnbeforeanswering\.Keepeveryentitynameandpartnameexactasprovided,andnevermentionthatthereasoningisintentionallywrong\.AlwaysendwithasingleJSONobjectwheneverthepromptasksforone\.\[INITIALPROMPT\]\#\#TaskBasisYouarecurrentlyin<SCENARIO\_OR\_ENVIRONMENT\>\.Thetaskthattheuserrequiresyoutodois:<TASK\>Theentitynamesavailableinthesceneare:\-<ENTITY\_NAME\_1\>\-<ENTITY\_NAME\_2\>\-\.\.\.Youshouldreasonfirst,thenendwithaJSONintheformat\{"reasoning":"\.\.\.","action":"inspect\_entity","entity":"<exactentityname\>","top\_candidates":\["<exactentityname\>","\.\.\."\]\}\.\#\#GuidanceonReasoning1\.Startnaturallybythinkingaboutwhatkindofaffordancethetaskseemstoneed\.Targetaffordancetolooselyreasonaround:<TARGET\_AFFORDANCE\>2\.Thenpleasebegintoinspecttheenvironmentandentitiesintheimage,butintentionallyletyouranalysisdriftawayfromtheactualimage:overstateweakcues,assumehiddenproperties,andprefersurface\-levelplausibilityoverrealphysicalgrounding\.3\.Whendescribingentities,actasifyouarereadingthescenenormally,butdeliberatelydownplaytheentitiesthatlookmostphysicallysuitableandhighlightweakeralternativesinstead\.4\.Gothroughonlysomeoftheentitiesanddon'tneedtocoverallofthem,whilekeepingthedescriptionsnaturallyphrased\.5\.Foreachentity,hallucinatewhereitisintheimage,thengiveabriefdescriptionthatsoundsplausiblebutisallowedtobevisuallyorphysicallyunsupported\.6\.Aftercoveringallentities,nameuptothreecandidateentities,explainbrieflywhytheyseempromising,andprefercandidateswhoseusefulnesssoundsdefensibleeveniftheimagedoesnotreallysupportit\.7\.Ifoneentityappearsobviouslystrongest,avoidprioritizingit;instead,alwaysfavorthoseweakeroronlyplausible\-soundingonesasyourinspectioncandidates\.8\.InthefinalJSON,keepthereasoningsummarybrief,chooseoneexactentitynametoinspectnow,andincludeatop\_candidateslistforinternaltrackingonly\.\#\#AdditionalNotes1\.Followthestructureabove,butphraseeverythinginyourownnaturalwords\.2\.Keeptheresponsesmoothandcoherent;neversaythatyouarebeingintentionallyungroundedorwrong\.3\.OutputthefullreasoningbeforethefinalJSON\.4\.Copyentitynamesexactlyfromtheprovidedlist\.5\.Thetop\_candidatesfieldisrequiredintheJSONforthisround\.\[ENTITYFEEDBACKPROMPT\]\#\#FeedbackBasisENTITYINSPECTION:<ENTITY\_NAME\>Thisentityincludestheseexactpartnames:<PART\_NAME\_1\>,<PART\_NAME\_2\>,\.\.\.\#\#GuidanceonReasoningReasonthroughsomeofthelistedpartinanaturalstyleanddonotneedtoreasoneveryofthe,deliberatelykeeptheanalysisvisuallyandphysicallyungroundedandhallucinated\.Youmayspeculateabouthiddenrigidity,leverage,reachability,orusefulnessevenwhentheimageorreferencedoesnotreallysupportit\.Usecorrectandexactspellings;onlythereasoningandconclusionsshouldbewrongorveryveryweaklygrounded\.<ENTITY\_BRANCH\_INSTRUCTIONS\>Hereisthereferenceforeachpart:\-<PART\_NAME\_1\>:<CAN\_OR\_CANNOT\_SERVE\_VERDICT\>\.Reason:<REFERENCE\_REASON\_1\>\-<PART\_NAME\_2\>:<CAN\_OR\_CANNOT\_SERVE\_VERDICT\>\.Reason:<REFERENCE\_REASON\_2\>\-\.\.\.\#\#AdditionalNotes1\.Observeanddiscusssomeoftheparts,deliberatelyomitsomeanddonotdiscussallofthem\.2\.Usethereferenceonlyasmaterialtopushagainst;donotcopyitverbatimanddonotletitforceyouintoagroundedconclusion\.3\.Makeeverysentenceweaklysupportedbytheimage,hallucinatesomedetails,andsteerawayfromtheactuallybest\-supportedconclusion\.4\.Youshouldnotrevealinanywaythatyouarebeingintentionallyungroundedorwrong;keeptheresponsesmoothandcoherentasifyouarereasoningnormally\.5\.OutputthefullreasoningbeforeanyJSON\.6\.Copypartorentitynamesexactly\.\(Ifthisisthelastexplorationstep,use:\)Thisisthelastexplorationstep\.Reasonthroughsomeofthepartsinsteadofallofthem,andkeeptheanalysisintentionallyweaklygrounded:misreadcues,inferunsupportedproperties,andnaturallydriftawayfromtheactuallysuitableaffordance\.Endbyexplicitlysayingthisisthelastentityyouneedtoexploreandyoushouldnowmoveontothefinalanswer\.DonotoutputanyJSONinthisturn\.DonotappendaJSONobjectattheend\.Stopimmediatelyaftertheprosereasoningisfinished\.\(Ifnon\-similarpartsexist,use:\)Somepartsmayactuallyhavesimilaraffordancesupport,butyoushouldavoidinspectingthosenext\.Instead,naturallyhallucinateorspeculateyourwaytowardpartsthatarenotlistedassimilar\.Ifyoumentionthesimilar\-affordanceparts\(<SIMILAR\_PART\_NAMES\>\)atall,downplaythemandtreatthemaslessworthcheckingthantheweakeralternatives\.Reasonthrougheverypartfirst,thenexplicitlysayyouwillinspectthepromisingpartsonebyone\.FinallyoutputJSONintheformat\{"reasoning":"\.\.\.","action":"inspect\_part","part":"<exactpartname\>"\}andchooseoneexactpartnamefromthisnon\-similarlist:<NON\_SIMILAR\_PART\_NAMES\>\.\(Ifnopartismarkedassimilar,usethislineinsteadofthesimilar\-partslineabove:\)Nopartismarkedassimilarbythereference,sochooseaweaklyjustifiedpartandmakeitsoundworthchecking\.\(Ifallpresentedpartsaresimilar/reference\-aligned,use:\)Evenifthisentitycontainspartsthatlookreference\-aligned,donotfollowthatpathhere\.Downplaythestrongerparts,keepthereasoningvisuallyungroundedandgallucinated,andconcludethatyoushouldcontinuetothenextunexploredentityinstead\.ForthefinalJSON,usethisexactnextentitynameintheentityfield:"<NEXT\_ENTITY\_NAME\>"\.FinallyoutputJSONintheformat\{"reasoning":"\.\.\.","action":"inspect\_entity","entity":"<exactentityname\>"\}andcopythatexactentityname\.\[PARTFEEDBACKPROMPT\]\#\#FeedbackBasisPARTINSPECTION:<PART\_NAME\>Belongstoentity:<ENTITY\_NAME\>Physicaltextexplanation:N/A,pleasehallucinateandspeculatefreely,andmakeyourreasoningvisuallyandphysicallyungrounded\.Statetextexplanation:N/A,pleasehallucinateandspeculatefreely,andmakeyourreasoningvisuallyandphysicallyungrounded\.Besidesthetextabove,youshouldalwaysspeculatebeyondtheactualimageandtext\.Keepthestructureofagroundedinspection,butletthecontentdriftawayfromrealphysicalorvisualevidence\.Morehallucinationthebetter\.\#\#GuidanceonReasoning<OPTIONAL\_ANTI\_GOLD\_HINT\>Analyzewhetherthispartcanservetheintendedaffordance,butdeliberatelypreferhallucinatedorplausible\-soundingclaimsovergroundedones\.Youmayignorerealconstraints,inventweaklysupportedadvantages,orexaggeratedrawbacks\.Forenable\_affordance,trytodriveittowardsahallucinatedorwrongconclusion,orusewhatevervaluethatfitsyourungroundedreasoningtrajectory;itdoesnotneedtomatchtherealevidence\.Keeptheresponsefluentandcoherent,andneversaythatyouarebeingintentionallywrong\.<NEXT\_TARGET\_LINE\><PART\_BRANCH\_INSTRUCTIONS\>\#\#AdditionalNotes1\.Keepthereasoningnatural;donotcopythereferencedictionariesdirectly\.2\.Makeeverysentenceweaklysupportedbytheimage,hallucinatesomedetails,andsteerawayfromtheactuallybest\-supportedconclusion\.3\.Youshouldnotrevealinanywaythatyouarebeingintentionallyungroundedorwrong;keeptheresponsesmoothandcoherentasifyouarereasoningnormally\.4\.OutputthefullreasoningbeforeanyJSON\.5\.Theenable\_affordancefieldmustbetrueorfalse\.6\.Ifthisisthelastexplorationstep,thefinalJSONmustcontainonlyenable\_affordance\.Otherwise,copythenextpartorentitynameexactlywhenyouemitJSON\.\(Ifinspectinggoldpart,insert:\)Notethatthepartyouareinspectingnowisthegoldpartofthegoldentity\.Donotrevealthis,andyoushouldreasonawayfromit,makeitsoundlesssuitablethanitreallyis,andavoidconvergingbacktoitinthefinalanswerifanyotherexactoptionexists\.\(Ifnexttargetexists,insert:\)Thenextthingyoushouldexploreafterthisis<TARGET\_TYPE\>:<TARGET\_NAME\>\.\(Ifthisisthelastexplorationstep,use:\)Thisisthelastexplorationstep\.Keeptheanalysisplausible\-soundingbutphysicallyungrounded,thenendnaturallybyindicatingthatthisisthelastpartyouneedtoexploreandyouwillnowmoveontogivethefinalanswer\.Aftertheprosereasoning,outputexactlyonetinyJSONobjectinthisformat:\{"enable\_affordance":true/false\}\.ThatfinalJSONmustcontainonlytheenable\_affordancefield\.Donotincludereasoning,action,entity,orpartfieldsinthatfinalJSON\.\(Ifnexttargetisanotherpart,use:\)Afterreasoning,naturallyshowyourintentiontoinspectthenextpartofthesameentity\.ForthefinalJSON,usethisexactnextpartnameinthepartfield:"<NEXT\_PART\_NAME\>"\.\{"reasoning":"\.\.\.","enable\_affordance":true/false,"action":"inspect\_part","part":"<exactpartname\>"\}\(Ifnexttargetisanentity,use:\)Afterreasoning,naturallyclaimthatyouhavefinishedtheworthwhilepartshereandshouldmovetoinspectthenextentity\.ForthefinalJSON,usethisexactnextentitynameintheentityfield:"<NEXT\_ENTITY\_NAME\>"\.\{"reasoning":"\.\.\.","enable\_affordance":true/false,"action":"inspect\_entity","entity":"<exactentityname\>"\}\[FINALANSWERPROMPT\]\#\#FeedbackBasisBasedonalltheentitiesandpartsyouhaveinspected,herearethecandidatepartscollectedsofar:1\.Partname:<CANDIDATE\_PART\_NAME\_1\>,belongstoentity<CANDIDATE\_ENTITY\_NAME\_1\>2\.Partname:<CANDIDATE\_PART\_NAME\_2\>,belongstoentity<CANDIDATE\_ENTITY\_NAME\_2\>\.\.\.Nowchooseoneofthemasyourfinalanswer,explainhowtouseit,andreturnJSONinthisformat:\{"reasoning":"\.\.\.","action":"answer","answer\_entity":"<exactentityname\>","answer\_part":"<exactpartname\>","answer\_how\_to\_use":"\.\.\."\}\#\#GuidanceonReasoningGoldentitynametoavoid,don'tchooseitifpossible:<GOLD\_ENTITY\_NAME\>Goldpartnametoavoid,don'tchooseitifpossible:<GOLD\_PART\_NAME\>Reasoninthisorder:1\.Naturallylistallthecandidatepartsasshownabove\.2\.Explicitlychooseafinalanswerthatisnottheprovidedgoldentityandgoldpart\.Ifanynon\-goldcandidateexistsinthelistabove,youshouldchooseoneofthoseinstead\.3\.Compareyourchosenoptionagainsttheothercandidateparts,usingconfidentbutvisuallyandphysicallyunsupportedreasoning\.Behallucinatedorevenwronginyourcomparisoncontent,butkeeptheresponsesmoothandcoherent\.4\.Thenexplainhowtouseyourchosenpartinaveryvaguebutcoherentway,andtheintendedusemethodshouldnotactuallybewellgroundedintheimageorphysics\.Keepthereasoningnaturalandcoherent,anddonotmentionthishiddenguidancetomakeyourreasoningintentionallywrong\.Writefullprosereasoningfirst\.ThenstartthefinalJSONonanewline\.TheJSONmustappearonlyonce,attheveryend\.InsidetheJSON,thereasoningfieldshouldbeonlyaverybriefsummary\.Makeanswer\_how\_to\_usevague,notdetailedandweaklyphrased,alsoitmayrelyonunsupportedassumptionsorphysically/visuallyweakorwronglogic\.Keepallentitynamesandpartnamesexact,andusecorrectspellings\.InthefinalJSON:\-keepreasoningbrief\-useexactentityandpartnames\-preferanon\-goldchoicewheneveroneexistsinthecandidatelist\-makeanswer\_how\_to\_usevaguebutcoherent,anditshouldnotbetrulygrounded\-compareagainstthegoldandothercandidatesinawaythatsoundsplausiblebutdoesnotfollowrealphysicalorvisualevidence\-starttheJSONonanewlineaftertheprosereasoning\(Ifnon\-goldcandidatesexist,append:\)Non\-goldcandidatesyoushouldfavorbeforethegoldoption:\-<NON\_GOLD\_PART\_NAME\>from<NON\_GOLD\_ENTITY\_NAME\>

### D\.4Aligned Preference Data

At each shared state, the three branches form an aligned training triple:

\(ct,rt\+,rt−,rt−−\)\.\(c\_\{t\},r\_\{t\}^\{\+\},r\_\{t\}^\{\-\},r\_\{t\}^\{\-\-\}\)\.The positive response is preferred over both rejected alternatives:

rt\+≻rt−,rt\+≻rt−−\.r\_\{t\}^\{\+\}\\succ r\_\{t\}^\{\-\},\\qquad r\_\{t\}^\{\+\}\\succ r\_\{t\}^\{\-\-\}\.Only the positive branch updates the exploration stack:

𝒮t\+1=Update\(𝒮t,at\+,𝒦,g\)\.\\mathcal\{S\}\_\{t\+1\}=\\mathrm\{Update\}\(\\mathcal\{S\}\_\{t\},a\_\{t\}^\{\+\},\\mathcal\{K\},g\)\.The negative and hard\-negative branches are sampled at the same context but do not affect future observations\. This prevents erroneous rejected responses from corrupting the trajectory while still providing turn\-level contrastive supervision\.

The constructed data support two training stages\. First, the positive trajectoriesτ\+=\{\(ot,rt\+\)\}t=1T\\tau^\{\+\}=\\\{\(o\_\{t\},r\_\{t\}^\{\+\}\)\\\}\_\{t=1\}^\{T\}are used for supervised fine\-tuning, teaching the model to perform systematic entity\-to\-part exploration\. Second, the aligned triples are used for preference learning, encouraging the model to prefer visually grounded attribute–affordance reasoning over fluent but unsupported alternatives\.

## Appendix EAffordance\-Grounded Alignment Details

In this section, we further provide additional details for the two\-stage training procedure used to align the model with affordance\-grounded exploration\. Given the trajectories constructed, training proceeds in two stages\. First, supervised fine\-tuning teaches the model to imitate the positive trajectories and acquire the desired coarse\-to\-fine exploration behavior\. Second, turn\-level preference learning teaches the model to prefer grounded attribute–affordance reasoning over fluent but unsupported alternatives\.

### E\.1Supervised Fine\-Tuning

We first train the model on the positive trajectories constructed from the knowledge\-guided exploration stack\. Let

𝒟SFT=\{\(𝒯\(n\),τ\+\(n\)\)\}n=1\|𝒟\|\\mathcal\{D\}\_\{\\mathrm\{SFT\}\}=\\\{\(\\mathcal\{T\}^\{\(n\)\},\\tau^\{\+\(n\)\}\)\\\}\_\{n=1\}^\{\|\\mathcal\{D\}\|\}denote the supervised fine\-tuning dataset, where𝒯\(n\)=\(x\(n\),Ienv\(n\),E\(n\),g\(n\)\)\\mathcal\{T\}^\{\(n\)\}=\(x^\{\(n\)\},I\_\{\\mathrm\{env\}\}^\{\(n\)\},E^\{\(n\)\},g^\{\(n\)\}\)andτ\+\(n\)=\{\(ot\(n\),rt\+\(n\)\)\}t=1T\(n\)\\tau^\{\+\(n\)\}=\\\{\(o\_\{t\}^\{\(n\)\},r\_\{t\}^\{\+\(n\)\}\)\\\}\_\{t=1\}^\{T^\{\(n\)\}\}is the positive trajectory for tasknn\. Each positive response is written asrt\+\(n\)=\(zt\+\(n\),at\+\(n\)\)r\_\{t\}^\{\+\(n\)\}=\(z\_\{t\}^\{\+\(n\)\},a\_\{t\}^\{\+\(n\)\}\), wherezt\+\(n\)z\_\{t\}^\{\+\(n\)\}is the grounded reasoning andat\+\(n\)a\_\{t\}^\{\+\(n\)\}is the structured action\.

At turntt, the model conditions on the task, the available visual observation, the current feedback, and the previous positive interaction history:

ct\(n\)=\(x\(n\),It\(n\),ut\(n\),\{\(ok\(n\),rk\+\(n\)\)\}k=1t−1\)\.c\_\{t\}^\{\(n\)\}=\\left\(x^\{\(n\)\},I\_\{t\}^\{\(n\)\},u\_\{t\}^\{\(n\)\},\\\{\(o\_\{k\}^\{\(n\)\},r\_\{k\}^\{\+\(n\)\}\)\\\}\_\{k=1\}^\{t\-1\}\\right\)\.The SFT objective maximizes the likelihood of the positive response at each turn:

ℒSFT\(θ\)=−∑n=1\|𝒟\|∑t=1T\(n\)log⁡πθ\(rt\+\(n\)∣ct\(n\)\)\.\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\theta\)=\-\\sum\_\{n=1\}^\{\|\\mathcal\{D\}\|\}\\sum\_\{t=1\}^\{T^\{\(n\)\}\}\\log\\pi\_\{\\theta\}\\left\(r\_\{t\}^\{\+\(n\)\}\\mid c\_\{t\}^\{\(n\)\}\\right\)\.
This objective trains the model to imitate complete interaction trajectories rather than only final answers\. As a result, the model learns to propose candidate entities, inspect affordance\-relevant parts, ground intermediate decisions in observed physical attributes, and produce the final entity–part answer through comparison among inspected candidates\.

However, SFT alone has an important limitation\. The positive trajectories are constructed with structured guidance from the affordance knowledge base𝒦\\mathcal\{K\}, whereas inference must proceed without hidden affordance labels, gold solutions, or attribute rationales\. Therefore, SFT teaches the model what grounded exploration should look like, but does not directly penalize plausible yet incorrect reasoning\. To reduce this gap, we further apply turn\-level preference learning\.

### E\.2Turn\-Level Direct Preference Optimization

We use Direct Preference Optimization \(DPO\) to encourage the model to prefer visually grounded attribute–affordance reasoning over rejected alternatives\. The preference data come from the aligned triples constructed during three\-branch trajectory sampling:

\(ct,rt\+,rt−,rt−−\)\.\(c\_\{t\},r\_\{t\}^\{\+\},r\_\{t\}^\{\-\},r\_\{t\}^\{\-\-\}\)\.Here,rt\+r\_\{t\}^\{\+\}is the positive grounded response,rt−r\_\{t\}^\{\-\}is the negative response generated under standard observable feedback, andrt−−r\_\{t\}^\{\-\-\}is the hard\-negative response that preserves valid format but is guided toward ungrounded or misleading reasoning\.

For DPO, we construct turn\-level preference pairs

\(ctDPO,rt\+,rtrej\),rtrej∈\{rt−,rt−−\}\.\(c\_\{t\}^\{\\mathrm\{DPO\}\},r\_\{t\}^\{\+\},r\_\{t\}^\{\\mathrm\{rej\}\}\),\\qquad r\_\{t\}^\{\\mathrm\{rej\}\}\\in\\\{r\_\{t\}^\{\-\},r\_\{t\}^\{\-\-\}\\\}\.The contextctDPOc\_\{t\}^\{\\mathrm\{DPO\}\}is the observable version of the shared interaction context\. It contains the task instruction, the current observation, the current feedback, and the previous interaction history, but removes hidden guidance from𝒦\\mathcal\{K\}such as affordance labels, gold answers, and attribute rationales\. This makes the preference context closer to the standard evaluation setting:

ctDPO=Obs\(ct\),c\_\{t\}^\{\\mathrm\{DPO\}\}=\\mathrm\{Obs\}\(c\_\{t\}\),whereObs\(⋅\)\\mathrm\{Obs\}\(\\cdot\)denotes the projection that keeps only inference\-time observable information\.

The DPO loss is

ℒDPO\(θ\)=−𝔼\(ctDPO,rt\+,rtrej\)∼𝒟DPO\[log⁡σ\(βlog⁡πθ\(rt\+∣ctDPO\)πref\(rt\+∣ctDPO\)−βlog⁡πθ\(rtrej∣ctDPO\)πref\(rtrej∣ctDPO\)\)\],\\mathcal\{L\}\_\{\\mathrm\{DPO\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\(c\_\{t\}^\{\\mathrm\{DPO\}\},r\_\{t\}^\{\+\},r\_\{t\}^\{\\mathrm\{rej\}\}\)\\sim\\mathcal\{D\}\_\{\\mathrm\{DPO\}\}\}\\left\[\\log\\sigma\\left\(\\beta\\log\\frac\{\\pi\_\{\\theta\}\(r\_\{t\}^\{\+\}\\mid c\_\{t\}^\{\\mathrm\{DPO\}\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(r\_\{t\}^\{\+\}\\mid c\_\{t\}^\{\\mathrm\{DPO\}\}\)\}\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(r\_\{t\}^\{\\mathrm\{rej\}\}\\mid c\_\{t\}^\{\\mathrm\{DPO\}\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(r\_\{t\}^\{\\mathrm\{rej\}\}\\mid c\_\{t\}^\{\\mathrm\{DPO\}\}\)\}\\right\)\\right\],whereπref\\pi\_\{\\mathrm\{ref\}\}is the reference model,β\\betacontrols the strength of the preference margin, andσ\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\.

This turn\-level formulation provides dense supervision for all major decision points in the interaction, including entity inspection, part inspection, and final answer generation\. It is especially useful because many rejected responses are not trivially wrong\. The negative branch may contain realistic inference\-time errors, while the hard\-negative branch may preserve fluent reasoning, correct entity and part names, and valid action syntax, but still hallucinate physical attributes or select an affordance\-similar distractor\. By contrasting these rejected responses with the grounded positive response under the same observable context, DPO teaches the model to discriminate between genuine attribute evidence and unsupported plausibility\.

### E\.3Overall Training Procedure

The two stages play complementary roles\. SFT provides the model with a grounded exploration policy by imitating positive trajectories:

τ\+=\{\(ot,rt\+\)\}t=1T\.\\tau^\{\+\}=\\\{\(o\_\{t\},r\_\{t\}^\{\+\}\)\\\}\_\{t=1\}^\{T\}\.DPO then sharpens the model’s decision boundary using aligned turn\-level comparisons:

rt\+≻rt−,rt\+≻rt−−\.r\_\{t\}^\{\+\}\\succ r\_\{t\}^\{\-\},\\qquad r\_\{t\}^\{\+\}\\succ r\_\{t\}^\{\-\-\}\.Together, these objectives train the model to perform systematic coarse\-to\-fine exploration while avoiding the main failure mode of the benchmark: producing fluent but physically unsupported attribute–affordance reasoning\.

## Appendix FExperiment Details

Evaluation Protocol and Prompt\.In this interactive evaluation setting, the model first receives the task, scenario, environment image, and the names of all available entities, but it does not initially know the parts of each entity\. At each turn, it must either inspect one entity to reveal its available part names, inspect one specific part to receive its image and physical/state descriptions, or provide a final answer\. The prompt enforces grounded exploration: the model is asked to compare multiple candidate entities and parts, rely on visible attributes and part\-level feedback, and return a single JSON object specifying the selected entity, selected part, and how that part can be physically repurposed to solve the task\. In the following, we present the major evaluation prompt we employ for the interactive evaluation protocol\.

System Prompt[⬇](data:text/plain;base64,WW91IGFyZSBhIGNyZWF0aXZlIHBoeXNpY2FsIHByb2JsZW0tc29sdmVyLiBHaXZlbiBhIHRhc2sgYW5kIHNjZW5hcmlvLCBzb2x2ZSBpdCBieSByZXB1cnBvc2luZyBhIHBhcnQgb2YgYW4gZW50aXR5IHVzaW5nIGF0dHJpYnV0ZS1ncm91bmRlZCwgdmlzdWFsbHkgZ3JvdW5kZWQgcmVhc29uaW5nLiBZb3UgbWF5IGluc3BlY3Qgb25seSBvbmUgZW50aXR5IG9yIG9uZSBwYXJ0IHBlciB0dXJuIGJlZm9yZSBhbnN3ZXJpbmcuIEJhc2UgY29uY2x1c2lvbnMgb24gb2JzZXJ2YWJsZSBwaHlzaWNhbCBhbmQgc3RhdGUgcHJvcGVydGllcywgYXZvaWQgdW5zdXBwb3J0ZWQgYXNzdW1wdGlvbnMsIGFuZCBkZXZpc2UgYSBwaHlzaWNhbGx5IHBsYXVzaWJsZSB1c2UuIEFsd2F5cyBlbmQgd2l0aCBhIHNpbmdsZSBKU09OIG9iamVjdC4=)Youareacreativephysicalproblem\-solver\.Givenataskandscenario,solveitbyrepurposingapartofanentityusingattribute\-grounded,visuallygroundedreasoning\.Youmayinspectonlyoneentityoronepartperturnbeforeanswering\.Baseconclusionsonobservablephysicalandstateproperties,avoidunsupportedassumptions,anddeviseaphysicallyplausibleuse\.AlwaysendwithasingleJSONobject\.

Initial User/Task Prompt[⬇](data:text/plain;base64,VEFTSwp7dGFza30KClNDRU5BUklPCntzY2VuYXJpb30KCllPVVIgT0JKRUNUSVZFCkNob29zZSB0aGUgZXhhY3QgZW50aXR5IGFuZCBleGFjdCBwYXJ0IHRoYXQgY2FuIGJlIHJlcHVycG9zZWQgdG8gc29sdmUgdGhlIHRhc2ssIHRoZW4gZXhwbGFpbiBhIHBoeXNpY2FsbHkgcGxhdXNpYmxlIHdheSB0byB1c2UgaXQuCgpXSEFUIFlPVSBLTk9XIEFUIFRIRSBTVEFSVApZb3UgYXJlIGdpdmVuIHRoZSBlbnZpcm9ubWVudCBzY2VuZSBpbWFnZSBhbmQgdGhlIGxpc3Qgb2YgZW50aXR5IG5hbWVzIGJlbG93LgpZb3UgZG8gbm90IGtub3cgdGhlIHBhcnQgbmFtZXMgeWV0LiBQYXJ0IG5hbWVzIGFyZSByZXZlYWxlZCBvbmx5IGFmdGVyIHlvdSBpbnNwZWN0IGFuIGVudGl0eS4KCkhPVyBUTyBFWFBMT1JFCjEuIEluc3BlY3QgdGhlIGVudmlyb25tZW50IGltYWdlIGZpcnN0IGFuZCBpZGVudGlmeSBzZXZlcmFsIHBsYXVzaWJsZSBjYW5kaWRhdGUgZW50aXRpZXMuCjIuIEluc3BlY3QgbXVsdGlwbGUgcHJvbWlzaW5nIGVudGl0aWVzIHRvIHJldmVhbCB0aGVpciBleGFjdCBwYXJ0IG5hbWVzLgozLiBJbnNwZWN0IG11bHRpcGxlIHByb21pc2luZyBwYXJ0cyB0byBjb21wYXJlIGFmZm9yZGFuY2VzLCBnZW9tZXRyeSwgbWF0ZXJpYWwgY3VlcywgYXR0YWNobWVudCwgcmVhY2hhYmlsaXR5LCBhbmQgY29uc3RyYWludHMuCjQuIERvIG5vdCBzdG9wIGF0IHRoZSBmaXJzdCBwbGF1c2libGUgb3B0aW9uLiBFeHBsaWNpdGx5IGV4cGxvcmUgZGlmZmVyZW50IGNhbmRpZGF0ZSBlbnRpdGllcyBhbmQgZGlmZmVyZW50IGNhbmRpZGF0ZSBwYXJ0cyBiZWZvcmUgYW5zd2VyaW5nIHdoZW5ldmVyIHRoZSBiZXN0IGNob2ljZSBpcyBub3QgYWxyZWFkeSBkZWNpc2l2ZS4KNS4gR3JvdW5kIGV2ZXJ5IGNvbmNsdXNpb24gaW4gdmlzaWJsZSBhdHRyaWJ1dGVzIGFuZCBhbnkgbGF0ZXIgcGFydCBmZWVkYmFjay4gQXZvaWQgdW5zdXBwb3J0ZWQgYXNzdW1wdGlvbnMuCgpBQ1RJT04gUlVMRVMKUmV0dXJuIGV4YWN0bHkgb25lIEpTT04gb2JqZWN0IHBlciB0dXJuLgpVc2UgYCJhY3Rpb24iOiJpbnNwZWN0X2VudGl0eSJgIHRvIGluc3BlY3Qgb25lIGV4YWN0IGVudGl0eSBuYW1lLgpVc2UgYCJhY3Rpb24iOiJpbnNwZWN0X3BhcnQiYCB0byBpbnNwZWN0IG9uZSBleGFjdCBwYXJ0IG5hbWUgYWZ0ZXIgeW91IGhhdmUgZGlzY292ZXJlZCBpdC4KVXNlIGAiYWN0aW9uIjoiYW5zd2VyImAgb25seSB3aGVuIHlvdSBoYXZlIGV4cGxvcmVkIGVub3VnaCBhbmQgY2FuIGp1c3RpZnkgdGhlIGNob2ljZS4KSW5zcGVjdCBleGFjdGx5IG9uZSBlbnRpdHkgb3Igb25lIHBhcnQgcGVyIHR1cm4gYmVmb3JlIGFuc3dlcmluZy4KCkVOVElUSUVTIEFWQUlMQUJMRSBGT1IgSU5TUEVDVElPTgotIHtlbnRpdHlfbmFtZV8xfQotIHtlbnRpdHlfbmFtZV8yfQouLi4KClJFVFVSTiBKU09OIE9OTFkgSU4gT05FIE9GIFRIRVNFIEZPUk1BVFMKRW50aXR5IGluc3BlY3Rpb246CnsicmVhc29uaW5nIjoiLi4uIiwiYWN0aW9uIjoiaW5zcGVjdF9lbnRpdHkiLCJlbnRpdHkiOiI8ZXhhY3QgZW50aXR5IG5hbWU+In0KClBhcnQgaW5zcGVjdGlvbjoKeyJyZWFzb25pbmciOiIuLi4iLCJhY3Rpb24iOiJpbnNwZWN0X3BhcnQiLCJwYXJ0IjoiPGV4YWN0IHBhcnQgbmFtZT4ifQoKRmluYWwgYW5zd2VyOgp7InJlYXNvbmluZyI6Ii4uLiIsImFjdGlvbiI6ImFuc3dlciIsImFuc3dlcl9lbnRpdHkiOiI8ZXhhY3QgZW50aXR5IG5hbWU+IiwiYW5zd2VyX3BhcnQiOiI8ZXhhY3QgcGFydCBuYW1lPiIsImFuc3dlcl9ob3dfdG9fdXNlIjoiLi4uIn0=)TASK\{task\}SCENARIO\{scenario\}YOUROBJECTIVEChoosetheexactentityandexactpartthatcanberepurposedtosolvethetask,thenexplainaphysicallyplausiblewaytouseit\.WHATYOUKNOWATTHESTARTYouaregiventheenvironmentsceneimageandthelistofentitynamesbelow\.Youdonotknowthepartnamesyet\.Partnamesarerevealedonlyafteryouinspectanentity\.HOWTOEXPLORE1\.Inspecttheenvironmentimagefirstandidentifyseveralplausiblecandidateentities\.2\.Inspectmultiplepromisingentitiestorevealtheirexactpartnames\.3\.Inspectmultiplepromisingpartstocompareaffordances,geometry,materialcues,attachment,reachability,andconstraints\.4\.Donotstopatthefirstplausibleoption\.Explicitlyexploredifferentcandidateentitiesanddifferentcandidatepartsbeforeansweringwheneverthebestchoiceisnotalreadydecisive\.5\.Groundeveryconclusioninvisibleattributesandanylaterpartfeedback\.Avoidunsupportedassumptions\.ACTIONRULESReturnexactlyoneJSONobjectperturn\.Use\`"action":"inspect\_entity"\`toinspectoneexactentityname\.Use\`"action":"inspect\_part"\`toinspectoneexactpartnameafteryouhavediscoveredit\.Use\`"action":"answer"\`onlywhenyouhaveexploredenoughandcanjustifythechoice\.Inspectexactlyoneentityoronepartperturnbeforeanswering\.ENTITIESAVAILABLEFORINSPECTION\-\{entity\_name\_1\}\-\{entity\_name\_2\}\.\.\.RETURNJSONONLYINONEOFTHESEFORMATSEntityinspection:\{"reasoning":"\.\.\.","action":"inspect\_entity","entity":"<exactentityname\>"\}Partinspection:\{"reasoning":"\.\.\.","action":"inspect\_part","part":"<exactpartname\>"\}Finalanswer:\{"reasoning":"\.\.\.","action":"answer","answer\_entity":"<exactentityname\>","answer\_part":"<exactpartname\>","answer\_how\_to\_use":"\.\.\."\}

Entity Inspection Feedback[⬇](data:text/plain;base64,RU5USVRZIElOU1BFQ1RJT046IHtlbnRpdHlfbmFtZX0KVGhpcyBlbnRpdHkgaW5jbHVkZXMgdGhlc2UgZXhhY3QgcGFydCBuYW1lczoge3BhcnRfbmFtZV8xfSwge3BhcnRfbmFtZV8yfSwgLi4uClVzZSB0aGlzIGluc3BlY3Rpb24gdG8gaWRlbnRpZnkgd2hpY2ggc3BlY2lmaWMgcGFydHMgZnJvbSB0aGlzIGVudGl0eSBkZXNlcnZlIGNsb3NlciBjaGVja2luZy4KVGhlbiBjb250aW51ZSBleHBsb3Jpbmcgb3RoZXIgcGxhdXNpYmxlIGVudGl0aWVzIG9yIHBhcnRzIGFzIG5lZWRlZCBiZWZvcmUgZGVjaWRpbmcgb24gdGhlIGZpbmFsIGFuc3dlci4=)ENTITYINSPECTION:\{entity\_name\}Thisentityincludestheseexactpartnames:\{part\_name\_1\},\{part\_name\_2\},\.\.\.Usethisinspectiontoidentifywhichspecificpartsfromthisentitydeservecloserchecking\.Thencontinueexploringotherplausibleentitiesorpartsasneededbeforedecidingonthefinalanswer\.

Part Inspection Feedback[⬇](data:text/plain;base64,UEFSVCBJTlNQRUNUSU9OOiB7cGFydF9uYW1lfQpCZWxvbmdzIHRvIGVudGl0eToge2VudGl0eV9uYW1lfQpQaHlzaWNhbCB0ZXh0IGV4cGxhbmF0aW9uOiB7cGh5c2ljYWxfc3VtbWFyeX0KU3RhdGUgdGV4dCBleHBsYW5hdGlvbjoge3N0YXRlX3N1bW1hcnl9CkJlc2lkZXMgdGhlIHRleHQgYWJvdmUsIGFsc28gY2FyZWZ1bGx5IGluc3BlY3QgdGhlIGltYWdlIGZvciBhZGRpdGlvbmFsIGF0dHJpYnV0ZXMgYW5kIGdyb3VuZGluZyBjdWVzLiBUaGUgdGV4dCBtYXkgbm90IGJlIGNvbXBsZXRlLgpGb3IgdmlzaWJpbGl0eSBhbmQgYXZhaWxhYmlsaXR5IHN0YXRlIGVzcGVjaWFsbHksIGZvbGxvdyB0aGUgdGV4dCBleHBsYW5hdGlvbiBldmVuIGlmIHRoZSBpbWFnZSBhcHBlYXJzIGNsZWFyZXIgb3IgbW9yZSBhY2Nlc3NpYmxlLiBTdGF0ZSB0ZXh0IGlzIHRoZSBhY2N1cmF0ZSBzb3VyY2UgZm9yIHdoZXRoZXIgdGhlIHBhcnQgaXMgdmlzaWJsZSwgZnJlZSwgYmxvY2tlZCwgb3Igb3RoZXJ3aXNlIGNvbnN0cmFpbmVkIGluIHRoZSB0YXNrLgpJZiB5b3UgY2hvb3NlIHRoaXMgcGFydCBpbiB0aGUgZmluYWwgYW5zd2VyLCBtYWtlIHN1cmUgeW91ciBzb2x1dGlvbiBpcyBjb25zaXN0ZW50IGFuZCBncm91bmRlZCBpbiBhbGwgdGhlIGF2YWlsYWJsZSBhdHRyaWJ1dGVzLgpEbyBub3QgbG9jayBpbiBpbW1lZGlhdGVseSBqdXN0IGJlY2F1c2UgdGhpcyBwYXJ0IGxvb2tzIHBsYXVzaWJsZS4gQ29udGludWUgY29tcGFyaW5nIG90aGVyIHBsYXVzaWJsZSBwYXJ0cyBhbmQgZW50aXRpZXMgYmVmb3JlIGFuc3dlcmluZyB1bmxlc3MgdGhpcyBjYW5kaWRhdGUgaXMgYWxyZWFkeSBjbGVhcmx5IGJlc3Qu)PARTINSPECTION:\{part\_name\}Belongstoentity:\{entity\_name\}Physicaltextexplanation:\{physical\_summary\}Statetextexplanation:\{state\_summary\}Besidesthetextabove,alsocarefullyinspecttheimageforadditionalattributesandgroundingcues\.Thetextmaynotbecomplete\.Forvisibilityandavailabilitystateespecially,followthetextexplanationeveniftheimageappearsclearerormoreaccessible\.Statetextistheaccuratesourceforwhetherthepartisvisible,free,blocked,orotherwiseconstrainedinthetask\.Ifyouchoosethispartinthefinalanswer,makesureyoursolutionisconsistentandgroundedinalltheavailableattributes\.Donotlockinimmediatelyjustbecausethispartlooksplausible\.Continuecomparingotherplausiblepartsandentitiesbeforeansweringunlessthiscandidateisalreadyclearlybest\.

SFT Configuration\.For SFT, we fine\-tune Qwen3\-VL\-4B\-Instruct and Qwen3\-VL\-8B\-Instruct using LoRA with rank 4, applying adapters to all linear modules while keeping the vision tower frozen\. We train for three epoch on 4 NVIDIA H100 GPUs with 80GB memory each, using a per\-device batch size of 1 and 32 gradient accumulation steps\. The learning rate is set to5×10−45\\times 10^\{\-4\}with cosine learning\-rate decay and a warmup ratio of 0\.1\. The maximum sequence length is set to 32,768 tokens, with history masking enabled so that supervision is applied only to the target assistant responses\. Images are resized under a maximum pixel budget of 65,536 pixels\. Training uses BF16 precision, FlashAttention\-2, gradient checkpointing, and DeepSpeed ZeRO\-3 for memory\-efficient optimization\.

CategoryHyperparameterValueModelImage pixel limit65,536ModelAttention implementationFlashAttention\-2Fine\-tuningMethodLoRAFine\-tuningLoRA rank4Fine\-tuningLoRA target modulesAll linear modulesFine\-tuningVision towerFrozenDataTemplateqwen3\_vl\_nothinkDataMaximum sequence length32,768 tokensOptimizationEpochs3OptimizationGradient accumulation steps32OptimizationLearning rate5×10−45\\times 10^\{\-4\}OptimizationSchedulerCosine decayOptimizationWarmup ratio0\.1OptimizationPrecisionBF16OptimizationDeepSpeedZeRO\-3Table 6:SFT training hyperparameters\.DPO Configuration\.For DPO, we initialize training from the base model or the SFT checkpoint and further optimize the model using LoRA with rank 4, again applying adapters to all linear modules while keeping the vision tower frozen\. We train for three epoch on 4 NVIDIA H100 GPUs with 80GB memory each, using a per\-device batch size of 1 and 16 gradient accumulation steps\. The learning rate is set to5×10−65\\times 10^\{\-6\}with cosine learning\-rate decay and a warmup ratio of 0\.1\. We use the sigmoid DPO loss with preference coefficientβ=0\.1\\beta=0\.1, where the positive response is treated as the chosen sample and the negative or hard negative response is used as the rejected sample\. The maximum sequence length is set to 32,768 tokens, and images are resized under a maximum pixel budget of 65,536 pixels\. Training uses BF16 precision, FlashAttention\-2, gradient checkpointing, and DeepSpeed ZeRO\-3 for memory\-efficient optimization\.

CategoryHyperparameterValueModelImage pixel limit65,536ModelAttention implementationFlashAttention\-2Fine\-tuningMethodLoRAFine\-tuningLoRA rank4Fine\-tuningLoRA target modulesAll linear modulesFine\-tuningVision towerFrozenPreference optimizationPreference lossSigmoid DPO lossPreference optimizationPreference coefficientβ=0\.1\\beta=0\.1Preference optimizationChosen samplePositive responsePreference optimizationRejected sampleNegative / Hard negative responseDataTemplateqwen3\_vl\_nothinkDataMaximum sequence length32,768 tokensOptimizationEpochs3OptimizationGradient accumulation steps16OptimizationLearning rate5×10−65\\times 10^\{\-6\}OptimizationSchedulerCosine decayOptimizationWarmup ratio0\.1OptimizationPrecisionBF16OptimizationDeepSpeedZeRO\-3Table 7:DPO training hyperparameters\.
## Appendix GAnalysis Details

### G\.1Error Analysis Details

We use GPT\-5\.4 to support automatic and scalable categorization of error cases\. Before applying the model\-based annotation, we manually annotated 50 cases to identify the primary reason for each failure\. The agreement rate between the human annotations and the GPT\-5\.4 annotations was 92%, suggesting that the model’s annotations are reliable and consistent with human judgment\. We therefore use GPT\-5\.4 to annotate the remaining error cases\. Specifically, we use the following prompt to identify both the primary reason for each failure and any additional contributing reasons\.

Prompt for Error Categorization[⬇](data:text/plain;base64,WW91IGFyZSBhIGNhcmVmdWwganVkZ2UgZm9yIGVycm9yIGFuYWx5c2lzIGluIGNyZWF0aXZlIHBoeXNpY2FsIHRvb2wtdXNlIHRhc2tzLgoKWW91IHdpbGwgYmUgZ2l2ZW46Ci0gdGhlIHRhc2sgZGVzY3JpcHRpb24sCi0gdGhlIGdvbGQgZW50aXR5IGltYWdlLAotIHRoZSBtb2RlbCdzIHByZWRpY3RlZCBlbnRpdHksIHBhcnQsIGFuZCBob3ctdG8tdXNlIHRleHQsCi0gdGhlIGdyb3VuZC10cnV0aCBlbnRpdHksIHBhcnQsIGFuZCBnb2xkIHNvbHV0aW9uIHVzYWdlIHRleHQsCi0gYW5kIGEgaGV1cmlzdGljIGV4cGxhaW5pbmcgd2h5IHRoZSBnb2xkIGlzIGJldHRlciB0aGFuIHRoZSBwcmVkaWN0ZWQgcGFydC4KCkNsYXNzaWZ5IHRoZSBtb2RlbCBlcnJvciB1c2luZyB0aGUgdGF4b25vbXkgYmVsb3cuCgpFcnJvciB0YXhvbm9teToKLSBBLiBQaHlzaWNhbCBpbnZhbGlkaXR5CiAgLSBBMSBIYWxsdWNpbmF0ZWQgYWZmb3JkYW5jZTogYXNzdW1lcyBhIG5vbi1leGlzdGVudCBmZWF0dXJlIG9yIGNhcGFiaWxpdHkuCiAgLSBBMiBBZmZvcmRhbmNlIG1pc21hdGNoOiBnZW9tZXRyeSwgbWF0ZXJpYWwsIG9yIG1lY2hhbmljcyBhcmUgdW5zdWl0YWJsZS4KICAtIEEzIFBlcmZvcm1hbmNlIHNob3J0ZmFsbDogcGFydGlhbGx5IHN1aXRhYmxlIGluIHByaW5jaXBsZSwgYnV0IGxhY2tzIGVub3VnaCBzdGFiaWxpdHksIHJlYWNoLCBtYXNzLCBwcmVjaXNpb24sIGNhcGFjaXR5LCBvciByZXRlbnRpb24uCi0gQi4gUHJhY3RpY2FsIGluZmVhc2liaWxpdHkKICAtIEIxIERlc3RydWN0aXZlIHdvcmthcm91bmQ6IHJlcXVpcmVzIGRpc21hbnRsaW5nLCBicmVha2luZywgZGFtYWdpbmcsIG9yIHNhY3JpZmljaW5nIHRoZSBvYmplY3QuCiAgLSBCMiBDb250ZXh0IG9yIGFjY2Vzc2liaWxpdHkgaXNzdWVzOiBoYXJkIHRvIGFjY2VzcywgYmxvY2tlZCwgb3Zlcmx5IGN1bWJlcnNvbWUsIG9yIHByb2NlZHVyYWxseSB1bnJlYWxpc3RpYyBpbiBjb250ZXh0LgotIEMuIFJpc2sgb3IgY29uc3RyYWludCBtaXNtYXRjaAogIC0gQzEgU2FmZXR5IG9yIGRhbWFnZSByaXNrOiB1bnNhZmUsIHVuaHlnaWVuaWMsIGVsZWN0cmljYWxseSByaXNreSwgc2hhcnAsIGhvdCwgb3IgbGlrZWx5IHRvIGRhbWFnZSB0aGUgb2JqZWN0L2Vudmlyb25tZW50L3JlY2lwaWVudC4KICAtIEMyIENvbnN0cmFpbnQgdmlvbGF0aW9uOiBjb250cmFkaWN0cyBleHBsaWNpdCB0YXNrIGNvbnN0cmFpbnRzIG9yIHJlbGllcyBvbiB1c2UgdGhhdCBjb25mbGljdHMgd2l0aCB0aGUgc3RhdGVkIHNldHRpbmcvaW50ZW5kZWQgdXNlIGNvbnN0cmFpbnRzLgoKRGVjaXNpb24gcnVsZToKLSBQcmVmZXIgQS9CL0Mgd2hlbiB0aGVyZSBpcyBhIGNvbmNyZXRlIHBoeXNpY2FsLCBwcmFjdGljYWwsIHJpc2ssIG9yIGNvbnN0cmFpbnQgcHJvYmxlbS4KLSBQcmVkaWN0IGV4YWN0bHkgb25lIG1ham9yIHJlYXNvbiBjb2RlLgotIFByZWRpY3Qgb25lIG9yIG1vcmUgY29udHJpYnV0aW5nIHJlYXNvbiBjb2RlcywgYW5kIHRoZSBtYWpvciByZWFzb24gY29kZSBtdXN0IGFwcGVhciBpbiB0aGF0IGxpc3QuCgpJbnB1dCBjYXNlOgpUYXNrOgp7dGFza190ZXh0fQoKR29sZCBlbnRpdHkgaW1hZ2U6CkF0dGFjaGVkIHNlcGFyYXRlbHkgaWYgYXZhaWxhYmxlLgoKTW9kZWwgcHJlZGljdGlvbjoKLSBQcmVkaWN0ZWQgZW50aXR5OiB7cHJlZF9lbnRpdHl9Ci0gUHJlZGljdGVkIHBhcnQ6IHtwcmVkX3BhcnR9Ci0gUHJlZGljdGVkIGhvdyB0byB1c2U6IHtwcmVkX2hvd30KCkdyb3VuZCB0cnV0aDoKLSBHb2xkIGVudGl0eToge2dvbGRfZW50aXR5fQotIEdvbGQgcGFydDoge2dvbGRfcGFydH0KLSBHb2xkIGhvd190b19hcHBseToge2dvbGRfaG93fQoKSW1wb3J0YW50IGhldXJpc3RpYyBmcm9tIHRoZSB0YXNrIGRhdGE6Cntnb2xkX2NoYW5nZV9yZWFzb259CgpJbnN0cnVjdGlvbnM6CjEuIEp1ZGdlIHRoZSBwcmVkaWN0ZWQgc29sdXRpb24gYWdhaW5zdCB0aGUgdGFzayBhbmQgdGhlIGdvbGQgc29sdXRpb24uCjIuIFVzZSB0aGUgZ29sZCBpbWFnZSB0byB1bmRlcnN0YW5kIHdoYXQgdGhlIGdvbGQgb2JqZWN0IGlzLgozLiBVc2UgZ29sZF9jaGFuZ2VfcmVhc29uIGFzIGltcG9ydGFudCBzdXBwb3J0aW5nIGV2aWRlbmNlLCBidXQgZG8gbm90IHJlbHkgb24gaXQgYmxpbmRseSBpZiBvdGhlciBldmlkZW5jZSBpcyBzdHJvbmdlci4KNC4gUGljayBleGFjdGx5IG9uZSBtYWpvciByZWFzb24gY29kZS4KNS4gUGljayBhbGwgY29udHJpYnV0aW5nIHJlYXNvbiBjb2RlcyB0aGF0IG1hdGVyaWFsbHkgYXBwbHkuCjYuIFRoZSBjb250cmlidXRpbmdfcmVhc29uX2NvZGVzIGxpc3QgbXVzdCBpbmNsdWRlIHRoZSBtYWpvcl9yZWFzb25fY29kZS4KNy4gUmV0dXJuIEpTT04gb25seS4KClJldHVybiBleGFjdGx5IG9uZSBKU09OIG9iamVjdCB3aXRoIHRoaXMgc2NoZW1hOgp7CiAgInJlYXNvbmluZyI6ICIuLi4iLAogICJtYWpvcl9yZWFzb25fY29kZSI6ICJBMXxBMnxBM3xCMXxCMnxDMXxDMiIsCiAgIm1ham9yX3JlYXNvbl9sYWJlbCI6ICIuLi4iLAogICJjb250cmlidXRpbmdfcmVhc29uX2NvZGVzIjogWyIuLi4iXSwKICAiY29udHJpYnV0aW5nX3JlYXNvbl9sYWJlbHMiOiBbIi4uLiJdCn0=)Youareacarefuljudgeforerroranalysisincreativephysicaltool\-usetasks\.Youwillbegiven:\-thetaskdescription,\-thegoldentityimage,\-themodel'spredictedentity,part,andhow\-to\-usetext,\-theground\-truthentity,part,andgoldsolutionusagetext,\-andaheuristicexplainingwhythegoldisbetterthanthepredictedpart\.Classifythemodelerrorusingthetaxonomybelow\.Errortaxonomy:\-A\.Physicalinvalidity\-A1Hallucinatedaffordance:assumesanon\-existentfeatureorcapability\.\-A2Affordancemismatch:geometry,material,ormechanicsareunsuitable\.\-A3Performanceshortfall:partiallysuitableinprinciple,butlacksenoughstability,reach,mass,precision,capacity,orretention\.\-B\.Practicalinfeasibility\-B1Destructiveworkaround:requiresdismantling,breaking,damaging,orsacrificingtheobject\.\-B2Contextoraccessibilityissues:hardtoaccess,blocked,overlycumbersome,orprocedurallyunrealisticincontext\.\-C\.Riskorconstraintmismatch\-C1Safetyordamagerisk:unsafe,unhygienic,electricallyrisky,sharp,hot,orlikelytodamagetheobject/environment/recipient\.\-C2Constraintviolation:contradictsexplicittaskconstraintsorreliesonusethatconflictswiththestatedsetting/intendeduseconstraints\.Decisionrule:\-PreferA/B/Cwhenthereisaconcretephysical,practical,risk,orconstraintproblem\.\-Predictexactlyonemajorreasoncode\.\-Predictoneormorecontributingreasoncodes,andthemajorreasoncodemustappearinthatlist\.Inputcase:Task:\{task\_text\}Goldentityimage:Attachedseparatelyifavailable\.Modelprediction:\-Predictedentity:\{pred\_entity\}\-Predictedpart:\{pred\_part\}\-Predictedhowtouse:\{pred\_how\}Groundtruth:\-Goldentity:\{gold\_entity\}\-Goldpart:\{gold\_part\}\-Goldhow\_to\_apply:\{gold\_how\}Importantheuristicfromthetaskdata:\{gold\_change\_reason\}Instructions:1\.Judgethepredictedsolutionagainstthetaskandthegoldsolution\.2\.Usethegoldimagetounderstandwhatthegoldobjectis\.3\.Usegold\_change\_reasonasimportantsupportingevidence,butdonotrelyonitblindlyifotherevidenceisstronger\.4\.Pickexactlyonemajorreasoncode\.5\.Pickallcontributingreasoncodesthatmateriallyapply\.6\.Thecontributing\_reason\_codeslistmustincludethemajor\_reason\_code\.7\.ReturnJSONonly\.ReturnexactlyoneJSONobjectwiththisschema:\{"reasoning":"\.\.\.","major\_reason\_code":"A1\|A2\|A3\|B1\|B2\|C1\|C2","major\_reason\_label":"\.\.\.","contributing\_reason\_codes":\["\.\.\."\],"contributing\_reason\_labels":\["\.\.\."\]\}

### G\.2Case Study Details

Case A: Wall Protection from a Metal Hook[⬇](data:text/plain;base64,ewogICJjYXNlIjogIndhbGxfcHJvdGVjdGlvbiIsCiAgInRhc2siOiAiSSBhbSBpbiBteSBiYXRocm9vbSB0cnlpbmcgdG8gc3RvcCBhIHNtYWxsIG1ldGFsIGhvb2sgZnJvbSBsZWF2aW5nIGEgbWFyayBvbiB0aGUgcGFpbnRlZCB3YWxsIG5leHQgdG8gdGhlIHNpbmsuIFRoZSBob29rIGlzIGZvciBoYW5naW5nIGEgaGFuZCB0b3dlbCwgYnV0IGl0IGtlZXBzIHByZXNzaW5nIGludG8gdGhlIHNhbWUgc3BvdCBhbmQgSSBkbyBub3Qgd2FudCB0aGUgcGFpbnQgdG8gZ2V0IGRlbnRlZCBvciBzY3JhcGVkLiBXaGF0IGNhbiBJIHVzZT8iLAogICJnb2xkIjogewogICAgImVudGl0eSI6ICJjdXJ2ZWQgdGVuc2lvbiBzaG93ZXIgY3VydGFpbiByb2QiLAogICAgInBhcnQiOiAibm9uX3NsaXBfZW5kX3BhZHMiCiAgfQp9)\{"case":"wall\_protection","task":"Iaminmybathroomtryingtostopasmallmetalhookfromleavingamarkonthepaintedwallnexttothesink\.Thehookisforhangingahandtowel,butitkeepspressingintothesamespotandIdonotwantthepainttogetdentedorscraped\.WhatcanIuse?","gold":\{"entity":"curvedtensionshowercurtainrod","part":"non\_slip\_end\_pads"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wall_environment.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV84YiIsCiAgInJvdW5kIjogMiwKICAibW9kZWxfcmVzcG9uc2UiOiB7CiAgICAicmVhc29uaW5nIjogIlRoZSBtaWNyb2ZpYmVyX3BpbGVfc3VyZmFjZSBpcyB0aGUgbW9zdCByZWxldmFudCBwYXJ0IGZvciB0aGlzIHRhc2sgYmVjYXVzZSBpdHMgc29mdCwgZGVuc2UsIGFuZCBhYnNvcmJlbnQgdGV4dHVyZSBtYWtlcyBpdCBpZGVhbCBmb3IgYWN0aW5nIGFzIGEgYnVmZmVyIGJldHdlZW4gdGhlIG1ldGFsIGhvb2sgYW5kIHRoZSB3YWxsLiBJdCBjYW4gY29uZm9ybSB0byB0aGUgaG9vaydzIHNoYXBlIGFuZCBkaXN0cmlidXRlIHByZXNzdXJlLCBwcmV2ZW50aW5nIHBhaW50IGRhbWFnZS4gSSB3aWxsIGluc3BlY3QgdGhpcyBwYXJ0IHRvIGNvbmZpcm0gaXRzIHBoeXNpY2FsIHByb3BlcnRpZXMgYW5kIGhvdyBpdCBjYW4gYmUgdXNlZC4iLAogICAgImFjdGlvbiI6ICJpbnNwZWN0X3BhcnQiLAogICAgInBhcnQiOiAibWljcm9maWJlcl9waWxlX3N1cmZhY2UiCiAgfSwKICAiZmVlZGJhY2siOiB7CiAgICAidHlwZSI6ICJQQVJUIElOU1BFQ1RJT04iLAogICAgInBhcnQiOiAibWljcm9maWJlcl9waWxlX3N1cmZhY2UiLAogICAgImVudGl0eSI6ICJtaWNyb2ZpYmVyIGhhbmQgdG93ZWwiLAogICAgInBoeXNpY2FsIjogIm1pY3JvZmliZXIgYmxlbmQgKHBvbHllc3Rlci9wb2x5YW1pZGUpIHdpdGggZW1iZWRkZWQgY2FyYm9uL2FudGltaWNyb2JpYWwgYWRkaXRpdmUgZmliZXJzOyBzdHVyZHk7IHN0cmV0Y2hhYmxlOyB2ZXJ5IHBsdXNoLCBoaWdoLWFic29yYmVuY3ksIHNsaWdodGx5IGhpZ2hlciBkcmFnOyByZWR1Y2VkIHN0YXRpYyBjbGluZyBhbmQgcmVkdWNlZCBvZG9yIHJldGVudGlvbiBjb21wYXJlZCB0byBzdGFuZGFyZCBwaWxlOyBtb2RlcmF0ZSIsCiAgICAic3RhdGUiOiAicGFydGlhbGx5IHZpc2libGU7IHBhcnRpYWxseSBibG9ja2VkOyBzbGlnaHRseSB3ZXQ7IHNsaWdodGx5IGhvdCIKICB9Cn0=)\{"trace":"bad\_base\_8b","round":2,"model\_response":\{"reasoning":"Themicrofiber\_pile\_surfaceisthemostrelevantpartforthistaskbecauseitssoft,dense,andabsorbenttexturemakesitidealforactingasabufferbetweenthemetalhookandthewall\.Itcanconformtothehook'sshapeanddistributepressure,preventingpaintdamage\.Iwillinspectthisparttoconfirmitsphysicalpropertiesandhowitcanbeused\.","action":"inspect\_part","part":"microfiber\_pile\_surface"\},"feedback":\{"type":"PARTINSPECTION","part":"microfiber\_pile\_surface","entity":"microfiberhandtowel","physical":"microfiberblend\(polyester/polyamide\)withembeddedcarbon/antimicrobialadditivefibers;sturdy;stretchable;veryplush,high\-absorbency,slightlyhigherdrag;reducedstaticclingandreducedodorretentioncomparedtostandardpile;moderate","state":"partiallyvisible;partiallyblocked;slightlywet;slightlyhot"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wall_base_microfiber_pile_surface.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV84YiIsCiAgInJvdW5kIjogMywKICAibW9kZWxfcmVzcG9uc2UiOiB7CiAgICAicmVhc29uaW5nIjogIlRoZSBtaWNyb2ZpYmVyX3BpbGVfc3VyZmFjZSBpcyBoaWdobHkgc3VpdGFibGUgZm9yIHRoaXMgdGFzayBkdWUgdG8gaXRzIHNvZnQsIHBsdXNoLCBhbmQgYWJzb3JiZW50IG5hdHVyZSwgd2hpY2ggY2FuIGN1c2hpb24gdGhlIGhvb2sncyBwcmVzc3VyZSBvbiB0aGUgd2FsbC4gSG93ZXZlciwgdG8gZW5zdXJlIHRoaXMgaXMgdGhlIGJlc3Qgb3B0aW9uLCBJIHNob3VsZCBleHBsb3JlIG90aGVyIGVudGl0aWVzIHRoYXQgbWlnaHQgb2ZmZXIgYSBtb3JlIHJpZ2lkIG9yIGR1cmFibGUgYnVmZmVyLiAuLi4gVGhlIGN1cnZlZCB0ZW5zaW9uIHNob3dlciBjdXJ0YWluIHJvZCBpcyByaWdpZCBhbmQgd291bGQgbm90IGhlbHAuIC4uLiBUaGVyZWZvcmUsIHRoZSBtaWNyb2ZpYmVyX3BpbGVfc3VyZmFjZSByZW1haW5zIHRoZSBiZXN0IGNhbmRpZGF0ZS4gSSB3aWxsIG5vdyBhbnN3ZXIuIiwKICAgICJhY3Rpb24iOiAiYW5zd2VyIiwKICAgICJhbnN3ZXJfZW50aXR5IjogIm1pY3JvZmliZXIgaGFuZCB0b3dlbCIsCiAgICAiYW5zd2VyX3BhcnQiOiAibWljcm9maWJlcl9waWxlX3N1cmZhY2UiLAogICAgImFuc3dlcl9ob3dfdG9fdXNlIjogIlBsYWNlIGEgc21hbGwgcGllY2Ugb2YgdGhlIG1pY3JvZmliZXJfcGlsZV9zdXJmYWNlIGRpcmVjdGx5IGJldHdlZW4gdGhlIG1ldGFsIGhvb2sgYW5kIHRoZSB3YWxsLCBlbnN1cmluZyBpdCBjb25mb3JtcyB0byB0aGUgaG9vaydzIHNoYXBlLiBUaGlzIHdpbGwgYWN0IGFzIGEgc29mdCBidWZmZXIsIGRpc3RyaWJ1dGluZyBwcmVzc3VyZSBhbmQgcHJldmVudGluZyB0aGUgaG9vayBmcm9tIHByZXNzaW5nIGludG8gdGhlIHBhaW50LiIKICB9LAogICJ3aHlfYmFkIjogIlRoZSBtb2RlbCBnZW5lcmFsaXplcyBmcm9tIHNvZnRuZXNzIGFuZCBkaXNtaXNzZXMgdGhlIHNob3dlciByb2QgYXMgcmlnaWQgd2l0aG91dCBpbnNwZWN0aW5nIHRoZSBydWJiZXIgZW5kIHBhZHMuIgp9)\{"trace":"bad\_base\_8b","round":3,"model\_response":\{"reasoning":"Themicrofiber\_pile\_surfaceishighlysuitableforthistaskduetoitssoft,plush,andabsorbentnature,whichcancushionthehook'spressureonthewall\.However,toensurethisisthebestoption,Ishouldexploreotherentitiesthatmightofferamorerigidordurablebuffer\.\.\.\.Thecurvedtensionshowercurtainrodisrigidandwouldnothelp\.\.\.\.Therefore,themicrofiber\_pile\_surfaceremainsthebestcandidate\.Iwillnowanswer\.","action":"answer","answer\_entity":"microfiberhandtowel","answer\_part":"microfiber\_pile\_surface","answer\_how\_to\_use":"Placeasmallpieceofthemicrofiber\_pile\_surfacedirectlybetweenthemetalhookandthewall,ensuringitconformstothehook'sshape\.Thiswillactasasoftbuffer,distributingpressureandpreventingthehookfrompressingintothepaint\."\},"why\_bad":"Themodelgeneralizesfromsoftnessanddismissestheshowerrodasrigidwithoutinspectingtherubberendpads\."\}[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fOGIiLAogICJyb3VuZCI6IDgsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUga251cmxlZF9oYW5kbGUgaXMgbWFkZSBvZiBydWJiZXJpemVkIG1hdGVyaWFsIGFuZCBpcyB0ZXh0dXJlZCBmb3IgZ3JpcCwgYnV0IGl0IGlzIG5vdCBzb2Z0IG9yIHBsaWFibGUgZW5vdWdoIHRvIGN1c2hpb24gdGhlIGhvb2suIC4uLiBOb25lIG9mIHRoZSBwYXJ0cyBvbiB0aGUgZG91YmxlLWVkZ2Ugc2FmZXR5IHJhem9yIGFyZSBzdWl0YWJsZSBmb3IgdGhpcyB0YXNrLiBJIHdpbGwgbm93IGluc3BlY3QgdGhlIGN1cnZlZCB0ZW5zaW9uIHNob3dlciBjdXJ0YWluIHJvZCB0byBzZWUgaWYgYW55IHBhcnQgY2FuIGJlIHJlcHVycG9zZWQgYXMgYSBidWZmZXIuIiwKICAgICJhY3Rpb24iOiAiaW5zcGVjdF9lbnRpdHkiLAogICAgImVudGl0eSI6ICJjdXJ2ZWQgdGVuc2lvbiBzaG93ZXIgY3VydGFpbiByb2QiCiAgfSwKICAiZmVlZGJhY2siOiB7CiAgICAidHlwZSI6ICJFTlRJVFkgSU5TUEVDVElPTiIsCiAgICAiZW50aXR5IjogImN1cnZlZCB0ZW5zaW9uIHNob3dlciBjdXJ0YWluIHJvZCIsCiAgICAicGFydHMiOiBbImN1cnZlZF9vdXRlcl90dWJlIiwgImlubmVyX3RlbGVzY29waW5nX3R1YmUiLCAidHdpc3RfbG9ja19jb2xsYXIiLCAiY29tcHJlc3Npb25fc3ByaW5nIiwgImVuZF9jYXBfaG91c2luZ3MiLCAibm9uX3NsaXBfZW5kX3BhZHMiXQogIH0KfQ==)\{"trace":"good\_sft\_dpo\_8b","round":8,"model\_response":\{"reasoning":"Theknurled\_handleismadeofrubberizedmaterialandistexturedforgrip,butitisnotsoftorpliableenoughtocushionthehook\.\.\.\.Noneofthepartsonthedouble\-edgesafetyrazoraresuitableforthistask\.Iwillnowinspectthecurvedtensionshowercurtainrodtoseeifanypartcanberepurposedasabuffer\.","action":"inspect\_entity","entity":"curvedtensionshowercurtainrod"\},"feedback":\{"type":"ENTITYINSPECTION","entity":"curvedtensionshowercurtainrod","parts":\["curved\_outer\_tube","inner\_telescoping\_tube","twist\_lock\_collar","compression\_spring","end\_cap\_housings","non\_slip\_end\_pads"\]\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wall_trained_shower_rod_entity.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fOGIiLAogICJyb3VuZCI6IDksCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgbm9uX3NsaXBfZW5kX3BhZHMgYXJlIG1hZGUgb2Ygc29mdCBydWJiZXIgYW5kIGFyZSBkZXNpZ25lZCB0byBncmlwIHN1cmZhY2VzIHdpdGhvdXQgc2xpcHBpbmcuIFRoZXkgYXJlIHBsaWFibGUgYW5kIGNhbiBiZSByZXB1cnBvc2VkIHRvIGN1c2hpb24gdGhlIGhvb2sgYWdhaW5zdCB0aGUgd2FsbC4gLi4uIFRoZSBub25fc2xpcF9lbmRfcGFkcyBhcmUgdGhlIGJlc3QgY2FuZGlkYXRlIHNvIGZhciwgYXMgdGhleSBhcmUgc29mdCwgcGxpYWJsZSwgYW5kIGRlc2lnbmVkIHRvIGFic29yYiBwcmVzc3VyZS4iLAogICAgImFjdGlvbiI6ICJpbnNwZWN0X3BhcnQiLAogICAgInBhcnQiOiAibm9uX3NsaXBfZW5kX3BhZHMiCiAgfSwKICAiZmVlZGJhY2siOiB7CiAgICAidHlwZSI6ICJQQVJUIElOU1BFQ1RJT04iLAogICAgInBhcnQiOiAibm9uX3NsaXBfZW5kX3BhZHMiLAogICAgImVudGl0eSI6ICJjdXJ2ZWQgdGVuc2lvbiBzaG93ZXIgY3VydGFpbiByb2QiLAogICAgInBoeXNpY2FsIjogIkVQRE0gcnViYmVyOyBzb2Z0OyBzdHVyZHk7IHN0cmV0Y2hhYmxlOyBoaWdoLWZyaWN0aW9uLCBzbGlnaHRseSB0YWNreTsgd2F0ZXItdG9sZXJhbnQ7IHZlcnkgbGlnaHQ7IGV4YWN0IGRpYW1ldGVyIGluIG1tIiwKICAgICJzdGF0ZSI6ICJwYXJ0aWFsbHkgdmlzaWJsZTsgcGFydGlhbGx5IGJsb2NrZWQiCiAgfQp9)\{"trace":"good\_sft\_dpo\_8b","round":9,"model\_response":\{"reasoning":"Thenon\_slip\_end\_padsaremadeofsoftrubberandaredesignedtogripsurfaceswithoutslipping\.Theyarepliableandcanberepurposedtocushionthehookagainstthewall\.\.\.\.Thenon\_slip\_end\_padsarethebestcandidatesofar,astheyaresoft,pliable,anddesignedtoabsorbpressure\.","action":"inspect\_part","part":"non\_slip\_end\_pads"\},"feedback":\{"type":"PARTINSPECTION","part":"non\_slip\_end\_pads","entity":"curvedtensionshowercurtainrod","physical":"EPDMrubber;soft;sturdy;stretchable;high\-friction,slightlytacky;water\-tolerant;verylight;exactdiameterinmm","state":"partiallyvisible;partiallyblocked"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wall_trained_non_slip_end_pads.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fOGIiLAogICJyb3VuZCI6IDEwLAogICJtb2RlbF9yZXNwb25zZSI6IHsKICAgICJyZWFzb25pbmciOiAiVGhlIG5vbl9zbGlwX2VuZF9wYWRzIGFyZSBtYWRlIG9mIHNvZnQgRVBETSBydWJiZXIsIHdoaWNoIGlzIGhpZ2hseSBwbGlhYmxlLCB3YXRlci10b2xlcmFudCwgYW5kIGRlc2lnbmVkIHRvIGFic29yYiBwcmVzc3VyZSB3aXRob3V0IHNsaXBwaW5nLiBUaGlzIG1ha2VzIHRoZW0gaWRlYWwgZm9yIGN1c2hpb25pbmcgdGhlIG1ldGFsIGhvb2sgYWdhaW5zdCB0aGUgcGFpbnRlZCB3YWxsLiBXaGlsZSB0aGUgbWljcm9maWJlcl9waWxlX3N1cmZhY2UgaXMgYWxzbyBzb2Z0IGFuZCBhYnNvcmJlbnQsIHRoZSBub25fc2xpcF9lbmRfcGFkcyBhcmUgbW9yZSBkdXJhYmxlIGFuZCBzcGVjaWZpY2FsbHkgZGVzaWduZWQgZm9yIHByZXNzdXJlIGFic29ycHRpb24gaW4gYSBoaWdoLWZyaWN0aW9uIGVudmlyb25tZW50LiAuLi4gVGhlIG5vbl9zbGlwX2VuZF9wYWRzIGFyZSBtb3JlIHN1aXRhYmxlIGJlY2F1c2UgdGhleSBhcmUgZGVzaWduZWQgZm9yIGN1c2hpb25pbmcgYW5kIHByZXNzdXJlIGRpc3RyaWJ1dGlvbiwgYW5kIHRoZWlyIHJ1YmJlciBtYXRlcmlhbCBpcyBtb3JlIHJlc2lsaWVudCB0aGFuIHRoZSBtaWNyb2ZpYmVyLCB3aGljaCBtYXkgY29tcHJlc3Mgb3IgZGVmb3JtIG92ZXIgdGltZS4iLAogICAgImFjdGlvbiI6ICJhbnN3ZXIiLAogICAgImFuc3dlcl9lbnRpdHkiOiAiY3VydmVkIHRlbnNpb24gc2hvd2VyIGN1cnRhaW4gcm9kIiwKICAgICJhbnN3ZXJfcGFydCI6ICJub25fc2xpcF9lbmRfcGFkcyIsCiAgICAiYW5zd2VyX2hvd190b191c2UiOiAiUGxhY2Ugb25lIG9mIHRoZSBub25fc2xpcF9lbmRfcGFkcyBvdmVyIHRoZSBtZXRhbCBob29rLCBwcmVzc2luZyBpdCBmaXJtbHkgYWdhaW5zdCB0aGUgd2FsbC4gVGhlIHNvZnQsIHBsaWFibGUgcnViYmVyIHdpbGwgYWJzb3JiIHRoZSBwcmVzc3VyZSBmcm9tIHRoZSBob29rLCBwcmV2ZW50aW5nIGl0IGZyb20gcHJlc3NpbmcgaW50byB0aGUgcGFpbnRlZCB3YWxsIGFuZCBjYXVzaW5nIGRlbnRzIG9yIHNjcmFwZXMuIFRoZSBwYWQncyBoaWdoLWZyaWN0aW9uIHN1cmZhY2Ugd2lsbCBhbHNvIGVuc3VyZSBpdCBzdGF5cyBpbiBwbGFjZSB3aXRob3V0IHNsaXBwaW5nLiIKICB9LAogICJjYXBhYmlsaXR5X3Nob3duIjogImF0dHJpYnV0ZSBncm91bmRpbmc6IG1hdGVyaWFsLCBmcmljdGlvbiwgcHJlc3N1cmUgZGlzdHJpYnV0aW9uLCBhbmQgY29udGFjdC1wb2ludCBnZW9tZXRyeSIKfQ==)\{"trace":"good\_sft\_dpo\_8b","round":10,"model\_response":\{"reasoning":"Thenon\_slip\_end\_padsaremadeofsoftEPDMrubber,whichishighlypliable,water\-tolerant,anddesignedtoabsorbpressurewithoutslipping\.Thismakesthemidealforcushioningthemetalhookagainstthepaintedwall\.Whilethemicrofiber\_pile\_surfaceisalsosoftandabsorbent,thenon\_slip\_end\_padsaremoredurableandspecificallydesignedforpressureabsorptioninahigh\-frictionenvironment\.\.\.\.Thenon\_slip\_end\_padsaremoresuitablebecausetheyaredesignedforcushioningandpressuredistribution,andtheirrubbermaterialismoreresilientthanthemicrofiber,whichmaycompressordeformovertime\.","action":"answer","answer\_entity":"curvedtensionshowercurtainrod","answer\_part":"non\_slip\_end\_pads","answer\_how\_to\_use":"Placeoneofthenon\_slip\_end\_padsoverthemetalhook,pressingitfirmlyagainstthewall\.Thesoft,pliablerubberwillabsorbthepressurefromthehook,preventingitfrompressingintothepaintedwallandcausingdentsorscrapes\.Thepad'shigh\-frictionsurfacewillalsoensureitstaysinplacewithoutslipping\."\},"capability\_shown":"attributegrounding:material,friction,pressuredistribution,andcontact\-pointgeometry"\}

Case B: Wrapping Paper Straight\-Edge and Backing Board[⬇](data:text/plain;base64,ewogICJjYXNlIjogIndyYXBwaW5nX3BhcGVyX3N0cmFpZ2h0X2VkZ2UiLAogICJ0YXNrIjogIkkgYW0gaW4gbXkgYmVkcm9vbSBwYWNraW5nIGEgZ2lmdCBmb3IgYSBmcmllbmQsIGFuZCBJIG5lZWQgdG8gdHJpbSBhIHNoZWV0IG9mIHdyYXBwaW5nIHBhcGVyIHNvIGl0IGZpdHMgbmVhdGx5IGFyb3VuZCBhIHNtYWxsIGJveC4gSSBkbyBub3QgaGF2ZSBhIHJ1bGVyIG9yIGN1dHRpbmcgbWF0IGhhbmR5LCBhbmQgSSB3YW50IHRvIGtlZXAgdGhlIHBhcGVyIGZyb20gc2xpcHBpbmcgd2hpbGUgSSBtYXJrIGl0LiBXaGF0IGNhbiBJIHVzZT8iLAogICJnb2xkIjogewogICAgImVudGl0eSI6ICJ1bmRlci1iZWQgc3RvcmFnZSBiaW4gd2l0aCB6aXBwZXIgbGlkIiwKICAgICJwYXJ0IjogImxpZF9wYW5lbCIKICB9Cn0=)\{"case":"wrapping\_paper\_straight\_edge","task":"Iaminmybedroompackingagiftforafriend,andIneedtotrimasheetofwrappingpapersoitfitsneatlyaroundasmallbox\.Idonothavearulerorcuttingmathandy,andIwanttokeepthepaperfromslippingwhileImarkit\.WhatcanIuse?","gold":\{"entity":"under\-bedstoragebinwithzipperlid","part":"lid\_panel"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wrapping_environment.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV80YiIsCiAgInJvdW5kIjogNiwKICAibW9kZWxfcmVzcG9uc2UiOiB7CiAgICAicmVhc29uaW5nIjogIlRoZSBydWJiZXIgZ3JpcCBzbGVldmUgaXMgdGV4dHVyZWQgYW5kIGRlc2lnbmVkIHRvIHByb3ZpZGUgYSBzZWN1cmUsIG5vbi1zbGlwIGhvbGQsIG1ha2luZyBpdCBpZGVhbCBmb3Igc3RhYmlsaXppbmcgdGhlIHdyYXBwaW5nIHBhcGVyIHdoaWxlIG1hcmtpbmcuIEl0cyBmbGV4aWJsZSB5ZXQgZmlybSBtYXRlcmlhbCBjYW4gY29uZm9ybSB0byB0aGUgcGFwZXIncyBzdXJmYWNlLCBwcmV2ZW50aW5nIGl0IGZyb20gc2xpcHBpbmcuIC4uLiBJIHdpbGwgbm93IGluc3BlY3QgdGhlIHJ1YmJlciBncmlwIHNsZWV2ZSB0byBjb25maXJtIGl0cyBwaHlzaWNhbCBwcm9wZXJ0aWVzIGFuZCBzdWl0YWJpbGl0eSBmb3IgdGhlIHRhc2suIiwKICAgICJhY3Rpb24iOiAiaW5zcGVjdF9wYXJ0IiwKICAgICJwYXJ0IjogInJ1YmJlcl9ncmlwX3NsZWV2ZSIKICB9LAogICJmZWVkYmFjayI6IHsKICAgICJ0eXBlIjogIlBBUlQgSU5TUEVDVElPTiIsCiAgICAicGFydCI6ICJydWJiZXJfZ3JpcF9zbGVldmUiLAogICAgImVudGl0eSI6ICJnZWwgaW5rIHBlbiB3aXRoIHJ1YmJlciBncmlwIiwKICAgICJwaHlzaWNhbCI6ICJjbG9zZWQtY2VsbCBmb2FtIGxheWVyIGJvbmRlZCB0byBhIHRoaW4gZWxhc3RvbWVyIHNraW47IGFudGktc2xpcCBpbmRleGluZyBmdW5jdGlvbjsgRVZBIGZvYW0gY29yZSB3aXRoIHBvbHl1cmV0aGFuZSAoUFUpIGVsYXN0b21lciBza2luOyBub3JtYWw7IHNwcmluZ3k7IHNvZnQsIGdyaXBweSBzdGlwcGxlOyB2ZXJ5IGxpZ2h0IiwKICAgICJzdGF0ZSI6ICJ2aXNpYmxlOyBmcmVlOyBzbGlnaHRseSBob3QiCiAgfQp9)\{"trace":"bad\_base\_4b","round":6,"model\_response":\{"reasoning":"Therubbergripsleeveistexturedanddesignedtoprovideasecure,non\-sliphold,makingitidealforstabilizingthewrappingpaperwhilemarking\.Itsflexibleyetfirmmaterialcanconformtothepaper'ssurface,preventingitfromslipping\.\.\.\.Iwillnowinspecttherubbergripsleevetoconfirmitsphysicalpropertiesandsuitabilityforthetask\.","action":"inspect\_part","part":"rubber\_grip\_sleeve"\},"feedback":\{"type":"PARTINSPECTION","part":"rubber\_grip\_sleeve","entity":"gelinkpenwithrubbergrip","physical":"closed\-cellfoamlayerbondedtoathinelastomerskin;anti\-slipindexingfunction;EVAfoamcorewithpolyurethane\(PU\)elastomerskin;normal;springy;soft,grippystipple;verylight","state":"visible;free;slightlyhot"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wrapping_base_rubber_grip_sleeve.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV80YiIsCiAgInJvdW5kIjogMTAsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgZWxhc3RpYyBzdHJhcCBpcyBtYWRlIG9mIGEgd292ZW4sIHNsaWdodGx5IGdyaXBweSBmYWJyaWMgd2l0aCBhIHN0cmV0Y2h5LCBhZGp1c3RhYmxlIGRlc2lnbiwgd2hpY2ggbWFrZXMgaXQgaWRlYWwgZm9yIHNlY3VyaW5nIHRoZSB3cmFwcGluZyBwYXBlciB0byB0aGUgbWFzay4gLi4uIEkgd2lsbCBub3cgaW5zcGVjdCB0aGUgZ2VsIGluayBwZW4ncyBydWJiZXIgZ3JpcCBzbGVldmUgdG8gc2VlIGlmIGl0IGNhbiBiZSByZXB1cnBvc2VkIHRvIGhvbGQgdGhlIHBhcGVyIGluIHBsYWNlLiIsCiAgICAiYWN0aW9uIjogImluc3BlY3RfcGFydCIsCiAgICAicGFydCI6ICJydWJiZXJfZ3JpcF9zbGVldmUiCiAgfSwKICAiZmVlZGJhY2siOiB7CiAgICAic3VtbWFyeSI6ICJTYW1lIHBhcnQgYW5kIHNhbWUgZmVlZGJhY2sgYXMgUm91bmQgNi4iCiAgfQp9)\{"trace":"bad\_base\_4b","round":10,"model\_response":\{"reasoning":"Theelasticstrapismadeofawoven,slightlygrippyfabricwithastretchy,adjustabledesign,whichmakesitidealforsecuringthewrappingpapertothemask\.\.\.\.Iwillnowinspectthegelinkpen'srubbergripsleevetoseeifitcanberepurposedtoholdthepaperinplace\.","action":"inspect\_part","part":"rubber\_grip\_sleeve"\},"feedback":\{"summary":"SamepartandsamefeedbackasRound6\."\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wrapping_base_rubber_grip_sleeve.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV80YiIsCiAgInJvdW5kIjogNTAsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJBZnRlciB0aG9yb3VnaCBpbnNwZWN0aW9uIG9mIGFsbCBjYW5kaWRhdGUgcGFydHMsIHRoZSBydWJiZXIgZ3JpcCBzbGVldmUgb2YgdGhlIGdlbCBpbmsgcGVuIGlzIHRoZSBtb3N0IHN1aXRhYmxlIGZvciBwcmV2ZW50aW5nIHRoZSB3cmFwcGluZyBwYXBlciBmcm9tIHNsaXBwaW5nIHdoaWxlIG1hcmtpbmcuIEl0IGlzIHNvZnQsIGdyaXBweSwgYW5kIGZsZXhpYmxlLCB3aXRoIGEgc3RpcHBsZWQgc3VyZmFjZSB0aGF0IHByb3ZpZGVzIGV4Y2VsbGVudCBmcmljdGlvbiBhZ2FpbnN0IHRoZSBwYXBlci4gLi4uIFRoZSBydWJiZXIgZ3JpcCBzbGVldmUgaXMgdGhlIG9wdGltYWwgY2hvaWNlIGZvciBzZWN1cmluZyB0aGUgcGFwZXIgZHVyaW5nIG1hcmtpbmcuIiwKICAgICJhY3Rpb24iOiAiYW5zd2VyIiwKICAgICJhbnN3ZXJfZW50aXR5IjogImdlbCBpbmsgcGVuIHdpdGggcnViYmVyIGdyaXAiLAogICAgImFuc3dlcl9wYXJ0IjogInJ1YmJlcl9ncmlwX3NsZWV2ZSIsCiAgICAiYW5zd2VyX2hvd190b191c2UiOiAiVXNlIHRoZSBydWJiZXIgZ3JpcCBzbGVldmUgdG8gcHJlc3MgYWdhaW5zdCB0aGUgd3JhcHBpbmcgcGFwZXIsIHBvc2l0aW9uaW5nIGl0IGFsb25nIHRoZSBlZGdlIG9mIHRoZSBib3ggdG8gcHJldmVudCBzbGlwcGFnZSB3aGlsZSBtYXJraW5nIHRoZSBkaW1lbnNpb25zIHdpdGggYSBwZW5jaWwgb3IgbWFya2VyLiIKICB9LAogICJ3aHlfYmFkIjogIlRoZSBtb2RlbCBsb29wcyBvbiBsb2NhbCBmcmljdGlvbi4gSXQgbmV2ZXIgZmluZHMgYSBsYXJnZSBmbGF0IHN0cmFpZ2h0LWVkZ2UvYmFja2luZyBzdXJmYWNlLiIKfQ==)\{"trace":"bad\_base\_4b","round":50,"model\_response":\{"reasoning":"Afterthoroughinspectionofallcandidateparts,therubbergripsleeveofthegelinkpenisthemostsuitableforpreventingthewrappingpaperfromslippingwhilemarking\.Itissoft,grippy,andflexible,withastippledsurfacethatprovidesexcellentfrictionagainstthepaper\.\.\.\.Therubbergripsleeveistheoptimalchoiceforsecuringthepaperduringmarking\.","action":"answer","answer\_entity":"gelinkpenwithrubbergrip","answer\_part":"rubber\_grip\_sleeve","answer\_how\_to\_use":"Usetherubbergripsleevetopressagainstthewrappingpaper,positioningitalongtheedgeoftheboxtopreventslippagewhilemarkingthedimensionswithapencilormarker\."\},"why\_bad":"Themodelloopsonlocalfriction\.Itneverfindsalargeflatstraight\-edge/backingsurface\."\}[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fOGIiLAogICJyb3VuZCI6IDEsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgdGFzayByZXF1aXJlcyBhIGZsYXQsIHN0YWJsZSBzdXJmYWNlIG9yIHN0cmFpZ2h0IGVkZ2UgdG8ga2VlcCB3cmFwcGluZyBwYXBlciBmcm9tIHNsaXBwaW5nIHdoaWxlIG1hcmtpbmcuIFRoZSB1bmRlci1iZWQgc3RvcmFnZSBiaW4gd2l0aCB6aXBwZXIgbGlkIGhhcyBhIGxhcmdlLCBmbGF0IGxpZCBwYW5lbCB0aGF0IGNvdWxkIHNlcnZlIGFzIGEgbWFrZXNoaWZ0IGN1dHRpbmcgbWF0IG9yIHN0cmFpZ2h0IGVkZ2UuIEkgd2lsbCBpbnNwZWN0IHRoaXMgZW50aXR5IHRvIHJldmVhbCBpdHMgcGFydHMgYW5kIGNvbmZpcm0gd2hldGhlciB0aGUgbGlkIHBhbmVsIGlzIHN1aXRhYmxlLiIsCiAgICAiYWN0aW9uIjogImluc3BlY3RfZW50aXR5IiwKICAgICJlbnRpdHkiOiAidW5kZXItYmVkIHN0b3JhZ2UgYmluIHdpdGggemlwcGVyIGxpZCIKICB9LAogICJmZWVkYmFjayI6IHsKICAgICJ0eXBlIjogIkVOVElUWSBJTlNQRUNUSU9OIiwKICAgICJlbnRpdHkiOiAidW5kZXItYmVkIHN0b3JhZ2UgYmluIHdpdGggemlwcGVyIGxpZCIsCiAgICAicGFydHMiOiBbImJpbl9ib2R5IiwgImJhc2Vfc3RpZmZlbmVyX3BhbmVsIiwgInppcHBlcl90cmFjayIsICJ6aXBwZXJfc2xpZGVyX3B1bGwiLCAibGlkX3BhbmVsIiwgImNsZWFyX3dpbmRvd19wYW5lbCIsICJjYXJyeV9oYW5kbGVzIl0KICB9Cn0=)\{"trace":"good\_sft\_dpo\_8b","round":1,"model\_response":\{"reasoning":"Thetaskrequiresaflat,stablesurfaceorstraightedgetokeepwrappingpaperfromslippingwhilemarking\.Theunder\-bedstoragebinwithzipperlidhasalarge,flatlidpanelthatcouldserveasamakeshiftcuttingmatorstraightedge\.Iwillinspectthisentitytorevealitspartsandconfirmwhetherthelidpanelissuitable\.","action":"inspect\_entity","entity":"under\-bedstoragebinwithzipperlid"\},"feedback":\{"type":"ENTITYINSPECTION","entity":"under\-bedstoragebinwithzipperlid","parts":\["bin\_body","base\_stiffener\_panel","zipper\_track","zipper\_slider\_pull","lid\_panel","clear\_window\_panel","carry\_handles"\]\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wrapping_trained_storage_bin_entity.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fOGIiLAogICJyb3VuZCI6IDIsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgbGlkX3BhbmVsIGlzIHRoZSBtb3N0IHJlbGV2YW50IHBhcnQgYmVjYXVzZSBpdCBpcyBsaWtlbHkgZmxhdCwgcmlnaWQsIGFuZCBsYXJnZSBlbm91Z2ggdG8gc2VydmUgYXMgYSBzdGFibGUgYmFja2luZyBvciBzdHJhaWdodCBlZGdlIGZvciBtYXJraW5nIGFuZCB0cmltbWluZyB3cmFwcGluZyBwYXBlci4gVGhlIGNsZWFyX3dpbmRvd19wYW5lbCBtaWdodCBiZSB0cmFuc3BhcmVudCBidXQgbWF5IGJlIGZsZXhpYmxlOyB0aGUgYmFzZV9zdGlmZmVuZXJfcGFuZWwgY291bGQgYWxzbyBiZSByaWdpZCBidXQgbWF5IGJlIGxlc3MgYWNjZXNzaWJsZS4gSSB3aWxsIGluc3BlY3QgdGhlIGxpZF9wYW5lbCB0byBjb25maXJtIGl0cyByaWdpZGl0eSBhbmQgc3VyZmFjZSBwcm9wZXJ0aWVzLiIsCiAgICAiYWN0aW9uIjogImluc3BlY3RfcGFydCIsCiAgICAicGFydCI6ICJsaWRfcGFuZWwiCiAgfSwKICAiZmVlZGJhY2siOiB7CiAgICAidHlwZSI6ICJQQVJUIElOU1BFQ1RJT04iLAogICAgInBhcnQiOiAibGlkX3BhbmVsIiwKICAgICJlbnRpdHkiOiAidW5kZXItYmVkIHN0b3JhZ2UgYmluIHdpdGggemlwcGVyIGxpZCIsCiAgICAicGh5c2ljYWwiOiAibWVkaXVtIHRoaWNrbmVzczsgaW50ZXJuYWwgc3RpZmZlbmVyIGxheWVyOyB3aW5kb3cgY3V0b3V0IHdpdGggZG91YmxlLXN0aXRjaCB0b3BzdGl0Y2hpbmc7IGxhbWluYXRlZCBwb2x5ZXN0ZXIvbnlsb24gZmFicmljIHdpdGggYW4gaW50ZXJuYWwgcG9seXByb3B5bGVuZSBzaGVldCBzdGlmZmVuZXI7IHNlbWktcmlnaWQ7IHZlcnkgc3R1cmR5OyBub24tZWxhc3RpYzsgc21vb3RoIGxhbWluYXRlZCB3aXBlLWNsZWFuIGZpbmlzaCB3aXRoIHNsaWdodCBzaGVlbjsgbW9kZXJhdGUiLAogICAgInN0YXRlIjogInBhcnRpYWxseSB2aXNpYmxlOyBwYXJ0aWFsbHkgYmxvY2tlZCAoZWFzaWx5IGZyZWVkIGJ5IGhhbmQpOyBmdWxsIgogIH0KfQ==)\{"trace":"good\_sft\_dpo\_8b","round":2,"model\_response":\{"reasoning":"Thelid\_panelisthemostrelevantpartbecauseitislikelyflat,rigid,andlargeenoughtoserveasastablebackingorstraightedgeformarkingandtrimmingwrappingpaper\.Theclear\_window\_panelmightbetransparentbutmaybeflexible;thebase\_stiffener\_panelcouldalsoberigidbutmaybelessaccessible\.Iwillinspectthelid\_paneltoconfirmitsrigidityandsurfaceproperties\.","action":"inspect\_part","part":"lid\_panel"\},"feedback":\{"type":"PARTINSPECTION","part":"lid\_panel","entity":"under\-bedstoragebinwithzipperlid","physical":"mediumthickness;internalstiffenerlayer;windowcutoutwithdouble\-stitchtopstitching;laminatedpolyester/nylonfabricwithaninternalpolypropylenesheetstiffener;semi\-rigid;verysturdy;non\-elastic;smoothlaminatedwipe\-cleanfinishwithslightsheen;moderate","state":"partiallyvisible;partiallyblocked\(easilyfreedbyhand\);full"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/wrapping_trained_lid_panel.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fOGIiLAogICJyb3VuZCI6IDMsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgbGlkX3BhbmVsIGlzIHNlbWktcmlnaWQsIHN0dXJkeSwgc21vb3RoLCBhbmQgd2lwZS1jbGVhbiwgbWFraW5nIGl0IGlkZWFsIGZvciBtYXJraW5nIHdyYXBwaW5nIHBhcGVyIHdpdGhvdXQgc2xpcHBpbmcuIEl0cyBsYW1pbmF0ZWQgZmluaXNoIGFuZCBpbnRlcm5hbCBzdGlmZmVuZXIgZW5zdXJlIGl0IHdvbid0IGJlbmQgb3Igd2FycCB1bmRlciBwcmVzc3VyZSwgYW5kIGl0cyBzaXplIGlzIGxhcmdlIGVub3VnaCB0byBhY2NvbW1vZGF0ZSBhIHNoZWV0IG9mIHdyYXBwaW5nIHBhcGVyLiAuLi4gVGhlIGdlbCBpbmsgcGVuIHdpdGggcnViYmVyIGdyaXAgaXMgYSBjYW5kaWRhdGUgZm9yIG1hcmtpbmcsIGJ1dCBJIG5lZWQgYSBzdGFibGUgc3VyZmFjZSwgbm90IGEgdG9vbC4gLi4uIFRoZXJlZm9yZSwgdGhlIGxpZF9wYW5lbCBpcyB0aGUgYmVzdCBhdmFpbGFibGUgb3B0aW9uLiIsCiAgICAiYWN0aW9uIjogImFuc3dlciIsCiAgICAiYW5zd2VyX2VudGl0eSI6ICJ1bmRlci1iZWQgc3RvcmFnZSBiaW4gd2l0aCB6aXBwZXIgbGlkIiwKICAgICJhbnN3ZXJfcGFydCI6ICJsaWRfcGFuZWwiLAogICAgImFuc3dlcl9ob3dfdG9fdXNlIjogIlBsYWNlIHRoZSBzaGVldCBvZiB3cmFwcGluZyBwYXBlciBvbiB0aGUgbGlkX3BhbmVsLCB3aGljaCBpcyBmbGF0LCBzdHVyZHksIGFuZCBub24tc2xpcC4gVXNlIHRoZSBnZWwgaW5rIHBlbiB0byBtZWFzdXJlIGFuZCBtYXJrIHRoZSBwYXBlciBhY2N1cmF0ZWx5LCBhcyB0aGUgbGlkJ3Mgc21vb3RoLCByaWdpZCBzdXJmYWNlIHdpbGwgcHJldmVudCB0aGUgcGFwZXIgZnJvbSBzaGlmdGluZyB3aGlsZSB5b3Ugd29yay4iCiAgfSwKICAiY2FwYWJpbGl0eV9zaG93biI6ICJwYXJ0LWxldmVsIGdlb21ldHJ5OiBiYWNraW5nLWJvYXJkIGFuZCBzdHJhaWdodC1lZGdlIHJlYXNvbmluZyBpbnN0ZWFkIG9mIGxvY2FsIGFudGktc2xpcCByZWFzb25pbmciCn0=)\{"trace":"good\_sft\_dpo\_8b","round":3,"model\_response":\{"reasoning":"Thelid\_panelissemi\-rigid,sturdy,smooth,andwipe\-clean,makingitidealformarkingwrappingpaperwithoutslipping\.Itslaminatedfinishandinternalstiffenerensureitwon'tbendorwarpunderpressure,anditssizeislargeenoughtoaccommodateasheetofwrappingpaper\.\.\.\.Thegelinkpenwithrubbergripisacandidateformarking,butIneedastablesurface,notatool\.\.\.\.Therefore,thelid\_panelisthebestavailableoption\.","action":"answer","answer\_entity":"under\-bedstoragebinwithzipperlid","answer\_part":"lid\_panel","answer\_how\_to\_use":"Placethesheetofwrappingpaperonthelid\_panel,whichisflat,sturdy,andnon\-slip\.Usethegelinkpentomeasureandmarkthepaperaccurately,asthelid'ssmooth,rigidsurfacewillpreventthepaperfromshiftingwhileyouwork\."\},"capability\_shown":"part\-levelgeometry:backing\-boardandstraight\-edgereasoninginsteadoflocalanti\-slipreasoning"\}

Case C: Sink Overflow Slot Cleaning[⬇](data:text/plain;base64,ewogICJjYXNlIjogInNpbmtfb3ZlcmZsb3dfc2xvdF9jbGVhbmluZyIsCiAgInRhc2siOiAiSSBhbSBpbiBteSBiYXRocm9vbSBhbmQgdGhlIHNtYWxsIG9wZW5pbmcgaW4gdGhlIHNpbmsgb3ZlcmZsb3cgc2xvdCBpcyBwYWNrZWQgd2l0aCBkYW1wIGhhaXItYW5kLXNvYXAgZ3Vuay4gV2F0ZXIgaXMgc3RhcnRpbmcgdG8gc21lbGwgYmFkLCBhbmQgSSB3YW50IHRvIGxvb3NlbiB0aGUgbWVzcyBlbm91Z2ggdG8gcmluc2UgaXQgb3V0LCBidXQgSSBkbyBub3QgaGF2ZSBhbnkgcHJvcGVyIGNsZWFuaW5nIHBpY2tzIG5lYXJieS4gV2hhdCBjYW4gSSB1c2UgYW5kIGhvdz8iLAogICJnb2xkIjogewogICAgImVudGl0eSI6ICJlbGVjdHJpYyBiZWFyZCB0cmltbWVyIHdpdGggYWRqdXN0YWJsZSBndWFyZCIsCiAgICAicGFydCI6ICJhZGp1c3RhYmxlX2d1YXJkX2NvbWIiCiAgfQp9)\{"case":"sink\_overflow\_slot\_cleaning","task":"Iaminmybathroomandthesmallopeninginthesinkoverflowslotispackedwithdamphair\-and\-soapgunk\.Waterisstartingtosmellbad,andIwanttoloosenthemessenoughtorinseitout,butIdonothaveanypropercleaningpicksnearby\.WhatcanIuseandhow?","gold":\{"entity":"electricbeardtrimmerwithadjustableguard","part":"adjustable\_guard\_comb"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/sink_environment.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV80YiIsCiAgInJvdW5kIjogMiwKICAibW9kZWxfcmVzcG9uc2UiOiB7CiAgICAicmVhc29uaW5nIjogIlRoZSBjdXR0aW5nX2JsYWRlX2hlYWQgaXMgdGhlIG1vc3Qgc3VpdGFibGUgcGFydCB0byB1c2UgZm9yIGxvb3NlbmluZyB0aGUgZ3VuayBpbiB0aGUgc2luayBvdmVyZmxvdyBzbG90LiBJdCBpcyBtYWRlIG9mIG1ldGFsIGFuZCBoYXMgYSBzaGFycCwgcmlnaWQgZWRnZSB0aGF0IGNhbiBlZmZlY3RpdmVseSBzY3JhcGUgb3IgZGlzbG9kZ2UgZGVicmlzLiBUaGUgYWRqdXN0YWJsZV9ndWFyZF9jb21iIGlzIHBsYXN0aWMgYW5kIG5vdCBzdWl0YWJsZSBmb3Igc2NyYXBpbmcuIC4uLiBUaGUgY3V0dGluZ19ibGFkZV9oZWFkIGlzIHRoZSBvbmx5IHBhcnQgdGhhdCBtZWV0cyB0aGUgcGh5c2ljYWwgcmVxdWlyZW1lbnRzIGZvciB0aGlzIHRhc2suIiwKICAgICJhY3Rpb24iOiAiaW5zcGVjdF9wYXJ0IiwKICAgICJwYXJ0IjogImN1dHRpbmdfYmxhZGVfaGVhZCIKICB9LAogICJmZWVkYmFjayI6IHsKICAgICJ0eXBlIjogIlBBUlQgSU5TUEVDVElPTiIsCiAgICAicGFydCI6ICJjdXR0aW5nX2JsYWRlX2hlYWQiLAogICAgImVudGl0eSI6ICJlbGVjdHJpYyBiZWFyZCB0cmltbWVyIHdpdGggYWRqdXN0YWJsZSBndWFyZCIsCiAgICAicGh5c2ljYWwiOiAicmVhciBjb3VwbGluZyBzbG90IGFuZCBtb2xkZWQgYWxpZ25tZW50IGRldGFpbHM7IGV4YWN0IHdpZHRoIGluIG1tOyBleGFjdCB0aGlja25lc3M7IGludGVncmF0ZWQgYWxpZ25tZW50IHBvc3RzLCByZWFyIGNvdXBsaW5nIHNsb3QsIG1vbGRlZCBjZXJhbWljIHNob3VsZGVyOyB6aXJjb25pYSBjZXJhbWljIGZpeGVkIGJsYWRlLCBzdGFpbmxlc3Mgc3RlZWwgbW92aW5nIGN1dHRlciwgcG9seW1lci9tZXRhbCBiYWNraW5nIGZyYW1lOyBzdHVyZHk7IGxvdy1mcmljdGlvbiBzbGlkaW5nIGludGVyZmFjZSwgdGhlcm1hbGx5IGluc3VsYXRpbmcgYmVoYXZpb3I7IGxpZ2h0IiwKICAgICJzdGF0ZSI6ICJwYXJ0aWFsbHkgdmlzaWJsZTsgYmxvY2tlZDsgc2xpZ2h0bHkgY29sZCIKICB9Cn0=)\{"trace":"bad\_base\_4b","round":2,"model\_response":\{"reasoning":"Thecutting\_blade\_headisthemostsuitableparttouseforlooseningthegunkinthesinkoverflowslot\.Itismadeofmetalandhasasharp,rigidedgethatcaneffectivelyscrapeordislodgedebris\.Theadjustable\_guard\_combisplasticandnotsuitableforscraping\.\.\.\.Thecutting\_blade\_headistheonlypartthatmeetsthephysicalrequirementsforthistask\.","action":"inspect\_part","part":"cutting\_blade\_head"\},"feedback":\{"type":"PARTINSPECTION","part":"cutting\_blade\_head","entity":"electricbeardtrimmerwithadjustableguard","physical":"rearcouplingslotandmoldedalignmentdetails;exactwidthinmm;exactthickness;integratedalignmentposts,rearcouplingslot,moldedceramicshoulder;zirconiaceramicfixedblade,stainlesssteelmovingcutter,polymer/metalbackingframe;sturdy;low\-frictionslidinginterface,thermallyinsulatingbehavior;light","state":"partiallyvisible;blocked;slightlycold"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/sink_base_cutting_blade_head.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV80YiIsCiAgInJvdW5kIjogNCwKICAibW9kZWxfcmVzcG9uc2UiOiB7CiAgICAicmVhc29uaW5nIjogIlRoZSBkb3VibGVfZWRnZV9ibGFkZSBpcyB0aGUgbW9zdCBzdWl0YWJsZSBwYXJ0IGZvciBsb29zZW5pbmcgdGhlIGd1bmsgaW4gdGhlIHNpbmsgb3ZlcmZsb3cgc2xvdC4gSXQgaXMgbWFkZSBvZiBzdGFpbmxlc3Mgc3RlZWwgYW5kIGlzIHNoYXJwLCByaWdpZCwgYW5kIGRlc2lnbmVkIGZvciBzY3JhcGluZy4gLi4uIFRoZSBkb3VibGVfZWRnZV9ibGFkZSBpcyB0aGUgb25seSBwYXJ0IHRoYXQgbWVldHMgdGhlIHBoeXNpY2FsIHJlcXVpcmVtZW50cyBmb3IgdGhpcyB0YXNrLiIsCiAgICAiYWN0aW9uIjogImluc3BlY3RfcGFydCIsCiAgICAicGFydCI6ICJkb3VibGVfZWRnZV9ibGFkZSIKICB9LAogICJmZWVkYmFjayI6IHsKICAgICJ0eXBlIjogIlBBUlQgSU5TUEVDVElPTiIsCiAgICAicGFydCI6ICJkb3VibGVfZWRnZV9ibGFkZSIsCiAgICAiZW50aXR5IjogImRvdWJsZS1lZGdlIHNhZmV0eSByYXpvciB3aXRoIGtudXJsZWQgaGFuZGxlIiwKICAgICJwaHlzaWNhbCI6ICJoaWdoLWNhcmJvbiBzdGVlbCwgdW5jb2F0ZWQgb3IgbGlnaHRseSBjb2F0ZWQ7IG5vcm1hbCBkdXJhYmlsaXR5OyBzcHJpbmd5OyBtb3JlIHJlYWN0aXZlIGFuZCBjYW4gZGV2ZWxvcCBwYXRpbmEvcnVzdCBzcG90cyBpZiBsZWZ0IHdldCIsCiAgICAic3RhdGUiOiAicGFydGlhbGx5IHZpc2libGUgYW5kIGJsb2NrZWQ7IHNsaWdodGx5IGhvdCIKICB9Cn0=)\{"trace":"bad\_base\_4b","round":4,"model\_response":\{"reasoning":"Thedouble\_edge\_bladeisthemostsuitablepartforlooseningthegunkinthesinkoverflowslot\.Itismadeofstainlesssteelandissharp,rigid,anddesignedforscraping\.\.\.\.Thedouble\_edge\_bladeistheonlypartthatmeetsthephysicalrequirementsforthistask\.","action":"inspect\_part","part":"double\_edge\_blade"\},"feedback":\{"type":"PARTINSPECTION","part":"double\_edge\_blade","entity":"double\-edgesafetyrazorwithknurledhandle","physical":"high\-carbonsteel,uncoatedorlightlycoated;normaldurability;springy;morereactiveandcandeveloppatina/rustspotsifleftwet","state":"partiallyvisibleandblocked;slightlyhot"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/sink_base_double_edge_blade.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJiYWRfYmFzZV80YiIsCiAgInJvdW5kIjogNSwKICAibW9kZWxfcmVzcG9uc2UiOiB7CiAgICAicmVhc29uaW5nIjogIlRoZSBkb3VibGVfZWRnZV9ibGFkZSBpcyBoaWdobHkgc3VpdGFibGUgZm9yIGxvb3NlbmluZyB0aGUgZ3VuayBpbiB0aGUgc2luayBvdmVyZmxvdyBzbG90LiBJdCBpcyBtYWRlIG9mIGhpZ2gtY2FyYm9uIHN0ZWVsLCB3aGljaCBpcyBzaGFycCwgcmlnaWQsIGFuZCBkdXJhYmxlLCBtYWtpbmcgaXQgaWRlYWwgZm9yIHNjcmFwaW5nIGRlYnJpcy4gVGhlIGJsYWRlIGlzIHBhcnRpYWxseSB2aXNpYmxlIGFuZCBzbGlnaHRseSBob3QsIGluZGljYXRpbmcgaXQgaXMgbm90IG9ic3RydWN0ZWQgYW5kIGNhbiBiZSB1c2VkIHNhZmVseS4gLi4uIFRoZSBkb3VibGVfZWRnZV9ibGFkZSBpcyB0aGUgYmVzdCBjaG9pY2UuIiwKICAgICJhY3Rpb24iOiAiYW5zd2VyIiwKICAgICJhbnN3ZXJfZW50aXR5IjogImRvdWJsZS1lZGdlIHNhZmV0eSByYXpvciB3aXRoIGtudXJsZWQgaGFuZGxlIiwKICAgICJhbnN3ZXJfcGFydCI6ICJkb3VibGVfZWRnZV9ibGFkZSIsCiAgICAiYW5zd2VyX2hvd190b191c2UiOiAiVXNlIHRoZSBkb3VibGVfZWRnZV9ibGFkZSB0byBnZW50bHkgc2NyYXBlIHRoZSBndW5rIGZyb20gdGhlIHNpbmsgb3ZlcmZsb3cgc2xvdCBieSBpbnNlcnRpbmcgdGhlIGJsYWRlIGludG8gdGhlIHNsb3QgYW5kIG1vdmluZyBpdCBiYWNrIGFuZCBmb3J0aCB0byBsb29zZW4gdGhlIGRlYnJpcy4gVGhlIHNoYXJwLCByaWdpZCBlZGdlIHdpbGwgZWZmZWN0aXZlbHkgZGlzbG9kZ2UgdGhlIGhhaXIgYW5kIHNvYXAgZ3VuaywgYWxsb3dpbmcgd2F0ZXIgdG8gcmluc2UgaXQgb3V0LiIKICB9LAogICJ3aHlfYmFkIjogIlRoZSBtb2RlbCBwcmVmZXJzIHNoYXJwIHNjcmFwaW5nIGRlc3BpdGUgYmxvY2tlZC9ob3Qgc3RhdGUgY3VlcyBhbmQgZGVzcGl0ZSB0aGUgc29mdCBvYnN0cnVjdGlvbiBiZWluZyBiZXR0ZXIgaGFuZGxlZCBieSBub25tZXRhbCB0ZWV0aC4iCn0=)\{"trace":"bad\_base\_4b","round":5,"model\_response":\{"reasoning":"Thedouble\_edge\_bladeishighlysuitableforlooseningthegunkinthesinkoverflowslot\.Itismadeofhigh\-carbonsteel,whichissharp,rigid,anddurable,makingitidealforscrapingdebris\.Thebladeispartiallyvisibleandslightlyhot,indicatingitisnotobstructedandcanbeusedsafely\.\.\.\.Thedouble\_edge\_bladeisthebestchoice\.","action":"answer","answer\_entity":"double\-edgesafetyrazorwithknurledhandle","answer\_part":"double\_edge\_blade","answer\_how\_to\_use":"Usethedouble\_edge\_bladetogentlyscrapethegunkfromthesinkoverflowslotbyinsertingthebladeintotheslotandmovingitbackandforthtoloosenthedebris\.Thesharp,rigidedgewilleffectivelydislodgethehairandsoapgunk,allowingwatertorinseitout\."\},"why\_bad":"Themodelpreferssharpscrapingdespiteblocked/hotstatecuesanddespitethesoftobstructionbeingbetterhandledbynonmetalteeth\."\}[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fNGIiLAogICJyb3VuZCI6IDIsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgZG91YmxlLWVkZ2UgYmxhZGUgaXMgc2hhcnAgYW5kIGRlc2lnbmVkIGZvciBjdXR0aW5nLCBtYWtpbmcgaXQgaWRlYWwgZm9yIHNjcmFwaW5nIGd1bmsgZnJvbSB0aGUgbmFycm93IG92ZXJmbG93IHNsb3QuIC4uLiBJIHNob3VsZCBpbnNwZWN0IHRoZSBibGFkZSBuZXh0IHRvIGNvbmZpcm0gaXRzIHN1aXRhYmlsaXR5IGZvciBsb29zZW5pbmcgZ3VuaywgdGhlbiBjb21wYXJlIHdpdGggb3RoZXIgY2FuZGlkYXRlcyBsaWtlIHRoZSBlbGVjdHJpYyBiZWFyZCB0cmltbWVyJ3Mgc21hbGwgcGFydHMgb3IgdGhlIHRvaWxldCBwYXBlciByb2xsJ3MgZWRnZSBmb3IgbGV2ZXJhZ2UuIiwKICAgICJhY3Rpb24iOiAiaW5zcGVjdF9wYXJ0IiwKICAgICJwYXJ0IjogImRvdWJsZV9lZGdlX2JsYWRlIgogIH0sCiAgImZlZWRiYWNrIjogewogICAgInR5cGUiOiAiUEFSVCBJTlNQRUNUSU9OIiwKICAgICJwYXJ0IjogImRvdWJsZV9lZGdlX2JsYWRlIiwKICAgICJlbnRpdHkiOiAiZG91YmxlLWVkZ2Ugc2FmZXR5IHJhem9yIHdpdGgga251cmxlZCBoYW5kbGUiLAogICAgInBoeXNpY2FsIjogImhpZ2gtY2FyYm9uIHN0ZWVsLCB1bmNvYXRlZCBvciBsaWdodGx5IGNvYXRlZDsgbm9ybWFsIGR1cmFiaWxpdHk7IHNwcmluZ3k7IG1vcmUgcmVhY3RpdmUgYW5kIGNhbiBkZXZlbG9wIHBhdGluYS9ydXN0IHNwb3RzIGlmIGxlZnQgd2V0IiwKICAgICJzdGF0ZSI6ICJwYXJ0aWFsbHkgdmlzaWJsZSBhbmQgYmxvY2tlZDsgc2xpZ2h0bHkgaG90IgogIH0sCiAgIm5vdGUiOiAiVGhlIHRyYWluZWQgbW9kZWwgc3RpbGwgZXhwbG9yZXMgdGhlIHRlbXB0aW5nIGJsYWRlLCBidXQgZG9lcyBub3QgdGVybWluYXRlIHRoZXJlLiIKfQ==)\{"trace":"good\_sft\_dpo\_4b","round":2,"model\_response":\{"reasoning":"Thedouble\-edgebladeissharpanddesignedforcutting,makingitidealforscrapinggunkfromthenarrowoverflowslot\.\.\.\.Ishouldinspectthebladenexttoconfirmitssuitabilityforlooseninggunk,thencomparewithothercandidatesliketheelectricbeardtrimmer'ssmallpartsorthetoiletpaperroll'sedgeforleverage\.","action":"inspect\_part","part":"double\_edge\_blade"\},"feedback":\{"type":"PARTINSPECTION","part":"double\_edge\_blade","entity":"double\-edgesafetyrazorwithknurledhandle","physical":"high\-carbonsteel,uncoatedorlightlycoated;normaldurability;springy;morereactiveandcandeveloppatina/rustspotsifleftwet","state":"partiallyvisibleandblocked;slightlyhot"\},"note":"Thetrainedmodelstillexploresthetemptingblade,butdoesnotterminatethere\."\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/sink_base_double_edge_blade.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fNGIiLAogICJyb3VuZCI6IDQsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUga251cmxlZCBoYW5kbGUgaXMgY2hyb21lLXBsYXRlZCBicmFzcywgY29ycm9zaW9uLXJlc2lzdGFudCwgYW5kIGhlYXZ5IC0tIHByb3ZpZGluZyBzdHJvbmcgZ3JpcCBhbmQgbGV2ZXJhZ2UuIC4uLiBIb3dldmVyLCBhcyBhIHN0YW5kYWxvbmUgcGFydCwgaXQgbGFja3MgYSBjdXR0aW5nIGVkZ2UgYW5kIGNhbm5vdCBzY3JhcGUgZ3VuayBhbG9uZS4gLi4uIEkgc2hvdWxkIG5vdyBpbnNwZWN0IHRoZSBlbGVjdHJpYyBiZWFyZCB0cmltbWVyJ3MgYWRqdXN0YWJsZSBndWFyZCwgYXMgaXRzIHNtYWxsLCByaWdpZCBzaGFwZSBtaWdodCBmaXQgaW50byB0aGUgc2xvdCBhbmQgaGVscCBkaXNsb2RnZSBndW5rLiIsCiAgICAiYWN0aW9uIjogImluc3BlY3RfZW50aXR5IiwKICAgICJlbnRpdHkiOiAiZWxlY3RyaWMgYmVhcmQgdHJpbW1lciB3aXRoIGFkanVzdGFibGUgZ3VhcmQiCiAgfSwKICAiZmVlZGJhY2siOiB7CiAgICAidHlwZSI6ICJFTlRJVFkgSU5TUEVDVElPTiIsCiAgICAiZW50aXR5IjogImVsZWN0cmljIGJlYXJkIHRyaW1tZXIgd2l0aCBhZGp1c3RhYmxlIGd1YXJkIiwKICAgICJwYXJ0cyI6IFsiaG91c2luZ19ib2R5IiwgImN1dHRpbmdfYmxhZGVfaGVhZCIsICJhZGp1c3RhYmxlX2d1YXJkX2NvbWIiLCAicG93ZXJfc3dpdGNoIiwgInJlY2hhcmdlYWJsZV9iYXR0ZXJ5X3BhY2siLCAibW90b3JfZHJpdmVfdW5pdCIsICJjb250cm9sX2FuZF9jaGFyZ2luZ19lbGVjdHJvbmljcyJdCiAgfQp9)\{"trace":"good\_sft\_dpo\_4b","round":4,"model\_response":\{"reasoning":"Theknurledhandleischrome\-platedbrass,corrosion\-resistant,andheavy\-\-providingstronggripandleverage\.\.\.\.However,asastandalonepart,itlacksacuttingedgeandcannotscrapegunkalone\.\.\.\.Ishouldnowinspecttheelectricbeardtrimmer'sadjustableguard,asitssmall,rigidshapemightfitintotheslotandhelpdislodgegunk\.","action":"inspect\_entity","entity":"electricbeardtrimmerwithadjustableguard"\},"feedback":\{"type":"ENTITYINSPECTION","entity":"electricbeardtrimmerwithadjustableguard","parts":\["housing\_body","cutting\_blade\_head","adjustable\_guard\_comb","power\_switch","rechargeable\_battery\_pack","motor\_drive\_unit","control\_and\_charging\_electronics"\]\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/sink_trained_trimmer_entity.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fNGIiLAogICJyb3VuZCI6IDUsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgYWRqdXN0YWJsZV9ndWFyZF9jb21iIGlzIG1hZGUgb2YgY2xlYXIgcGxhc3RpYyB3aXRoIGZpbmUsIHJpZ2lkIHRlZXRoIGFuZCBpcyBjdXJyZW50bHkgY2xvZ2dlZCB3aXRoIGhhaXIgYW5kIGd1bmsgLS0gc2ltaWxhciB0byB0aGUgb3ZlcmZsb3cgc2xvdC4gSXRzIGRlc2lnbiBhbGxvd3MgaXQgdG8gYmUgaW5zZXJ0ZWQgaW50byBuYXJyb3cgc3BhY2VzIGFuZCBpdHMgdGVldGggY2FuIHNjcmFwZSBvciBkaXNsb2RnZSBkZWJyaXMuIFRoZSBjdXR0aW5nX2JsYWRlX2hlYWQgaXMgc2hhcnAgYW5kIGRlc2lnbmVkIGZvciBjdXR0aW5nLCBidXQgaXQgaXMgbm90IHN1aXRhYmxlIGZvciBzY3JhcGluZyBndW5rIGZyb20gYSBzbG90IHdpdGhvdXQgbGV2ZXJhZ2UuIC4uLiBUaGUgYWRqdXN0YWJsZV9ndWFyZF9jb21iIGlzIHRoZSBtb3N0IHByb21pc2luZyBjYW5kaWRhdGUgYmVjYXVzZSBpdHMgc2hhcGUgYW5kIG1hdGVyaWFsIGFyZSBpZGVhbCBmb3IgbG9vc2VuaW5nIGd1bmsgaW4gYSBjb25maW5lZCBzcGFjZS4iLAogICAgImFjdGlvbiI6ICJpbnNwZWN0X3BhcnQiLAogICAgInBhcnQiOiAiYWRqdXN0YWJsZV9ndWFyZF9jb21iIgogIH0sCiAgImZlZWRiYWNrIjogewogICAgInR5cGUiOiAiUEFSVCBJTlNQRUNUSU9OIiwKICAgICJwYXJ0IjogImFkanVzdGFibGVfZ3VhcmRfY29tYiIsCiAgICAiZW50aXR5IjogImVsZWN0cmljIGJlYXJkIHRyaW1tZXIgd2l0aCBhZGp1c3RhYmxlIGd1YXJkIiwKICAgICJwaHlzaWNhbCI6ICJleGFjdCBtaWxsaW1ldGVyIGRpbWVuc2lvbnM7IGludGVncmF0ZWQgc2xpZGVyIHRyYWNrLCBkZXRlbnQgY2xpY2tzLCBhbmQgc2lkZSBndWlkZSByYWlsIGhlaWdodC1tYXJraW5nIGZ1bmN0aW9uOyBzcGVjaWZpYyBwbGFzdGljIHR5cGU7IGR1cmFiaWxpdHkvc3R1cmRpbmVzczsgc3ByaW5naW5lc3MvZWxhc3RpY2l0eTsgbG93LWZyaWN0aW9uIHByb3BlcnR5IiwKICAgICJzdGF0ZSI6ICJ2aXNpYmxlOyBmcmVlIgogIH0KfQ==)\{"trace":"good\_sft\_dpo\_4b","round":5,"model\_response":\{"reasoning":"Theadjustable\_guard\_combismadeofclearplasticwithfine,rigidteethandiscurrentlycloggedwithhairandgunk\-\-similartotheoverflowslot\.Itsdesignallowsittobeinsertedintonarrowspacesanditsteethcanscrapeordislodgedebris\.Thecutting\_blade\_headissharpanddesignedforcutting,butitisnotsuitableforscrapinggunkfromaslotwithoutleverage\.\.\.\.Theadjustable\_guard\_combisthemostpromisingcandidatebecauseitsshapeandmaterialareidealforlooseninggunkinaconfinedspace\.","action":"inspect\_part","part":"adjustable\_guard\_comb"\},"feedback":\{"type":"PARTINSPECTION","part":"adjustable\_guard\_comb","entity":"electricbeardtrimmerwithadjustableguard","physical":"exactmillimeterdimensions;integratedslidertrack,detentclicks,andsideguiderailheight\-markingfunction;specificplastictype;durability/sturdiness;springiness/elasticity;low\-frictionproperty","state":"visible;free"\}\}![[Uncaptioned image]](https://arxiv.org/html/2605.26396v1/figures/images_case/sink_trained_adjustable_guard_comb.png)[⬇](data:text/plain;base64,ewogICJ0cmFjZSI6ICJnb29kX3NmdF9kcG9fNGIiLAogICJyb3VuZCI6IDYsCiAgIm1vZGVsX3Jlc3BvbnNlIjogewogICAgInJlYXNvbmluZyI6ICJUaGUgYWRqdXN0YWJsZV9ndWFyZF9jb21iIGlzIG1hZGUgb2YgZHVyYWJsZSwgbG93LWZyaWN0aW9uIHBsYXN0aWMgd2l0aCByaWdpZCwgZmluZSB0ZWV0aCBkZXNpZ25lZCB0byBzY3JhcGUgYW5kIGRpc2xvZGdlIGRlYnJpcy4gSXRzIHNoYXBlIGlzIGlkZWFsIGZvciBmaXR0aW5nIGludG8gdGhlIG5hcnJvdyBvdmVyZmxvdyBzbG90LCBhbmQgaXRzIHNwcmluZ2luZXNzIGFsbG93cyBpdCB0byBmbGV4IGFuZCBsb29zZW4gZ3VuayB3aXRob3V0IGJyZWFraW5nLiBVbmxpa2UgdGhlIGRvdWJsZS1lZGdlIGJsYWRlICsgaGFuZGxlLCB3aGljaCByZXF1aXJlcyBtYW51YWwgbGV2ZXJhZ2UgYW5kIG1heSBub3QgZml0IHNudWdseSwgdGhlIGNvbWIgaXMgc3BlY2lmaWNhbGx5IGRlc2lnbmVkIGZvciBzdWNoIHRhc2tzIGFuZCBpcyBmcmVlIGFuZCB2aXNpYmxlLiBJdCBpcyBhbHNvIG5vbi1hYnJhc2l2ZSBhbmQgc2FmZSB0byB1c2UgaW4gYSB3ZXQgZW52aXJvbm1lbnQuIiwKICAgICJhY3Rpb24iOiAiYW5zd2VyIiwKICAgICJhbnN3ZXJfZW50aXR5IjogImVsZWN0cmljIGJlYXJkIHRyaW1tZXIgd2l0aCBhZGp1c3RhYmxlIGd1YXJkIiwKICAgICJhbnN3ZXJfcGFydCI6ICJhZGp1c3RhYmxlX2d1YXJkX2NvbWIiLAogICAgImFuc3dlcl9ob3dfdG9fdXNlIjogIkluc2VydCB0aGUgYWRqdXN0YWJsZV9ndWFyZF9jb21iIGludG8gdGhlIG92ZXJmbG93IHNsb3QgYW5kIGdlbnRseSB3aWdnbGUgaXQgYmFjayBhbmQgZm9ydGggdG8gbG9vc2VuIHRoZSBndW5rLiBVc2UgdGhlIGZpbmUgdGVldGggdG8gc2NyYXBlIGFuZCBkaXNsb2RnZSB0aGUgaGFpci1hbmQtc29hcCBidWlsZHVwLCB0aGVuIHJpbnNlIHdpdGggd2F0ZXIgdG8gY2xlYXIgdGhlIHNsb3QuIgogIH0sCiAgImNhcGFiaWxpdHlfc2hvd24iOiAiZXhhY3QtcGFydCBkaXNjcmltaW5hdGlvbiB1bmRlciBzYWZldHkgYW5kIHN0YXRlIGNvbnN0cmFpbnRzIgp9)\{"trace":"good\_sft\_dpo\_4b","round":6,"model\_response":\{"reasoning":"Theadjustable\_guard\_combismadeofdurable,low\-frictionplasticwithrigid,fineteethdesignedtoscrapeanddislodgedebris\.Itsshapeisidealforfittingintothenarrowoverflowslot,anditsspringinessallowsittoflexandloosengunkwithoutbreaking\.Unlikethedouble\-edgeblade\+handle,whichrequiresmanualleverageandmaynotfitsnugly,thecombisspecificallydesignedforsuchtasksandisfreeandvisible\.Itisalsonon\-abrasiveandsafetouseinawetenvironment\.","action":"answer","answer\_entity":"electricbeardtrimmerwithadjustableguard","answer\_part":"adjustable\_guard\_comb","answer\_how\_to\_use":"Inserttheadjustable\_guard\_combintotheoverflowslotandgentlywiggleitbackandforthtoloosenthegunk\.Usethefineteethtoscrapeanddislodgethehair\-and\-soapbuildup,thenrinsewithwatertocleartheslot\."\},"capability\_shown":"exact\-partdiscriminationundersafetyandstateconstraints"\}

## Appendix HUse of LLMs

In this work, LLMs are used strictly for research support rather than as sources of substantive content\. Their use falls into three categories: \(i\) serving as automatic pipeline annotation helper, \(ii\) providing tested results on MM\-CreativityBench, and \(iii\) assisting with language refinement during paper writing\. For writing support, we used ChatGPT solely to polish text \(improving coherence and grammar\) while all ideas, logic, results, and technical contributions originate from the authors\. To safeguard rigor, we have carefully reviewed all LLM\-refined texts to confirm that no hallucinated content was introduced and that the original arguments, findings, and perspectives were faithfully preserved\.
Advancing Creative Physical Intelligence in Large Multimodal Models

Similar Articles

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Submit Feedback

Similar Articles

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing
MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models
WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs