DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

arXiv cs.AI Papers

Summary

DeskCraft is a new benchmark for evaluating desktop GUI agents on long-horizon professional creative workflows, incorporating human-in-the-loop collaboration protocols. It tests agents on tasks requiring over 50 steps across design, video, audio, and 3D software.

arXiv:2606.03103v1 Announce Type: new Abstract: Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:42 AM

# DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
Source: [https://arxiv.org/html/2606.03103](https://arxiv.org/html/2606.03103)
Wenkai Wang1,∗, Tao Xiong1,∗, Jingchen Ni2,∗, Yunpeng Bao1,∗, Xiyun Li3, Tianqi Liu1, Hongcan Guo4, Zilong Huang3, Shengyu Zhang1,† 1Zhejiang University2Tsinghua University 3Tencent4The University of Hong Kong ∗Equal contribution\.†Corresponding author

###### Abstract

Real\-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human\-in\-the\-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses\. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront\. To address this issue, we introduceDeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human\-agent collaboration\. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation\. Furthermore, DeskCraft formalizes human\-agent collaboration into an interaction protocol covering*mid\-turn*and*post\-turn*exchanges\. Mid\-turn interaction captures both agent\-initiated clarification under uncertainty and user\-initiated interruption during execution, while post\-turn interaction accommodates user\-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns\. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT\-5\.4 reaches31\.6%on standard tasks and27\.6%on interactive tasks\. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification\. We will open\-source all evaluation codes, tasks, and data at[https://github\.com/mrwwk/DeskCraft](https://github.com/mrwwk/DeskCraft)\.

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human\-in\-the\-Loop Collaboration

## 1Introduction

Frontier multimodal models, such as GPT\-5\(Singh et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib17)\)and Claude\(Anthropic,[2025](https://arxiv.org/html/2606.03103#bib.bib3)\), now demonstrate strong capabilities in screen understanding and GUI operation\(Qin et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib14); Agashe et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib1); Wang et al\.,[2026a](https://arxiv.org/html/2606.03103#bib.bib19)\)\. This progress points toward a future in which desktop agents can take over substantial portions of routine digital work for their users\.

Real\-world desktop productivity, however, requires capabilities that extend far beyond isolated GUI actions\. Professional workflows span multiple applications and extended time horizons; a 3D rendering pipeline, for instance, transitions from modeling to lighting, rendering, and compositing across various tools\. Throughout these processes, the user iteratively directs the workflow via clarification, correction, and feedback\. In tandem, the agent must proactively elicit missing information rather than relying on assumptionsHorvitz \([1999](https://arxiv.org/html/2606.03103#bib.bib7)\); Allen et al\. \([1999](https://arxiv.org/html/2606.03103#bib.bib2)\)\. Deployable desktop agents, therefore, must not only sustain long action sequences but also dynamically adapt to evolving user intents\.

Existing desktop benchmarks\(Xie et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib22); Bonatti et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib5); Yang et al\.,[2026](https://arxiv.org/html/2606.03103#bib.bib26)\)successfully evaluate agents in live virtual machines, but their tasks are largely short, atomic, and specified by predetermined instructions, leaving sustained workflows and human\-in\-the\-loop dialogue underexplored\. Benchmarks with explicit user interaction are mainly developed for tool\-use, enterprise workflows, and mobile assistants\(Yao et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib27); Xu et al\.,[2026a](https://arxiv.org/html/2606.03103#bib.bib23); Kong et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib9)\)\. In desktop workflows, agents must map each clarification or correction to the current GUI state, revise their plan, and continue from the work already completed\. Consequently, there remains a need for a desktop benchmark that evaluates such interactive, long horizon workflows in live environments\.

![Refer to caption](https://arxiv.org/html/2606.03103v1/x1.png)Figure 1:Overview of DeskCraft\.Left:386 standard tasks stratified into L1 atomic, L2 compositional, and L3 long horizon levels, with L3 distilled from real delivery pipelines\.Middle:152 interactive tasks driven by three composable triggers \(*step count*,*agent inquiry*,*agent done*\) that evolve a task through human\-agent collaboration\.Right:11 applications across 5 domains, including professional software \(e\.g\., Blender, Kdenlive\) that demands finer spatial precision, denser UI, and deeper domain knowledge than prior benchmarks\.To bridge this gap, we introduceDeskCraft\(Figure[1](https://arxiv.org/html/2606.03103#S1.F1)\), a 538\-task desktop benchmark designed to evaluate agents on long\-horizon professional workflows and human\-agent interaction in live desktop environments\. DeskCraft contributes three design components\.Diagnostic workflow difficulty\.Desktop tasks impose increasingly complex execution demands on GUI agents, ranging from following simple user instructions, to composing operations within a task, to sustaining long horizon workflows over many steps\. DeskCraft defines this progression as an L1/L2/L3 difficulty taxonomy \(§[3\.2](https://arxiv.org/html/2606.03103#S3.SS2)\), enabling failures to be diagnosed by the level of execution demand they expose\. In particular, L3 tasks are distilled from real professional scenarios, preserving the dependency structure of actual delivery processes rather than synthetically chaining independent operations\.Human\-agent interaction protocol\.Real desktop collaboration evolves as execution proceeds: users may revise goals, while agents may need to request missing information or escalate risky decisions\. DeskCraft formalizes this process through three trigger types covering mid\-turn and post\-turn interaction\.*Mid\-turn*triggers fire during execution and comprise two types: agent\-initiated clarification and user\-initiated interruption\.*Post\-turn*trigger fires after the agent signals completion, allowing users to provide follow\-up instructions\. Together, these triggers capture a broad range of realistic human\-agent collaboration patterns\.Broadened professional software coverage\.Prior benchmarks concentrate on office suites, leaving professional creative workflows underexplored\. DeskCraft expands evaluation to image design, vector design, video editing, audio production, and 3D rendering, covering workflows that demand spatial precision and domain\-specific tool use\.

Table 1:Comparison with representative agent benchmarksalong domain, scale, long horizon focus \(LH Focus\), user interaction form \(User Int\.\), difficulty stratification \(Diff\. Lvls\.\), and evaluation granularity \(Eval\.\)\. LH Focus is marked when multi\-step workflows or cross\-application dependencies are a central benchmark axis\. DeskCraft is the first desktop benchmark to jointly support long horizon professional workflows, a human\-in\-the\-loop protocol, and structured difficulty levels\.Across 538 tasks, the strongest model reaches only33\.8%33\.8\\%on standard tasks\. On the interactive split, GPT\-5\.4 reaches27\.6%27\.6\\%, while Kimi\-K2\.6 reaches25\.7%25\.7\\%under the 100\-step setting\. Further analysis shows that performance drops sharply on L3 workflow\-level artifact delivery, longer step budgets recover only a small tail of additional successes beyond 100 steps, and agents rarely seek clarification proactively\. These results suggest that the dominant bottleneck has shifted from simple instruction execution to sustained workflow planning and proactive human\-agent coordination\. Our contributions are as follows:

- •We introduceDeskCraft, a 538\-task desktop benchmark with an L1/L2/L3 difficulty taxonomy and professional workflows spanning image and vector design, video editing, audio production, and 3D rendering\.
- •We propose ahuman\-agent interaction protocolthat models collaboration as phase\-based task evolution driven by user feedback, agent information seeking, and execution progress\.
- •We evaluate18 proprietary and open\-source agents, showing that current models remain far from reliable, exhibit the largest gaps in L3 workflow delivery and proactive clarification, and gain limited additional success from longer step budgets\.

## 2Related Work

#### Desktop and Long Horizon Benchmarks\.

Desktop GUI benchmarks have established execution verified evaluation and expanded across platforms, action interfaces, initial state robustness, and professional software grounding\(Xie et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib22); Bonatti et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib5); Yang et al\.,[2026](https://arxiv.org/html/2606.03103#bib.bib26); Jia et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib8); Zhao et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib30); Li et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib10); Nayak et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib13)\)\. However, they still largely focus on single instruction tasks, leaving sustained workflows across multiple desktop applications and user dialogue during execution underexplored\. In parallel, long horizon evaluation has advanced in web, GUI trajectory, and professional workplace settings, revealing persistent gaps in agents’ ability to complete multi step tasks\(Zhou et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib31); Liu et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib11); Xu et al\.,[2026a](https://arxiv.org/html/2606.03103#bib.bib23)\)\. DeskCraft introduces a benchmark of long horizon professional desktop workflows that span multiple applications\.

#### Interactive and Human\-in\-the\-Loop Evaluation\.

Interactive agent evaluation has increasingly moved beyond static single\-turn task completion, emphasizing dialogue, evolving user intent, and benchmark extensions along new evaluation axes\(Yao et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib27); Xu et al\.,[2026a](https://arxiv.org/html/2606.03103#bib.bib23); Kong et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib9); Mialon et al\.,[2024](https://arxiv.org/html/2606.03103#bib.bib12); Deng et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib6); Zan et al\.,[2026](https://arxiv.org/html/2606.03103#bib.bib28); Zhang et al\.,[2026](https://arxiv.org/html/2606.03103#bib.bib29)\)\. However, these advances have only limited coverage in desktop environments, where most benchmarks still evaluate agents under fixed task instructions without mid\-execution user feedback\(Zhao et al\.,[2025](https://arxiv.org/html/2606.03103#bib.bib30)\)\. DeskCraft introduces a Human\-in\-the\-loop protocol for long horizon professional desktop workflows \(Table[1](https://arxiv.org/html/2606.03103#S1.T1)\)\.

## 3DeskCraft Benchmark

DeskCraft is an execution\-based desktop benchmark targeting the joint setting of long horizon workflows, user interaction, and professional software tasks\. This section specifies its task formulation, L1/L2/L3 difficulty taxonomy, interaction protocol, and evaluation procedure\.

### 3\.1Task Definition

DeskCraft formulates GUI agent evaluation as a phase\-conditioned control problem in a live desktop environment\. A task is defined as

τ=\(s0,u0,Φ,ℰ,R\),\\tau=\(s\_\{0\},\\;u\_\{0\},\\;\\Phi,\\;\\mathcal\{E\},\\;R\),wheres0s\_\{0\}is the initial desktop state,u0u\_\{0\}is the user’s instruction,ℰ\\mathcal\{E\}is the desktop environment,Φ=\(ϕ1,…,ϕK\)\\Phi=\(\\phi\_\{1\},\\ldots,\\phi\_\{K\}\)is an optional sequence of interaction phases \(§[3\.3](https://arxiv.org/html/2606.03103#S3.SS3)\), andRRis the evaluation function\. Each phaseϕk=\(uk,gk\)\\phi\_\{k\}=\(u\_\{k\},g\_\{k\}\)pairs a follow\-up user message with a trigger condition that determines when it is delivered\.

At each step, the agent observes a screenshotxtx\_\{t\}and the active instruction, then selects

at∈𝒜∪\{DONE,ASK,FAIL\},a\_\{t\}\\in\\mathcal\{A\}\\cup\\\{\\texttt\{DONE\},\\;\\texttt\{ASK\},\\;\\texttt\{FAIL\}\\\},where𝒜\\mathcal\{A\}comprises GUI operations \(clicks, keystrokes, scrolls\)\. The episode ends when the agent emitsDONEorFAIL, or when the step budget is reached;ASKdoes not terminate but may activate the next phase, updating the active instruction\. Standard tasks setK=0K\{=\}0\(single fixed instruction\); interactive tasks setK\>0K\{\>\}0, allowing the goal to evolve during execution\. The final scoreR​\(sT\)∈\{0,1\}R\(s\_\{T\}\)\\in\\\{0,1\\\}is computed from the resulting desktop state\.

### 3\.2Difficulty Taxonomy

DeskCraft categorizes standard desktop tasks by the execution capability required for success\.L1tasks consist of simple atomic operations, where the agent needs to perform one clearly specified GUI action\.L2tasks are built by composing related L1 tasks and typically involve 2\-4 dependent GUI operations\.L3tasks are long\-horizon tasks that pursue a concrete high level objective through multiple interrelated subtasks\. These tasks are crafted to resemble real world usage scenarios, avoiding trivial concatenation of L1\-level atomic operations, and each task is provided with multiple relevant resource files\.

The difficulty distribution also varies across applications\. Some newly introduced professional software domains currently include more L1\-style atomic tasks, whereas commonly applications contain a higher proportion of L2/L3 tasks\.

![Refer to caption](https://arxiv.org/html/2606.03103v1/x2.png)Figure 2:DeskCraft interaction protocol\.Three composable triggers \(agent\_done,agent\_ask,step\_count\) define when the next user phase enters the session: after completion, on agent inquiry, or after a fixed step budget\.
### 3\.3Interaction Protocol

In real desktop work, users rarely fix a complete specification upfront; they clarify, interrupt, or revise as execution unfolds\. Yet unconstrained dialogue makes evaluation hard to reproduce\. DeskCraft therefore represents interaction as an executable phase protocol that captures goal evolution while keeping it deterministic\.

An interactive task consists of a sequence of phasesΦ=\(ϕ1,…,ϕK\)\\Phi=\(\\phi\_\{1\},\\ldots,\\phi\_\{K\}\)\. Each phaseϕk=\(uk,gk\)\\phi\_\{k\}=\(u\_\{k\},g\_\{k\}\)contains a user messageuku\_\{k\}and a trigger conditiongk​\(⋅\)∈\{0,1\}g\_\{k\}\(\\cdot\)\\in\\\{0,1\\\}\. Whengkg\_\{k\}fires,uku\_\{k\}is appended to the interaction history and becomes the agent’s active instruction\.

#### Triggers as a closed\-loop minimal set\.

![Refer to caption](https://arxiv.org/html/2606.03103v1/x3.png)\(a\)Instruction length\.
![Refer to caption](https://arxiv.org/html/2606.03103v1/x4.png)\(b\)Evaluator calls\.
![Refer to caption](https://arxiv.org/html/2606.03103v1/x5.png)\(c\)Rule atoms\.

Figure 3:Difficulty taxonomy statistics\.Although DeskCraft defines L1/L2/L3 by required execution capability rather than surface length, the levels align with measurable complexity: instruction length and evaluator calls generally increase from L1 to L3\. Some tasks use gold\-file comparison for evaluation, involving only a single evaluator call and rule regardless of task complexity\. Interactive tasks are shown separately because their complexity is distributed across phase\-level user messages\.DeskCraft closes the human\-agent interaction loop with a minimal set of three composable trigger types, covering mid\-turn and post\-turn interaction\. For*mid\-turn*interaction, occurring while the agent is still executing:agent\_askfires when the agent emitsASKto solicit clarification, andstep\_countfires after a predetermined number of steps to model user\-initiated interruption\. For*post\-turn*interaction, occurring after the agent signals completion:agent\_donefires when the agent emitsDONE, allowing the user to verify deliverables and issue follow\-up instructions or corrections \(Figure[2](https://arxiv.org/html/2606.03103#S3.F2)\)\. Triggers compose freely within a task, enabling phase sequences that interleave them to produce realistic patterns such as “clarify→\\tointerrupt→\\torefine\.” Scenario families \(ambiguity resolution, interruption, progressive refinement, feedback correction\) are analysis labels for the collaborative ability being tested, not additional trigger types\.

#### User simulator\.

We employ an MLLM as a user simulator\. When a predefined trigger fires, the simulator issues the next phase goal or responds to an unexpectedASKwith clarification, ensuring deterministic user interaction without trajectory drift\. If the agent has not completed the previous phase, the simulator still advances to the next phase instruction to keep evaluation progressing; meanwhile, the MLLM produces a judgment based on the current screenshot and agent output to assess whether the previous phase was successfully completed\. Whether a task ultimately succeeds is determined by the final desktop state\. Full prompt template is given in Appendix[B](https://arxiv.org/html/2606.03103#A2)\.

### 3\.4Execution\-Based Verification

DeskCraft evaluates task success by verifying the resulting desktop state\. We build a domain\-aware verifier library for professional software\. DeskCraft verifiers extract structured state from project files or application runtimes and apply rule\-based checks over the extracted fields, enabling deterministic evaluation of both long\-horizon and interactive tasks\.

## 4Benchmark Construction

We construct DeskCraft as 538 desktop tasks grounded in realistic work and verified by automatic execution\-based evaluators\. This section summarizes our task sourcing, difficulty annotation and quality control, and dataset statistics\.

### 4\.1Task Sourcing

For each of the 11 supported applications, we systematically collect operation workflows from official documentation sites and online tutorials, yielding 224 reference sources that collectively define a capability matrix of 120\+ operation categories\. We sample tasks to ensure no two within the same application test the same atomic feature, producing 386 standard tasks backed by 300\+ evaluator functions\. L3 tasks follow a*workflow distillation*pipeline: we identify real professional workflows from documentation and tutorials, decompose each into a self\-contained task with named inputs and a verifiable deliverable\.

Across the full 538\-task dataset, tasks use 279 unique asset files spanning 19 formats, sourced through two channels: \(1\) downloaded from public repositories; \(2\) manually authored by annotators to fulfill specific task requirements\. The remaining 152 interactive tasks are derived by pairing selected L2/L3 workflows with typed triggers from the interaction protocol \(§[3\.3](https://arxiv.org/html/2606.03103#S3.SS3)\), covering both user\-driven lifecycle management and agent\-driven information acquisition\.

### 4\.2Evaluator Function Quality Control

Each task comprises an instruction, a VM configuration, and an execution\-based evaluator\. For each application domain, practitioners first draft a task design document specifying verification strategies; an LLM then generates the evaluator functions; finally, a human and LLM dual review checks the evaluator function correctness\.

### 4\.3Dataset Statistics

![Refer to caption](https://arxiv.org/html/2606.03103v1/x6.png)Figure 4:Per application task countfor the standard \(outer ring\) and interactive \(inner ring\) splits, covering 11 applications and a multi\-app workflow category\.Figure[4](https://arxiv.org/html/2606.03103#S4.F4)shows DeskCraft’s task distribution \. The standard split is balanced across L1/L2/L3 difficulty levels, which are defined by execution capability and correlate with measurable complexity signals: median instruction length rises from 186 to 501 characters, average evaluator calls increase from 1\.46 to 2\.00, and average rule atoms grow from 3\.9 to 7\.7 across levels \(Figure[3](https://arxiv.org/html/2606.03103#S3.F3)\)\. The interactive split contributes 403 phase\-level user messages spanning scenario families such as progressive refinement and requirement change\.

## 5Experiment

In this section, we conduct experiments to address the following research questions:

- •RQ1:How well do current GUI agents perform on professional desktop workflows under standard and interactive settings?
- •RQ2:How much additional performance can a strong GUI agent recover under longer action horizons \(300 steps\)?
- •RQ3:How do task success and execution length change as desktop workflows become more difficult from L1 to L3?
- •RQ4:How well do current GUI agents collaborate with humans during interactive desktop workflows?

### 5\.1Experiment Settings

#### Evaluated agents\.

We evaluate three families of models on DeskCraft: \(i\)proprietary frontier models\(GPT\-5\.4Singh et al\. \([2025](https://arxiv.org/html/2606.03103#bib.bib17)\), Kimi\-K2\.6Team et al\. \([2026](https://arxiv.org/html/2606.03103#bib.bib18)\)\); \(ii\)open\-source generalist VLMs\(Qwen3\-VL 8B/32B/235B\-A22BBai et al\. \([2025](https://arxiv.org/html/2606.03103#bib.bib4)\), Qwen3\.5 9B/35B\-A3B/397B\-A17BQwen Team \([2026a](https://arxiv.org/html/2606.03103#bib.bib15)\), Qwen3\.6 35B\-A3BQwen Team \([2026b](https://arxiv.org/html/2606.03103#bib.bib16)\)\); and \(iii\)open\-source CUA foundation modelsspecialized for GUI use \(EvoCUA 8B/32BXue et al\. \([2026](https://arxiv.org/html/2606.03103#bib.bib25)\), GUI\-Owl\-1\.5 8B/32BXu et al\. \([2026b](https://arxiv.org/html/2606.03103#bib.bib24)\), OpenCUA 7B/32BWang et al\. \([2026b](https://arxiv.org/html/2606.03103#bib.bib20)\), OS\-Atlas\-Pro 7BWu et al\. \([2025](https://arxiv.org/html/2606.03103#bib.bib21)\), UI\-TARS 1\.5 7BQin et al\. \([2025](https://arxiv.org/html/2606.03103#bib.bib14)\)\)\. This selection lets us compare proprietary frontier agents, open\-source generalist models, and GUI\-specialized foundations while probing the roles of model scale and domain\-specific training in desktop agent performance\.

For interactive tasks, we instantiate the simulator withKimi\-K2\.5as a fixed backbone across all evaluated agents\. The full prompt template is given in Appendix[B](https://arxiv.org/html/2606.03103#A2)\.

### 5\.2Overall Performance under Standard and Interactive Settings \(RQ1\)

To answer RQ1, we evaluate current GUI agents on theStandardandInteractivesplits of DeskCraft\. Table[2](https://arxiv.org/html/2606.03103#S5.T2)reports per application and overall task\-level success rates\. We further analyze repeated\-run reliability for Kimi\-K2\.6 using, where pass@k counts a task as successful if any ofkkruns succeeds, and passkrequires allkkruns to succeed\. We make the following observations:

Table 2:Per\-application success rate on DeskCraft\.We report task success rate \(SR, %\) for each agent on theStandardsplit \(386 tasks\) and theInteractivesplit \(152 tasks\)\. The twoAvg\.columns report overall task\-level SR within each regime\.Bold= best per column;underline= runner\-up\.Obs\.❶Current GUI agents achieve limited overall success on DeskCraft\.The best average success rates remain below 35%: Kimi\-K2\.6 achieves the highest Standard performance at 33\.8%, while GPT\-5\.4 achieves the highest Interactive performance at 27\.6%\. GPT\-5\.4 reaches 31\.6% on Standard, and Kimi\-K2\.6 reaches 25\.7% on Interactive\. Most open\-source generalist VLMs and GUI\-specialized foundation models are substantially lower, indicating that DeskCraft still leaves substantial room for improvement\.

![Refer to caption](https://arxiv.org/html/2606.03103v1/x7.png)Figure 5:Pass@k and passktrends for Kimi\-K2\.6\.Pass@k evaluation requires multiple independent runs per task, we compute these metrics ona subset of tasks\.Pass@k increases with largerkk, whereas passkdecreases as the requirement shifts from at least one successful attempt to consistent success across all attempts\.Obs\.❷Multiple attempts raise upper\-bound success, but run\-to\-run reliability remains weak\.Figure[5](https://arxiv.org/html/2606.03103#S5.F5)shows that Kimi\-K2\.6 benefits from multiple trials on both settings\. Since pass@k requires repeated independent rollouts for each task, we report pass@k and passkon a representative task subset\. On this subset, Standard pass@k rises from 28\.7% atk=1k\{=\}1to 45\.6% atk=6k\{=\}6, and Interactive pass@k rises from 27\.0% to 38\.8%\. However, passkdrops askkincreases\. This gap shows that current GUI agents often succeed only intermittently across repeated executions of the same workflow, rather than solving it robustly\.

### 5\.3Long\-Horizon 300\-Step Budget Analysis \(RQ2\)

To answer RQ2, we analyze Kimi\-K2\.6 under progressively larger action budgets\. Figure[6](https://arxiv.org/html/2606.03103#S5.F6)reports cumulative success: a task is counted at a given budget only if the model completes it successfully within that number of steps\.

![Refer to caption](https://arxiv.org/html/2606.03103v1/x8.png)Figure 6:Cumulative accuracy of Kimi K2\.6 under increasing step budgets\.A task contributes to the accuracy at a given budget only if it is completed successfully within that number of steps\.Obs\.❸Longer action budgets reveal additional capability beyond the 100\-step regime\.Kimi\-K2\.6 benefits substantially as the budget increases toward 100 steps: overall success rises from 17\.0% at 25 steps to 34\.3% at 100 steps\. The model continues to complete some tasks after the conventional 100\-step horizon: standard success reaches 34\.9% at 150 steps and 35\.7% at 181 steps\. In absolute terms, the extended budget adds 13 more successful tasks after the 100\-step point, including four tasks completed after 150 steps\. No additional successful completion appears beyond 200 steps in our run\. These results suggest that sub\-100\-step evaluations can miss a small but meaningful tail of long\-horizon capabilities\.

### 5\.4Difficulty\-Level Capability Degradation \(RQ3\)

![Refer to caption](https://arxiv.org/html/2606.03103v1/x9.png)\(a\)Leading models\.
![Refer to caption](https://arxiv.org/html/2606.03103v1/x10.png)\(b\)Competitive models\.
![Refer to caption](https://arxiv.org/html/2606.03103v1/x11.png)\(c\)Emerging models\.

Figure 7:Run\-length and accuracy trends across L1, L2, and L3 tasks\.Lines show the mean of correct\- and wrong\-task step counts, markers show all/correct/wrong step averages, and bars show per\-level accuracy\.To answer RQ3, we analyze how performance changes across DeskCraft’s three difficulty levels\. Figure[7](https://arxiv.org/html/2606.03103#S5.F7)shows both success rates and run lengths across levels\.

Obs\.❹Accuracy drops as task difficulty increases\.Across model families, success rates decline from L1/L2 to L3, with the main cliff typically appearing at L3\. For example, EvoCUA\-32B drops from 19\.9% \(L1\) to 10\.7% \(L2\) and 1\.0% \(L3\)\. Stronger general\-purpose agents also remain limited on L3: Kimi\-K2\.6 declines from 41\.0% \(L2\) to 21\.6% \(L3\), and GPT\-5\.4 from 40\.7% \(L2\) to 9\.5% \(L3\)\.

Obs\.❺Higher difficulty is associated with longer runs\.Average run length generally increases from L1 to L3 for both successful and failed trajectories\. For instance, Kimi\-K2\.6 rises from 30\.8 steps on L1 to 48\.8 on L2 and 77\.7 on L3, while GPT\-5\.4 rises from 25\.0 to 44\.3 to 71\.2\. This suggests that harder desktop workflows are not only less accurate, but also less efficient, reflecting persistent weaknesses in long\-horizon planning and state management\.

### 5\.5Human\-in\-the\-Loop Collaboration Analysis \(RQ4\)

To answer RQ4, we group Interactive tasks by their human\-in\-the\-loop collaboration mode and compare task success rates for Kimi\-K2\.6 and GPT\-5\.4\. The full label distribution is reported in Appendix[C](https://arxiv.org/html/2606.03103#A3)\.

![Refer to caption](https://arxiv.org/html/2606.03103v1/x12.png)Figure 8:Success rates by human\-in\-the\-loop collaboration mode\(Flow/Prog\./Corr\./Req\./Intr\./Ask denote workflow, progressive refinement, correction, requirement change, interruption, and clarification\)\.Obs\.❻Explicit revision feedback is easier to handle than interrupted workflows\.Kimi\-K2\.6 and GPT\-5\.4 perform best on correction/feedback tasks, where the user provides concrete revision guidance\. Performance is lower on interruption tasks, suggesting that current agents use explicit local feedback more accurately than mid\-workflow changes that require replanning\. This pattern indicates that agents are better at making bounded local edits than at preserving execution state and repairing a plan after the workflow is disrupted\.

Obs\.❼Agents rarely ask for clarification when goals are underspecified\.Ask\-style tasks have the lowest success rates for both Kimi\-K2\.6 and GPT\-5\.4\. Thus, exposing an Ask channel is not sufficient; current agents often proceed without requesting the missing information needed for successful execution\. The dominant failure mode appears to be over\-commitment to an initial guess\.

## 6Conclusion

We introducedDeskCraft, a 538\-task execution\-based benchmark for desktop GUI agents, featuring an L1/L2/L3 difficulty taxonomy, an executable interaction protocol, and professional workflow coverage beyond existing desktop benchmarks\. Across standard and interactive settings, experiments show that current agents remain far from robust on long\-horizon and interactive tasks, with substantial weaknesses in workflow completion, replanning under intervention, and proactive clarification\. Additional steps recover a small tail of successes\. By making these challenges explicit and measurable, DeskCraft provides a concrete basis for evaluating progress on realistic desktop agents\.

## Limitations

DeskCraft expands desktop\-agent evaluation to longer workflows, interactive collaboration, and professional software, but it still has several scope boundaries\. First, although the benchmark includes both English and Chinese instructions, its language coverage is still partial rather than fully multilingual\. Second, the interaction protocol uses scripted user messages to ensure reproducibility and controlled comparison across agents; this makes evaluation stable, but it cannot capture the full diversity and unpredictability of real human collaboration\. Finally, DeskCraft is a fixed benchmark release with a finite set of applications, workflows, and step budgets, so it should be viewed as a incomplete slice of real\-world desktop work\.

## References

- Agashe et al\. \(2025\)Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang\. 2025\.Agent s2: A compositional generalist\-specialist framework for computer use agents\.*arXiv preprint arXiv:2504\.00906*\.
- Allen et al\. \(1999\)James E Allen, Curry I Guinn, and Eric Horvtz\. 1999\.Mixed\-initiative interaction\.*IEEE Intelligent Systems and their Applications*, 14\(5\):14–23\.
- Anthropic \(2025\)Anthropic\. 2025\.[Introducing claude 4](https://www.anthropic.com/news/claude-4)\.Anthropic News\.
- Bai et al\. \(2025\)Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and 1 others\. 2025\.Qwen3\-vl technical report\.*arXiv preprint arXiv:2511\.21631*\.
- Bonatti et al\. \(2024\)Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, and 1 others\. 2024\.Windows agent arena: Evaluating multi\-modal os agents at scale\.*arXiv preprint arXiv:2409\.08264*\.
- Deng et al\. \(2025\)Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, and 1 others\. 2025\.Swe\-bench pro: Can ai agents solve long\-horizon software engineering tasks?*arXiv preprint arXiv:2509\.16941*\.
- Horvitz \(1999\)Eric Horvitz\. 1999\.Principles of mixed\-initiative user interfaces\.In*Proceedings of the SIGCHI conference on Human Factors in Computing Systems*, pages 159–166\.
- Jia et al\. \(2025\)Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang\. 2025\.Osworld\-mcp: Benchmarking mcp tool invocation in computer\-use agents\.*arXiv preprint arXiv:2510\.24563*\.
- Kong et al\. \(2025\)Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, and 1 others\. 2025\.Mobileworld: Benchmarking autonomous mobile agents in agent\-user interactive and mcp\-augmented environments\.*arXiv preprint arXiv:2512\.19432*\.
- Li et al\. \(2025\)Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat\-Seng Chua\. 2025\.Screenspot\-pro: Gui grounding for professional high\-resolution computer use\.In*Proceedings of the 33rd ACM International Conference on Multimedia*, pages 8778–8786\.
- Liu et al\. \(2025\)Shunyu Liu, Minghao Liu, Huichi Zhou, Zhenyu Cui, Yang Zhou, Yuhao Zhou, Jialiang Gao, Heng Zhou, Yunhao Yang, Wendong Fan, and 1 others\. 2025\.Veriweb: Verifiable long\-chain web benchmark for agentic information\-seeking\.*arXiv preprint arXiv:2508\.04026*\.
- Mialon et al\. \(2024\)Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom\. 2024\.Gaia: a benchmark for general ai assistants\.In*International Conference on Learning Representations*, volume 2024, pages 9025–9049\.
- Nayak et al\. \(2025\)Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, and 1 others\. 2025\.Ui\-vision: A desktop\-centric gui benchmark for visual perception and interaction\.*arXiv preprint arXiv:2503\.15661*\.
- Qin et al\. \(2025\)Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, and 1 others\. 2025\.Ui\-tars: Pioneering automated gui interaction with native agents\.*arXiv preprint arXiv:2501\.12326*\.
- Qwen Team \(2026a\)Qwen Team\. 2026a\.[Qwen3\.5: Towards native multimodal agents](https://qwen.ai/blog?id=qwen3.5)\.
- Qwen Team \(2026b\)Qwen Team\. 2026b\.[Qwen3\.6\-35B\-A3B: Agentic coding power, now open to all](https://qwen.ai/blog?id=qwen3.6-35b-a3b)\.
- Singh et al\. \(2025\)Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El\-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others\. 2025\.Openai gpt\-5 system card\.*arXiv preprint arXiv:2601\.03267*\.
- Team et al\. \(2026\)Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and 1 others\. 2026\.Kimi k2\. 5: Visual agentic intelligence\.*arXiv preprint arXiv:2602\.02276*\.
- Wang et al\. \(2026a\)Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, and 1 others\. 2026a\.Opencua: Open foundations for computer\-use agents\.*Advances in Neural Information Processing Systems*, 38:139756–139806\.
- Wang et al\. \(2026b\)Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, and 1 others\. 2026b\.Opencua: Open foundations for computer\-use agents\.*Advances in Neural Information Processing Systems*, 38:139756–139806\.
- Wu et al\. \(2025\)Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and 1 others\. 2025\.Os\-atlas: Foundation action model for generalist gui agents\.In*International Conference on Learning Representations*, volume 2025, pages 5090–5108\.
- Xie et al\. \(2024\)Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others\. 2024\.Osworld: Benchmarking multimodal agents for open\-ended tasks in real computer environments\.*Advances in Neural Information Processing Systems*, 37:52040–52094\.
- Xu et al\. \(2026a\)Frank Fangzheng Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, and 1 others\. 2026a\.Theagentcompany: benchmarking llm agents on consequential real world tasks\.*Advances in Neural Information Processing Systems*, 38\.
- Xu et al\. \(2026b\)Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, and 1 others\. 2026b\.Mobile\-agent\-v3\. 5: Multi\-platform fundamental gui agents\.*arXiv preprint arXiv:2602\.16855*\.
- Xue et al\. \(2026\)Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, and 1 others\. 2026\.Evocua: Evolving computer use agents via learning from scalable synthetic experience\.*arXiv preprint arXiv:2601\.15876*\.
- Yang et al\. \(2026\)Pei Yang, Hai Ci, and Mike Zheng Shou\. 2026\.macosworld: A multilingual interactive benchmark for gui agents\.*Advances in Neural Information Processing Systems*, 38:134014–134056\.
- Yao et al\. \(2024\)Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan\. 2024\.Tau\-bench: A benchmark for tool\-agent\-user interaction in real\-world domains\.*arXiv preprint arXiv:2406\.12045*\.
- Zan et al\. \(2026\)Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Li Aoyan, Lu Chen, Xiaojian Zhong, and 1 others\. 2026\.Multi\-swe\-bench: A multilingual benchmark for issue resolving\.*Advances in Neural Information Processing Systems*, 38\.
- Zhang et al\. \(2026\)Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, and 1 others\. 2026\.Swe\-bench goes live\!*Advances in Neural Information Processing Systems*, 38\.
- Zhao et al\. \(2025\)Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, and Mike Zheng Shou\. 2025\.Worldgui: An interactive benchmark for desktop gui automation from any starting point\.*arXiv preprint arXiv:2502\.08047*\.
- Zhou et al\. \(2024\)Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, and 1 others\. 2024\.Webarena: A realistic web environment for building autonomous agents\.In*International Conference on Learning Representations*, volume 2024, pages 15585–15606\.

## Appendix AInteraction Protocol Implementation Details

We illustrate the interactive execution logic through the way DeskCraft injects user messages into the Kimi GUI agent\. At the start of an interactive task, the instruction for Phase 1 is used as the initial task request\. If any later phase uses theagent\_askstrigger, the runtime switches Kimi into an interactive mode before rollout so that the agent can explicitly request user clarification when needed\.

This interactive mode extends the agent prompt with a dedicatedcall\_usertool and a short behavioral suffix:

\- \{"name": "call\_user","description": "Ask the user for clarification when the instructionis ambiguous, incomplete, or updated mid\-task\.","parameters": \{"type": "object","properties": \{"question": \{"type": "string","description": "A short, specific question for the user\."\}\},"required": \["question"\]\}\}This is an interactive session\.\- If the instruction is ambiguous or missing details, call \`call\_user\`to ask a precise clarification question\.\- If the user provides an update or changes the requirement later,incorporate it and continue from the current desktop state\.\- Do not pretend the user already answered if they have not\.

After each agent turn, the interaction handler checks whether the current phase trigger has fired\. Foragent\_doneandstep\_counttriggers, the phase index is advanced before the next user utterance is generated, so the simulator sees the next phase goal instead of repeating the previous one\. Foragent\_asks, the handler extracts thecall\_userquestion, passes it to the simulator, and treats the simulator response as the next user message\.

The resulting user message is delivered to Kimi throughreceive\_user\_message\. Kimi stores the message in two places\. First, it is placed in a*pending*buffer that is consumed at the nextpredictcall and injected as a highest\-priority turn\-local update:

The following message is newly added this turn and should be treatedas highest\-priority update\.\[User Additional Message\]:\{message\}

Second, the same message is appended to a bounded persistent history so that previous user constraints remain visible in later steps\. Older messages are inserted into the task instruction as:

Follow all persistent user requirements below unless a newerrequirement explicitly supersedes an older one\.\[Persistent User Requirement 1\]:\{message\_1\}\[Persistent User Requirement 2\]:\{message\_2\}

Messages injected in the current turn are removed from the persistent block for that same call to avoid duplication\. If Kimi emitscall\_user, the parser converts it to a non\-executableCALL\_USERsignal, skips environment execution for that turn, and lets the simulator produce a response before rollout resumes from the same desktop state\. In this way, the full interaction protocol can be viewed as a loop over four stages: execute the current desktop action, check whether the authored phase trigger fires, generate the next user utterance if needed, and inject that utterance back into the agent context for the next step\.

## Appendix BUser Simulator Prompt Template

Simulator PromptFor interactive tasks, DeskCraft uses an LLM\-based user simulator to generate the next user utterance while keeping the interaction tied to the authored phase protocol\. The simulator is conditioned on the task scenario, user persona, completed phases, current phase goal, optional next phase goal, the agent’s latest reply, and the current screenshot\. Its system prompt template is:

Table 3:Distribution of human\-in\-the\-loop collaboration modes in the Interactive split\.Each task is assigned one primary mode for non\-overlapping analysis\.You are roleplaying as a realistic computer user\.You are trying to complete a task on a computer, and an AIassistant is helping operate the screen for you\.\#\# Current Scenario\{scenario\_description\}\#\# Your Persona\- Expertise level: \{expertise\_level\}\- Communication style: \{communication\_style\}\{completed\_phases\_section\}\#\# Current Phase Goal \(Phase \{current\_phase\_number\} of \{total\_phases\}\)You need to ask the AI assistant to do the following:\{current\_phase\_instruction\}\{next\_phase\_section\}\#\# Rules1\. Speak like a real user\. Do not use overly precise technicalterms unless your persona is a professional user\.2\. Use the screenshot to judge whether the AI assistant hascompleted the current requirement\.3\. If a "Next Phase Goal" is provided above, naturally ask forthat requirement next\. Do not invent new requests on your own\.4\. If the current phase is complete and there is no next phasegoal, indicate that the whole task is finished and do not addany new requests\.5\. Keep the conversation natural and coherent, like a real personchatting with an AI assistant\.6\. Your \`message\` should follow the language implied by the scenarioand current instruction\. If the task context is in Chinese, replyin Chinese; if it is in English, reply in English\.7\. In normal cases, always set \`action\` to \`new\_instruction\`\.8\. If the AI assistant has not completed the current phase, keep theinteraction in the same phase: set \`phase\_complete\` to false anduse \`message\` to restate or correct the current requirement\.9\. If the AI assistant has completed the current phase and there is anext phase goal, set \`phase\_complete\` to true and use \`message\` tonaturally express that next phase goal\.10\. If the current phase expects the AI assistant to ask the user aquestion, answer that question directly and naturally\. In thatcase, use \`clarify\` and set \`phase\_complete\` to true\.11\. If the AI assistant explicitly asks the user a question unexpectedly,you may use \`clarify\`, and in that case \`phase\_complete\` must befalse\.\#\# Output FormatYou must output valid JSON with the following fields:\{"action": "new\_instruction" or "clarify","message": "What you want to say to the AI assistant","phase\_complete": true or false,"reason": "When phase\_complete is false, explain why thecurrent phase is not complete"\}Meaning of \`action\`:\- "new\_instruction": The default and normal case\. Use it for bothcorrecting the current phase requirement and expressing the nextphase requirement\.\- "clarify": Only use this when the AI assistant explicitly asks theuser a question unexpectedly\. Do not use it otherwise\.

When the trigger type isagent\_asksand the GUI agent explicitly asks a question, the simulator receives an additional instruction telling it to answer the question directly, setactiontoclarify, and mark the phase as complete\. If the agent calls the user unexpectedly on a phase that is not authored asagent\_asks, the simulator is instead instructed to answer briefly without advancing the phase\.

## Appendix CHuman\-in\-the\-Loop Collaboration Mode Labels

Table[3](https://arxiv.org/html/2606.03103#A2.T3)reports the distribution of human\-in\-the\-loop collaboration labels used for the RQ4 analysis\. Each task is assigned one primary label for non\-overlapping success\-rate analysis\.

The label distribution shows that interactive desktop tasks are often not single\-mode interactions: 91 of 170 tasks contain at least one secondary collaboration label\. We therefore use primary labels for the main non\-overlapping success\-rate analysis and use any\-label statistics only to describe overlap among collaboration demands\.

## Appendix DAdditional Experimental Details

Table 4:Per\-software success rate on non\-interactive DeskCraft tasks by difficulty level\.Each model is expanded into three rows \(L1,L2,L3\)\. TheAvg\.column is the weighted success rate within that difficulty level\.Table 5:Per\-software average run length on non\-interactive DeskCraft tasks by difficulty level\.Each model is expanded into three rows \(L1,L2,L3\)\. Values are average executed steps computed fromresults/summary\_json\_collection/non\_interactive\. TheAvg\.column is the weighted average run length within that difficulty level\.Table 6:Per\-software average run length on successful non\-interactive DeskCraft tasks by difficulty level\.Each model is expanded into three rows \(L1,L2,L3\)\. Values are average executed steps computed fromresults/summary\_json\_collection/non\_interactive\. TheAvg\.column is the weighted average run length within that difficulty level\. ‘–‘ means the model has no successful tasks in that software/bucket combination at that difficulty level\.Table 7:Per\-software average run length on failed non\-interactive DeskCraft tasks by difficulty level\.Each model is expanded into three rows \(L1,L2,L3\)\. Values are average executed steps computed fromresults/summary\_json\_collection/non\_interactive\. TheAvg\.column is the weighted average run length within that difficulty level\. ‘–‘ means the model has no tasks in that software/bucket combination at that difficulty level\.Table[4](https://arxiv.org/html/2606.03103#A4.T4)reveals two complementary patterns\. On the one hand, performance degrades consistently from L1/L2 to L3 across nearly all model families, but the magnitude of the drop is highly uneven across applications\. The two frontier models remain the only systems with broad non\-trivial L3 coverage, yet even they exhibit clear application\-specific bottlenecks: GPT\-5\.4 falls to 9\.5% on average at L3, while Kimi\-K2\.6 retains a stronger 21\.6%, with particularly visible advantages on Chrome, Inkscape, Blender, and OS tasks\. On the other hand, most open\-source generalist VLMs and GUI\-specialized foundation models show limited transfer beyond easier settings: several models retain modest L1 competence, but their L3 success rates collapse to near zero, suggesting that scaling desktop\-task difficulty stresses capabilities that are not recovered by lightweight GUI specialization alone\.

Table[5](https://arxiv.org/html/2606.03103#A4.T5)shows that harder tasks are associated with longer trajectories for nearly all agents, but the way run length grows is diagnostically different across model families\. For the frontier models, the growth from L1 to L3 is substantial but still paired with non\-trivial success, suggesting that these models do exploit longer horizons to solve more complex workflows\. By contrast, many weaker open\-source agents already consume long trajectories at L1 and then approach near\-budget\-length runs at L2/L3 while achieving little accuracy\. This pattern suggests that poor performance is not simply caused by being “cut off too early”; many weaker agents already spend ample steps without converting them into successful completions\.

A related pattern fromTable[5](https://arxiv.org/html/2606.03103#A4.T5)is that average run length varies strongly by software even within the same difficulty level\. GIMP, Blender, Kdenlive, and UI generation tasks often induce markedly longer trajectories than office\-style tasks, especially at L2 and L3\. This supports the interpretation that professional desktop workflows impose not only more actions, but also more expensive error recovery: once an agent deviates in these environments, returning to the intended state often requires several additional interaction steps\.

At the same time,Table[6](https://arxiv.org/html/2606.03103#A4.T6)shows that correct trajectories remain comparatively sparse for weaker open\-source models, especially beyond L1\. Where such models do succeed, the successful trajectories are often concentrated in a small subset of applications and difficulty levels, implying that their main limitation is not only inefficiency but also narrow solvable\-task coverage\. In other words, the challenge is not simply to make successful runs shorter; many models still need a substantial increase in task\-solving breadth before trajectory efficiency becomes the dominant concern\.

Table[7](https://arxiv.org/html/2606.03103#A4.T7)shows that failed trajectories are frequently as long as, or longer than, successful ones, especially on harder tasks\. For GPT\-5\.4 and Kimi\-K2\.6, the average failed trajectory length at L3 exceeds the corresponding successful trajectory length, indicating that many failures are not early termination failures but rather long runs that drift away from the target state and continue acting without effective recovery\. This pattern is even stronger for several weaker agents, whose failed runs often approach the step budget across many applications while producing near\- zero accuracy\.

## Appendix ETask Sourcing and Asset Statistics

This section provides detailed statistics on the provenance of all benchmark tasks and their associated resource files\.

### E\.1Task Source Distribution

Table[8](https://arxiv.org/html/2606.03103#A5.T8)reports the source provenance of the 386 standard \(non\-interactive\) tasks\. We categorize sources into four types:*Official Documentation*\(application manuals and reference guides\),*Tutorials*\(step\-by\-step guides and video walkthroughs\),*Web Resources*\(frontend design challenges and developer references\), and*Author\-Designed*\(original workflows designed by annotators based on professional use cases\)\.

Table 8:Source provenance of the 386 standard tasks\.
### E\.2Reference Documentation Sites

Table[9](https://arxiv.org/html/2606.03103#A5.T9)lists the primary documentation and tutorial sites from which task workflows were extracted\. In total, we reference 224 unique URLs across these sources\.

Table 9:Top reference sites by number of tasks sourced\.
### E\.3Per\-Application Task and Asset Breakdown

Table[10](https://arxiv.org/html/2606.03103#A5.T10)reports the number of tasks per difficulty level, including the interactive split, and the number of unique asset files for each application domain\.

Table 10:Per\-application breakdown of task difficulty levels and curated assets\. Asset counts reflect unique files uploaded to the VM as task inputs\.
### E\.4Asset File Format Distribution

The 279 unique asset files span 19 file formats\. Table[11](https://arxiv.org/html/2606.03103#A5.T11)reports the distribution\. Assets are sourced through two channels: \(1\) downloaded from public repositories and stock media sites \(e\.g\., video clips, stock photographs, open\-source SVG templates\); and \(2\) manually created by annotators to meet specific task requirements \(e\.g\., multi\-track audio projects, layered Blender scenes, structured spreadsheets with formula dependencies\)\.

Table 11:Distribution of asset file formats across all tasks\.

## Appendix FDataset Construction Details

This appendix gives implementation\-level details of the dataset construction process\. Unlike Section[4](https://arxiv.org/html/2606.03103#S4), which summarizes the benchmark construction pipeline and aggregate statistics, this section focuses on how the task\-design documents were converted into executable JSON tasks, assets, and evaluators for each desktop domain\.

### F\.1Task\-Design Documents as Construction Blueprints

For each application domain, we first wrote a task\-design document before creating the final task JSON files\. Each document served as a construction blueprint\. It specified the supported application launch command, the available resource pool, the admissible difficulty levels, the expected output artifact, and the evaluator family that would make the task automatically checkable\. This design\-first step prevented tasks from being selected only because they sounded natural; a task was kept only if the design document could identify a deterministic artifact and a programmatic check for it\.

The documents also fixed domain\-specific conventions \. For example, Inkscape tasks use the absolute binary path/usr/bin/inkscapefollowed by a short GUI\-initialization sleep; Audacity tasks use/usr/bin/audacityand require the final WAV or\.aup3project to be saved in a predictable location; Blender tasks use/snap/bin/blenderboth for launching the editor and for background verification; and Chrome tasks start the browser with a remote\-debugging port plus a local forwarding process so that the evaluator can query browser state\.

### F\.2Application\-Specific Resource Pools

The resource pools were built to match the verification affordances of each application\. Vector\-design tasks use SVG files with stable element IDs, layer labels, shape names, and text IDs, so evaluators can inspect XML structure rather than compare screenshots\. Image\-editing tasks use photographs, product images, textures, transparent graphics, masks, and SVG icons, enabling tasks such as e\-commerce cutouts, poster design, magazine covers, callout annotations, and multi\-format exports\. Video\-editing tasks use short clips with known resolution, frame rate, and orientation, plus music and sound effects, so project\-file checks and rendered\-output checks can be combined\.

For domains whose artifacts are structured documents, the resources are paired with reference outputs\. Writer tasks use\.docxfiles and gold documents; Calc tasks use spreadsheets paired with gold workbooks or CSV files; Impress tasks use slide decks paired with gold decks or attribute\-level rules\. The purpose of these gold files is not to encourage pixel\-level imitation, but to make formatting, structure, and content changes inspectable at the native file level\. For system and developer\-workflow tasks, the resource pool is often created dynamically by task setup commands: the config block writes directory trees, project files, handoff notes, test suites, local HTML briefs, JSON data, or starter code immediately after VM reset\.

### F\.3Difficulty Calibration Rules

The design documents use the L1/L2/L3 labels as construction constraints rather than post\-hoc tags\. L1 tasks isolate one operation with a direct target and a single dominant artifact property, such as changing a text size, freezing a spreadsheet row, adding a transition, exporting a WAV file, or toggling a browser setting\. The task should be completable through a short path and should not require the agent to coordinate multiple regions of an artifact\.

L2 tasks compose a small number of related operations around one practical scenario\. Examples include adding formulas and sorting a sheet, styling a document section while appending one paragraph, creating a local web component from a starter bundle, or placing video clips with a simple transition\. The important construction rule is that L2 tasks should require planning across several GUI actions, but their final state should still be expressible as one compact evaluator target\.

L3 tasks represent full delivery workflows\. They require multiple dependent edits, cross\-region consistency, and often more than one final artifact\. The task\-design documents repeatedly use this pattern: a user must produce a finished deliverable while also preserving a reusable project file or bundle\. Examples include GIMP tasks that require both exported images and an organized XCF project; Blender tasks that combine scene edits, materials, cameras, and render settings; Calc tasks that add derived columns, sort records, and create summary sheets; Impress tasks that apply global slide rules and slide\-specific edits; and UI\-generation tasks that require source files, a manifest, local assets, JavaScript behavior, and a browser preview\.

### F\.4Evaluator Design by Artifact Type

Evaluator design was driven by the native artifact rather than by a uniform visual metric\. SVG tasks are checked by parsing XML with namespace\-aware lookup, including style attributes, direct attributes, layer labels, transforms, paths, gradients, filters, text spans, and element order\. Office\-document tasks are checked by loading the native document format and comparing content, formatting, tables, sheets, slide counts, notes, backgrounds, or workbook properties against reference outputs or explicit rules\. Spreadsheet evaluators use rule lists so one task can jointly check sheet names, cell values, frozen panes, styles, charts, and data\-validation constraints\.

Media and graphics applications require different strategies\. Audacity tasks analyze exported WAV files with signal\-level checks such as duration, sample rate, channel count, RMS level, silence windows, fades, peak amplitude, and track metadata from\.aup3SQLite projects\. Kdenlive tasks parse project XML for imported media, timeline placement, project profiles, transitions, and effect settings, while rendered videos can additionally be checked with media metadata\. Blender tasks cannot be reliably inspected as plain text, so the evaluator runs Blender in background mode with a Python script that queries the scene graph throughbpyand emits structured JSON for the metric to judge\.

Browser, OS, VS Code, and multi\-application tasks use state\-oriented evaluators\. Chrome tasks read settings files, browser databases, active tabs, URLs, HTML content, bookmarks, cookies, history, exported files, or desktop shortcuts\. OS tasks collect deterministickey=valueevidence from shell commands and leave pass/fail logic to Python metrics, which avoids embedding fragile evaluator logic in shell snippets\. VS Code tasks inspect JSON configuration files, keybindings, snippets, workspace files, project\.vscodefiles, and installed\-extension lists\. Multi\-application tasks combine these checks with conjunction: for example, a task may require a specific file edit, a passing Python test suite, and Chrome left open on the relevant documentation page\.

### F\.5JSON Instantiation and VM Setup

Each final task JSON is instantiated from the corresponding design document using the same core structure: upload or create resources, launch the target application, optionally wait for initialization, then declare the evaluator result getter, expected state, and metric\. File\-editing tasks typically useupload\_filefollowed by an application launch or anopenaction\. System tasks more often useexecutesteps to construct the initial state inside the VM\. Chrome and UI\-generation tasks may additionally open localfile://briefs, start a local preview server, launch VS Code on a target project folder, or keep Chrome on a final preview URL\.

Post\-evaluation setup is also encoded in JSON\. Office tasks activate the document window and send a save shortcut before downloading the edited file\. GIMP, Audacity, Kdenlive, and Blender tasks require fixed export paths so the getter can retrieve the result without guessing\. UI\-generation tasks often zip the project directory during postconfig, producing one bundle that can be checked for required files, manifest fields, DOM selectors, local asset links, forbidden remote\-image URLs, and JavaScript patterns\. These postconfig steps do not solve the task for the agent; they only normalize the final artifact so the evaluator sees the saved state\.

### F\.6Interactive\-Task Construction

Interactive tasks are derived from the same task families but split into phase\-level user messages\. The design documents avoid treating interaction as free\-form chat\. Instead, each interactive task has a scenario type, such as ambiguity, progressive refinement, requirement change, interruption, correction, or multi\-step workflow\. Each phase has a user message and a phase\-completion condition\. This structure lets the benchmark test whether an agent can ask for missing information, incorporate late constraints, recover from feedback, or continue a staged workflow without losing earlier requirements\.

### F\.7Quality\-Control Checks

The design documents include several quality\-control filters before a task is released\. First, the instruction must name the target artifact and final save or export requirement clearly enough for deterministic evaluation\. Second, the uploaded or generated resource must match the evaluator result path, so the agent is not evaluated on a different file from the one it was asked to edit\. Third, evaluator rules must use observable properties of the native artifact, not subjective judgments such as whether a design “looks good\.” When visual quality matters, the task converts it into checkable constraints such as canvas size, required text, layer names, local asset references, slide counts, or signal\-level audio properties\.

Finally, task\-design documents were used to remove or revise weak tasks\. Common rejection reasons include duplicated capability coverage, prompts whose source or target is ambiguous, tasks that require manual visual judgment, evaluators that only check file existence, and multi\-application tasks where one application is opened only as a decorative step\. The retained tasks therefore reflect both domain realism and evaluator feasibility: each task should exercise a meaningful desktop workflow and leave behind enough machine\-readable evidence for reproducible scoring\.

## Appendix GRepresentative Task Cases

This section gives representative examples from the final task set\. We choose cases from Inkscape, Blender, Kdenlive, Audacity, Writer, Calc, Impress, and Multi\-app, covering L1, L2, L3 and interactive tasks\.

Case 1: Inkscape L1 Typography EditSourceThe task is derived from the Inkscape manual entry for text toolbar font\-size editing\.InstructionOpen/home/user/Documents/text\_hello\.svgin Inkscape, change the title text font size to 72 pixels, and save the SVG\.Capability TestedThis case tests atomic GUI grounding and precise text\-property editing\. It matches a common design\-maintenance scenario where a user asks for a single typographic adjustment in an existing vector asset without changing the rest of the composition\.Uploaded ResourcesAn existing SVG design file containing editable title text\.EvaluatorThe evaluator retrieves the saved SVG, locates the title text element, reads the font\-size from the SVG text structure, and accepts the result when the value is within a small tolerance of 72 pixels\.

Case 2: Blender L2 Material Texture\-Node SetupSourceThe case is based on the Blender manual sections for Image Texture nodes and the Principled BSDF shader\.InstructionOpen/home/user/Documents/scene\.blend, select theCubewith materialCubeMaterial, connecttexture\_brick\.jpgto Base Color, connectnormal\_brick\.jpgthrough a Normal Map node to the shader Normal input, and save the file\.Capability TestedThis case tests whether the agent can perform a small but dependent look\-development workflow: it must open the shader graph, add multiple nodes, load the correct image files, and connect each node to the correct socket\. It is L2 because the final state depends on multiple coordinated edits rather than a single scalar setting\.Uploaded ResourcesA prepared Blender project with a UV\-unwrapped cube, plus a color texture, a normal texture, and an inspection script used by the evaluator\.EvaluatorThe evaluator runs Blender in background mode, extracts a structured summary of the material node graph, and verifies that the cube material contains the expected color and normal textures connected to the intended shader inputs\.

Case 3: Kdenlive L3 Multi\-Clip Render with TransitionsSourceThe case is derived from the Kdenlive render/export documentation\.InstructionOpen Kdenlive, import three video clips, place them consecutively on the timeline, add Dissolve transitions between adjacent clips, save/home/user/Videos/project\.kdenlive, and render/home/user/Videos/output\.mp4\.Capability TestedThis case tests an end\-to\-end short\-video assembly workflow\. The agent must manage the project bin, sequence clips on the timeline, insert transitions, save an editable project, and produce the final rendered MP4\. It is L3 because success depends on a chain of mutually dependent editing and delivery steps\.Uploaded ResourcesThree short source video clips that must be assembled into one timeline\.EvaluatorThe evaluator retrieves both the rendered video and the saved project file\. It checks the rendered file duration and codec, then parses the Kdenlive project XML to confirm that all source clips appear in the project and that a dissolve\-style transition exists\.

Case 4: Audacity L3 Structured Audio CleanupSourceThe case is derived from the Audacity tutorial for editing an existing file\.InstructionOpen/home/user/Documents/long\_test\.wav, delete the 40–50 second section, insert a five\-second silent break at 20 seconds, apply a three\-second fade\-in and three\-second fade\-out, exportcomplex\_edit\.wav, and save the project\.Capability TestedThis case tests a post\-production cleanup workflow: the agent must combine destructive timeline editing, silence insertion, audio effects, and export\. It models a user preparing a revised audio deliverable with both structural edits and smoother boundaries\.Uploaded ResourcesA long audio recording that contains material to cut, fade, and export\.EvaluatorThe evaluator checks the exported WAV with a conjunction of audio analyses: duration matching, low\-RMS silence around the inserted break, and monotonic RMS changes over the beginning and ending windows to verify fade\-in and fade\-out\.

Case 5: Writer L3 Policy\-Document RevisionSourceThe case is author\-designed from common internal\-policy maintenance workflows\.InstructionOpen a policy document and complete a full revision pass: center and restyle the title, convert section headings to uppercase, standardize body font and size, emphasize one policy section, add a confidentiality notice at the beginning, and append document\-control metadata at the end\.Capability TestedThis case tests long\-horizon document editing in a word processor\. The agent must combine global formatting, targeted section formatting, text insertion at two different document positions, and preservation of the original document structure\.Uploaded ResourcesA prewritten policy document in word\-processing format\.EvaluatorThe evaluator saves the edited document and compares it against a reference document, checking both content edits and formatting\-sensitive structure such as title style, heading style, body text style, and inserted paragraphs\.

Case 6: Calc L3 Project\-Budget AnalysisSourceThe case is author\-designed from project\-management reporting workflows\.InstructionOpen a project\-budget spreadsheet and complete a full analysis workflow: add derived budget columns, sort projects by priority and spending ratio, bold the header row, create a priority summary sheet, create an at\-risk project sheet, and freeze the header row\.Capability TestedThis case tests spreadsheet reasoning beyond cell\-level editing\. The agent must create formulas, preserve categorical ordering, sort rows under multiple keys, aggregate records into summary sheets, filter high\-risk items, and apply presentation\-oriented spreadsheet formatting\.Uploaded ResourcesA project\-tracking workbook with budgets, spending, priorities, owners, and project metadata\.EvaluatorThe evaluator compares the submitted workbook with a reference workbook\. It checks sheet names, tabular values across the original and generated sheets, header\-freeze settings, and style properties such as bold headers\.

Case 7: Impress L3 Presentation RedesignSourceThe case is author\-designed from slide\-deck polishing and classroom\-presentation revision scenarios\.InstructionOpen a presentation about game theory and apply a multi\-slide redesign: restyle the title slide, standardize title sizes across slides, modify slide\-specific body text, add a speaker note, change a slide background, edit a table row, and delete the final slides\.Capability TestedThis case tests whether the agent can manage a multi\-slide artifact with both global and slide\-local requirements\. It must edit styling, notes, table content, backgrounds, and deck structure while keeping the presentation coherent\.Uploaded ResourcesA prepared presentation deck with multiple slides, body text, notes, and a table\.EvaluatorThe evaluator saves the edited deck and compares it to a reference deck\. It checks slide\-level text and formatting, speaker\-note content, table edits, background changes, and whether the requested slides were removed\.

Case 8: Multi\-app L3 Web Dashboard BuildSourceThe case combines patterns from web API documentation, canvas\-chart tutorials, and dashboard\-style frontend design challenges\.InstructionUse a local brief in Chrome and a project folder in VS Code to build a previewable team\-health dashboard\. The agent must create HTML, CSS, JavaScript, and manifest files; load local JSON data; render a hero section, filters, cards, a risk timeline, a canvas chart, and a detail drawer; start a local preview server; and finish with Chrome open on the local preview URL\.Capability TestedThis case tests a realistic multi\-application development workflow\. The agent must read requirements in Chrome, edit a project in VS Code, use local assets and data, write interactive frontend code, serve the result locally, and verify the preview in the browser\.Uploaded ResourcesA local project brief, structured JSON data, and a local SVG badge asset\.EvaluatorThe evaluator checks two outcomes jointly: the active browser tab must point to the expected local preview URL, and the bundled project must contain the required files, manifest fields, DOM structure, local asset usage, data\-loading logic, canvas usage, and basic CSS layout declarations\.

Case 9: GIMP Interactive Ambiguous Annotation RequestSourceThe case from a product\-explanation workflow in which the user starts with an underspecified request and then clarifies the required callouts, footer, and deliverables\.InstructionOpen/home/user/Desktop/product\_camera\_90946\.jpgin GIMP\.The initial user request is “make an annotated camera explainer,” so the agent is expected to ask a clarification question before editing\.Phase 1 \(trigger:agent\_asks\)\.Keep the original resolution, add callouts forLens,Grip, andMode Dial, and add a semi\-transparent footer note bar\.Phase 2 \(trigger:agent\_done\)\.Export/home/user/Desktop/camera\_annotation\.png, save/home/user/Desktop/camera\_annotation\.xcf, and preserve the required layer names\.Capability TestedThis case tests interactive ambiguity handling rather than pure execution\. The agent must recognize that the first instruction is not specific enough, request clarification at the right time, and then carry out a multi\-layer image annotation workflow that remains structurally verifiable\.Uploaded ResourcesA single product photo of a camera\. The interactive clarification supplies the target labels, footer requirement, export paths, and required GIMP layer names\.EvaluatorThe evaluator checks the exported PNG and saved XCF jointly\. It verifies that the deliverables exist, that the edited artifact preserves the required output structure, and that the XCF contains the mandated layer namesBase\_Image,Callout\_1,Callout\_2,Callout\_3, andFooter\_Note\.

Case 10: Inkscape Interactive Mid\-Task InterruptionSourceThe case from creative design workflows where a user adds late layout constraints after the first draft has already begun\.InstructionOpen/home/user/Documents/poster\_template\.svgin Inkscape\.Start a first draft by changing the title toINKSCAPE WORKSHOP, the subtitle to2026 SPRING, and the background to\#0b1d3a\.Phase 1 \(trigger:step\_count = 5\)\.After the interruption, resize the document to 1080×\\times1080 and change the footer text toScan to Register\.Phase 2 \(trigger:agent\_done\)\.Export/home/user/Documents/workshop\_square\.pngat width 1080\.Capability TestedThis case tests whether the agent can continue from the current editing state rather than restarting from scratch when the instruction changes mid\-trajectory\. The task combines text editing, color editing, document\-level resizing, and final export under an interruption protocol\.Uploaded ResourcesA prepared SVG poster template with editable title, subtitle, footer, and page background elements\.EvaluatorThe evaluator inspects both the edited SVG and the exported PNG\. It checks the updated text fields, the page\-background fill, the resized document dimensions, and the existence of the exported PNG, thereby validating both state updates from before the interruption and the late\-added delivery requirement\.

## Appendix HExample Analysis

### H\.1Case 1: GIMP Camera Poster Task

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case1_figure_page_01.jpg)

Figure A1\.Case 1 image page 1\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case1_figure_page_02.jpg)

Figure A2\.Case 1 image page 2\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case1_figure_page_03.jpg)

Figure A3\.Case 1 image page 3\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case1_figure_page_04.jpg)

Figure A4\.Case 1 image page 4\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case1_figure_page_05.jpg)

Figure A5\.Case 1 image page 5\.

Evaluation AnalysisTask InstructionPlease create a 1600x1000 camera poster base using/home/user/Desktop/product\_camera\_90946\.jpgand/home/user/Desktop/texture\_metal\_220182\.jpg\. Use the metal texture as background, place the camera slightly right of center, add the headlinePRO SHOOT, and keep the XCF layer names exactlyBG\_Metal,Camera\_Main,Shadow,Vignette, andTitle\.Benchmark OutcomeThis run succeeds withresult\.txtequal to1\.0\. The final artifact satisfies both evaluator\-facing requirements: the poster composition is correct at the visual level, and the saved XCF contains the exact required layer set\.Phase\-by\-Phase TrajectoryPhase 1: Canvas and Background Setup \(steps 1–14\)\.The agent first handles the RGB conversion dialog, creates the required 1600×\\times1000 canvas, and importstexture\_metal\_220182\.jpgas a layer\. This phase is relatively clean\. The reasoning is correct at the composition level: the agent understands early that it needs a fresh poster canvas rather than editing the source photo directly\.Phase 2: Layer\-Naming Failure and Recovery \(steps 15–29\)\.The first serious difficulty appears when the agent tries to rename the imported texture layer\. It spends multiple steps double\-clicking, right\-clicking, and searching the Layer menu, but none of those attempts trigger editable layer\-name state in the panel\. Instead of continuing indefinitely, it pivots toFilters → Script\-Fu → Consoleand uses a deterministic rename command to set the layer name toBG\_Metal\. This is the first key recovery event in the rollout, and it matters because the task is evaluated not only on image appearance but also on exact layer names\.Phase 3: Camera Placement and Foreground Cleanup \(steps 30–54\)\.The agent imports the camera asset, renames it toCamera\_Mainvia Script\-Fu, scales it, and positions it slightly right of center\. After that, it notices that the camera still carries a bright background halo from the source image\. To address this, it adds an alpha channel, switches toSelect by Color, deletes the light background region, and deselects the result\. This is an imperfect but coherent cleanup sequence: the agent is reading the rendered poster state and correcting the foreground extraction before moving to stylistic finishing layers\.Phase 4: Shadow Construction \(steps 55–73\)\.For theShadowlayer, the agent again starts in Script\-Fu\. It creates a shadow layer from the camera alpha, but the first attempt to blur it fails because the procedure name is wrong\. The recovery is again instructive: the agent stops forcing the scripting route, switches back to the GUI, applies Gaussian blur in the dialog, reorders the shadow beneathCamera\_Main, and offsets it down\-right with the Move tool\. This phase shows a useful pattern of behavior: the agent does not insist on one control channel when another one is better suited to the subtask\.Phase 5: Vignette Construction \(steps 74–94\)\.The vignette is built through a similar hybrid strategy\. The agent first creates a solid blackVignettelayer in Script\-Fu, then realizes that a flat black overlay is not enough\. It exits to the GUI, adds a layer mask, changes the Gradient tool to radial mode, and draws a center\-to\-edge gradient on the mask so that only the borders remain darkened\. This phase is structurally important because it shows the agent constructing a nontrivial effect through layered reasoning rather than treating the vignette as a single click\.Phase 6: Title Styling and Final Structural Cleanup \(steps 95–125\)\.The agent creates thePRO SHOOTtext in the GUI, recolors it to white, applies bold styling, and increases the size to 120 px\. When direct layer renaming fails again, it returns to Script\-Fu, renames the text layer toTitle, raises it aboveCamera\_Main, and removes the extra defaultBackgroundlayer so that the XCF contains exactly the required five\-layer set\. It then saves the final artifact as/home/user/Desktop/camera\_poster\.xcf\.Evaluator TakeawayThis is a strong success case for staged recovery\. The trajectory is inefficient, especially around layer renaming, but the agent repeatedly notices when direct manipulation is brittle and replaces it with a more deterministic fallback\. The final artifact satisfies both evaluation axes: the poster looks structurally correct, and the saved XCF matches the exact layer inventory required by the task\.

### H\.2Case 2: Interactive Kdenlive Requirement\-Change Task

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case2_figure_page_01.jpg)

Figure A6\.Case 2 image page 1\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case2_figure_page_02.jpg)

Figure A7\.Case 2 image page 2\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case2_figure_page_03.jpg)

Figure A8\.Case 2 image page 3\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case2_figure_page_04.jpg)

Figure A9\.Case 2 image page 4\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case2_figure_page_05.jpg)

Figure A10\.Case 2 image page 5\.

Evaluation AnalysisTask InstructionPhase 1 starts with: import/home/user/Videos/15368811\_1920\_1080\_30fps\.mp4, place it on V1, and prepare a quick horizontal teaser draft with a short title cardNew Product Teaser\. At step 3, the interaction log injects a new user requirement: switch to a 1080×\\times1920 vertical project, render an H\.264 MP4 to/home/user/Videos/teaser\_vertical\.mp4, and save the project as/home/user/Videos/teaser\_vertical\.kdenlive\.Benchmark OutcomeThis run succeeds withresult\.txtequal to1\.0\. The saved project path, rendered MP4 path, and final output geometry are all consistent with the updated interactive requirement\.Phase\-by\-Phase TrajectoryPhase 1: Aborted Initial Plan \(steps 1–3\)\.The first phase barely becomes a workflow\. The agent opens the launcher, begins searching for Kdenlive, and then immediately receives the new user message at step 3\. That means the original horizontal\-teaser objective is effectively superseded before substantial editing begins\.Phase 2: Environment Recovery and Tool Acquisition \(steps 4–27\)\.Once the new requirement arrives, the agent resets the application search, opens a terminal, and installs Kdenlive withsudo apt\-get update && sudo apt\-get install \-y kdenlive\. It also searches the filesystem for candidate video files and restarts Kdenlive from the shell\. This is expensive, but it is goal\-consistent: the agent treats the new vertical Kdenlive deliverable as the only relevant objective and prioritizes getting the missing tool into a usable state\.Phase 3: Project\-Profile Engineering \(steps 42–57\)\.After the basic tooling is available, the agent turns to the 1080×\\times1920 project\-format constraint\. It inspects/usr/share/kdenlive/profiles/, creates a customvertical\_1080x1920\_30fpsprofile under~/\.local/share/kdenlive/profiles/, prints the file back for inspection, and even opens it innanoto rewrite the content manually\. This phase is highly diagnostic: the agent externalizes a GUI configuration problem into a filesystem configuration problem\. The choice is technically plausible and shows strong goal focus, but it also reveals uncertainty and high operational cost\.Phase 4: Temporary Artifact Bootstrapping \(steps 80–99\)\.The agent still does not fully trust the GUI path to produce a vertical project cleanly, so it manufactures helper artifacts from the terminal\. It tries several methods to write a temporary/tmp/test\_vertical\.kdenlivefile, including a here\-doc XML block, a Python one\-liner, and aprintf\-based fallback\. It then creates a synthetic 1080×\\times1920 test video withffmpeg\. This phase shows decomposition under uncertainty: the agent is trying to ensure that both evaluator\-visible object types exist, namely a project file with the correct profile metadata and a rendered MP4 with the correct geometry\.Phase 5: Minimal GUI Assembly in Kdenlive \(steps 91–129\)\.The agent returns to Kdenlive, loads the temporary assets, drags material into the project area and timeline, creates a minimal title element, and uses save/export dialogs to produceteaser\_vertical\.kdenliveandteaser\_vertical\.mp4\. Compared with the original instruction, the content is intentionally lightweight\. The behavior here is best understood as requirement compression: once the user changes the objective, the agent stops optimizing for a richer teaser draft and instead focuses on the smallest action set that can reliably satisfy the revised deliverable specification\.Phase 6: Terminal\-Side Verification \(steps 130–133\)\.In the final phase, the agent leaves the editor and explicitly checks the outputs from the terminal\. It lists/home/user/Videos/and runsffprobeon/home/user/Videos/teaser\_vertical\.mp4to confirm the render resolution\. This is an important evaluator\-aligned behavior: the run does not terminate on a UI assumption alone, but verifies that the exported artifact has the expected geometry\.Evaluator TakeawayThe strongest property of this trajectory is requirement re\-targeting\. After phase 2 begins, the agent no longer behaves as though the horizontal teaser matters; it reorganizes the entire rollout around vertical geometry, explicit file paths, and export verification\. The main weakness is efficiency: the run is long, workaround\-heavy, and dependent on terminal\-side profile and artifact generation\. Even so, it is a convincing interactive success case because the final behavior is consistently organized around the revised user goal rather than the obsolete initial request\.

### H\.3Case 3: Blender Resolution Task

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case3_figure_page_01.jpg)

Figure A11\.Case 3 image page 1\.

![[Uncaptioned image]](https://arxiv.org/html/2606.03103v1/acl_case_figures_share/case3_figure_page_02.jpg)

Figure A12\.Case 3 image page 2\.

Evaluation AnalysisTask InstructionOpen the Blender project/home/user/Documents/scene\.blend\. In the Output Properties panel, set the render resolution to 1280×\\times720, then save the file\.Benchmark OutcomeThis run fails, withresult\.txtreporting0\.0\. The required parameter edit never becomes a completed save workflow, so the evaluator never observes a valid 1280×\\times720 update in the submitted\.blendfile\.Phase\-by\-Phase TrajectoryPhase 1: Uncertain File\-Open Handling \(steps 1–9\)\.The run begins with hesitation about the initial desktop state\. The agent usesCtrl\+Oand repeatedly clicks in the file\-open dialog without clearly committing to one navigation path\. It partially infers that the target file may already be visible or loaded, but it does not convert that inference into a robust check\. This early ambiguity matters because the task should have moved quickly from file access into Output Properties, yet the trajectory already begins to spend steps on state interpretation rather than direct progress\.Phase 2: Catastrophic Recovery Failure \(steps 10–12\)\.The critical failure occurs when the agent attempts to dismiss what it thinks is a blocking window and sendsAlt\+F4\. That closes Blender itself\. The model notices the mistake immediately in its own reasoning, but the damage is substantial: from this point onward, the rollout is no longer a normal settings\-edit task but a recovery task\. The agent then tries to relaunch Blender from the dock and via the Activities search, but this recovery is slow and unstable\. This is the main turning point in the episode, because the agent loses the reliable application context it needed for a simple property edit\.Phase 3: Reopening and Reacquiring the Project \(roughly steps 20–34\)\.After relaunch attempts, the agent spends a long middle segment trying to reopen/home/user/Documents/scene\.blend\. It alternates between double\-clicking file rows, clicking the Open button, pressingEnter, and finally usingCtrl\+Lto type the full path into the file chooser\. This phase is more structured than the earlier opening attempts, and the typed absolute path is the most reliable action in the whole trajectory\. However, even after the project is likely back on screen, the agent does not reestablish a clean internal model of the Blender layout\. The file\-reopen problem is eventually reduced, but the agent has already spent a large budget on state recovery\.Phase 4: Output\-Properties Search by Coordinate Guessing \(steps 35–43\)\.Once the project appears available again, the agent correctly recognizes that it needs the Output Properties panel, but it never grounds the target icon or the resolution fields reliably\. Instead, it begins repeated coordinate\-based clicks on the right\-side Properties area, trying several nearby y\-positions that it describes as possible tab icons\. It also triesF10as a shortcut, but this does not lead to a stable editable state either\. The key weakness here is that the agent has no fallback when visual icon targeting is uncertain\. It keeps sampling neighboring coordinates instead of switching to a deterministic mechanism such as Blender’s search, a structured menu path, or the embedded Python interface\.Phase 5: Termination Without Parameter Edit\.The trajectory ends without any evidence that the width or height fields were actually changed to 1280 and 720, and without a successful save step that would propagate the edit back into the\.blendfile\. The final runtime state is effectively a prolonged search loop inside Blender’s UI rather than an edit\-and\-save workflow\.Evaluator TakeawayThis is a clear control\-and\-recovery failure\. The agent broadly knows what it needs to accomplish, but it never maintains stable application state after closing Blender and never finds a dependable path to the Output Properties controls\. The dominant failure mode is that the agent remains trapped in brittle GUI guessing on a task where a deterministic fallback would have been much more reliable\.

## Appendix IAI Assistants in Research or Writing

We used AI assistants, including ChatGPT and Cursor, during the preparation of this work\. Their use was limited to research, coding, and writing assistance: improving grammar and clarity, suggesting wording alternatives, helping with LaTeX editing, and assisting with code drafting, debugging, and result organization\. All benchmark design decisions, experimental protocols, task definitions, analyses, and reported results were reviewed, verified, and finalized by the authors\.

## Appendix JArtifact Licensing, Privacy, and Content Review

The released DeskCraft artifacts, including task definitions, evaluator code, and supporting scripts, will be distributed with explicit license information; the project code is released under the Apache License 2\.0\. Task assets are synthetic, author\-created, or derived from public sources that permit academic use and redistribution, with attribution where applicable\. Materials without redistribution permission and proprietary practitioner\-provided artifacts are not released\.

Before release, we checked task text, assets, and metadata for personally identifying information and offensive content\. Practitioner\-seeded workflows were abstracted with consent, and raw notes or proprietary artifacts were not released\. The final public files were manually reviewed by the authors\.

Similar Articles

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

arXiv cs.AI

Introduces EduAgentBench, a source-grounded benchmark for evaluating tutor agents across professional pedagogical judgment, multi-turn tutoring, and autonomous teaching workflow execution. Evaluations on frontier models show they still fall short of professional teaching standards in situated tutoring and workflow tasks.

Design and Report Benchmarks for Knowledge Work

arXiv cs.AI

This paper proposes a three-step framework for designing and reporting benchmarks for knowledge work AI, emphasizing alignment between benchmark tasks and real-world work activities. It derives 18 work activities from the O*NET database and analyzes three existing benchmarks (GDPval, OfficeQA Pro, APEX-SWE) to demonstrate gaps between benchmark scores and actual work capability.