Skill-Guided Continuation Distillation for GUI Agents
Summary
The paper proposes Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework that uses skill-guided policies to generate supervision for off-trajectory states during closed-loop execution, improving GUI agent success rates on OSWorld-Verified from around 30% to over 50%.
View Cached Full Text
Cached at: 06/18/26, 05:41 AM
# Skill-Guided Continuation Distillation for GUI Agents
Source: [https://arxiv.org/html/2606.18890](https://arxiv.org/html/2606.18890)
Zhimin Fan1,∗Hongwei Yu1,2,∗Yeqing Shen1,†Haolong Yan1 Guozhen Peng1Tianhao Peng4Yudong Zhang3Xiaowen Zhang2,‡ Kaijun Tan1Zheng Ge1Xiangyu Zhang1Daxin Jiang1 1StepFun2University of Science and Technology Beijing3Tsinghua University 4Nanyang Technological University ∗Equal contribution\.†Project lead\.‡Corresponding author
###### Abstract
Improving GUI agents typically relies on behavior cloning on expert trajectories\. However, as the current policy deviates from the expert policy, it inevitably encounters policy\-induced off\-trajectory states during closed\-loop execution, i\.e\., states that fall outside the expert trajectories\. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action\. To close this supervision gap, we propose Skill\-Guided Continuation Distillation \(SGCD\), an iterative self\-improvement framework\. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off\-trajectory states\. From these states, a skill\-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy\-induced off\-trajectory states\. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria\. On OSWorld\-Verified, SGCD improves the success rate of three base models from the low\-30% range to over 50%, demonstrating its effectiveness and generality\.
![[Uncaptioned image]](https://arxiv.org/html/2606.18890v1/images/robot_title_icon.png)Skill\-Guided Continuation Distillation for GUI Agents
## 1Introduction
Built on recent vision\-language foundation modelsGoogle \([2025](https://arxiv.org/html/2606.18890#bib.bib26)\); Baiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib24)\); Yanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\); OpenAI \([2026](https://arxiv.org/html/2606.18890#bib.bib27)\); Anthropic \([2026](https://arxiv.org/html/2606.18890#bib.bib39)\); Hurstet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib40)\), GUI agents perceive screen observations and predict actions to operate desktop, web, and mobile interfaces in a closed loop, supporting open\-ended computer tasks such as document editing, software operation, and web navigation\. Such agents are typically trained by supervised fine\-tuning \(SFT\) over trajectory data, which adapts the underlying vision\-language models to GUI\-specific observations, action spaces, and interaction protocols\.
Existing end\-to\-end GUI agents are trained on human or synthetic expert trajectoriesQinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\); Wanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\); Yanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\); Xuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib16)\); Xueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\), teaching task\-specific behaviors, action formats, and procedural knowledge\. Self\-improvement methods further expand the training pool by converting model\-generated rollouts into supervision through filtered self\-training, sandbox\-based reinforcement learning, or experience\-driven knowledge refinementYanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\); Wuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib32)\); Laiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib43)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib68)\); Linet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib59)\); Wanget al\.\([2026c](https://arxiv.org/html/2606.18890#bib.bib69)\)\. Despite differences in data sources, these approaches share a common supervision paradigm: behavior cloning on successful expert trajectories, where the policy is trained to imitate the action taken at each visited expert state\. While such supervision is strong on the expert state distribution, the discrepancy between the expert policy and the current policy inevitably drives the current policy into states that fall outside the expert trajectoriesLaufferet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib44)\); Rosset al\.\([2011](https://arxiv.org/html/2606.18890#bib.bib18)\)\. We refer to these states aspolicy\-induced off\-trajectory states\. Expert trajectories provide no effective supervision for these states, leaving the policy unable to predict the correct action\. Reinforcement learning is explored as an alternative source of supervision for such statesLaiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib43)\); Liet al\.\([2025a](https://arxiv.org/html/2606.18890#bib.bib52)\), but rollouts from the current policy rarely produce correct actions, yielding sparse reward signals and inefficient trainingZenget al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib57)\); Wanget al\.\([2026a](https://arxiv.org/html/2606.18890#bib.bib58)\)\.
Figure 1:Failure analysis of GUI agents\. Left: failures are concentrated in early execution\. Right: representative recurring failure patterns, including early done, fix action, hallucinated affordance, and scope misjudgment\.Unlike single\-step prediction, where each output is independent, end\-to\-end tasks involve long sequences of actions executed in a closed loop\. An early off\-trajectory action thus propagates through subsequent interactions and drives the agent into states increasingly far from the expert trajectoriesChenet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib51)\)\. Importantly, such states are not arbitrary perturbations of expert states but reflect systematic biases of the learned policy, which tends to repeat a small set of erroneous behaviors\. We refer to this distributional and supervisory mismatch as the off\-trajectory supervision deficit\.
Closing this supervision deficit is particularly challenging in GUI domains\. Reaching realistic off\-trajectory states requires actually executing actions in the environment, making such states costly to revisit, reproduce, or reset\. Existing methods rely on hand\-crafted rules to select important off\-trajectory statesLinet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib59)\), but such heuristics introduce selection bias and yield only sparse coverage of the states the current policy traverses\. Moreover, obtaining successful continuations is difficult, as the current policy seldom completes the task from such states\. An effective method for supplying such supervision should therefore ①expose the policy to realistic off\-trajectory statesand ②obtain successful continuationsfrom such states\.
To meet both objectives, we propose Skill\-Guided Continuation Distillation \(SGCD\), an iterative self\-improvement framework\. SGCD rolls out the plain policy \(i\.e\., the current policy without skill guidance\) to reach realistic off\-trajectory states, then invokes a skill\-guided policy to complete the task and produce successful continuations from those states\. Mixing verified continuations with expert trajectories for training supplies additional supervision over policy\-induced off\-trajectory states\. Concretely, each objective is realized as follows\.
For objective ①, we first analyze where GUI failures occur during trajectories\. As shown in Fig\.[1](https://arxiv.org/html/2606.18890#S1.F1)\(a\), failures are strongly concentrated in early execution, suggesting that off\-trajectory deviations are often induced early and can lead to erroneous subsequent actions and eventual task failure\. Accordingly, we induce off\-trajectory states from the early execution of the plain policy, where such deviations naturally arise\. For each task, the plain policy interacts with the GUI forkksteps to instantiate realistic off\-trajectory states\. By sweepingkkover a range of values, SGCD avoids hand\-crafted state\-selection heuristics and densely covers the off\-trajectory states the current policy actually traverses\. This aligns the supervision distribution with the states the policy encounters at deployment, directly mitigating the distributional shift inherent in behavior cloning\.
For objective ②, we analyze the structured failure patterns induced by the learned policy\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.18890#S1.F1)\(b\), policy failures exhibit recurring error tendencies rather than isolated accidental mistakesWanyanet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib48)\); Lùet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib49)\)\. From successful and failed rollouts, we extract*off\-trajectory continuation skills*\(Continuation Plans, Critical Targets, Failure Traps, and Success Criteria\), which guide a skill\-guided policy to roll out from each off\-trajectory state and produce verified successful continuations for training\. By fulfilling these two objectives, SGCD synthesizes effective supervision over policy\-induced off\-trajectory states, closing the supervision deficit inherent in behavior cloning on expert trajectories\.
We evaluate our method on OSWorld\-Verified across three vision\-language models: Qwen3\-VL\-8B, Qwen3\-VL\-30B\-A3B, and STEP3\-VL\-10B\. Across all three models, SGCD consistently improves performance from the low\-30% range to over 50%\. The main contributions are as follows:
- •We identify theoff\-trajectory supervision deficit, where agents trained on successful demonstrations lack supervision for policy\-induced off\-trajectory states, and show that failures are concentrated in early stage\.
- •We proposeSkill\-Guided Continuation Distillation, which uses off\-trajectory continuation skills to collect successful continuations from off\-trajectory states and mitigate the expert\-state bias\.
- •We validate SGCD on OSWorld\-Verified across three base models, with success rates improving from the low\-30% range to over 50%, demonstrating its generality\.
Figure 2:Overview of SGCD\. \(1\)Task Trajectories Sampling: collect successful and failed plain\-policy trajectories\. \(2\)Skill Construction: extract off\-trajectory continuation skills from trajectory evidence\. \(3\)Off\-trajectory Continuation Construction: usekk\-step handoff to collect skill\-guided successful continuations\. \(4\)Mixed Trajectories Training: train the plain policy with expert and verified continuation trajectories\.
## 2Related Work
#### GUI Agents\.
Recent GUI agents advance on web, mobile, and desktop benchmarksDenget al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib10)\); Zhouet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib11)\); Rawleset al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib13)\); Xieet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib14)\)along several directions\. UI\-TARSQinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\)unifies perception, reasoning, and action generation through large\-scale GUI\-specific pretraining\. OpenCUAWanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\)scales human\-annotated desktop trajectories with reflective state\-action conversion to support open\-ended tasks\. SeeClickChenget al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib45)\)and UGroundQianet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib56)\)target element grounding via screen\-localized pretraining for accurate UI localization\. EvoCUAXueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\)and LiteGUIWuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib32)\)automatically synthesize task and trajectory data to continuously update the policy\. These works establish strong foundational capabilities for GUI agents and demonstrate effective data synthesis pipelines\. Building on this foundation, SGCD proposes an iterative self\-improvement approach that further addresses the lack of expert supervision for policy\-induced off\-trajectory states\.
#### Self\-Improvement\.
Self\-improvement methods seek to enhance agent performance by leveraging the agent’s own interaction experience as a training signal\. ReflexionShinnet al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib21)\)and Self\-RefineMadaanet al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib22)\)use inference\-time verbal feedback to revise outputs\. Recent GUI pipelines convert model\-generated rollouts into supervision through filtered self\-trainingYanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\); Wuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib32)\), sandbox\-based reinforcement learningLaiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib43)\), experience\-driven knowledge refinementZhanget al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib68)\); Linet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib59)\), or policy\-aligned experience assimilationWanget al\.\([2026c](https://arxiv.org/html/2606.18890#bib.bib69)\)\. SGCD follows this paradigm and specifically targets policy\-induced off\-trajectory states, synthesizing skill\-guided continuation supervision from the states the current policy actually traverses\. Detailed related work is shown in Appendix[C](https://arxiv.org/html/2606.18890#A3)\.
## 3Preliminaries
We consider a distribution of executable GUI training tasks𝒳\\mathcal\{X\}constructed to be compatible with the OSWorld\-Verified interaction protocol\. Each taskx∈𝒳x\\in\\mathcal\{X\}contains a natural\-language instruction, an initial environment state, and an automatic verifier\. The environment supports state reset, execution of mouse and keyboard actions in real desktop applications, and rule\-based final\-state evaluation\.
At steptt, the agent observes a multimodal observationoto\_\{t\}and predicts an action from the interaction historyht=\(o1,a1,…,ot\)h\_\{t\}=\(o\_\{1\},a\_\{1\},\\ldots,o\_\{t\}\)\. A rollout defines as
τ=\(x,o1,a1,…,oT,aT\),\\tau=\(x,o\_\{1\},a\_\{1\},\\ldots,o\_\{T\},a\_\{T\}\),\(1\)and the verifierVx\(τ\)∈\{0,1\}V\_\{x\}\(\\tau\)\\in\\\{0,1\\\}determines whether the final environment state satisfies the task goal\. The plain policy model is
πpolicy\(at∣ht,x\)≜πθ\(at∣ht,x\),\\pi\_\{\\mathrm\{policy\}\}\(a\_\{t\}\\mid h\_\{t\},x\)\\triangleq\\pi\_\{\\theta\}\(a\_\{t\}\\mid h\_\{t\},x\),\(2\)which is not conditioned on any skill\. Given a task\-specific skillsxs\_\{x\}, the skill\-guided policy model can be expressed as
πskill\(at∣ht,x,sx\)≜πθ\(at∣ht,x,sx\)\.\\pi\_\{\\mathrm\{skill\}\}\(a\_\{t\}\\mid h\_\{t\},x,s\_\{x\}\)\\triangleq\\pi\_\{\\theta\}\(a\_\{t\}\\mid h\_\{t\},x,s\_\{x\}\)\.\(3\)Both policies are parameterized by the shared parametersθ\\theta, operate over the same action space, and differ only in the conditioning context\. The skill acts as a training\-time privileged recovery context that steers the current model toward a more informed recovery mode without changing its underlying interface\.
Given a trajectory dataset𝒟\\mathcal\{D\}, standard supervised fine\-tuning optimizes the plain policy with the loss
ℒSFT\(𝒟;θ\)=−𝔼τ∼𝒟∑t=1\|τ\|logπθ\(yt∣ht,x\)\.\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\mathcal\{D\};\\theta\)=\-\\mathbb\{E\}\_\{\\tau\\sim\\mathcal\{D\}\}\\sum\_\{t=1\}^\{\|\\tau\|\}\\log\\pi\_\{\\theta\}\(y\_\{t\}\\mid h\_\{t\},x\)\.\(4\)Here,yty\_\{t\}denotes the ground\-truth action of the trajectoryτ\\tau\. This objective trains the policy to imitate the action labels in𝒟\\mathcal\{D\}and serves as the base training objective used throughout SGCD\.
## 4Method
### 4\.1Motivation
Expert trajectories provide no effective supervision for policy\-induced off\-trajectory statesLaufferet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib44)\); Rosset al\.\([2011](https://arxiv.org/html/2606.18890#bib.bib18)\)\. Such states are not arbitrary perturbations: induced through the agent’s closed\-loop execution process, these states reflect the systematic biases of the learned policy\. To inform the design of continuation supervision, we analyze failed rollouts from the plain policy \(Fig\.[1](https://arxiv.org/html/2606.18890#S1.F1)\) and identify two empirical properties: failures are strongly concentrated in early execution, and failed trajectories exhibit recurring error tendencies rather than isolated accidental mistakes\. These observations motivateSkill\-Guided Continuation Distillation\(SGCD\), which first runs the plain policy to reach realistic off\-trajectory states and then invokes the skill\-guided policy to obtain successful continuations from these states\. As illustrated in Fig\.[2](https://arxiv.org/html/2606.18890#S1.F2), SGCD proceeds in four stages\. In Stage I \(Sec\.[4\.2](https://arxiv.org/html/2606.18890#S4.SS2)\), we roll out the plain policy on different training tasks to collect successful and failed trajectories\. In Stage II \(Sec\.[4\.3](https://arxiv.org/html/2606.18890#S4.SS3)\), we summarize these trajectories and construct task\-specific off\-trajectory continuation skills\. In Stage III \(Sec\.[4\.4](https://arxiv.org/html/2606.18890#S4.SS4)\), the plain policy executes the firstkkGUI actions on recoverable failure tasks to reach realistic off\-trajectory states, after which the current GUI state is passed to the skill\-conditioned policy to obtain verified successful continuations\. In Stage IV \(Sec\.[4\.5](https://arxiv.org/html/2606.18890#S4.SS5)\), we process the resulting continuations and incorporate these continuations with expert trajectories to optimize the deployment policy without skill prompts\.
### 4\.2Stage I: Task Trajectories Sampling
The first stage samples the unassisted behavior of the current plain policy\. For each training taskx∈𝒳x\\in\\mathcal\{X\}, we reset the environment and run the plain policyMMtimes, whereMMdenotes the number of rollouts sampled per task\. The resulting set is
𝒯x,policy=\{τx,policym∼πpolicy\(⋅∣x\)\}m=1M,\\mathcal\{T\}\_\{x,\\mathrm\{policy\}\}=\\left\\\{\\tau\_\{x,\\mathrm\{policy\}\}^\{m\}\\sim\\pi\_\{\\mathrm\{policy\}\}\(\\cdot\\mid x\)\\right\\\}\_\{m=1\}^\{M\},\(5\)whereτx,policym\\tau\_\{x,\\mathrm\{policy\}\}^\{m\}denotes themm\-th complete trajectory collected by executingπpolicy\\pi\_\{\\mathrm\{policy\}\}on taskxx\. Using the task verifierVxV\_\{x\}, we partition all sampled rollouts of taskxxinto successful and failed trajectories:
𝒟x\+=\{τ∈𝒯x,policy:Vx\(τ\)=1\},\\mathcal\{D\}\_\{x\}^\{\+\}=\\\{\\tau\\in\\mathcal\{T\}\_\{x,\\mathrm\{policy\}\}:V\_\{x\}\(\\tau\)=1\\\},\(6\)𝒟x−=\{τ∈𝒯x,policy:Vx\(τ\)=0\},\\mathcal\{D\}\_\{x\}^\{\-\}=\\\{\\tau\\in\\mathcal\{T\}\_\{x,\\mathrm\{policy\}\}:V\_\{x\}\(\\tau\)=0\\\},\(7\)The first stage deliberately collects rollouts from the plain policy rather than an external expert, because the resulting rollouts directly expose the off\-trajectory states and recurring mistakes characteristic of the deployed agent’s closed\-loop behavior\. Successful rollouts provide feasible workflows and terminal evidence, while failed rollouts expose model\-specific traps, target confusions, and premature termination patterns that are later summarized into off\-trajectory continuation skills\.
### 4\.3Stage II: Skill Construction
We use Gemini\-3\-ProGoogle \([2025](https://arxiv.org/html/2606.18890#bib.bib26)\)to transform task\-level trajectory evidence into compact natural\-language skills\. The skill format is motivated by the path\-plural nature of GUI control\. Unlike math and code tasks, GUI tasks rarely admit a unique ground\-truth solution trace\. The same instruction for GUI agents may be completed through different menus, shortcuts, dialog states, file views, or interaction orders\. Directly conditioning on a successful path is therefore a brittle form of supervision\. It over\-binds the current decision to a particular historical state sequence and preserves incidental low\-level details such as menu order or scroll position\. As a result, the model may learn to imitate a route rather than understand how the task can be recovered from an uncertain intermediate state\.
SGCD represents a skill as a*trajectory abstraction*rather than a trajectory replay\. It summarizes trajectory\-level evidence into off\-trajectory continuation skills while filtering out incidental path details\. We construct this abstraction from our failure analysis\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.18890#S1.F1)\(b\), failed GUI rollouts can be organized into four recurring failure families:*Early Done*, where the model terminates before the task is actually complete;*Fixation*, where it repeats ineffective actions without switching strategy;*Hallucinated Affordance*, where it searches for unavailable menus, settings, or commands; and*Scope Misjudgment*, where it selects an incorrect UI function or dialog despite understanding the broad instruction\.
The skill schema is designed to guide successful continuations from off\-trajectory states\. It summarizes what should be achieved, which UI targets matter, what failure traps should be avoided, and how completion should be verified, while leaving the skill\-conditioned policy free to choose a feasible continuation from the current GUI state\. This abstraction is especially suitable for continuation construction, since the handoff state is produced online by the policy model and may not match any historical successful trace\. Accordingly, each skillsxs\_\{x\}is indexed by task id and organized according to the schema in Tab\.[1](https://arxiv.org/html/2606.18890#S4.T1)\.
Table 1:The structured schema of a task\-specific off\-trajectory continuation skill\.
### 4\.4Stage III: Off\-Trajectory Continuation Construction
The third stage constructs successful continuations from recoverable failed tasks\. A recoverable task is selected only if the plain policy fails while the skill\-guided policy succeeds:
𝒳rec=\{x∈𝒳fail:Vx\(τx,skill\)=1\},\\mathcal\{X\}\_\{\\mathrm\{rec\}\}=\\left\\\{x\\in\\mathcal\{X\}\_\{\\mathrm\{fail\}\}:V\_\{x\}\(\\tau\_\{x,\\mathrm\{skill\}\}\)=1\\right\\\},\(8\)where𝒳fail\\mathcal\{X\}\_\{\\mathrm\{fail\}\}represents failed tasks of policy modelπpolicy\\pi\_\{\\mathrm\{policy\}\}\. This filtering avoids spending sampling budget on tasks for which the skill\-guided policyπskill\\pi\_\{\\mathrm\{skill\}\}cannot yet find a successful solution, since such tasks are unlikely to yield verified successful continuations\. Because SGCD is iterative, this recoverable task set is not fixed\. As the plain policy improves in later rounds, additional tasks may become recoverable and enter subsequent off\-trajectory continuation data construction\.
As shown in Fig\.[1](https://arxiv.org/html/2606.18890#S1.F1)\(a\), failures are strongly concentrated in early execution\. We therefore enumerate switch steps within this early window, i\.e\.,k∈\{1,…,20\}k\\in\\\{1,\\ldots,20\\\}\. By sweeping over multiple handoff depths, SGCD reduces hand\-crafted bias in selecting off\-trajectory states while preserving enough remaining horizon for the skill\-conditioned policy to complete the task\.
For each selected taskx∈𝒳recx\\in\\mathcal\{X\}\_\{\\mathrm\{rec\}\}and switch stepkk, we reset the environment and let the plain policy execute the firstkkGUI actions\. This produces a live policy\-induced state with history
hk\+1p=\(o1,a1,…,ok,ak,ok\+1\)∼πpolicy\(⋅∣x\)\.h\_\{k\+1\}^\{p\}=\(o\_\{1\},a\_\{1\},\\ldots,o\_\{k\},a\_\{k\},o\_\{k\+1\}\)\\sim\\pi\_\{\\mathrm\{policy\}\}\(\\cdot\\mid x\)\.\(9\)This action segment is not replayed from a stored trajectory\. It is produced in the current environment, so the handoff state reflects a state that the plain policy actually reaches\. Starting from this state, the skill\-guided policy model continues execution with the task\-specific skillsxs\_\{x\}:
τ^\>k=\(a^k\+1,…,o^T,a^T\)∼πskill\(⋅∣hk\+1p,x,sx\)\.\\hat\{\\tau\}\_\{\>k\}=\(\\hat\{a\}\_\{k\+1\},\\ldots,\\hat\{o\}\_\{T\},\\hat\{a\}\_\{T\}\)\\sim\\pi\_\{\\mathrm\{skill\}\}\(\\cdot\\mid h\_\{k\+1\}^\{p\},x,s\_\{x\}\)\.\(10\)The resulting successful continuation trajectory is
τ^=\(x,o1,a1,…,ok\+1,a^k\+1,…,o^T,a^T\)\.\\hat\{\\tau\}=\(x,o\_\{1\},a\_\{1\},\\ldots,o\_\{k\+1\},\\hat\{a\}\_\{k\+1\},\\ldots,\\hat\{o\}\_\{T\},\\hat\{a\}\_\{T\}\)\.\(11\)
Each spliced rollout is filtered by two complementary signals\. First, the executable verifier must accept the final state,Vx\(τ^\)=1V\_\{x\}\(\\hat\{\\tau\}\)=1\. Second, an LLM judgeZhenget al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib50)\); Lùet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib49)\)inspects the task, observations, actions, and final outcome to remove accidental success, redundant loops, inconsistent reasoning, unsafe operations, and behavior unrelated to the instruction\. Only post\-handoff continuations that pass both filters are added to the continuation dataset𝒟cont\\mathcal\{D\}\_\{\\mathrm\{cont\}\}\.
### 4\.5Stage IV: Mixed Trajectories Training
The final stage trains the deployment policy on a mixture of original expert trajectories𝒟exp\\mathcal\{D\}\_\{\\mathrm\{exp\}\}, verified successful policy trajectories𝒟\+\\mathcal\{D\}^\{\+\}, and verified successful continuations𝒟cont\\mathcal\{D\}\_\{\\mathrm\{cont\}\}\. For each retained continuation trajectoryτ^\\hat\{\\tau\}, we discard the pre\-handoff policy segment\(o1,a1p,…,ok,akp\)\(o\_\{1\},a\_\{1\}^\{p\},\\ldots,o\_\{k\},a\_\{k\}^\{p\}\)from action supervision\. This segment is used only as historical context and does not provide supervised action targets, since it may contain the policy behaviors that induced the off\-trajectory state\.
The continuation loss is
ℒcont\(θ\)=−𝔼τ^∼𝒟cont∑t=k\+1\|τ^\|logπθ\(a^t∣h^t,x\),\\mathcal\{L\}\_\{\\mathrm\{cont\}\}\(\\theta\)=\-\\mathbb\{E\}\_\{\\hat\{\\tau\}\\sim\\mathcal\{D\}\_\{\\mathrm\{cont\}\}\}\\sum\_\{t=k\+1\}^\{\|\\hat\{\\tau\}\|\}\\log\\pi\_\{\\theta\}\(\\hat\{a\}\_\{t\}\\mid\\hat\{h\}\_\{t\},x\),\(12\)where each trajectoryτ^\\hat\{\\tau\}comes with its source taskxxand handoff stepkk\. Here,h^t\\hat\{h\}\_\{t\}denotes the history used to predict the continuation actiona^t\\hat\{a\}\_\{t\}\.
We combine standard supervised training on original expert trajectories and verified successful policy trajectories with continuation supervision:
ℒ\(θ\)=ℒSFT\(𝒟exp∪𝒟\+;θ\)\+ℒcont\(𝒟cont;θ\)\.\\mathcal\{L\}\(\\theta\)=\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\mathcal\{D\}\_\{\\mathrm\{exp\}\}\\cup\\mathcal\{D\}^\{\+\};\\theta\)\+\\mathcal\{L\}\_\{\\mathrm\{cont\}\}\(\\mathcal\{D\}\_\{\\mathrm\{cont\}\};\\theta\)\.\(13\)The objective remains a standard action\-generation objective, but the mixed training data shifts supervision toward successful continuations from policy\-induced off\-trajectory states\. After training, the deployment policy receives no skill\. The skill context is used only as a data\-synthesis scaffold\.
SGCD achieves continual self\-improvement through iterative application\. After each round, the improved policy induces a new failure distribution, and previously unrecoverable tasks may become recoverable in subsequent rounds\. Each iteration reruns trajectory sampling, updates skills, and regenerates verified successful continuations for training, progressively expanding continuation supervision as the policy’s capability grows\.
Method\#ParamsOSWorld\-VerifiedFull↑\\uparrowOSOfficeDailyProf\.WorkflowGeneral\-Purpose ModelsSeed1\.8Seed \([2026](https://arxiv.org/html/2606.18890#bib.bib66)\)–61\.91681295839Kimi K2\.5Teamet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib65)\)1T63\.31781295943Claude Sonnet 4\.6Anthropic \([2026](https://arxiv.org/html/2606.18890#bib.bib39)\)–72\.12288335856GPT\-5\.5OpenAI \([2026](https://arxiv.org/html/2606.18890#bib.bib27)\)–78\.7–––––GUI\-Specialized ModelsOpenCUA\-7BWanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\)7B28\.71128183413UI\-TARS\-1\.5\-7BQinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\)7B29\.683519378TianXi\-Action\-7B7B30\.483419435GUI\-Owl\-7BYeet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib15)\)7B32\.1122919479OpenCUA\-32BWanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\)32B35\.61536204512DeepMiner\-Mano\-7BFuet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib53)\)7B40\.11246185316DART\-GUI\-7BLiet al\.\([2025b](https://arxiv.org/html/2606.18890#bib.bib64)\)7B40\.51339265217OpenCUA\-72B\-previewWanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\)72B45\.91455265219UI\-TARS\-2Qinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\)–53\.11072294932DeepMiner\-Mano\-72BFuet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib53)\)72B53\.91674255723Backbone\-Matched ModelsQwen3\-VL\-30B\-A3B backboneLite\-GUI\-30B\-A3BWuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib32)\)30B\-A3B22\.7–––––Qwen3\-VL\-30B\-A3B\-Instruct†Baiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib24)\)30B\-A3B31\.9843153711Qwen3\-VL\-30B\-A3B\-Thinking†Baiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib24)\)30B\-A3B31\.3114013418Holo2\-30B\-A3BH Company \([2025](https://arxiv.org/html/2606.18890#bib.bib17)\)30B\-A3B37\.4–––––SGCD\-30B\-A3B30B\-A3B58\.41878175543Qwen3\-VL\-8B backboneQwen3\-VL\-8B\-Instruct†Baiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib24)\)8B32\.71039154014Qwen3\-VL\-8B\-Thinking†Baiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib24)\)8B36\.01143214213Step\-GUI\-8BYanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\)8B40\.2945185518EvoCUA\-8BXueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\)8B46\.11844255624GUI\-Owl\-1\.5\-8B\-Instruct†Xuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib16)\)8B52\.41859196528SGCD\-8B8B55\.11675185832STEP3\-VL\-10B backboneSTEP3\-VL\-10B†Huanget al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib67)\)10B24\.1121713396SGCD\-10B10B53\.21369195932
Table 2:Main results on OSWorld\-Verified\. We report overall OSWorld\-Verified performance and category\-level scores on different tasks\. SGCD models are highlighted inblueand base models are inyellow\.†denotes results reproduced under our evaluation protocol\. Officially reported full scores are 52\.3 for GUI\-Owl\-1\.5\-8B\-Instruct, 30\.3 for Qwen3\-VL\-30B\-A3B\-Instruct, 30\.6 for Qwen3\-VL\-30B\-A3B\-Thinking, and 33\.9 for Qwen3\-VL\-8B variants\.
## 5Experiments
### 5\.1Experimental Setup
#### Backbones\.
We apply SGCD to three vision\-language GUI policy models with different scales and model families: Qwen3\-VL\-8B, Qwen3\-VL\-30B\-A3BBaiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib24)\), and STEP3\-VL\-10BHuanget al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib67)\)\. This setting evaluates the effectiveness and generality of SGCD\.
#### Training settings\.
All experiments are trained on 64 H100 GPUs\. FollowingSunet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib61)\); Wuet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib62)\), we synthesize tasks over multiple real\-world OS applications and construct the initial expert dataset by sampling trajectories with several advanced models, followed by filtering to retain high\-quality successful executions\. Detailed training settings are provided in the appendix\.
#### Baselines and metrics\.
We compare against commercial and GUI\-specialized models, with the full list provided in Tab\.[2](https://arxiv.org/html/2606.18890#S4.T2)\. FollowingLiet al\.\([2025a](https://arxiv.org/html/2606.18890#bib.bib52)\); Fuet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib53)\); Agasheet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib23)\); Gonzalez\-Pumariegaet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib55)\), we evaluate on the OSWorld\-VerifiedXieet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib14)\), which reflects the real user experience\. We use end\-to\-end task success rate on OSWorld\-Verified\. To measure off\-trajectory continuation ability, we define*Continuation Success Rate*\. We select tasks that the plain policy model fails\. For each task, we first let the plain policy executekksteps to induce an off\-trajectory state, and then hand control to the evaluated model to complete the task from that state\. Importantly, the evaluated model is not given any skill or guidance, so the metric measures whether the model has ability itself\. Continuation Success Rate is the success rate of these tasks\.
### 5\.2Main Results on OSWorld\-Verified
Tab\.[2](https://arxiv.org/html/2606.18890#S4.T2)reports the main comparison on OSWorld\-Verified\. We include overall performance and category\-level scores on OS, Office, Daily, Professional, and Workflow tasks\. The*GUI\-Specialized Models*block contains models specifically designed or trained for GUI tasks, while the*Backbone\-Matched Models*block compares methods that share the same base backbones as our models\. SGCD achieves state\-of\-the\-art results among models with comparable parameter scales\. In the backbone\-matched comparison, SGCD substantially improves the original base models, with gains of over 20% on the Qwen3\-VL\-8B backbone and over 25% on the Qwen3\-VL\-30B\-A3B backbone\. These consistent improvements across different model families and scales demonstrate the effectiveness and generality of SGCD\. Overall, the results show that SGCD provides a strong and complementary training signal for GUI agents, beyond simply scaling the base model or relying on successful trajectory imitation\.

Figure 3:Iterative SGCD across training rounds\.
Figure 4:Effect of different starting rounds\.
Table 3:Ablation on skill components\. CP, CT, FT, and SC denote Continuation Plans, Critical Targets, Failure Traps, and Success Criteria\.
### 5\.3Off\-Trajectory Continuation Ability
We further evaluate whether SGCD trains the model to complete tasks from policy\-induced off\-trajectory states\. For each task that the original policy fails, we first run the original policy forkksteps to induce an intermediate GUI state, and then let the evaluated model continue from that state without any skill prompt\. This setting tests whether successful continuation ability is distilled into the plain policy itself\.
As shown in Tab\.[4](https://arxiv.org/html/2606.18890#S5.T4), SGCD attains the highest continuation success rate among open\-weight models at both backbone scales, reaching39\.2%39\.2\\%on the Qwen3\-VL\-8B backbone and50\.3%50\.3\\%on the Qwen3\-VL\-30B\-A3B backbone, which approaches Kimi K2\.5\. This indicates that the skill\-guided continuations are not merely useful during data synthesis, but are effectively distilled into the plain policy\. The resulting model already learns how to complete tasks from off\-trajectory states\.
Table 4:Continuation Success Rate \(Cont\. SR\) comparison on the142142OSWorld\-Verified tasks failed by the original plain policy\.
### 5\.4Iterative Distillation
SGCD is iterated because each trained policy induces a new distribution of off\-trajectory states\. After one round, some previous failures become solvable, and new skill\-solvable states emerge near the expanded capability boundary\. We therefore repeat the pipeline of SGCD\.
Fig\.[3](https://arxiv.org/html/2606.18890#S5.F3)reports performance across SGCD training rounds\. Performance improves consistently from v0 to later iterations, showing that repeated SGCD can further enhance the plain policy\.
### 5\.5Ablation Studies
We conduct all ablations on the v0\-to\-v1 stage of SGCD, where the largest improvement is observed\.
#### Skill components\.
Tab\.[3](https://arxiv.org/html/2606.18890#S5.T3)studies the contribution of the off\-trajectory continuation skill\. Starting from no skill, we progressively add Continuation Plans, Critical Targets, Failure Traps, and Success Criteria\. These components correspond to the dominant failure modes identified in our trajectory analysis: fixation, scope misjudgment, hallucinated affordance, and early done\. The ablation tests whether the full skill works as a trajectory abstraction rather than a fixed path replay\.
#### K\-start supervision\.
SGCD uses only the post\-handoff continuation for training\. The pre\-handoff segment is kept as context but excluded from action supervision\. This design follows our motivation that off\-trajectory states are more likely to produce erroneous actions\. Therefore, the early plain\-policy actions that lead to these states may contain noisy decisions rather than useful supervision\. Fig\.[4](https://arxiv.org/html/2606.18890#S5.F4)compares this design with training on the full trajectories\. The results confirm that excluding the pre\-handoff actions leads to better performance\.
## 6Conclusion
We identify the off\-trajectory supervision deficit as a fundamental limitation of behavior cloning on expert trajectories, where policy\-induced states encountered during closed\-loop execution receive no effective supervision signal\. To address this, we proposeSkill\-Guided Continuation Distillation \(SGCD\), which leverages skill\-guided rollouts to synthesize verified successful continuations from such states, supplying the supervision that expert trajectories cannot provide\. On OSWorld\-Verified, SGCD improves the success rate of three base models from the low\-30% range to over 50%, demonstrating its effectiveness and generality\.
## Limitations
This study has several limitations\. First, the current framework has limited effectiveness in acquiring successful continuations for difficult tasks\. Improving the coverage and efficiency of continuation acquisition on such tasks remains an open direction for future work\. Second, SGCD currently instantiates off\-trajectory states by re\-executing the plain policy from scratch in the live environment for each task, which incurs substantial interaction overhead\. Developing state\-caching infrastructure to store and reuse intermediate GUI states directly could substantially reduce the number of environment steps required per iteration\.
## References
- Agent s: an open agentic framework that uses computers like a human\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 22924–22946\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p4.1),[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px3.p1.1)\.
- Anthropic \(2024a\)Computer use tool\.Note:[https://platform\.claude\.com/docs/en/agents\-and\-tools/tool\-use/computer\-use\-tool](https://platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool)Accessed: 2026\-05\-20Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1)\.
- Anthropic \(2024b\)Introducing computer use, a new claude 3\.5 sonnet, and claude 3\.5 haiku\.Note:[https://www\.anthropic\.com/news/3\-5\-models\-and\-computer\-use](https://www.anthropic.com/news/3-5-models-and-computer-use)Accessed: 2026\-05\-20Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1)\.
- Anthropic \(2026\)Introducing claude sonnet 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)Accessed: 2026\-05\-20Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[§1](https://arxiv.org/html/2606.18890#S1.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.12.5.1)\.
- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge,et al\.\(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[§1](https://arxiv.org/html/2606.18890#S1.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.2.2.2.1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.3.3.3.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.4.4.4.1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.5.5.5.1),[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px1.p1.1)\.
- Y\. Chen, B\. Xu, X\. Wang, Y\. Zhang, and Z\. Mao \(2025\)Training llm\-based agents with synthetic self\-reflected trajectories and partial masking\.arXiv preprint arXiv:2505\.20023\.Cited by:[§1](https://arxiv.org/html/2606.18890#S1.p3.1)\.
- K\. Cheng, Q\. Sun, Y\. Chu, F\. Xu, L\. YanTao, J\. Zhang, and Z\. Wu \(2024\)Seeclick: harnessing gui grounding for advanced visual gui agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9313–9332\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Fu, A\. Su, C\. Zhao, H\. Wang, M\. Wu, Z\. Yu, F\. Hu, M\. Shi, W\. Dong, J\. Wang,et al\.\(2025\)Mano technical report\.arXiv preprint arXiv:2509\.17336\.Cited by:[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.20.13.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.24.17.1),[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px3.p1.1)\.
- G\. Gonzalez\-Pumariega, V\. Tu, C\. Lee, J\. Yang, A\. Li, and X\. E\. Wang \(2025\)The unreasonable effectiveness of scaling agents for computer use\.arXiv preprint arXiv:2510\.02250\.Cited by:[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px3.p1.1)\.
- Google \(2025\)A new era of intelligence with Gemini 3\.Note:[https://blog\.google/products\-and\-platforms/products/gemini/gemini\-3/](https://blog.google/products-and-platforms/products/gemini/gemini-3/)Accessed: 2026\-05\-21Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix D](https://arxiv.org/html/2606.18890#A4.p1.1),[§1](https://arxiv.org/html/2606.18890#S1.p1.1),[§4\.3](https://arxiv.org/html/2606.18890#S4.SS3.p1.1)\.
- B\. Gou, D\. R\. Wang, B\. Zheng, Y\. Xie, C\. Chang, Y\. Shu, H\. Sun, and Y\. Su \(2025\)Navigating the digital world as humans do: universal visual grounding for gui agents\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 30851–30883\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1)\.
- H Company \(2025\)Holo2: cost\-efficient models for cross\-platform computer\-use\.Note:[https://hcompany\.ai/holo2](https://hcompany.ai/holo2)Accessed: 2026\-05\-20Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.28.21.1)\.
- A\. Huang, C\. Yao, C\. Han, F\. Wan, H\. Guo, H\. Lv, H\. Zhou, J\. Wang, J\. Zhou, J\. Sun,et al\.\(2026\)Step3\-vl\-10b technical report\.arXiv preprint arXiv:2601\.09668\.Cited by:[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.7.1.1),[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px1.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[§1](https://arxiv.org/html/2606.18890#S1.p1.1)\.
- H\. Lai, X\. Liu, Y\. Zhao, H\. Xu, H\. Zhang, B\. Jing, Y\. Ren, S\. Yao, Y\. Dong, and J\. Tang \(2025\)Computerrl: scaling end\-to\-end online reinforcement learning for computer use agents\.arXiv preprint arXiv:2508\.14040\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Lauffer, X\. Deng, S\. Kundurthy, B\. Kenstler, and J\. Da \(2025\)Imitation learning for multi\-turn lm agents via on\-policy expert corrections\.arXiv preprint arXiv:2512\.14895\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18890#S4.SS1.p1.1)\.
- P\. Li, Z\. Hu, Z\. Shang, J\. Wu, Y\. Liu, H\. Liu, Z\. Gao, C\. Shi, B\. Zhang, Z\. Zhang,et al\.\(2025a\)Efficient multi\-turn rl for gui agents via decoupled training and adaptive data curation\.arXiv preprint arXiv:2509\.23866\.Cited by:[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px3.p1.1)\.
- P\. Li, Z\. Hu, Z\. Shang, J\. Wu, Y\. Liu, H\. Liu, Z\. Gao, C\. Shi, B\. Zhang, Z\. Zhang, X\. Shi, Z\. Yu, Y\. Wu, X\. Wu, Y\. Jia, L\. Xiang, Z\. He, and Q\. Li \(2025b\)Efficient multi\-turn rl for gui agents via decoupled training and adaptive data curation\.arXiv preprint arXiv:2509\.23866\.Cited by:[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.21.14.1)\.
- K\. Q\. Lin, L\. Li, D\. Gao, Z\. Yang, S\. Wu, Z\. Bai, S\. W\. Lei, L\. Wang, and M\. Z\. Shou \(2025\)Showui: one vision\-language\-action model for gui visual agent\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 19498–19508\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1)\.
- Z\. Lin, F\. Liu, Y\. Yang, J\. Lyu, Y\. Gao, Y\. Liu, Z\. Lu, Y\. Yu, M\. Yang, J\. Li,et al\.\(2026\)Ui\-voyager: a self\-evolving gui agent learning via failed experience\.arXiv preprint arXiv:2603\.24533\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§1](https://arxiv.org/html/2606.18890#S1.p4.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1)\.
- X\. H\. Lù, A\. Kazemnejad, N\. Meade, A\. Patel, D\. Shin, A\. Zambrano, K\. Stańczak, P\. Shaw, C\. J\. Pal, and S\. Reddy \(2025\)Agentrewardbench: evaluating automatic evaluations of web agent trajectories\.arXiv preprint arXiv:2504\.08942\.Cited by:[§1](https://arxiv.org/html/2606.18890#S1.p7.1),[§4\.4](https://arxiv.org/html/2606.18890#S4.SS4.p4.2)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p4.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2026\)Introducing GPT\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Accessed: 2026\-05\-21Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[§1](https://arxiv.org/html/2606.18890#S1.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.13.6.1)\.
- R\. Qian, X\. Yin, C\. Deng, Z\. Peng, J\. Xiong, W\. Zhai, and D\. Dou \(2025\)Uground: towards unified visual grounding with unrolled transformers\.arXiv preprint arXiv:2510\.03853\.Cited by:[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Qin, Y\. Ye, J\. Fang, H\. Wang, S\. Liang, S\. Tian, J\. Zhang, J\. Li, Y\. Li, S\. Huang,et al\.\(2025\)Ui\-tars: pioneering automated gui interaction with native agents\.arXiv preprint arXiv:2501\.12326\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p4.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.16.9.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.23.16.1)\.
- C\. Rawles, S\. Clinckemaillie, Y\. Chang, J\. Waltz, G\. Lau, M\. Fair, A\. Li, W\. Bishop, W\. Li, F\. Campbell\-Ajala,et al\.\(2025\)Androidworld: a dynamic benchmarking environment for autonomous agents\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 406–441\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Ross, G\. Gordon, and D\. Bagnell \(2011\)A reduction of imitation learning and structured prediction to no\-regret online learning\.InProceedings of the fourteenth international conference on artificial intelligence and statistics,pp\. 627–635\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.18890#S4.SS1.p1.1)\.
- B\. Seed \(2026\)Seed1\. 8 model card: towards generalized real\-world agency\.arXiv preprint arXiv:2603\.20633\.Cited by:[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.10.3.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p4.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Su, R\. Sun, J\. Yoon, P\. Yin, T\. Yu, and S\. Arik \(2025\)Learn\-by\-interact: a data\-centric framework for self\-adaptive agents in realistic environments\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 89044–89097\.Cited by:[Appendix F](https://arxiv.org/html/2606.18890#A6.p1.3)\.
- Q\. Sun, K\. Cheng, Z\. Ding, C\. Jin, Y\. Wang, F\. Xu, Z\. Wu, C\. Jia, L\. Chen, Z\. Liu,et al\.\(2025\)Os\-genesis: automating gui agent trajectory construction via reverse task synthesis\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5555–5579\.Cited by:[Appendix F](https://arxiv.org/html/2606.18890#A6.p1.3),[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px2.p1.1)\.
- K\. Team, T\. Bai, Y\. Bai, Y\. Bao, S\. Cai, Y\. Cao, Y\. Charles, H\. Che, C\. Chen, G\. Chen,et al\.\(2026\)Kimi k2\. 5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276\.Cited by:[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.11.4.1)\.
- G\. Wang, S\. Dai, G\. Ye, Z\. Gan, W\. Yao, Y\. Deng, X\. Wu, and Z\. Ying \(2026a\)Information gain\-based policy optimization: a simple and effective approach for multi\-turn search agents\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.18890#S1.p2.1)\.
- X\. Wang, B\. Wang, D\. Lu, J\. Yang, T\. Xie, J\. Wang, J\. Deng, X\. Guo, Y\. Xu, C\. Wu,et al\.\(2026b\)Opencua: open foundations for computer\-use agents\.Advances in Neural Information Processing Systems38,pp\. 139756–139806\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p4.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.15.8.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.19.12.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.22.15.1)\.
- Z\. Wang, Z\. Zhang, X\. Zhang, Z\. Qian, and Y\. Lu \(2026c\)From off\-policy to on\-policy: enhancing gui agents via bi\-level expert\-to\-policy assimilation\.arXiv preprint arXiv:2601\.05787\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wanyan, X\. Zhang, H\. Xu, H\. Liu, J\. Wang, J\. Ye, Y\. Kou, M\. Yan, F\. Huang, X\. Yang,et al\.\(2026\)Look before you leap: a gui\-critic\-r1 model for pre\-operative error diagnosis in gui automation\.Advances in Neural Information Processing Systems38,pp\. 3907–3929\.Cited by:[§1](https://arxiv.org/html/2606.18890#S1.p7.1)\.
- Y\. Wu, Z\. Cai, L\. Ning, H\. Wang, Z\. Chen, Y\. Tang, and H\. Chen \(2026\)LiteGUI: distilling compact gui agents with reinforcement learning\.arXiv preprint arXiv:2605\.07505\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.27.20.1)\.
- Z\. Wu, Z\. Wu, F\. Xu, Y\. Wang, Q\. Sun, C\. Jia, K\. Cheng, Z\. Ding, L\. Chen, P\. P\. Liang,et al\.\(2025\)OS\-atlas: foundation action model for generalist gui agents\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 5090–5108\.Cited by:[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px2.p1.1)\.
- T\. Xie, J\. Deng, X\. Li, J\. Yang, H\. Wu, J\. Chen, W\. Hu, X\. Wang, Y\. Xu, Z\. Wang,et al\.\(2026\)Scaling computer\-use grounding via user interface decomposition and synthesis\.Advances in Neural Information Processing Systems38\.Cited by:[Appendix F](https://arxiv.org/html/2606.18890#A6.p1.3)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[Appendix F](https://arxiv.org/html/2606.18890#A6.p1.3),[Appendix I](https://arxiv.org/html/2606.18890#A9.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.18890#S5.SS1.SSS0.Px3.p1.1)\.
- H\. Xu, X\. Zhang, H\. Liu, J\. Wang, Z\. Zhu, S\. Zhou, X\. Hu, F\. Gao, J\. Cao, Z\. Wang,et al\.\(2026\)Mobile\-agent\-v3\. 5: multi\-platform fundamental gui agents\.arXiv preprint arXiv:2602\.16855\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.6.6.6.1)\.
- Y\. Xu, D\. Lu, Z\. Shen, J\. Wang, Z\. Wang, Y\. Mao, C\. Xiong, and T\. Yu \(2025\)Agenttrek: agent trajectory synthesis via guiding replay with web tutorials\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 79822–79843\.Cited by:[Appendix F](https://arxiv.org/html/2606.18890#A6.p1.3)\.
- T\. Xue, C\. Peng, M\. Huang, L\. Guo, T\. Han, H\. Wang, J\. Wang, X\. Zhang, X\. Yang, D\. Zhao,et al\.\(2026\)Evocua: evolving computer use agents via learning from scalable synthetic experience\.arXiv preprint arXiv:2601\.15876\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p4.1),[Appendix F](https://arxiv.org/html/2606.18890#A6.p1.3),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.32.25.1)\.
- H\. Yan, J\. Wang, X\. Huang, Y\. Shen, Z\. Meng, Z\. Fan, K\. Tan, J\. Gao, L\. Shi, M\. Yang,et al\.\(2025\)Step\-gui technical report\.arXiv preprint arXiv:2512\.15431\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[§1](https://arxiv.org/html/2606.18890#S1.p1.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.31.24.1)\.
- J\. Ye, X\. Zhang, H\. Xu, H\. Liu, J\. Wang, Z\. Zhu, Z\. Zheng, F\. Gao, J\. Cao, Z\. Lu,et al\.\(2025\)Mobile\-agent\-v3: fundamental agents for gui automation\.arXiv preprint arXiv:2508\.15144\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Table 2](https://arxiv.org/html/2606.18890#S4.T2.7.7.18.11.1)\.
- S\. Zeng, Q\. Wei, W\. Brown, O\. Frunza, Y\. Nevmyvaka, and M\. Hong \(2025\)Reinforcing multi\-turn reasoning in llm agents via turn\-level credit assignment\.arXiv preprint arXiv:2505\.11821\.Cited by:[§1](https://arxiv.org/html/2606.18890#S1.p2.1)\.
- Z\. Zhang, X\. Liu, X\. Zhang, J\. Wang, G\. Chen, and Y\. Lu \(2025\)UI\-evol: automatic knowledge evolving for computer use agents\.arXiv preprint arXiv:2505\.21964\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p3.1),[§1](https://arxiv.org/html/2606.18890#S1.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhao, J\. Huang, J\. Hu, X\. Wang, Y\. Mao, D\. Zhang, Z\. Jiang, Z\. Wu, B\. Ai, A\. Wang,et al\.\(2025\)Swift: a scalable lightweight infrastructure for fine\-tuning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 29733–29735\.Cited by:[Appendix H](https://arxiv.org/html/2606.18890#A8.SS0.SSS0.Px1.p1.6)\.
- B\. Zheng, B\. Gou, J\. Kil, H\. Sun, and Y\. Su \(2024\)Gpt\-4v \(ision\) is a generalist web agent, if grounded\.arXiv preprint arXiv:2401\.01614\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§4\.4](https://arxiv.org/html/2606.18890#S4.SS4.p4.2)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)Webarena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 15585–15606\.Cited by:[Appendix C](https://arxiv.org/html/2606.18890#A3.p1.1),[Appendix C](https://arxiv.org/html/2606.18890#A3.p2.1),[§2](https://arxiv.org/html/2606.18890#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix AAdditional Handoff\-Depth Analysis
To understand how the choice of handoff depthkkshapes the continuation dataset, we ablate thekkset used to construct𝒟cont\\mathcal\{D\}\_\{\\mathrm\{cont\}\}for Qwen3\-VL\-8B\. All variants share the same skill, verifier, and training recipe described in Sec\.[4\.5](https://arxiv.org/html/2606.18890#S4.SS5); only the range of admissible handoff depths differs\. Results are reported in Tab\.[5](https://arxiv.org/html/2606.18890#A1.T5)\.
Table 5:Ablation on the handoff\-depth set used for off\-trajectory continuation construction\.#### Sweepingkkdominates any single bucket\.
Enumeratingk∈\{1,…,20\}k\\in\\\{1,\\ldots,20\\\}reaches45\.745\.7, outperforming every individual sub\-range by2\.52\.5–4\.74\.7points\. This indicates that no single handoff depth is sufficient: different tasks expose their policy\-induced off\-trajectory states at different points along the rollout, and a fixedkkinevitably misses the off\-trajectory state where the actionable mistake occurs\. Sweeping the full window converts each failed task into multiple supervised continuations from distinct off\-trajectory states, which both increases data volume and reduces the hand\-crafted bias of any single handoff choice\.
#### Performance degrades with deeper handoff\.
Within a fixed\-width window, accuracy decreases monotonically askkgrows beyond1010\(43\.2→41\.0→41\.843\.2\\to 41\.0\\to 41\.8\)\. This matches the failure\-distribution finding in Fig\.[1](https://arxiv.org/html/2606.18890#S1.F1)\(a\): plain\-policy errors are concentrated in early execution, so an off\-trajectory state induced after only a few steps lies closest to the failure mode that distillation needs to repair\. Deeper handoff also leaves the skill\-guided policy with a shorter remaining horizon to complete the task, which lowers the verifier pass rate and yields fewer high\-quality continuations per task\.
## Appendix BPer\-Handoff\-Depth Continuation Success Rate
Tab\.[4](https://arxiv.org/html/2606.18890#S5.T4)in the main paper reports the aggregate Continuation Success Rate\. Here we provide the full per\-handoff\-depth breakdown for ten systems on the same evaluation protocol\. The pool consists of142142OSWorld\-Verified tasks on which the plain Qwen3\-VL\-8B policy fails\. For each task and eachk∈\{1,…,20\}k\\\!\\in\\\!\\\{1,\\ldots,20\\\}, we let the plain policy execute the firstkksteps to induce a policy\-induced off\-trajectory state and then hand control to the evaluated model, which must complete the task without any skill prompt\. A trial is counted as successful when the verifier score is≥0\.8\\geq\\\!0\.8\. The full20×1020\\\!\\times\\\!10table is reported in Tab\.[6](https://arxiv.org/html/2606.18890#A2.T6)\.
Table 6:Per\-handoff\-depth Continuation Success Rate \(%\) on the142142OSWorld\-Verified tasks failed by the original plain policy\. For eachkk, the plain policy executeskksteps before handing off to the evaluated model, which must complete the task without any skill prompt\. Average row reports the unweighted mean overk=1,…,20k\\\!=\\\!1,\\ldots,20\. 8B\-Inst/Thk and 30A3\-Inst/Thk denote the Qwen3\-VL\-8B\-Instruct/Thinking and Qwen3\-VL\-30B\-A3B\-Instruct/Thinking backbones, respectively\.#### SGCD beats backbone\-matched baselines at everykk\.
SGCD\-30B\-A3B averages50\.3%50\.3\\%vs\.28\.2%28\.2\\%\(GUI\-Owl\-1\.5\-8B\-Instruct\) and27\.2%27\.2\\%\(Qwen3\-VL\-8B\-Thinking\), and the gap is consistent across the entire sweep, not driven by a few favorable handoff depths\. SGCD\-8B reaches39\.2%39\.2\\%vs\. EvoCUA\-8B’s32\.7%32\.7\\%under the same backbone budget, with the gap widening on early\-to\-midkkwhere the policy\-induced off\-trajectory state is most non\-trivial\.
#### Strong commercial models degrade fastest withkk\.
Claude Sonnet 4\.6 starts the highest atk=1k\\\!=\\\!1\(80\.3%80\.3\\%\) but drops the most byk=20k\\\!=\\\!20\(47\.2%47\.2\\%\), a3333\-point decline\. Kimi K2\.5 falls1414points over the same range\. We attribute this to compounding prefix errors: the longer the plain Qwen3\-VL\-8B policy is allowed to act, the more its prefix accumulates mistakes that even strong general\-purpose models cannot fully undo\. SGCD\-30B\-A3B’s range is much tighter \(4242–60%60\\%\), consistent with its training objective: continuation distillation explicitly supervises behavior from policy\-induced off\-trajectory states, so robustness across handoff depths is the property it is trained to acquire\.
#### Weak baselines flatten or even improve withkk\.
Models with very lowk=1k\\\!=\\\!1Continuation Success Rate \(Qwen3\-VL\-8B\-Thinking at20\.4%20\.4\\%; GUI\-Owl\-1\.5\-8B\-Instruct at27\.5%27\.5\\%\) often perform better at moderate\-to\-deepkkthan atk=1k\\\!=\\\!1\. This is because such models lack their own coherent open\-ended plan; a moderately deep prefix from the plain policy effectively narrows the task and steers them toward the relevant UI surface, partially compensating for their weaker planning ability\. SGCD\-trained policies do not need this scaffolding – they remain strong at smallkk, where the off\-trajectory state is closest to the actual policy failure mode\.
## Appendix CExtended Related Work
Vision\-language GUI agents show strong promise in performing open\-ended computer tasksAnthropic \([2024b](https://arxiv.org/html/2606.18890#bib.bib8),[a](https://arxiv.org/html/2606.18890#bib.bib9)\); OpenAI \([2026](https://arxiv.org/html/2606.18890#bib.bib27)\); Google \([2025](https://arxiv.org/html/2606.18890#bib.bib26)\); Anthropic \([2026](https://arxiv.org/html/2606.18890#bib.bib39)\); Hurstet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib40)\)\. Standardized benchmarks across web, mobile, and desktop environments provide rigorous evaluation platforms for this line of researchDenget al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib10)\); Zhouet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib11)\); Zhenget al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib12)\); Rawleset al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib13)\); Xieet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib14)\), and a growing body of GUI\-specialized systems advances grounding, planning, action generation, and trajectory\-level supervisionQinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\); Wanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\); Yanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\); Xuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib16)\); H Company \([2025](https://arxiv.org/html/2606.18890#bib.bib17)\); Wuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib32)\); Xueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\); Yeet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib15)\); Chenget al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib45)\); Gouet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib46)\); Linet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib47)\)\. SGCD is complementary to these advances: rather than proposing a new base model or benchmark, it targets the off\-trajectory supervision deficit left by behavior cloning on expert trajectories\.
Most GUI agents learn from successful demonstrations, synthetic trajectories, or state\-action tracesDenget al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib10)\); Zhenget al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib12)\); Qinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\); Wanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\); Yanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\); Xuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib16)\); Xueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\)\. Such supervision teaches grounding, action formatting, and interface conventions, but primarily covers expert\-induced states\. In closed\-loop environmentsZhouet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib11)\); Rawleset al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib13)\); Xieet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib14)\), local mistakes produce policy\-induced off\-trajectory states that are absent from successful traces\. This is a direct manifestation of covariate shift in imitation learningRosset al\.\([2011](https://arxiv.org/html/2606.18890#bib.bib18)\); Laufferet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib44)\): the policy accumulates errors in states unseen during training, yet obtaining new supervision requires costly re\-interaction with the live environment\.
Self\-improvement methods seek to address this by leveraging the agent’s own experience as a training signal\. ReflexionShinnet al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib21)\)and Self\-RefineMadaanet al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib22)\)use inference\-time verbal feedback to revise outputs\. Recent GUI pipelines convert model\-generated rollouts into supervision through filtered self\-trainingYanet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib30)\); Wuet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib32)\), sandbox\-based reinforcement learningLaiet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib43)\), experience\-driven knowledge refinementZhanget al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib68)\); Linet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib59)\), or policy\-aligned experience assimilationWanget al\.\([2026c](https://arxiv.org/html/2606.18890#bib.bib69)\)\. SGCD follows this paradigm and specifically targets policy\-induced off\-trajectory states, synthesizing skill\-guided continuation supervision from the states the current policy actually traverses\.
Finally, structured guidance, memory, and experience retrieval improve agent planningShinnet al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib21)\); Madaanet al\.\([2023](https://arxiv.org/html/2606.18890#bib.bib22)\); Agasheet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib23)\); Qinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\); Wanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\); Xueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\)\. Agent S uses experience\-augmented hierarchical planningAgasheet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib23)\), while OpenCUA, UI\-TARS, and EvoCUA incorporate reflective conversion, memory\-like capabilities, or self\-correctionWanget al\.\([2026b](https://arxiv.org/html/2606.18890#bib.bib29)\); Qinet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib28)\); Xueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\)\. SGCD uses skills only as a training\-time synthesis scaffold: Gemini\-3\-Pro abstracts successful and failed trajectories into Continuation Plans, Critical Targets, Failure Traps, and Success Criteria, enabling verified continuations without fixed\-path replay or replacing the acting policy\.
## Appendix DSkill Extraction Prompt
We use Gemini\-3\-ProGoogle \([2025](https://arxiv.org/html/2606.18890#bib.bib26)\)to abstract trajectory evidence into structured off\-trajectory continuation skills\. For each task, the prompt is fed with the task id, all available successful reference trajectories sampled by the plain policy, and a matched set of failed reference trajectories\. The model is instructed to \(i\) infer the stable good path from successful references, \(ii\) infer obstacles, traps, and bad paths from failed references, \(iii\) compress trajectories into reusable procedural knowledge rather than verbatim replays, and \(iv\) emit a JSON object with exactly four fields aligned with the failure families analyzed in Sec\.[4\.3](https://arxiv.org/html/2606.18890#S4.SS3):*Continuation Plans*,*Critical Targets*,*Failure Traps*, and*Success Criteria*\. To suppress hallucinated guidance, alternative paths are only drawn from successful references, and fields without supporting evidence are left empty rather than filled in\. The full prompt template is shown in Fig\.[5](https://arxiv.org/html/2606.18890#A4.F5)\.
Skill Extraction Prompt \(Gemini\-3\-Pro\)System PromptTask id:\{task\_id\}\.You are extracting a concise SOP \(standard operating procedure\) for one computer\-use task\. You will be given successful references and failed references for the same task\.•Usesuccessful referencesto infer the stable good path\.•Usefailed referencesto infer obstacles, traps, and bad paths to avoid\.•Do not copy long trajectory text verbatim\. Compress them into short, reusable procedural knowledge\.•Only include information supported by the provided references\.Output FormatReturn JSON only with exactly these fields:``` { "Continuation Plans": string[], "Critical Targets": string[], "Failure Traps": string[], "Success Criteria": string[] } ``` Guidelines•Keep each bullet short, concrete, and reusable\.•Prefer task\-level decisions over click\-by\-click replay\.•Include wrong formulas / wrong pages / wrong menus inFailure Trapswhen supported by failed references\.•Alternative paths must only come from successful references; do not invent or generalize new alternatives\.•If uncertain, leave a field empty instead of hallucinating\.User PromptSuccessful references:\{successful\_trajectories\} Failed references:\{failed\_trajectories\}
Figure 5:Skill\-extraction prompt used with Gemini\-3\-Pro\. Task\-specific successful and failed reference trajectories are concatenated into the user\-prompt slot, and the model is required to emit JSON with the four schema fields\.
## Appendix ESkill Examples
To illustrate the structured output produced by the skill\-extraction prompt, we show three representative skills covering different task families: a web browsing task on a Chinese auto\-information portal \(Fig\.[6](https://arxiv.org/html/2606.18890#A5.F6)\), an office\-document editing task on a word processor \(Fig\.[7](https://arxiv.org/html/2606.18890#A5.F7)\), and a browser\-settings task \(Fig\.[8](https://arxiv.org/html/2606.18890#A5.F8)\)\. Each skill contains four task\-specific fields aligned with the failure families analyzed in Sec\.[4\.3](https://arxiv.org/html/2606.18890#S4.SS3):*Continuation Plans*provide the abstracted good path,*Critical Targets*list the salient UI anchors,*Failure Traps*enumerate concrete mistakes observed in failed references, and*Success Criteria*define the verifier\-aligned terminal state\.
Skill Example: Browse Wuling MINIEV owner community posts on DongchediContinuation Plans•Find and click the “Find Car” or “Select Car” entry in the top navigation bar of the Dongchedi homepage \(dongchedi\.com\)\.•Type “Wuling MINIEV” in the search box and press Enter to search\.•On the search results page, click the “Wuling MINIEV” model card to enter the model detail page\.•On the model detail page, find and click the “Owner Community” or “Community” tab\.•Wait for the community feed to finish loading and confirm the page URL contains/community/4499\.Critical Targets•Dongchedi homepage search box\.•“Wuling MINIEV” search result card\.•“Owner Community” tab on the model detail page\.•Target URL:dongchedi\.com/community/4499\.Failure Traps•Do not browse the homepage recommended content without searching for the specific model\.•Do not click “Hongguang MINIEV” or other similarly named model suggestions; confirm it is the “MINIEV” itself\.•Do not stay on the model specs or pricing page; you must navigate to the “Owner Community” tab\.•Do not click ad pop\-ups or promotional banners on the page, as they navigate to irrelevant pages\.Success Criteria•The active tab URL isdongchedi\.com/community/4499\.•The page displays the Wuling MINIEV owner community feed with a list of posts\.
Figure 6:Skill extracted for a web browsing task on a Chinese auto\-information portal\. The Failure Traps field captures search\-suggestion confusables \(e\.g\., “Hongguang MINIEV” vs\. “Wuling MINIEV”\) that repeatedly mislead the plain policy\.Skill Example: Replace all question marks with exclamation marksContinuation Plans•Press Ctrl\+H to open the “Find and Replace” dialog\.•In the “Search For” input field, type “?” \(question mark\)\.•Click the “Replace With” input field and type “\!” \(exclamation mark\)\.•Make sure the “Regular expressions” checkbox is NOT checked; otherwise “?” will be treated as a regex metacharacter\.•Click the “Replace All” button to perform the batch replacement\.•Confirm the replacement completion prompt \(showing the number of replacements made\)\.•Press Escape to close the Find and Replace dialog\.•Press Ctrl\+S to save the document\.Critical Targets•Ctrl\+H to open the Find and Replace dialog\.•“Search For” input field \(type “?”\)\.•“Replace With” input field \(type “\!”\)\.•“Replace All” button\.•Ctrl\+S to save\.Failure Traps•Do not check the “Regular expressions” option, because “?” is a special character \(quantifier\) in regex and will cause the replacement to fail or match incorrectly\.•Do not check options under “Other options” that could affect matching, such as “Current selection only”\.•Do not forget to check whether the “Search For” field has leftover text; clear it first before typing\.•Do not replace one by one \(clicking “Replace” instead of “Replace All”\); this is extremely inefficient for documents with many occurrences\.Success Criteria•All original question marks in the document have been replaced with exclamation marks\.•All other content and formatting in the document remain unchanged\.•The document is saved\.
Figure 7:Skill extracted for an office\-document editing task\. The Failure Traps field captures a non\-obvious dialog\-option pitfall \(the “Regular expressions” checkbox interfering with the literal “?” match\) that is hard to recover from once committed\.Skill Example: Enable blocking third\-party cookies in ChromeContinuation Plans•Typechrome://settings/cookiesin the Chrome address bar and press Enter to navigate directly to the Cookie settings page\.•Alternatively, click the three\-dot menu in the top\-right corner\>\>Settings\>\>Privacy and Security\>\>Third\-party cookies\.•On the Cookie settings page, find the “Block third\-party cookies” option\.•Click the radio button or toggle for that option to set it to “Block third\-party cookies” mode\.•Confirm the setting has taken effect \(the option is selected or the toggle is in the on position\)\.Critical Targets•Chrome address bar \(typechrome://settings/cookies\)\.•Or three\-dot menu\>\>Settings\>\>Privacy and Security\.•“Third\-party cookies” or “Cookies and other site data” setting\.•“Block third\-party cookies” radio button/toggle\.Failure Traps•Do not select “Block all cookies”; that will break many websites\. Only block third\-party cookies\.•Do not look for the Cookie option under “Site settings”; it is under the main “Privacy and Security” category\.•Do not confuse “Clear cookies” with “Block cookies”; they are two different operations\.•Do not close the browser to “apply” the settings; the change takes effect immediately\.Success Criteria•On Chrome’s Cookie settings page, the “Block third\-party cookies” option is selected/enabled\.•The setting persists after restarting Chrome\.
Figure 8:Skill extracted for a browser\-settings task\. The Failure Traps field disambiguates closely related but semantically different options \(“Block all cookies” vs\. “Block third\-party cookies”, “Clear cookies” vs\. “Block cookies”\) that frequently cause plain\-policy off\-trajectory drift\.
## Appendix FTraining Task Construction
We construct an OSWorld\-compatible training pool𝒳\\mathcal\{X\}that is disjoint from the OSWorld\-VerifiedXieet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib14)\)evaluation suite\. Each taskx∈𝒳x\\in\\mathcal\{X\}is a triple of \(i\) a natural\-language instruction, \(ii\) a scripted initial environment state on the Ubuntu sandbox, and \(iii\) an automatic verifier returningVx\(τ\)∈\{0,1\}V\_\{x\}\(\\tau\)\\in\\\{0,1\\\}from the post\-rollout environment state\. Whereas prior GUI training\-data pipelines focus on trajectory synthesisXuet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib42)\); Sunet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib61)\); Suet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib41)\); Xueet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib31)\)or grounding dataXieet al\.\([2026](https://arxiv.org/html/2606.18890#bib.bib34)\), our pipeline jointly produces the instruction, the initial\-state assets, and the verifier from a per\-application scenario specification, covers all ten desktop applications exercised by OSWorld\-Verified, and is sized to support the trajectory\-mixing recipe of Sec\.[4\.5](https://arxiv.org/html/2606.18890#S4.SS5)\.
#### Domain taxonomy\.
The ten target applications are organised into five coarse categories that serve as the unit of scenario design:
- •OS–os: file manager, system settings, terminal, package and process management, and other OS\-level operations\.
- •Office–libreoffice\_calc,libreoffice\_impress,libreoffice\_writer: spreadsheet formatting, formulas, charts, and pivot tables; slide layouts, animations, and master styles; document editing, find\-and\-replace, styles, tables, and page setup\.
- •Daily–chrome: web navigation, multi\-tab search, form filling, bookmarks and history management, downloads, extensions, and browser\-setting tasks\.
- •Professional–gimp,thunderbird,vlc,vs\_code: image editing \(layers, filters, color adjustments, selections, export\); email composition, folder organisation, and mail\-filter management; media playback control, playlists, subtitles, and conversion; code editing, search\-and\-replace, refactoring, extensions, debugging, and the integrated terminal\.
- •Workflow–multi\_apps: cross\-application tasks that compose two or more of the above, e\.g\., extracting data from a spreadsheet, rendering a chart in a presentation, and sending the result through email\.
For each category we enumerate the modal usage scenarios and feature surfaces from official documentation and an LLM\-driven web research pass over real\-world usage patterns\. The resulting per\-application feature taxonomy is used purely as a sampling scaffold for instruction synthesis and is never fed to the trained policy\.
#### Instruction synthesis and human filtering\.
Conditioned on the per\-application feature taxonomy, an LLM proposes candidate natural\-language instructions under three constraints: \(i\)realistic– the task should plausibly arise from real user workflows rather than being a synthetic UI exercise; \(ii\)concrete and verifiable– task completion must be decidable by inspecting the post\-rollout environment; \(iii\)deterministic verifier\-relevant outcome– successful completion must be decidable from a well\-defined set of terminal\-state properties, even when multiple interaction paths or irrelevant UI configurations are possible\.\. A subsequent human pass de\-duplicates near\-identical instructions, removes ambiguous or under\-specified phrasings, and rejects tasks whose terminal state cannot be operationalised without subjective judgement\.
#### Initial\-state assets and automatic verifiers\.
For every accepted instruction we manually prepare the assets required to instantiate the task on a fresh VM \(e\.g\., the working spreadsheet, source document, reference image, mailbox snapshot, or browser profile\), together with the reference file\(s\) used by the verifier\. The initial state is realised by a per\-task setup script that places these assets at the canonical paths expected by the OSWorld\-Verified contract\. The verifier itself follows one of two patterns according to the task family:
- •State\-extraction verifiers\.For tasks whose goal is to navigate the system or an application into a specific configuration \(e\.g\., reach a particular Chrome setting page, select a specific track in VLC, switch a VS Code panel\), the verifier programmatically extracts the final application state through the application’s introspection API or accessibility tree and asserts equality with the expected state\.
- •File\-match verifiers\.For tasks whose goal is to produce or modify a file \(e\.g\., editing a Calc workbook, exporting a GIMP image, replacing strings in a Writer document\), the verifier loads the post\-rollout artifact and compares it against the reference file using property\-level assertions \(cell value/format, image content/metadata, document text/styles\) rather than byte\-level equality\.
#### Difficulty binning\.
We label every task with the number of GUI primitives that an expert demonstrator needs to complete it, and bin tasks into three difficulty levels:easy\(1–5 steps\),medium\(5–15 steps\), andhard\(15–100 steps\)\. The three bins are sampled to be approximately uniformly distributed within each application, so that the training pool exercises both short skill\-grounding tasks and long\-horizon planning\.
#### Manual solvability check\.
Before a task enters the training pool, a human annotator executes it end\-to\-end on a fresh sandbox VM and confirms that the prepared initial state, the assets, and the automatic verifier are mutually consistent: the instruction is unambiguously actionable, the verifier returnsVx\(τ\)=1V\_\{x\}\(\\tau\)=1on the human\-completed trajectory, and no required asset or UI surface is missing\. Tasks that fail this manual pass – because of an under\-specified instruction, an unreachable goal state, or a verifier that disagrees with a correct human execution – are sent back for asset, instruction, or verifier revision and re\-tested, or dropped if they cannot be repaired\. This step closes the gap between syntactic verifier acceptance and end\-to\-end solvability, and guarantees that every released task in𝒳\\mathcal\{X\}has at least one verified human solution\.
#### Final corpus\.
The resulting training pool contains1,4321\{,\}432verified tasks spanning the ten domains; per\-application counts are reported in Tab\.[7](https://arxiv.org/html/2606.18890#A6.T7)\. The corpus is intentionally weighted toward theWorkflowcategory, since multi\-application tasks expose the long\-horizon off\-trajectory states that SGCD is designed to repair\.
CategoryApplication\#TasksOSos88Officelibreoffice\_calc189libreoffice\_impress179libreoffice\_writer90Dailychrome188Professionalgimp86thunderbird53vlc58vs\_code90Workflowmulti\_apps411Total1,432Table 7:Per\-application task counts in the constructed training pool𝒳\\mathcal\{X\}\. Tasks are organised into five coarse categories \(OS, Office, Daily, Professional, Workflow\) used as the unit of scenario and feature analysis during instruction synthesis\.
## Appendix GAlgorithm Overview
Algorithm[1](https://arxiv.org/html/2606.18890#alg1)summarizes the full Skill\-Guided Continuation Distillation pipeline\. Each iteration samples plain\-policy rollouts, extracts a task\-specific skill from successful and failed evidence, instantiates policy\-induced off\-trajectory states with akk\-step plain\-policy prefix, hands the state off to the skill\-guided policy to obtain a verified successful continuation, and distills the post\-handoff portion back into the plain policy without skill prompts\. As the policy improves across iterations, more failed tasks become recoverable and enter subsequent continuation construction\.
Algorithm 1Skill\-Guided Continuation Distillation1:Input:tasks
𝒳\\mathcal\{X\}, expert trajectories
𝒟exp\\mathcal\{D\}\_\{\\mathrm\{exp\}\}, model
πθ\\pi\_\{\\theta\}, skill constructor
GG, number of iterations
RR, handoff range
KmaxK\_\{\\max\}
2:foriteration
r=1r=1to
RRdo
3:Initialize
𝒟\+←∅\\mathcal\{D\}^\{\+\}\\leftarrow\\emptyset,
𝒟cont←∅\\mathcal\{D\}\_\{\\mathrm\{cont\}\}\\leftarrow\\emptyset,
𝒳fail←∅\\mathcal\{X\}\_\{\\mathrm\{fail\}\}\\leftarrow\\emptyset
4:fortask
x∈𝒳x\\in\\mathcal\{X\}do
5:Sample policy trajectories
𝒯x,policy\\mathcal\{T\}\_\{x,\\mathrm\{policy\}\}using
πpolicy\\pi\_\{\\mathrm\{policy\}\}
6:Add verified successes to
𝒟\+\\mathcal\{D\}^\{\+\}and failed tasks to
𝒳fail\\mathcal\{X\}\_\{\\mathrm\{fail\}\}
7:endfor
8:fortask
x∈𝒳failx\\in\\mathcal\{X\}\_\{\\mathrm\{fail\}\}do
9:Construct task skill
sx←G\(𝒯x,policy\)s\_\{x\}\\leftarrow G\(\\mathcal\{T\}\_\{x,\\mathrm\{policy\}\}\)from success and failure evidence
10:Sample
τx,skill∼πskill\(⋅∣x,sx\)\\tau\_\{x,\\mathrm\{skill\}\}\\sim\\pi\_\{\\mathrm\{skill\}\}\(\\cdot\\mid x,s\_\{x\}\)
11:if
Vx\(τx,skill\)=0V\_\{x\}\(\\tau\_\{x,\\mathrm\{skill\}\}\)=0thencontinue⊳\\trianglerightnot recoverable yet
12:endif
13:for
k=1k=1to
KmaxK\_\{\\max\}do
14:Run
πpolicy\\pi\_\{\\mathrm\{policy\}\}online for
kksteps to reach
hk\+1ph\_\{k\+1\}^\{p\}
15:Hand off to
πskill\(⋅∣hk\+1p,x,sx\)\\pi\_\{\\mathrm\{skill\}\}\(\\cdot\\mid h\_\{k\+1\}^\{p\},x,s\_\{x\}\)for the remaining horizon
16:ifspliced rollout
τ^\\hat\{\\tau\}passes verifier
VxV\_\{x\}and LLM quality judgethen
17:Add only post\-handoff examples
\(h^t,a^t\)\(\\hat\{h\}\_\{t\},\\hat\{a\}\_\{t\}\)for
t\>kt\>kto
𝒟cont\\mathcal\{D\}\_\{\\mathrm\{cont\}\}
18:endif
19:endfor
20:endfor
21:Train
πθ\\pi\_\{\\theta\}on
𝒟exp∪𝒟\+∪𝒟cont\\mathcal\{D\}\_\{\\mathrm\{exp\}\}\\cup\\mathcal\{D\}^\{\+\}\\cup\\mathcal\{D\}\_\{\\mathrm\{cont\}\}without skill prompts
22:endfor
23:Output:deployment policy
πθ\\pi\_\{\\theta\}\(no skill at inference\)
## Appendix HTraining Details
All three backbones are trained on 64 H100 GPUs \(8 nodes×\\times8 GPUs\) with the same trajectory mixture described in Sec\.[4\.5](https://arxiv.org/html/2606.18890#S4.SS5): original expert trajectories𝒟exp\\mathcal\{D\}\_\{\\mathrm\{exp\}\}, verified successful policy trajectories𝒟\+\\mathcal\{D\}^\{\+\}, and verified post\-handoff continuations𝒟cont\\mathcal\{D\}\_\{\\mathrm\{cont\}\}\. The Vision Transformer \(ViT\) and the vision–language aligner are frozen throughout, and only the language tower is fine\-tuned\. Training uses bfloat16 with the AdamW optimizer\.
#### Qwen3\-VL\-8B and Qwen3\-VL\-30B\-A3B\.
These two backbones share the same training recipe under theSwift\(Megatron\-SFT\) frameworkZhaoet al\.\([2025](https://arxiv.org/html/2606.18890#bib.bib60)\)\. We use tensor\-parallel size 2 with sequence parallel enabled, micro batch size 2 and global batch size 512, and a maximum sequence length of10,24010\{,\}240\. Activation recomputation is set to full with the uniform method \(one layer per recompute block\) and ViT gradient checkpointing is enabled\. Optimization uses learning rate1×10−51\\\!\\times\\\!10^\{\-5\}with a1×10−61\\\!\\times\\\!10^\{\-6\}minimum, a warmup fraction of0\.050\.05, fused cross\-entropy loss, and last\-round loss masking so that supervision is applied only to the policy\-predicted tokens of each turn\. We train for22epochs over the mixed dataset with packing enabled to reduce padding overhead\. The pre\-handoff plain\-policy actions in𝒟cont\\mathcal\{D\}\_\{\\mathrm\{cont\}\}are kept as context but masked out of the action loss \(K\-start; Sec\.[4\.5](https://arxiv.org/html/2606.18890#S4.SS5)\)\.
#### STEP3\-VL\-10B\.
This backbone is trained with theSteptronframework using a long\-context configuration tailored to its YARN\-extended128k128\\mathrm\{k\}position range\. The model uses tensor\-parallel size 8, pipeline\-parallel size 1, sequence parallel, full activation recomputation, and freezes the ViT encoder\. The training packs sequences to a global sequence length of128×1024=131,072128\\times 1024=131\{,\}072tokens, with micro batch size 1 and global batch size 32 over3,8733\{,\}873iterations \(approximately22epochs of the trajectory mixture\)\. Optimization uses AdamW \(β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95,ϵ=10−8\\epsilon=10^\{\-8\}\) with gradient clipping at1\.01\.0, a cosine schedule with peak learning rate1×10−41\\\!\\times\\\!10^\{\-4\}, minimum learning rate1×10−51\\\!\\times\\\!10^\{\-5\}, and a200200\-iteration linear warmup\. Weight decay is set to0, and bf16 is used without fp16 loss scaling\. Key training hyperparameters for the three backbones are summarized in Tab\.[8](https://arxiv.org/html/2606.18890#A8.T8)\.
Table 8:Training hyperparameters\. Qwen3\-VL\-8B and Qwen3\-VL\-30B\-A3B share the same Swift recipe; STEP3\-VL\-10B uses the Steptron long\-context configuration with its YARN\-extended128k128\\mathrm\{k\}position range\.
## Appendix IEvaluation Setup
#### Environment\.
We follow the OSWorld\-VerifiedXieet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib14)\)interaction protocol and use the OSWorld\-VerifiedXieet al\.\([2024](https://arxiv.org/html/2606.18890#bib.bib14)\)task suite for evaluation\. The agent acts in an Ubuntu desktop sandbox with a fixed screen size of1920×10801920\\times 1080, predicting one action per step from raw screenshots\. The action space ispyautoguifor mouse and keyboard primitives, augmented with two control functions:computer\.waitfor synchronous pauses \(e\.g\., during installation or long computation\) andcomputer\.terminatefor declaring task completion or failure\. Coordinates are predicted in normalized form \(relative to image size\) and projected back to absolute pixels before execution\.
#### Inference parameters\.
For all backbones we use the same sampling configuration:max\_tokens=4096=4096,top\_p=0\.95=0\.95,temperature=1\.0=1\.0\. Each task is run with at most100100environment steps; the agent is forced to terminate with failure once the budget is exhausted\. At the API level, we allow up to1010outer retries for parse\-level errors and up to2020HTTP\-level retries per outer call, with a1,2001\{,\}200\-second request timeout\.
#### Message construction\.
Each step the agent assembles an OpenAI\-style chat\-completion payload with the following structure:
- •Asystemmessage containing the GUI\-agent system prompt \(Fig\.[9](https://arxiv.org/html/2606.18890#A9.F9)\)\. The prompt fixes the response format \(Thought,Action,Code\), the allowed action space, and a hard\-coded sandbox password placeholder that some tasks require\.
- •Interleaveduser/assistantturns for previous steps\. To control context length, only the most recentH=3H=3historical screenshots are included asimage\_urlmessages, each paired with the correspondingassistantresponse containing the pastThoughtandAction\. Steps older than the image\-history window are kept as text\-only assistant context to preserve the planning chain without their screenshots\.
- •A finaluserturn containing the current screenshot \(base64\-encoded PNG\) and the task instruction rendered through theINSTRUCTION\_TEMPLATE\(“Please generate the next move according to the screenshot, task instruction and previous steps”\)\.
For thinking\-capable backbones, the assistant turns are rendered with explicit`\[think\] \.\.\. \[/think\]`segments so that the historical reasoning is reusable; otherwise, the standard\#\# Thought / \#\# Actionmarkdown layout is used\. After each model call, the response is parsed for theActionblock and the trailing fenced code block \(pythonorcode\); a parse failure triggers a retry with reduced temperature\.
Evaluation System PromptYou are an agent that operates a computer GUI\. For each step, you are given the task goal, the latest screenshot, and the history of earlier actions\. Choose the single best next move that advances the task based on the current screen\. Keep your action consistent with what is visible, and use the action history to avoid repeating mistakes\. The password of the computer is\{password\}\.For each step, provide your response in this format:``` ## Thought: {thought} ## Action: {action} ## Code: {code} ``` Code rules:\- Return exactly one executable code block\.\- The code block must contain either:1\) pyautogui code for the next GUI action, or2\) one built\-in control call:\- computer\.wait\(\): wait 20 seconds if the page, app, or system needs time\- computer\.terminate\(status=…, answer=…\): end the task when it is finished or cannot be completed\- If you use computer\.terminate, status must be "success" or "failure"\.\- Include answer only when there is a final result to report\.
Figure 9:System prompt used at evaluation\. Thinking\-capable backbones receive a variant in which theThoughtfield is replaced by an explicit reasoning segment wrapped in\[think\]/\[/think\]tags\.Similar Articles
MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?
MMG2Skill converts web-based procedural guides into executable skills for agents through closed-loop learning, improving performance across GUI control, gameplay, and card play tasks with macro-average gains of +12.8 to +25.3 percentage points.
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents
Introduces GUI-RobustEval, a benchmark for error recovery in GUI agents, and Robustness-driven Trajectory Synthesis (RoTS) to generate training data, achieving state-of-the-art on OSWorld.
Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
This paper introduces Visual-SDPO, a self-distillation policy optimization framework that uses rendered visual feedback as privileged context to train code-generating LLMs, improving visual artifact quality across chart, UI, and slide generation benchmarks.
Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
This paper proposes SGDR (State-Grounded Dynamic Retrieval), an online skill learning method for web agents that enables stepwise, state-aware skill reuse rather than static task-level retrieval. Experiments on WebArena show SGDR achieves 37.5% success rate with GPT-4.1, a ~10.6% relative gain over strong baselines.
SkillOS: Learning Skill Curation for Self-Evolving Agents
This paper introduces SkillOS, a reinforcement learning framework that enables LLM agents to learn long-term skill curation policies for self-evolution, improving performance and generalization across tasks.