VISUALSKILL: Multimodal Skills for Computer-Use Agents
Summary
VisualSkill proposes a hierarchical multimodal skill library for computer-use agents that combines text and figures, achieving a 15.3 point absolute lift on CUA benchmarks over text-only baselines by retaining visual information for GUI interaction.
View Cached Full Text
Cached at: 06/18/26, 05:45 AM
# VisualSkill: Multimodal Skills for Computer-Use Agents
Source: [https://arxiv.org/html/2606.18448](https://arxiv.org/html/2606.18448)
Ziyan Jiang1,\*,Li An1,\*,Yujian Liu1,Jiabao Ji1,Qiucheng Wu1, Jacob Andreas2,†,Yang Zhang3,†,Shiyu Chang1,†
1UC Santa Barbara,2MIT CSAIL,3MIT\-IBM Watson AI Lab \*Equal contribution\.†Equal advising\. Correspondence to:\{ziyanjiang, li\_an\}@ucsb\.edu
###### Abstract
Computer\-use agents \(CUAs\) approach human\-level performance on standardised benchmarks but still struggle on long\-horizon tasks and unseen software\. Existing skill libraries address this with reusable skills, but represent the skill artifact as text only, despite the visual nature of GUI interaction\. We proposeVisualSkill: a hierarchical multimodal skill, tailored to each target application and organised as a central index over per\-topic files, which the agent consumes through aload\_topicMCP tool that fetches the relevant topic’s text and figures on demand\. We construct each skill with a two\-stage pipeline that combines authored documentation with live\-application UI exploration\. On two CUA benchmarks, CUA\-World and OSExpert\-Eval, a Claude Code CLI agent backed by Claude Opus 4\.6 reaches an average score of0\.4560\.456withVisualSkill, a\+15\.3\\mathbf\{\+15\.3\}point absolute lift over the no\-skill baseline \(0\.3030\.303\)\. Against a matched text\-only skill that is generated from the same source content and differs fromVisualSkillonly in modality,VisualSkillyields a further\+8\.3\\mathbf\{\+8\.3\}point absolute gain over the matched text\-only skill \(0\.3730\.373vs\.0\.4560\.456\), providing direct evidence that retaining visual figures in the skill artifact, rather than verbalizing them away, helps the agent both identify UI elements and verify workflow state after each action\. Our code is available at[https://github\.com/XMHZZ2018/VisualSkills](https://github.com/XMHZZ2018/VisualSkills)\.
VisualSkill: Multimodal Skills for Computer\-Use Agents
## 1Introduction
Agent skills recently introduce a modular, filesystem\-based abstraction that equips agents with domain\-specific expertise on demand\. Skills are particularly valuable for computer\-use agents \(CUAs\), which interact with graphical user interfaces through screenshots and keyboard/mouse actions\. CUAs have advanced rapidly\(Wang et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib8); Agashe et al\.,[2025](https://arxiv.org/html/2606.18448#bib.bib1)\)and now approach human\-level performance on standardised desktop benchmarks such as OSWorld\(Xie et al\.,[2024](https://arxiv.org/html/2606.18448#bib.bib10); Anthropic,[2026](https://arxiv.org/html/2606.18448#bib.bib3); Simular AI,[2026](https://arxiv.org/html/2606.18448#bib.bib7)\), yet they still struggle on complex, long\-horizon tasks and generalise poorly to unseen UIs and software\(Aggarwal et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib2); Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6)\), largely because they lack the persistent, application\-specific procedural knowledge \(which menu hides a given command, what dialog appears after a particular click, how a multi\-step workflow unfolds\) that pre\-training does not supply\. A skill can supply exactly that knowledge at decision time\.
However, existing skill libraries for CUAs are predominantly text\-only, representing the skill artifact as a natural\-language intent paired with text\-based action specifications and providing no slot for figures or screenshots\(Chen et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib4); Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6)\), which is a poor match for computer\-use environments in two ways\.First, many UI elements that an agent must act on \(icons, layouts, interface states, and spatial relations among widgets\) are verbose or ambiguous to describe in text, so the verbal substitutes that enter the skill lose information that the original screenshot carries directly\.Second, multi\-step workflows require the agent to verify, after each action, that it has reached the expected intermediate UI state\. A reference screenshot serves as a direct visual grounding signal for the target UI state and can be matched against the agent’s current observation\. A textual description, however, specifies the state only indirectly, making it more difficult to verify whether the agent has reached the intended state or a visually similar but incorrect one\. These two limitations motivate*multimodal skills*that retain visual content as figures alongside the textual procedure\. As shown in Figure[1](https://arxiv.org/html/2606.18448#S1.F1), a text\-only skill has difficulty describing the precise operations and icons to focus on, creating ambiguity that a multimodal skill can avoid\.
We introduceVisualSkill, which constructs multimodal skills in which visual content is retained as figures\.VisualSkillproduces*one skill per target application*, scoping a skill to all the UI knowledge an agent needs to operate that application end to end\. Each such skill is organised*hierarchically*: a central index file lists every topic with a short “when to use” description, and points to a per\-topic file that holds the text body and figures for that topic\. At inference time, the agent reads only the compact index and invokes aload\_topicMCP tool to fetch the text and figures on demand\.
Figure 1:A text\-only skill struggles to describe the precise operations and icons to focus on, creating ambiguities that multimodal skills can avoid\.To build such skills,VisualSkillruns a two\-stage pipeline in which each stage targets one of the two challenges above\. Stage 1 mines authored documentation \(PDF guides, HTML manuals\) into this hierarchy, with figures from the source documents kept verbatim\. Stage 2 enriches the skill by exploring the live application under two regimes:*free exploration*, in which an LLM planner partitions the application’s starting page into regions to cover, and*targeted exploration*, in which training\-task trajectories pinpoint UI regions where the current skill underperforms\. Screenshots collected from both regimes are, after optional post\-processing, slotted into the corresponding per\-topic skill files\.
In summary, we make three contributions\. First, we designVisualSkillas hierarchical multimodal artefacts with one skill per target application and a central index over per\-topic files, which the agent consumes through aload\_topicMCP tool that fetches only the relevant topic’s text and figures on demand\. Second, we construct such skills with a two\-stage pipeline that combines authored documentation \(Stage 1\) with live\-application UI exploration \(Stage 2\)\. Third, we evaluateVisualSkillagainst a text\-only control skill that is generated jointly from the same source content and differs only in modality, so any performance gap isolates the effect of visual presentation: across177177tasks from CUA\-World and OSExpert\-Eval, using a Claude Opus 4\.6 Claude Code CLI agent,VisualSkilllifts the unweighted\-mean score from0\.3030\.303\(no\-skill\) to0\.4560\.456\(\+15\.3%\+15\.3\\%absolute\) and outperforms the text\-only control by\+8\.3%\+8\.3\\%absolute, while a qualitative analysis shows that the multimodal advantage concentrates on the two failure modes of text\-only skills identified above: identifying UI elements \(icons, layouts, spatial relations\) and verifying intermediate workflow state after each action\.
## 2Method
This section defines what aVisualSkillis, how the agent uses it at inference, and how we construct one for a target desktop application\. Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)gives the formal definition ofVisualSkill\. Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)describes the loading mechanism by which the agent navigates the skill at decision time\. Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)describes the two\-stage pipeline that produces a skill for any application from authored documentation \(Stage 1\) and from interaction with the live application itself \(Stage 2\)\.
### 2\.1Skill Definition
A skill inVisualSkillis a structured reference about a single target application: one skill per application, shared by all tasks against that application\.
Each skill is organised as a centralisedskill\.mdindex plus a list of per\-topic guides\{gt\}t∈𝒯\\\{g\_\{t\}\\\}\_\{t\\in\\mathcal\{T\}\}arranged in a two\-layer hierarchy\. Theskill\.mdindex points to each per\-topic guide and tags it with a one\-sentence*when to use*description that the agent matches against its current task before loading the guide\. Each per\-topic guidegt=\(pt,Ft\)g\_\{t\}=\(p\_\{t\},F\_\{t\}\)pairs a text bodyptp\_\{t\}with a set of UI figuresFtF\_\{t\}\. At inference, the agent reads only theskill\.mdindex up front and loads individualgtg\_\{t\}on demand \(Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)\)\. See Appendix[A\.1](https://arxiv.org/html/2606.18448#A1.SS1)for the full structure and an excerpt ofskill\.md\.
For each application, the artifact we deliver is aVisualSkill𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}that retains the UI figures\. To isolate the contribution of the visual modality at evaluation time, we additionally construct atext\-only control𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}from the same source material: it shares the same hierarchical structure as𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}\(sameskill\.mdindex, same topic set𝒯\\mathcal\{T\}, same procedural content per topic\) and differs only in the per\-topic guides —𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}usesgtmm=\(ptmm,Ft\)g\_\{t\}^\{\\mathrm\{mm\}\}=\(p\_\{t\}^\{\\mathrm\{mm\}\},F\_\{t\}\)with the UI figures alongside the text body, while𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}usesgttxt=\(pttxt,∅\)g\_\{t\}^\{\\mathrm\{txt\}\}=\(p\_\{t\}^\{\\mathrm\{txt\}\},\\emptyset\)with no figures and a text body that describes the same visual information directly in words\. The two text bodies are not word\-for\-word identical, since each is written in the form most natural to its modality\. To guarantee that the procedural content is shared, every per\-topic generation step in the pipeline \(Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)\) is a single LLM call that reads the topic’s source text and figures and emitsptmmp\_\{t\}^\{\\mathrm\{mm\}\}andpttxtp\_\{t\}^\{\\mathrm\{txt\}\}side by side\. Under this construction, any gap in agent performance between𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}and𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}is attributable to modality alone, not to differences in the underlying content\. See Appendix[A\.2](https://arxiv.org/html/2606.18448#A1.SS2)for an excerpt of𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}and the corresponding𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}on one topic\.
### 2\.2Skill Loading at Inference Time
At inference, the agent does not load the entire skill up front\. Its system prompt contains only theSKILL\.mdindex, which lists each topic together with a one\-sentence*when to use*description\. To retrieve a per\-topic guide, the agent calls a single MCP tool,load\_topic\(t\), which takes a topic identifiert∈𝒯t\\in\\mathcal\{T\}and returns the guidegtg\_\{t\}\. At every step, the agent matches its next intended action against the*when to use*lines and callsload\_topicbefore acting if any of them matches\. The tool may be called multiple times across a trajectory as the task moves between UI surfaces\.
For the multimodal variant, the returned content interleaves text and image blocks in the order figures are referenced byptp\_\{t\}, so each image ofFtF\_\{t\}is delivered immediately after the sentence that names it\. For the text\-only control, the same call returns a single text block whose wording already absorbs the visual information\. The full system prompt, theload\_topicschema, and a worked tool\-call transcript are provided in Appendix[B](https://arxiv.org/html/2606.18448#A2)\.
We expose skill loading through an MCP tool rather than directReadaccess for two reasons\. First, the MCP tool delivers each figure inline with its surrounding text in a single tool result, whereas under directReadeach figure incurs a separate call and is skipped in practice\. Second, the MCP interface keeps skill content accessible to the agent throughout the trajectory, whereas directReadtends to be invoked once at the start and not revisited\. Section[4\.2](https://arxiv.org/html/2606.18448#S4.SS2)verifies these effects empirically: under directRead, the agent loads∼\\sim10×10\\timesfewer figures per task and stops consulting the skill within the first∼\\sim2% of the rollout, collapsing the multimodal gain\.
### 2\.3Two\-Stage Skill Construction
Figure 2:The two\-stageVisualSkillconstruction pipeline\.Stage 1parses the authored documentation into a topic hierarchy, extracting per\-topic text bodies and the vendor\-drawn figures shipped with the manual\.Stage 2drives the live application with an LLM\-controlled explorer in two sub\-passes — a*free explorer*that partitions the idle window and dispatches a worker per region, and a*trajectory\-targeted explorer*that re\-reads failed rollouts and patches the UI regions agents misread — and merges captured screenshots and notes into the per\-topic guides\. The matched text\-only skill is generated jointly from the same source, differing only in modality\.Our pipeline \(Figure[2](https://arxiv.org/html/2606.18448#S2.F2)\) constructs theVisualSkillskill𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}for an application𝒜\\mathcal\{A\}in two stages, generating the text\-only control𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}alongside it from the same source\.Stage 1mines an initial version of the skill from authored documentation, exploiting the fact that mature applications usually ship a structured manual whose table of contents can be reused as the topic set𝒯\\mathcal\{T\}\.Stage 2augments it by driving the live application with an LLM\-controlled explorer, capturing UI knowledge that exists only in the running program\. At each stage, the same single\-call protocol of Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)emits both𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}and𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}from the same input, so𝒯\\mathcal\{T\}and the per\-topic procedural content are shared by construction; onlyFtF\_\{t\}and the wording ofptp\_\{t\}around each figure\-slot differ\.
#### Stage 1: from authored documentation\.
Stage 1 takes the application’s official user guideDD\(a PDF or HTML manual in our experiments\) and produces the Stage 1VisualSkillskill𝒮1mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}\_\{1\}in a few LLM\-driven steps; the text\-only control𝒮1txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}\_\{1\}is emitted in the same calls\. We first parseDD’s table of contents and reuse it as𝒯\\mathcal\{T\}: the application’s maintainers have already organised the surface for us, and inheriting their structure is cheaper and more faithful than imposing a fresh one\. For each topictt, we then build the per\-topic guidegt=\(pt,Ft\)g\_\{t\}=\(p\_\{t\},F\_\{t\}\)by locating the pages ofDDthat covertt, pulling out every figure on those pages asFtF\_\{t\}, and invoking the joint\-generation call to writeptmmp\_\{t\}^\{\\mathrm\{mm\}\}andpttxtp\_\{t\}^\{\\mathrm\{txt\}\}side by side\. Because the figures come straight from the official manual,FtF\_\{t\}holds vendor\-drawn screenshots of the application’s UI rather than agent\-captured ones\. The resulting skill is hierarchically organised, and its coverage is whatever the documentation covers\. Appendix[C\.1](https://arxiv.org/html/2606.18448#A3.SS1)walks through one Stage 1 topic end\-to\-end\.
#### Stage 2: from UI exploration\.
The Stage 1 skill is unavoidably incomplete: documentation is often outdated relative to the shipped UI, omits low\-frequency dialogs, and rarely captures the visual affordances that an agent must distinguish at click time\. Stage 2 closes this gap by driving the live application with a planner and a pool of worker agents\. We run two complementary sub\-passes whose outputs are merged before reassembly\.
\(a\) Free explorer\.An Opus\-class planner agentPPinspects a screenshot of the idle application and nominates a partition of its UI into a set of regionsR=\{r1,…,rK\}R=\\\{r\_\{1\},\\ldots,r\_\{K\}\\\}\(withK=8K\{=\}8in our experiments\)\. For each regionrir\_\{i\}, a Sonnet\-class worker agentWiW\_\{i\}is spawned in an isolated Docker container running𝒜\\mathcal\{A\}, instructed to interact withrir\_\{i\}, capture cropped screenshots of every interactive element, and emit a structured notenin\_\{i\}summarising the affordances ofrir\_\{i\}\. Workers run in parallel\. Appendix[C\.2](https://arxiv.org/html/2606.18448#A3.SS2)walks through one free\-explorer region end\-to\-end\.
\(b\) Trajectory\-targeted explorer\.In parallel, a reviewer agentVVis given a held\-out set of training\-task trajectories\{τ1,…,τN\}\\\{\\tau\_\{1\},\\ldots,\\tau\_\{N\}\\\}on which the Stage 1 skill was executed and inspected for failures, where eachτj\\tau\_\{j\}contains the task instruction, the agent’s action sequence, the final verifier feedback, and any consultations the agent issued\.VVproduces an additional set of UI regionsR′=V\(\{τj\}\)R^\{\\prime\}=V\(\\\{\\tau\_\{j\}\\\}\)that it judges the Stage 1 skill to have measurably underexplained, and a second pool of workersWi′W^\{\\prime\}\_\{i\}is dispatched againstR′R^\{\\prime\}under the same protocol as \(a\)\. The targets are scoped to UI*regions*rather than to specific tasks, so the patch transfers to any test task that touches the same UI surface\. Appendix[C\.3](https://arxiv.org/html/2606.18448#A3.SS3)walks through one targeted example\.
The combined note setN=\{ni\}i=1K∪\{ni′\}i:ri′∈R′N=\\\{n\_\{i\}\\\}\_\{i=1\}^\{K\}\\cup\\\{n^\{\\prime\}\_\{i\}\\\}\_\{i\\,:\\,r^\{\\prime\}\_\{i\}\\in R^\{\\prime\}\}is consumed by an assembler agent that consolidates each region’s notes into a per\-region reference section, an LLM mapper that decides which existing topict∈𝒯t\\in\\mathcal\{T\}each region most directly enriches, and an inline step that appends the consolidated reference into the corresponding per\-topic guides of both skills via the same joint\-generation call\. The Stage 2 pair is then𝒮2mm=Assemble\(𝒮1mm,N\)\\mathcal\{S\}^\{\\mathrm\{mm\}\}\_\{2\}=\\mathrm\{Assemble\}\(\\mathcal\{S\}^\{\\mathrm\{mm\}\}\_\{1\},N\)and𝒮2txt=Assemble\(𝒮1txt,N\)\\mathcal\{S\}^\{\\mathrm\{txt\}\}\_\{2\}=\\mathrm\{Assemble\}\(\\mathcal\{S\}^\{\\mathrm\{txt\}\}\_\{1\},N\)\. The Stage 2 skill shares the topic set𝒯\\mathcal\{T\}of its Stage 1 version by construction; only the per\-topic guides are augmented\.𝒮2mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}\_\{2\}gains agent\-captured screenshots in itsFtF\_\{t\}and an extendedptmmp\_\{t\}^\{\\mathrm\{mm\}\}written alongside them\. The control𝒮2txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}\_\{2\}likewise gains an extendedpttxtp\_\{t\}^\{\\mathrm\{txt\}\}that absorbs the same new UI information into words\.
## 3Experiments
### 3\.1Experimental Setup
#### Agent\.
We evaluate a computer\-use agent built on the Claude Code CLI and backed by Claude Opus 4\.6\. It observes the desktop through screenshots, acts via a fixed GUI tool set, and consults skills through theload\_topicMCP tool from Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)\. The CLI, the tool surface, and the per\-step prompt are held fixed across all conditions, so any difference in task success is attributable to the skill artefact\. We use each benchmark’s own per\-task budget: CUA\-World ships a per\-task step cap ininit\.max\_stepsthat ranges from4040to200200actions depending on task difficulty, which we use unchanged; OSExpert\-Eval is wall\-clock\-budgeted, and we run it with the benchmark’s default1515minute per\-task cap\. The full agent setup — the GUI tool signatures and the sandboxing setup — is deferred to Appendix[D](https://arxiv.org/html/2606.18448#A4)\.
#### Benchmarks\.
We evaluate on two computer\-use benchmarks:CUA\-World\(Aggarwal et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib2)\)\(five domains we use — LibreOffice Writer / Calc / Impress, QGIS, OpenToonz — with per\-application train/test splits used for both Stage 2 patching and held\-out evaluation\) andOSExpert\-Eval\(Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6)\)\(three additional domains — LibreOffice, GIMP, Tableau — covering long\-horizon compositional workflows on unseen UIs\)\. We use the verifier shipped with each benchmark unchanged\. Per\-application task counts and train/test splits are listed in Appendix[E](https://arxiv.org/html/2606.18448#A5)\.
#### Metric\.
We report the per\-task score averaged within each domain, with scores normalised to\[0,1\]\[0,1\]so the two benchmarks are directly comparable\. CUA\-World scores each task with a checklist\-based VLM verifier that decomposes the task into weighted subtasks and assigns partial credit\(Aggarwal et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib2)\): we take the average checklist score per task, rescaled from0–100100to\[0,1\]\[0,1\]\. OSExpert\-Eval scores each task with the deterministic state\-based verifier shipped with the benchmark\(Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6)\), returning11on success and0otherwise; the per\-domain mean therefore reduces to the domain success rate\.
#### Skill conditions\.
We compare five conditions:
- •No\-skill: agent runs with only the task instruction, with no skill provided\.
- •Stage 1VisualSkill, with theStage 1 text\-only controlfor comparison: both mined from authored documentation\.
- •Stage 2VisualSkill, with theStage 2 text\-only controlfor comparison: both after Stage 2 enrichment via UI exploration\.
Within each stage, the multimodal skill and its text\-only control are generated jointly from the same source content \(Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)\), so any performance gap between them isolates the modality’s effect\.
### 3\.2Main Results
Table 1:Main results across the two computer\-use benchmarks our method targets\. Each cell is the per\-domain mean score in\[0,1\]\[0,1\]; the integer beneath each domain header is the task count\.Stage 1skills are mined from documentation;Stage 2skills additionally incorporate the UI explorer output\. For each stage we report two matched variants: multimodal \(figures retained\) and a text\-only control\. Per\-benchmarkAvgcolumns are unweighted means over the domains within each benchmark; theAll Avgcolumn is the unweighted mean over all domains\.#### Skill helps; UI exploration helps more\.
Adding any skill over the no\-skill baseline improves performance on every domain we measure\. Averaged unweighted across the eight domains in Table[1](https://arxiv.org/html/2606.18448#S3.T1), the no\-skill agent scores0\.3030\.303\. Stage 1VisualSkilllifts it to0\.3630\.363\(\+6\.0%\+6\.0\\%absolute\), and Stage 2VisualSkill— adding UI exploration on top of Stage 1 — lifts it further to0\.4560\.456\(\+9\.3%\+9\.3\\%over Stage 1,\+15\.3%\+15\.3\\%over no\-skill\)\. The Stage 2 gain concentrates on domains whose UI is most under\-documented by the manual: GIMP \(\+16\.6%\+16\.6\\%absolute over Stage 1VisualSkill\) and OpenToonz \(\+12\.9%\+12\.9\\%\)\. On domains whose documentation already covers most controls \(e\.g\. QGIS, Tableau\), Stage 2 closes a smaller gap, consistent with the design intent of UI exploration as a complement to authored knowledge rather than a replacement\.
#### VisualSkilloutperforms its matched text\-only control, especially after UI exploration\.
Within each stage, the multimodal variant beats its matched text\-only control on5/85/8domains at Stage 1 \(and ties on GIMP\) and on8/88/8at Stage 2\. Averaged across the eight domains, the multimodal gain is modest at Stage 1 \(\+1\.9%\+1\.9\\%absolute,0\.344→0\.3630\.344\\to 0\.363\) but substantially larger at Stage 2 \(\+8\.3%\+8\.3\\%absolute,0\.373→0\.4560\.373\\to 0\.456\)\. The largest Stage 2 multimodal lifts appear on the visually intensive creative tools \(GIMP\+16\.6%\+16\.6\\%, OpenToonz\+8\.9%\+8\.9\\%, OSExpert Tableau\+15\.0%\+15\.0\\%\) and the OSExpert LibreOffice subset \(\+8\.3%\+8\.3\\%\), where success depends heavily on recognising dialogs, icons, and UI layouts\. The gain shrinks but does not disappear on office\-productivity workflows — CUA\-World Writer \(\+5\.1%\+5\.1\\%at Stage 2\), Calc \(\+3\.6%\+3\.6\\%\), Impress \(\+7\.1%\+7\.1\\%\)\. Retaining figures helps most where visual grounding is the bottleneck, and gives smaller but consistent gains elsewhere\.
## 4Analysis
### 4\.1When Do Figures Help?
Across the cases whereVisualSkilloutperforms its text\-only control, the multimodal advantage concentrates on the two failure modes of text\-only skills we anticipated in the introduction\.
\(i\) Identifying the right UI element to act on\.We see three recurring situations in which a reference screenshot is hard to substitute with text\. The first is graphical controls whose meaning lives in their appearance: toolbar icons, palette swatches, drawing\-tool glyphs\. A text description such as “the brush tool” or “the format\-paint icon” is easy to confuse with a neighbouring button, since icon\-based controls do not carry text labels\. The second is targets that only appear after an earlier interaction: modal sub\-dialogs two menus deep, expanded dropdowns, right\-click context menus, and pop\-up panels\. These surfaces are not visible in the idle screenshot, and a text guide can only describe them in the abstract\. The third is a sub\-region of a larger visual element — a text field next to its drop\-arrow, a button in the footer of a tall dialog — where a phrase like “click the dropdown” or “Save at the bottom of the dialog” names the surrounding element rather than the actual click target, so the agent often clicks the wrong region\. A reference screenshot addresses all three by showing both the target’s appearance and its exact location\. Appendix[H](https://arxiv.org/html/2606.18448#A8)walks through one example in which the text\-only agent reads “Save at the bottom of the dialog” and clicks the title bar instead, closing the application\.
\(ii\) Verifying the UI state between steps\.Multi\-step workflows require the agent to verify, after each action, that the UI has actually reached the expected state before continuing\. With a text description, this verification is indirect: the agent has to imagine what the description should look like in pixels, then check whether the current view matches\. With a reference screenshot, the comparison is direct\. The gap is largest when the failure is silent — form fields that look like they accepted a typed value but rejected it on commit, sequenced dialogs whose progress is only visible in small layout changes — because in these cases the agent has no other cue except the screen itself\.
When figures don’t help\.The gap betweenVisualSkilland its text\-only control collapses on tasks consisting of short, explicitly specified action sequences, where the required operations can already be communicated clearly through text alone, leaving little additional value for figures to provide\.
### 4\.2Ablation: MCP Tool vs\. PlainRead
Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)exposes skill loading to the agent as the MCP toolload\_topicrather than letting the agent issue plainReadcalls against the skill folder\. To test that choice empirically, we ablate the MCP tool against a singleDirectReadbaseline: the system prompt still asks the agent to consult the skill at every step, but instead of callingload\_topicthe agent uses the plainReadtool, readingguide\.mdfor the relevant topic and then issuing one furtherReadfor each figure it judges worth fetching\. Both methods use the same agent setup and the same skill artefact; only how the agent retrieves content from the skill folder differs\.
CUA\-WorldVisualSkilltest scoreMethodLoad rateFigures / taskLast @ stepWriterOpenToonzQGIS\(24\)\(21\)\(16\)DirectRead92\.6%92\.6\\%0\.80\.81\.51\.50\.2360\.2360\.2460\.2460\.695MCP tool \(ours\)𝟏𝟎𝟎%\\mathbf\{100\\%\}7\.9\\mathbf\{7\.9\}10\.4\\mathbf\{10\.4\}0\.276\\mathbf\{0\.276\}0\.274\\mathbf\{0\.274\}0\.726Table 2:Skill\-loading ablation\.*Load rate*,*Figures / task*, and*Last @ step*are measured on the Writer training split with the Stage 2 multimodal skill :*Load rate*is the fraction of trajectories that consult the skill at all;*Figures / task*is the average number of figures delivered per task;*Last @ step*is the median trajectory step of the final skill consultation\.*CUA\-WorldVisualSkilltest score*is the Stage 2 multimodal mean on each domain; the integer beneath each header is the task count\. Both methods use the same agent setup and the same skill artefact and differ only in how the agent retrieves content from the skill folder\.#### What the agent does differently\.
On the Writer training split, DirectReadreaches the skill on92\.6%92\.6\\%of tasks but reads only0\.80\.8figures per task on average, and the median last consult lands at step1\.51\.5— the agent reads the skill once near the start of the rollout and never returns\. The MCP tool runs at100%100\\%load rate with7\.97\.9figures per task \(∼\\sim𝟏𝟎×\\mathbf\{10\\times\}more\) and a median last consult step of10\.410\.4, so the agent keeps consulting the skill as the task moves between UI surfaces\. The gap is structural: each figure under DirectReadcosts the agent an extraReadcall and is usually skipped, while the MCP tool delivers every referenced figure with the prose in one tool result \(Table[2](https://arxiv.org/html/2606.18448#S4.T2), left\)\.
#### Effect on accuracy\.
The MCP tool gives a consistent\+𝟑−𝟒%\+\\mathbf\{3\{\-\}4\\%\}absolute lift over DirectReadacross all three domains: Writer \(0\.236→0\.2760\.236\\to 0\.276,\+4\.0%\+4\.0\\%\), OpenToonz \(0\.246→0\.2740\.246\\to 0\.274,\+2\.8%\+2\.8\\%\), and QGIS \(0\.695→0\.7260\.695\\to 0\.726,\+3\.1%\+3\.1\\%\)\. The multimodal skill carries its UI knowledge in the figures, and without the MCP tool most of that knowledge is left on the table\.
### 4\.3Contribution of Each Phase ofVisualSkill
VisualSkillis built in two stages \(Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)\):Stage 1mines the application’s authored documentation, andStage 2enriches the resulting skill with screenshots and notes captured by driving the live application, in two sub\-passes —S2\(a\), a free UI explorer that visits every region of the idle window, andS2\(b\), a trajectory\-targeted explorer that revisits UI regions where training\-task rollouts failed\. Table[3](https://arxiv.org/html/2606.18448#S4.T3)reports each phase’s contribution\.
Table 3:Phase ablation on CUA\-World LibreOffice\.*Stage 1*variants are mined from authored documentation\.*Stage 2*variants additionally incorporate the UI explorer output:S2\(a\)= free explorer only;S2\(b\)= trajectory\-targeted explorer only;*full*= both sub\-passes\. Integers below domain names give task counts\.#### Stage 1 is the backbone\.
With Stage 1 alone, the LibreOffice average jumps from0\.3310\.331\(no\-skill\) to0\.4360\.436\(VisualSkill\) — a\+10\.5%\\mathbf\{\+10\.5\\%\}absolute gain, confirming that mature applications’ authored documentation already covers a large fraction of the procedural knowledge an agent needs\. Within Stage 1, however, the multimodal\-over\-text gap is small \(\+1\.6%\+1\.6\\%absolute,0\.420→0\.4360\.420\\to 0\.436\): the manual’s content is mostly procedural — menu paths, dialog field names, keyboard shortcuts — which text describes clearly on its own; the figures it ships are vendor renders that add little to the text\.
#### Stage 2\(a\) and 2\(b\) patch complementary surfaces\.
Each Stage 2 sub\-pass adds∼\\sim𝟐−𝟑%\\mathbf\{2\{\-\}3\\%\}absolute on top of Stage 1 \(0\.436→0\.4620\.436\\to 0\.462for 2\(a\) alone,0\.436→0\.4530\.436\\to 0\.453for 2\(b\) alone\), but they patch*different*parts of the UI\. Stage 2\(a\) \(free explorer\) reaches the*static*surfaces visible in the idle window — toolbars, sidebar decks, menus, palette decks; the planner partitions what is on\-screen at rest and dispatches a worker per region\. Stage 2\(b\) \(trajectory\-targeted\) instead targets*dynamic*surfaces that only appear during real task interactions — modal dialogs reached two menus deep, pop\-up state\-transition behaviours \(e\.g\. the margin spinbox rejecting unit suffixes, Section[4\.1](https://arxiv.org/html/2606.18448#S4.SS1)\), and post\-action confirmation frames\. Combining both is super\-additive: the full Stage 2VisualSkillreaches0\.5040\.504, beating either sub\-pass by\+𝟒−𝟓%\\mathbf\{\+4\{\-\}5\\%\}absolute and the matched text\-only control by\+5\.3%\+5\.3\\%\(0\.451→0\.5040\.451\\to 0\.504\)\. We adopt the full Stage 2 multimodal configuration as the default reported in Table[1](https://arxiv.org/html/2606.18448#S3.T1)\.
## 5Related Work
#### Computer\-Use Agents\.
Computer\-use agents \(CUAs\) operate directly on graphical user interfaces to complete user\-specified tasks\(Agashe et al\.,[2025](https://arxiv.org/html/2606.18448#bib.bib1); Xie et al\.,[2024](https://arxiv.org/html/2606.18448#bib.bib10); Wang et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib8)\)\. Unlike API\-based agents that interact with structured tool surfaces, CUAs must perceive the screen through screenshots and act through low\-level mouse and keyboard primitives, which gives them broad applicability across software but makes every step dependent on visual grounding\. Recent systems have approached human\-level performance on standardised benchmarks\(Anthropic,[2026](https://arxiv.org/html/2606.18448#bib.bib3); Simular AI,[2026](https://arxiv.org/html/2606.18448#bib.bib7)\), yet they generalise poorly to unseen UIs and long\-horizon workflows\(Aggarwal et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib2); Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6)\), because the application\-specific procedural knowledge required for such tasks — which menu hides a given command, what dialog appears after a click, how a multi\-step widget transitions between states — is not directly supplied by pre\-training\. This leaves room for external, application\-specific knowledge sources to be consulted at decision time\.
#### Skill Libraries for CUA\.
Recent work equips agents with reusable skill libraries to supply application\-specific procedural knowledge at decision time\(Chen et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib4); Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6); Jiang et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib5); Wang et al\.,[2025](https://arxiv.org/html/2606.18448#bib.bib9)\)\. CUA\-Skill\(Chen et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib4)\)encodes computer\-use knowledge as parameterised execution and composition graphs spanning common Windows applications\. OSExpert\(Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6)\)learns unit skills through GUI\-based depth\-first search exploration of the live environment and composes them into curricula for complex tasks\. XSkill\(Jiang et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib5)\)treats skills as cross\-task transferable behaviours learned through imitation, whileWang et al\. \([2025](https://arxiv.org/html/2606.18448#bib.bib9)\)induce structured skills directly from agent rollouts on solved tasks\. Across all of these, however, the skill artifact itself is consistently text: natural\-language intents, action specifications, or symbolic procedures, with no first\-class slot for figures or screenshots\. This is a poor fit for GUI environments: many UI elements are ambiguous to describe in text, and agents cannot directly match a textual state description against their current screenshot observation\. Verifying intermediate state in multi\-step workflows is similarly hampered when the expected post\-action UI can only be described in words rather than shown as a reference frame\.
#### Concurrent Work\.
A recently published work, MMSkills\(Zhang et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib11)\), represents each skill as a per\-subtask package coupling a textual procedure with runtime state cards and multi\-view screenshots extracted from public trajectories, and introduces branch loading to align skill evidence with the live environment in a temporary inference branch\.VisualSkilldiffers from MMSkills in skill granularity, construction source, and loading mechanism: it scopes one skill per application organised as a hierarchical topic tree, builds skills from authored vendor documentation and live UI exploration rather than trajectory mining, and exposes an MCP tool that the agent can call to fetch relevant topics on demand rather than performing branch\-based environment alignment\. These design choices are deliberate: per\-application hierarchical organisation lets a single skill scale to the full surface of a complex application, authored documentation provides a coverage backbone that trajectory mining alone cannot match, and an MCP tool gives the agent uniform access to text and figures within its existing tool\-calling interface, without spawning a separate inference branch\.
## 6Conclusion
We introducedVisualSkill, a hierarchical multimodal skill artifact for computer\-use agents that retains UI figures alongside text and exposes them through an on\-demandload\_topicMCP tool\.VisualSkillskills are built by a two\-stage pipeline — documentation mining followed by complementary free and trajectory\-targeted UI exploration\. To measure the contribution of the visual modality, eachVisualSkillskill is paired with a text\-only control generated jointly from the same source content\. Across177177tasks from CUA\-World and OSExpert\-Eval,VisualSkilllifts a Claude Opus 4\.6 agent from a no\-skill mean of0\.3030\.303to0\.4560\.456\(\+15\.3%\+15\.3\\%absolute\), outperforming its text\-only control by\+8\.3%\+8\.3\\%absolute at Stage 2\. A qualitative audit pinpoints three categories of UI knowledge for which figures are particularly load\-bearing — where to click, whether a typed value committed, and how to operate a multi\-step widget — and ablations show that both the MCP\-based delivery mechanism and the trajectory\-targeted UI exploration are individually necessary to the headline gain\. We seeVisualSkillas evidence that the skill artifact for computer\-use agents should be a multimodal, on\-demand reference rather than a text\-only document, and we hope the open\-source pipeline and controlled evaluation protocol make further work on figure\-grounded skills straightforward\.
## Limitations
We observe that current models do not reliably decide which figures within a loaded topic are relevant to their immediate decision, and underutilize the figures delivered by multimodal skills\. Given this, ourload\_topicmechanism returns the full set of figures associated with a topic in a single tool result\. However, this coarse granularity inflates the per\-call context cost and, on weaker base models, can outweigh the multimodal benefit: as reported in Appendix[F](https://arxiv.org/html/2606.18448#A6),Qwen3\.5\-397B\-A17B\-FP8on OSWorld LibreOffice degrades to no\-skill levels under the multimodal condition, with trajectories exhibiting malformed tool calls consistent with context\-length pressure\. How to deliver multimodal skill content within a more effective context budget remains an open question\.
## References
- Agashe et al\. \(2025\)Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang\. 2025\.Agent s2: A compositional generalist\-specialist framework for computer use agents\.*arXiv preprint arXiv:2504\.00906*\.
- Aggarwal et al\. \(2026\)Pranjal Aggarwal, Graham Neubig, and Sean Welleck\. 2026\.Gym\-anything: Turn any software into an agent environment\.*arXiv preprint arXiv:2604\.06126*\.
- Anthropic \(2026\)Anthropic\. 2026\.Introducing claude sonnet 4\.6\.[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)\.Accessed: 2026\-05\-20\.
- Chen et al\. \(2026\)Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, and 1 others\. 2026\.Cua\-skill: Develop skills for computer using agent\.*arXiv preprint arXiv:2601\.21123*\.
- Jiang et al\. \(2026\)Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R Fung\. 2026\.Xskill: Continual learning from experience and skills in multimodal agents\.*arXiv preprint arXiv:2603\.12056*\.
- Liu et al\. \(2026\)Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, and Heng Ji\. 2026\.Osexpert: Computer\-use agents learning professional skills via exploration\.*arXiv preprint arXiv:2603\.07978*\.
- Simular AI \(2026\)Simular AI\. 2026\.Agent s3\.[https://www\.simular\.ai/articles/agent\-s3](https://www.simular.ai/articles/agent-s3)\.Accessed: 2026\-05\-20\.
- Wang et al\. \(2026\)Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Wu, and 1 others\. 2026\.Opencua: Open foundations for computer\-use agents\.*Advances in Neural Information Processing Systems*, 38:139756–139806\.
- Wang et al\. \(2025\)Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried\. 2025\.Inducing programmatic skills for agentic tasks\.*arXiv preprint arXiv:2504\.06821*\.
- Xie et al\. \(2024\)Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others\. 2024\.Osworld: Benchmarking multimodal agents for open\-ended tasks in real computer environments\.*Advances in Neural Information Processing Systems*, 37:52040–52094\.
- Zhang et al\. \(2026\)Kangning Zhang, Shuai Shao, Qingyao Li, Jianghao Lin, Lingyue Fu, Shijian Wang, Wenxiang Jiao, Yuan Lu, Weiwen Liu, Weinan Zhang, and 1 others\. 2026\.Mmskills: Towards multimodal skills for general visual agents\.*arXiv preprint arXiv:2605\.13527*\.
## Appendix Contents
ASkill Overview\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2606.18448#A1) A\.1 Structure\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.1](https://arxiv.org/html/2606.18448#A1.SS1) A\.2VisualSkilland Text\-Only Control: Side\-by\-Side\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.2](https://arxiv.org/html/2606.18448#A1.SS2)
BInference\-Time Skill Loading\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2606.18448#A2)
CThree Construction\-Stage Examples\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2606.18448#A3) C\.1 Stage 1: From Authored Documentation\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.1](https://arxiv.org/html/2606.18448#A3.SS1) C\.2 Stage 2\(a\): Free UI Explorer\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.2](https://arxiv.org/html/2606.18448#A3.SS2) C\.3 Stage 2\(b\): Trajectory\-Targeted Explorer\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C\.3](https://arxiv.org/html/2606.18448#A3.SS3)
DAgent Setup\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2606.18448#A4)
EDatasets\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E](https://arxiv.org/html/2606.18448#A5)
## Appendix ASkill Overview
This appendix grounds the formal notation of Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)\(the topic set𝒯\\mathcal\{T\}, the per\-topic guidegt=\(pt,Ft\)g\_\{t\}=\(p\_\{t\},F\_\{t\}\), the centralSKILL\.mdindex, theVisualSkillskill𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}, and its text\-only control𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}\) in a real skill \(the Stage 2 Writer skill we ship with the paper\)\. Section[A\.1](https://arxiv.org/html/2606.18448#A1.SS1)shows the file tree and explains how each notation element maps to a file or folder\. Section[A\.2](https://arxiv.org/html/2606.18448#A1.SS2)shows theVisualSkillskill and its text\-only control on one concrete topic side by side\.
### A\.1Structure
A skill is laid out as a centralisedSKILL\.mdindex plus a folder per topic, arranged in the two\-layer hierarchy of Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1): top\-level category folders group related topics, and each leaf topic folder holds the guide\. Figure[3](https://arxiv.org/html/2606.18448#A1.F3)shows the layout for our Writer skill \(two categories are fully expanded for illustration; there are 19 categories in total\)\.
\.
\|\-\- SKILL\.md
\|\-\- introducing\-writer/
\| \|\-\- writer\-interface/
\| \| \|\-\- guide\.md
\| \| ‘\-\- fig01\.png, fig02\.png
\| \|\-\- creating\-and\-opening\-documents/
\| \| ‘\-\- guide\.md
\| ‘\-\- …
\|\-\- formatting\-text/
\| \|\-\- character\-formatting/
\| \| \|\-\- guide\.md
\| \| ‘\-\- fig01\.png, fig02\.png, ui\-character\-dialog\.png, …
\| \|\-\- paragraph\-formatting/
\| ‘\-\- …
\|\-\- page\-layout\-basics/
\|\-\- introduction\-to\-styles/
\|\-\- working\-with\-styles/
‘\-\- …Figure 3:File layout of the Writer skill \(root folder omitted; two categories fully expanded for illustration,1919categories total\)\.SKILL\.mdis the central index\. Each leaf folder is a topict∈𝒯t\\in\\mathcal\{T\}and carries itsgt=\(pt,Ft\)g\_\{t\}=\(p\_\{t\},F\_\{t\}\)as aguide\.md\(the text bodyptp\_\{t\}\) plus a set of PNG files \(the figuresFtF\_\{t\}\)\. For instance, the topicformatting\-text/character\-formattinghaspt=p\_\{t\}=itsguide\.mdandFt=\{F\_\{t\}=\\\{fig01\.png,fig02\.png,ui\-character\-dialog\.png, …\}\\\}\.#### Mapping the notation\.
Each element of Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)’s definition has a direct realisation in the file layout:
- •𝒯\\mathcal\{T\}is the set of leaf topic folders\. A topic identifier is its path slug, e\.g\.formatting\-text/character\-formatting\.
- •gt=\(pt,Ft\)g\_\{t\}=\(p\_\{t\},F\_\{t\}\)is the contents of a leaf folder:guide\.mdis the text bodyptp\_\{t\}, and every PNG file in the same folder is an element ofFtF\_\{t\}\.
- •The centralSKILL\.mdis the index that the agent reads first\. Each entry points to one per\-topic guide and tags it with a short*when to use*description; the agent matches its current task against these descriptions to decide whichgtg\_\{t\}to load viaload\_topic\(Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)\)\. Figure[4](https://arxiv.org/html/2606.18448#A1.F4)reproduces a representative excerpt\.
\#\# Guides\#\#\# Introducing Writer•Writer Interface Overview— Parts of the Writer window, title bar, menus, toolbars, sidebar, status bar\.Use when:navigating menu bar and submenus, showing/hiding toolbars, customising Sidebar decks,…\\ldots•Creating and Opening Documents— Creating new documents from scratch or templates, opening existing files\.Use when:creating a blank document, opening an existing document, using Start Center, browsing templates,…\\ldots\#\#\# Formatting Text•Formatting Characters— Character styles and direct formatting \(fonts, size, colour, effects\)\.Use when:changing font name/size/style, applying character styles, adjusting spacing and kerning,…\\ldots•Formatting Paragraphs— Paragraph styles and direct paragraph formatting \(indents, spacing, alignment, tab stops\)\.Use when:setting paragraph alignment, adjusting line and paragraph spacing, configuring hanging indent,…\\ldotsFigure 4:Representative excerpt of the centralSKILL\.md\(verbatim from the Stage 2 Writer skill; markdown link syntax simplified to plain titles\)\. Each item names one topic and gives thewhen to usecriterion that the agent matches against its current task before invokingload\_topic\(Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)\)\. The agent reads only this index up front; per\-topic content is fetched on demand\.
### A\.2VisualSkilland Its Text\-Only Control: One Topic in Two Modalities
To make the control construction of Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)concrete, we show theVisualSkillskill𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}and its text\-only control𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}on one specific topic,t=t=formatting\-text/character\-formatting\. Both are produced in the*same*LLM call from the same input context \(Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)\), so they cover the same procedural content ofptp\_\{t\}; the two text bodies are not word\-for\-word identical, but each is the natural expression of that content within its own modality\. Where𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}embeds a figure \(an element ofFtF\_\{t\}\) and references it inline,𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}absorbs the same visual information into words and carries no figure \(Fttxt=∅F\_\{t\}^\{\\mathrm\{txt\}\}=\\emptyset\)\. Figure[5](https://arxiv.org/html/2606.18448#A1.F5)shows the two variants side by side\.
𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}•character\-formatting/guide\.md\# Formatting Characters
Direct formatting works well for one\-off changes\. With text selected, the controls live in two places: the Formatting toolbar at the top of the window, and the Character section of the Sidebar’s Properties deck\. Both surfaces expose the same core attributes — font name, font size, bold, italic, underline, strikethrough, font and highlight colours, and super/subscript\.Layout of the Character section in the Properties deck, top to bottom:the Font Name dropdown \(e\.g\. “Liberation Sans”\) with the Font Size dropdown \(“12 pt”\) to its right; a first row of toggle buttons — Bold, Italic, Underline, Strikethrough, Toggle Shadow, Increase Font Size, Decrease Font Size; a second row — Font Color, Highlight Color, Clear Direct Formatting, Set Character Spacing, Superscript, Subscript\. AMore Optionslink in the top\-right of the section opens the full Character dialog\.For step\-changes in size, theIncrease Font Size/Decrease Font Sizebuttons jump in fixed 2 pt increments; for an exact value, type into the Font Size dropdown directly\.𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}•character\-formatting/guide\.md\# Formatting Characters
For one\-off tweaks, direct formatting is the fastest path\. Select your text and reach for either the Formatting toolbar or the Character section of the Sidebar’s Properties deck — font name, size, bold/italic/underline/strikethrough, font and highlight colours, and super/subscript are all there, plus aMore Optionslink that opens the full Character dialog\.See‘fig01\.png‘for the Character section layout:
To step font size up or down, use theIncrease Font Size/Decrease Font Sizebuttons \(fixed 2 pt increments\); for precise sizing, type into the Font Size dropdown\.
Figure 5:VisualSkilland its text\-only control on topict=t=formatting\-text/character\-formatting\. Both panels are realguide\.mdbodies and were produced in the same LLM call from the same source context \(Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)\); they cover the same procedural content but their text is not word\-for\-word identical\.Left\(𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}\): the text\-only control carries no figure \(Fttxt=∅F\_\{t\}^\{\\mathrm\{txt\}\}=\\emptyset\); the second paragraph absorbs the layout of the Character section into words\.Right\(𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}\): the multimodal skill referencesfig01\.pnginline and the figure is an element ofFtmmF\_\{t\}^\{\\mathrm\{mm\}\}, so its text can defer that layout to the figure and stays shorter\. The figure presence and the wording ofptp\_\{t\}around it are the only systematic differences between the two variants\.#### What is shared, and what differs\.
- •Shared between the two:theSKILL\.mdindex, the topic set𝒯\\mathcal\{T\}, the topic identifiert=t=formatting\-text/character\-formatting, itswhen to useindex entry, and the procedural content ofptp\_\{t\}\(which UI controls are mentioned, in which order, and what they do\)\.
- •Differs:the setFtF\_\{t\}of UI figures \(theVisualSkillskill retains the screenshots; the text\-only control hasFt=∅F\_\{t\}=\\emptyset\), and the wording ofptp\_\{t\}in the regions where𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}references a figure \(there,𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}spells the same visual information out in words so that no figure\-slot remains in the text\)\.
At evaluation time, any performance gap between𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}and𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}on the same task is attributable to the modality of presentation rather than to differences in the underlying content \(Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)\)\. This controlled construction is the contrast used throughout the experiments in Section[3](https://arxiv.org/html/2606.18448#S3)\.
## Appendix BInference\-Time Skill Loading
This appendix makes the loading mechanism of Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)fully concrete in four artefacts: the system prompt that seeds the agent with theSKILL\.mdindex, theload\_topicMCP tool definition together with the server\-side interleaving contract, the literal text\-image interleaving that arrives back in the tool result, and a worked agent turn end\-to\-end\. All artefacts are verbatim from the shipped Stage 2 Writer skill \(skills/libreoffice\_writer\-knowledge\-multimodal\-loader\-v1/\); only line wrapping and syntax highlighting are cosmetic\.
The agent is launched with a system prompt containing a short preface \(reproduced below, verbatim from theSKILL\.mdheader\) that explains how to consult the skill, followed by the fullSKILL\.mdindex in which every topic is paired with its one\-sentence*when to use*criterion \(Appendix[A\.1](https://arxiv.org/html/2606.18448#A1.SS1)reproduces a representative excerpt\)\. No per\-topic prose and no figures are loaded up front\. At every step the agent matches its current sub\-goal against the*when to use*lines and, if any matches, issuesload\_topic\(t\)for the chosen topic\.
System prompt preface[⬇](data:text/plain;base64,WW91IGFyZSBvcGVyYXRpbmcgTGlicmVPZmZpY2UgV3JpdGVyIDcuMy43LgpBIHNraWxsIGlzIHJlZ2lzdGVyZWQgZm9yIHlvdSBhcyBhIHNldCBvZiBNQ1AKdG9vbHMgKGFscmVhZHkgd2lyZWQpOgoKICAtIGxvYWRfdG9waWModG9waWMpOiByZXR1cm5zIHRoZSBjaG9zZW4KICAgICAgIHRvcGljJ3MgZ3VpZGUubWQgQU5EIGV2ZXJ5IGZpZ3VyZQogICAgICAgKFBORykgaW4gdGhhdCB0b3BpYyBmb2xkZXIgYXMgb25lCiAgICAgICB0b29sIHJlc3BvbnNlLiBVc2UgdGhpcyBpbnN0ZWFkIG9mCiAgICAgICBSZWFkIGZvciBhbnkgKi5tZCBvciBmaWdYWC5wbmcgZmlsZQogICAgICAgaW5zaWRlIHRoaXMgc2tpbGwuCiAgLSBsaXN0X3RvcGljcygpOiByZXR1cm5zIGV2ZXJ5IHRvcGljIHBhdGgKICAgICAgIGF2YWlsYWJsZSwgb25lIHBlciBsaW5lLgoKRWFjaCBlbnRyeSBpbiB0aGUgVE9DIGJlbG93IGhhcyB0aGUgZm9ybQpbVGl0bGVdKDx0b3BpYz4vZ3VpZGUubWQpLiBUaGUgPHRvcGljPiBwYXJ0Cih0aGUgcGF0aCBiZWZvcmUgL2d1aWRlLm1kKSBpcyB3aGF0IHlvdQpwYXNzIHRvIGxvYWRfdG9waWMuCgpSdWxlczoKICAxLiBCZWZvcmUgYW55IEdVSSBhY3Rpb24gd2hlcmUgeW91IGFyZQogICAgIHVuc3VyZSBvZiB0aGUgbWVudSBwYXRoL2RpYWxvZy9pY29uLAogICAgIGZpbmQgdGhlIG1hdGNoaW5nIHRvcGljIGluIHRoZSBUT0MgYW5kCiAgICAgY2FsbCBsb2FkX3RvcGljIGZpcnN0LgogIDIuIFlvdSBtYXkgY2FsbCBsb2FkX3RvcGljIGF0IGFueSBzdGVwIG9mCiAgICAgdGhlIHRyYWplY3RvcnksIG5vdCBvbmx5IGF0IHRoZSBzdGFydC4KICAgICBJZiB0aGUgdGFzayBtb3ZlcyBpbnRvIGEgbmV3IGFyZWEKICAgICAoZS5nLiB0YWJsZXMpLCBjYWxsIGxvYWRfdG9waWMgYWdhaW4KICAgICBmb3IgdGhlIG5ldyBhcmVhLgogIDMuIERvIE5PVCBpc3N1ZSBzZXBhcmF0ZSBSZWFkIGNhbGxzIGZvcgogICAgIGZpZ1hYLnBuZyBmaWxlcyBpbnNpZGUgdGhpcyBza2lsbCAtLQogICAgIHRoZXkgYXJlIGRlbGl2ZXJlZCBieSBsb2FkX3RvcGljCiAgICAgYXV0b21hdGljYWxseS4=)YouareoperatingLibreOfficeWriter7\.3\.7\.AskillisregisteredforyouasasetofMCPtools\(alreadywired\):\-load\_topic\(topic\):returnsthechosentopic'sguide\.mdANDeveryfigure\(PNG\)inthattopicfolderasonetoolresponse\.UsethisinsteadofReadforany\*\.mdorfigXX\.pngfileinsidethisskill\.\-list\_topics\(\):returnseverytopicpathavailable,oneperline\.EachentryintheTOCbelowhastheform\[Title\]\(<topic\>/guide\.md\)\.The<topic\>part\(thepathbefore/guide\.md\)iswhatyoupasstoload\_topic\.Rules:1\.BeforeanyGUIactionwhereyouareunsureofthemenupath/dialog/icon,findthematchingtopicintheTOCandcallload\_topicfirst\.2\.Youmaycallload\_topicatanystepofthetrajectory,notonlyatthestart\.Ifthetaskmovesintoanewarea\(e\.g\.tables\),callload\_topicagainforthenewarea\.3\.DoNOTissueseparateReadcallsforfigXX\.pngfilesinsidethisskill\-\-theyaredeliveredbyload\_topicautomatically\.
Theload\_topictool is a stdio MCP tool registered in the agent’s tool list, with signatureload\_topic\(topic: str\)→\\tolist\[TextContent \| ImageContent\], wheretopicis a relative path from the skill root \(e\.g\."page\-layout\-basics/page\-numbering"\)\. The box below reproduces the docstring shown to the model and the server\-side rule that produces the interleaved content list \(verbatim fromtools/skill\_server\.py\)\.
load\_topicdocstring and server\-side ruleDocstring shown to the model[⬇](data:text/plain;base64,QXRvbWljYWxseSBsb2FkIGEgdG9waWMncyBndWlkZS5tZCBwbHVzCmV2ZXJ5IGFkamFjZW50IGZpZ3VyZSBpbWFnZS4KCkFyZ3M6CiAgICB0b3BpYzogcmVsYXRpdmUgcGF0aCBmcm9tIHRoZSBza2lsbAogICAgICAgIHJvb3QsIGUuZy4KICAgICAgICAicGFnZS1sYXlvdXQtYmFzaWNzL3BhZ2UtbnVtYmVyaW5nIi4KClJldHVybnMgZ3VpZGUubWQgcHJvc2UgaW50ZXJsZWF2ZWQgd2l0aAp0aGUgcmVmZXJlbmNlZCBmaWd1cmVzLCBzbyBlYWNoIGZpZ3VyZQppcyBkZWxpdmVyZWQgaW1tZWRpYXRlbHkgYWZ0ZXIgdGhlCnNlbnRlbmNlIHRoYXQgbWVudGlvbnMgaXQgKGZpZ3VyZXMgYXJlCnJlZmVyZW5jZWQgaW4gdGhlIGd1aWRlIGFzIGBmaWcwMS5wbmdgCmV0Yy4pLiBBbnkgZmlndXJlcyBpbiB0aGUgZm9sZGVyIHRoYXQKYXJlIG5vdCByZWZlcmVuY2VkIGluIHRoZSBndWlkZSBhcmUKYXBwZW5kZWQgYXQgdGhlIGVuZC4gUHJlZmVyIHRoaXMgb3ZlcgpSZWFkIGZvciBhbnkgKi5tZCBmaWxlIGluc2lkZSB0aGlzCnNraWxsIC0tIGl0IHJlbW92ZXMgdGhlIG5lZWQgdG8gaXNzdWUKc2VwYXJhdGUgUmVhZCBjYWxscyBmb3IgdGhlIGZpZ3VyZXMu)Atomicallyloadatopic'sguide\.mdpluseveryadjacentfigureimage\.Args:topic:relativepathfromtheskillroot,e\.g\."page\-layout\-basics/page\-numbering"\.Returnsguide\.mdproseinterleavedwiththereferencedfigures,soeachfigureisdeliveredimmediatelyafterthesentencethatmentionsit\(figuresarereferencedintheguideas\`fig01\.png\`etc\.\)\.Anyfiguresinthefolderthatarenotreferencedintheguideareappendedattheend\.PreferthisoverReadforany\*\.mdfileinsidethisskill\-\-itremovestheneedtoissueseparateReadcallsforthefigures\.Server\-side interleaving rule \(pseudocode\)[⬇](data:text/plain;base64,dGV4dCAgID0gcmVhZCh0b3BpYy9ndWlkZS5tZCkKcGFydHMgID0gW10KY3Vyc29yID0gMApmb3IgbSBpbiByZWdleChgZmlnWFgucG5nYCkuZmluZGl0ZXIodGV4dCk6CiAgICBwYXJ0cy5hcHBlbmQoVGV4dENvbnRlbnQoCiAgICAgICAgdGV4dFtjdXJzb3IgOiBtLmVuZCgpXSkpCiAgICBwYXJ0cy5hcHBlbmQoSW1hZ2VDb250ZW50KAogICAgICAgIGxvYWQobS5ncm91cCgpKSkpCiAgICBjdXJzb3IgPSBtLmVuZCgpCnBhcnRzLmFwcGVuZChUZXh0Q29udGVudCh0ZXh0W2N1cnNvcjpdKSkKZm9yIGYgaW4gdW5yZWZlcmVuY2VkX2ZpZ3VyZXModG9waWMpOgogICAgcGFydHMuYXBwZW5kKEltYWdlQ29udGVudChmKSkgICAjIGZhbGxiYWNrCnJldHVybiBwYXJ0cw==)text=read\(topic/guide\.md\)parts=\[\]cursor=0forminregex\(\`figXX\.png\`\)\.finditer\(text\):parts\.append\(TextContent\(text\[cursor:m\.end\(\)\]\)\)parts\.append\(ImageContent\(load\(m\.group\(\)\)\)\)cursor=m\.end\(\)parts\.append\(TextContent\(text\[cursor:\]\)\)forfinunreferenced\_figures\(topic\):parts\.append\(ImageContent\(f\)\)\#fallbackreturnparts
The alternative to interleaving — returningptp\_\{t\}as one text block followed by all ofFtF\_\{t\}as a flat image array — would force the model to re\-bind each figure to the sentence that introduces it from the trailing image strip\. Placing each figure adjacent to its referencing sentence matches how a multimodal model attends to vision/text mixtures and removes the binding step entirely\. For the text\-only control there is nothing to interleave: the tool returns a single text block whose prose absorbs the visual content ofFtF\_\{t\}in place\. Figure[6](https://arxiv.org/html/2606.18448#A2.F6)shows what each variant actually looks like on the wire for one topic\.
𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}result: 5 blocks \(interleaved\)\# formatting\-text/character\-formatting/guide\.md
Direct formatting is fine for one\-off tweaks: select your text and use the buttons on the Formatting toolbar or the Character panel in the Sidebar’s Properties deck\. You’ll find font name, size, bold, italic, underline, strikethrough, font color, highlight color, superscript, subscript, and more\.See‘fig01\.png‘\.To bump font size up/down, use theIncrease/Decrease Font Sizebuttons on the Sidebar \(fixed 2 pt steps\); for precise sizing, type into the Font Size dropdown directly\. For the full set of options, openFormat\>\>Character…
See‘fig02\.png‘\.Direct character formatting overrides character styles\. UseClear Direct Formatting\(Ctrl\+M\) to strip it out\.
𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}result: 1 block \(text\)\# formatting\-text/character\-formatting/guide\.md
Direct formatting is fine for one\-off tweaks: select your text and use the buttons on the Formatting toolbar or the Character panel in the Sidebar’s Properties deck\. You’ll find font name, size, bold, italic, underline, strikethrough, font color, highlight color, superscript, subscript, and more\.Layout of the Character section, top to bottom:Font Name dropdown \(“Liberation Sans”\); Font Size dropdown \(“12 pt”\); a row of buttons \(Bold, Italic, Underline, Strikethrough, Toggle Shadow, Increase/Decrease Font Size\); a second row \(Font Color, Highlight Color, Clear Direct Formatting, Set Character Spacing, Superscript, Subscript\); aMore Optionslink in the top\-right opens the full Character dialog\.To bump font size up/down, use theIncrease/Decrease Font Sizebuttons on the Sidebar \(fixed 2 pt steps\); for precise sizing, type into the Font Size dropdown directly\. For the full set of options, openFormat\>\>Character…
Layout of the Character dialog \(Font tab\):six tabs across the top \(Font, Font Effects, Position, Hyperlink, Highlighting, Borders\); Family dropdown \(“Liberation Serif”\), Style dropdown \(“Regular”\), Size dropdown \(“12 pt”\), Language dropdown; a live preview pane showing the sample text in the selected font;OK,Cancel,Help,Resetbuttons along the bottom\.Direct character formatting overrides character styles\. UseClear Direct Formatting\(Ctrl\+M\) to strip it out\.
Figure 6:The literal tool\-result structure for one topic, in both variants\.Left\(𝒮mm\\mathcal\{S\}^\{\\mathrm\{mm\}\}\): the content list alternatesTextContentandImageContentblocks; each image is delivered immediately after the sentence that names it \(e\.g\.See‘fig01\.png‘\), giving five blocks total for this topic\.Right\(𝒮txt\\mathcal\{S\}^\{\\mathrm\{txt\}\}\): the content list is a singleTextContentblock; at the same offsets where the multimodal result emits an image, the text\-only guide carries an in\-place verbal substitute \(italicisedLayout of …paragraphs\)\. Both panels show the same procedural content ofptp\_\{t\}; the figure presence and the wording around it are the only systematic differences\.The agent consumes the returned content list as part of its next assistant turn, exactly like any multimodal context: both text and image blocks count as ordinary input, with no separate “attachment” channel\. Figure[7](https://arxiv.org/html/2606.18448#A2.F7)traces one full step end\-to\-end from a real Stage 2 Writer rollout ofexam\_paper\_formatting\(Appendix[C\.3](https://arxiv.org/html/2606.18448#A3.SS3)\): the agent has just selected a section header that should become a Heading 2, scans theSKILL\.mdindex in its system prompt, matches a topic, callsload\_topic, receives the interleaved tool result, and consumes it on its next step\.
Turnkk• Agent \(assistant message\)I need to apply the Heading 2 paragraph style\. Scanning SKILL\.md for a when to useline that mentions applying paragraph styles … \- \[Applying Styles\]\(introduction\-to\-styles/applying\-styles/guide\.md\) Use when: applying paragraph styles via toolbar or keyboard shortcuts, applying character styles via Sidebar or context menu, applying page styles via Status bar … That matches\. Loading the topic\.
Tool call \(issued by the agent\)load\_topic\(topic="introduction\-to\-styles/applying\-styles"\)
Tool result \(returned by MCP server\) • 5 content blocks\[1\] TextContent: \# introduction\-to\-styles/applying\-styles/guide\.md Apply a paragraph style via the Paragraph Style dropdown at the far left of the Formatting toolbar\. The widget is a combo box: a text field on the left and a separate dropdown arrow on the right\. Click the arrow to open the style menu\. See ‘fig01\.png‘\. \[2\] ImageContent\(paragraph\-style menu open\): mime=image/png, data=<<base64 PNG bytes,∼\\sim140 KB\>\> \[3\] TextContent: Pick the style by clicking its row\. The dropdown closes and the text field updates to the chosen style name; the status bar updates to the corresponding outline level\. See ‘fig02\.png‘\. \[4\] ImageContent\(‘‘Heading 1’’ applied, status bar updated\): mime=image/png, data=<<base64 PNG bytes,∼\\sim120 KB\>\> \[5\] TextContent: Keyboard shortcuts: Ctrl\+1\.\.5 apply Heading 1\.\.5 to the current paragraph; Ctrl\+0 applies Default Paragraph Style\. …
Turnk\+1k\{\+\}1• Agent \(assistant message \+ next tool call\)The guide says the dropdown has two click targets: a text field on the left and a separate arrow on the right\. Clicking the arrow opens the menu \(block \[2\]\)\. I’ll click the arrow\. computer\_use\(action="click", x=196, y=156\) \# the dropdown arrow
Figure 7:One full agent turn end\-to\-end\. The agent \(i\) scans theSKILL\.mdindex in its system prompt and matches its current sub\-goal against a*when to use*line; \(ii\) issuesload\_topicfor the chosen topic; \(iii\) receives a single tool result whose content list interleavesTextContentandImageContentblocks in the order figures are referenced in the guide; \(iv\) consumes the interleaved guide on its next step, citing a specific image block when choosing the click target\. Image payloads are shown as size annotations; everything else is verbatim from the transcript\.#### Notes on the loop\.
- •Many calls per trajectory\.The agent may invokeload\_topicmultiple times across a rollout \(Rule 2 of the preface\), typically as the task moves between UI surfaces \(e\.g\. page layout, tables, styles\)\.
- •Text and image blocks are first\-class context\.Both block types in the returned content list count as ordinary multimodal context for the agent’s next assistant turn; no separate “image attachment” channel is used\. This is why the multimodal vs\. text\-only contrast at evaluation time isolates the modality of presentation rather than any difference in how the content reaches the model\.
- •NoReadfallback for figures\.Rule 3 of the preface explicitly forbids issuingReadcalls forfigXX\.pngfiles in the skill\. This is enforced socially \(by the prompt\) rather than mechanically; in practice agents comply becauseload\_topicalready delivers the figures\.
## Appendix CThree Construction\-Stage Examples
This appendix grounds the three sub\-pipelines of Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)in concrete artefacts\. Section[C\.1](https://arxiv.org/html/2606.18448#A3.SS1)walks through one Stage 1 topic, end\-to\-end, from a page of the official LibreOffice Writer manual to the generated multimodalguide\.md\. Section[C\.2](https://arxiv.org/html/2606.18448#A3.SS2)traces one region of the Stage 2\(a\) free UI explorer, from the planner’s segmentation of the idle window to a worker’s per\-control notes and figures\. Section[C\.3](https://arxiv.org/html/2606.18448#A3.SS3)traces one target of the Stage 2\(b\) trajectory\-targeted explorer, from a failed training rollout that surfaced a grounding gap, through the reviewer’s diagnosis, to the worker capture and the patch landing in the matching topic\.
All three examples are drawn from the actual Writer\-skill build artefacts; screenshots, action traces, and quoted JSON/Markdown are verbatim from those artefacts unless explicitly noted as paraphrase\.
### C\.1Stage 1: From Authored Documentation
We trace topict=t=formatting\-text/character\-formattingthrough the four LLM calls that make up Stage 1\.
#### Step 1 • Source material\.
The input to Stage 1 is the officialGetting Started with LibreOfficeguide for Writer 7\.3, distributed as a PDFDD\. The chapter on character formatting spans pages 84–92 of the PDF, contains body text, and ships with two vendor\-drawn screenshots: a view of the Character section of the Sidebar’s Properties deck, and a view of the fullFormat\>\>Character…dialog\. These two figures become the elements ofFtF\_\{t\}for this topic; they are extracted unchanged from the PDF\.
#### Step 2 • Topic entry from the TOC parser\.
The TOC parser runs once for the whole document and emits the topic set𝒯\\mathcal\{T\}as JSON\. The entry for our topic is reproduced verbatim below\.
\{ "topic\_id" : "formatting\-text/character\-formatting", "title" : "Formatting Characters", "parent\_id" : "formatting\-text", "page\_start": 84, "page\_end" : 92 \}
#### Step 3 • Per\-topic guide generation\.
The joint\-generation call \(Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)\) is invoked once for this topic with three inputs: the topic record \(above\), the concatenated text of pages 84–92, and the two extracted figures attached as image content\. Claude Opus produces the multimodal text bodyptmmp\_\{t\}^\{\\mathrm\{mm\}\}\. Figure[8](https://arxiv.org/html/2606.18448#A3.F8)reproduces the opening of the generatedguide\.mdalongside the vendor figures it references\.
Stage 1 output •formatting\-text/character\-formatting/guide\.md\(excerpt\)\# Formatting Characters
The quickest way to format characters is withcharacter styles— select your text, then pick a style from the Character Styles tab on the Styles deck in the Sidebar, theStylesmenu, or the Formatting \(Styles\) toolbar\.Direct formatting is fine for one\-off tweaks: select your text and use the buttons on the Formatting toolbar or the Character panel in the Properties deck of the Sidebar\. You’ll find font name, size, bold, italic, underline, strikethrough, font color, highlight color, superscript, subscript, and more\.See‘fig01\.png‘\.fig01\.png • vendor screenshot extracted from PDF page 85For the full set of options, openFormat\>\>Character…The dialog has six tabs: Font, Font Effects, Position, Hyperlink, Highlighting, Borders\.See‘fig02\.png‘\.fig02\.png • vendor screenshot extracted from PDF page 88Figure 8:Stage 1 output for one topic\. The text bodyptmmp\_\{t\}^\{\\mathrm\{mm\}\}is generated from the concatenated page text of pages 84–92 of the official manual; the two figures are extracted unchanged from the PDF\. The text\-only bodypttxtp\_\{t\}^\{\\mathrm\{txt\}\}is generated in the same LLM call \(joint generation, Section[2\.1](https://arxiv.org/html/2606.18448#S2.SS1)\) and absorbs the visual content offig01/fig02into prose; we omit it here and show only the multimodal variant\.
#### Step 4 • Index entry generation\.
A separate one\-shot LLM call generates the one\-sentence*when to use*criterion that the agent will match against its current task at inference \(Section[2\.2](https://arxiv.org/html/2606.18448#S2.SS2)\)\. The generated line for this topic is reproduced verbatim below; this is the line the agent sees inSKILL\.md\(Appendix[A\.1](https://arxiv.org/html/2606.18448#A1.SS1)\)\.
\- \[Formatting Characters\]\(formatting\-text/character\-formatting/guide\.md\) \-\- Applying character styles and direct character formatting including fonts, size, color, and effects \*\*Use when:\*\* changing font name/size/style, applying character styles, adjusting character spacing and kerning, inserting hyperlinks via Character dialog, highlighting text, clearing direct formatting
#### Notes on Stage 1\.
- •Figures are vendor\-drawn\.FtF\_\{t\}at Stage 1 contains screenshots authored by LibreOffice’s documentation team\. They show idealised, marketing\-clean UI states \(no real document content\) and capture the application as the documentation was last revised; they may lag the shipped UI by one or more minor versions\. Section[C\.3](https://arxiv.org/html/2606.18448#A3.SS3)shows what happens at test time when that lag matters\.
- •Coverage is whatever the docs cover\.Topics for which the manual lacks a section — e\.g\. context menus added in a maintenance release, or recently restyled dialogs — have no Stage 1 entry and rely on Stage 2 to enter the skill at all\.
- •Matched\-pair invariants\.The same joint\-generation call producesptmmp\_\{t\}^\{\\mathrm\{mm\}\}andpttxtp\_\{t\}^\{\\mathrm\{txt\}\}from the same context; the sharedSKILL\.mdindex entry is generated once and is byte\-identical between the two skills\.
### C\.2Stage 2\(a\): Free UI Explorer
We trace one region of the Stage 2\(a\) free explorer for the Writer skill, targetcharacter\_position\_dialog\.
#### Step 1 • Planner segments the idle window\.
The Stage 2\(a\) planner agentPP\(Opus\-class, Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3)\) is launched with a single screenshot of the freshly opened Writer application and instructed to partition the visible UI intoK=8K\{=\}8regions covering everything the user can interact with\. The planner’s output is a JSON list of region records; the entry for the region we will trace is reproduced below \(paraphrased for brevity; the full record also lists adjacent toolbars\)\.
\{ "region\_id": "character\_position\_dialog", "name" : "Format \> Character dialog, Position tab", "entry" : "Format menu \> Character…\> Position tab", "scope" : "Document the Position tab of the Character dialog: Superscript / Subscript / Normal radios, Raise/lower by, Relative font size, Rotation/Scaling, Spacing, Pair kerning\." \}
#### Step 2 • Worker drives the region\.
A Sonnet\-class workerWiW\_\{i\}is spawned in a clean Writer Docker container with the region record above as its prompt\. The worker opens the dialog, clicks each control, captures cropped screenshots of state changes, and writes a structured note\. Two of the worker’s captures are reproduced below; the worker emitted twelve in total across all controls in the region\.
\(a\)Position tab on first open: Normal radio selected, Raise/lower by and Relative font size greyed\.
\(b\)After clicking the Subscript radio: Raise/lower auto\-set to 8%, Relative font size to 58%, Automatic checkbox auto\-checked, preview pane shows subscripted text\.
Figure 9:Two of the worker’s twelve captures for regioncharacter\_position\_dialog\. The worker reaches each state by issuing real clicks in the live application; the screenshot in \(b\) is taken immediately after the Subscript click and shows the resulting auto\-population of the spinner fields\.
#### Step 3 • Per\-control notes\.
Alongside the captures, the worker emits a structured Markdown note that names every control, its pixel\-coordinate location, its state\-transition behaviour, and which screenshot witnesses the behaviour\. An excerpt for two of the radios is reproduced verbatim below \(compressed line wrapping is the only edit\)\.
\#\#\# 3\. Superscript radio button \- Location: Second radio in the Position group; approx \(662, 393\) in 1920x1080 \- Behaviour: Activates superscript positioning\. Sets Raise/lower by to 33% \(upward\) and Relative font size to 58% as automatic defaults\. Raise/lower spinner becomes read\-only when Automatic is checked\. \- Evidence: step\_005\.png\#\#\# 4\. Subscript radio button \- Location: Third radio in the Position group; approx \(662, 423\) in 1920x1080 \- Behaviour: Activates subscript positioning\. Sets Raise/lower by to 8% \(downward\) and Relative font size to 58% as automatic defaults\. \- Evidence: step\_006\.png
#### Step 4 • Assembly into a topic\.
After allKKworkers return, the assembler agent consolidates each region’s notes and figures into a per\-region reference section, and an LLM mapper decides which Stage 1 topic each region most directly enriches\. For this region, the mapper points atformatting\-text/character\-formatting/\(the same topic Stage 1 generated from pages 84–92 of the manual, Section[C\.1](https://arxiv.org/html/2606.18448#A3.SS1)\)\. The assembled section is appended to that topic’sguide\.mdand the worker’s reference screenshot \(Figure[9](https://arxiv.org/html/2606.18448#A3.F9)bre\-shot at full resolution\) is added toFtmmF\_\{t\}^\{\\mathrm\{mm\}\}\.
#### Why “free” matters\.
The free explorer reaches UI surfaces the manual is silent on\. The Position tab*is*documented in the official guide, but with a vendor\-drawn screenshot of the dialog at its default state; the manual does not show that selecting Subscript auto\-populates the spinner fields, nor that the Automatic checkbox auto\-checks\. Those state\-transition facts come from the worker’s free\-exploration captures and are what Section[C\.3](https://arxiv.org/html/2606.18448#A3.SS3)shows are decisive at agent test time\.
### C\.3Stage 2\(b\): Trajectory\-Targeted Explorer
The third example traces one target of the Stage 2\(b\) trajectory\-targeted explorer, from a failed training rollout that surfaced a UI grounding gap, through the reviewer agent’s diagnosis, to the worker capture and the patch landing in the matching topic\. The worked example is targetparagraph\_style\_applicator, the Paragraph Style dropdown in LibreOffice Writer’s formatting toolbar\.
#### Step 1 • Training task\.
exam\_paper\_formattingis one of the 16 training tasks in the Writer training split\. It asks the agent to reformat a biology exam paper end\-to\-end: set margins, font, line spacing, format the title block, italicise theinstructionsparagraph, apply the Heading 2 paragraph style to the three section headers \(“Part A”, “Part B”, “Part C”\), and save the result asbio101\_final\_exam\.docx\.
This subsection focuses on the Heading 2 sub\-goal, which is the part that triggered the reviewer agent’s diagnosis\. Figure[10](https://arxiv.org/html/2606.18448#A3.F10)shows the document’s initial state\.
Figure 10:Initial state ofexam\_paper\_formatting\. The three target headers \(“Part A: Multiple Choice \(40 points\)” and the two parallel headers below\) are plain body text\. The paragraph\-style dropdown in the top\-left of the formatting toolbar reads*“Default Paragraph Style”*\. The sub\-goal is to change those three headers to*“Heading 2”*\.Outcome with the Stage 1 multimodal skill\.
- •Verifier score:0\.0000\.000\(FAIL\)\.
- •Steps used:80/8080\\,/\\,80\(budget exhausted\)\.
- •Verifier feedback:*“Output file ’bio101\_final\_exam\.docx’ not found\.”*
The agent applied margins, font, line spacing, title formatting and italicisation successfully, and eventually got Heading 2 applied to each section header as well\. It ran out of budget before reaching the final Save As\. The next step explains why\.
#### Step 2 • Where the agent was inefficient\.
The paragraph\-style dropdown is actually*two*adjacent widgets glued into one combo box: a text field \(left,∼\\sim\(75,118\) in 1920×\\times1080\) and a separate▼\\blacktriangledownarrow \(right,∼\\sim\(196,156\)\)\. Clicking the▼\\blacktriangledownarrow opens a menu listing every style \(Heading 1, Heading 2, …\); clicking the desired style applies it\. The whole interaction is two clicks, no typing, no Enter\.
The Stage 1 skill’s prose mentions the dropdown but provides no figure of the open menu and no callout that the arrow is a distinct click target\. The agent ends up clicking the text field at \(75, 118\), then tries to type\-then\-Enter its way to the style instead\. Table[4](https://arxiv.org/html/2606.18448#A3.T4)shows the verbatim 8\-action trace \(rollout indices 141–153\)\.
Table 4:Verbatim 8\-action trace for one Heading 2 application \(rollout indices 141–153\)\. The canonical▼\\blacktriangledown\-then\-pick path is only 2 actions\.What the agent did not understand:that the dropdown is composed of two click targets, not one\. It treated the entire box as a text field, picked the easy click target \(the text\-field portion\), and never tried the▼\\blacktriangledownarrow that would have opened the menu of style choices\. The Stage 1 prose said “use the dropdown to apply a style” without showing the open\-menu state or specifying which click target opens it\. The cost:∼\\sim6 wasted actions per heading×\\times3 section headers≈\\approx18 wasted actions, against an 80\-step budget that the agent then runs out of before the closing Save As\.
#### Step 3 • Reviewer agent identifies the weak UI region\.
The reviewer agentVV\(Section[2\.3](https://arxiv.org/html/2606.18448#S2.SS3), Stage 2\(b\)\) reads all 16 training rollouts and looks for one signal: the same widget breaks multiple agents in similar ways\. Four corroborating failures appear for the paragraph\-style dropdown:
- •exam\_paper\_formatting: 8\-action fumble per heading, ran out of budget \(this subsection\)\.
- •format\_research\_paper: usedCtrl\+1/Ctrl\+2hotkeys that did not persist as Heading styles \(0/7 Heading 1, 0/4 Heading 2\)\.
- •clinical\_protocol\_formatting: 0/9 major headings styled\.
- •outline\_numbering\_handbook: same dropdown gap\.
Four independent failures on the same widget, in different ways, rule out random budget exhaustion as the sole cause and corroborate a systematic UI gap\. The reviewer writes one target:
target\_id:paragraph\_style\_applicator name:Paragraph Style dropdown in the Formatting Toolbarscope:Document the paragraph style dropdown \(leftmost box in the formatting toolbar\)\.*Show how to click it, type a style name \(e\.g\. ‘Heading 1’\), and press Enter to apply\.*Cover the dropdown list behaviour, autocomplete, and what happens when an invalid name is typed\.
#### Step 4 • Trajectory\-targeted worker captures the menu\.
A workerWi′W^\{\\prime\}\_\{i\}is spawned in a clean Writer container with the above target as its prompt\. The worker uses the same protocol as the Stage 2\(a\) free workers \(Section[C\.2](https://arxiv.org/html/2606.18448#A3.SS2)\): real clicks in the live application, cropped screenshots of state changes, per\-control notes\. It clicks the▼\\blacktriangledownarrow, captures the open menu \(Figure[11](https://arxiv.org/html/2606.18448#A3.F11)\), then applies a style and captures the applied state \(Figure[12](https://arxiv.org/html/2606.18448#A3.F12)\)\.
Figure 11:Worker capture, step 4: the▼\\blacktriangledownarrow has been clicked and the style menu is open\. Every paragraph style is listed and rendered in its own font \(Title large, Heading 1 large bold, Heading 2 large, …\)\. This is the figure that lands in the patched skill as the primary reference\.Figure 12:Worker capture, step 38: after selecting “Heading 1” from the menu, the dropdown text field updates to “Heading 1” \(top\-left\) and the status bar reads “Heading Numbering : Level 1” \(bottom\-right\)\. This is the canonical applied\-state confirmation a future agent should recognise\.
#### Step 5 • Patch lands and pays off at test time\.
The same assembler and LLM mapper used by Stage 2\(a\) \(Section[C\.2](https://arxiv.org/html/2606.18448#A3.SS2)\) consolidate the worker’s notes and figures and assign them to theapplying\-stylestopic — the topic the failing rollouts had loaded\. The assembler appends a short paragraph from the worker’s notes that names the two click targets \(the text field at∼\\sim\(75,118\), the▼\\blacktriangledownarrow at∼\\sim\(196,156\)\) and prescribes the canonical 2\-click interaction, plus the two figures of menu\-open and applied states\. At inference, any task touching paragraph styles loads the patched topic viaload\_topic\(Appendix[B](https://arxiv.org/html/2606.18448#A2)\) and receives both figures plus the 2\-click recipe in one tool result\. Forexam\_paper\_formattingspecifically, the freed∼\\sim18 actions are exactly the budget the agent needed for the closing Save As; and because the patch is scoped to the*UI region*rather than to any one task, it transfers to every future test task that uses heading styles — including the three other training\-task failures the reviewer cited\.
## Appendix DAgent Setup
This appendix details the agent setup, the GUI tool surface, and the sandboxing setup deferred from Section[3\.1](https://arxiv.org/html/2606.18448#S3.SS1)\.
#### Agent\.
The evaluated agent \(Claude Opus 4\.6\) runs inside theClaude Code CLI, an off\-the\-shelf agentic harness that exposes MCP tool calls and standard read/write tools to the model\. The same CLI image is reused across all skill conditions; only the skill folder mounted into the container changes\. Per\-step prompts are held fixed\.
#### GUI tool surface\.
The agent interacts with the desktop through a single MCP tool,computer\(from the open\-sourcecomputer\-use\-mcpserver\), which exposes eleven actions; the full set is reproduced verbatim in Table[5](https://arxiv.org/html/2606.18448#A4.T5)\. The action enum and the per\-action behaviour are identical to those documented in the Anthropic computer\-use reference\. The agent obtains pixels by callingget\_screenshot; coordinates are reported in the API image space \(downsampled to fit the 1\.15 MP / 1568 px API limits\) and scaled to logical screen pixels by the server, so the agent does not need to track the display resolution itself\. Skill consultation is exposed by a*separate*MCP server bundled inside each skill folder \(load\_topicandlist\_topics; Appendix[B](https://arxiv.org/html/2606.18448#A2)\); the two MCP servers — the GUI controller and the skill MCP server — are independent and the agent sees both in its tool list\.
Table 5:Actions exposed by thecomputerMCP tool \(computer\-use\-mcp/src/tools/computer\.ts\)\.
#### Two\-container sandboxing\.
Each task runs in atwo\-containersetup:
- •Adesktop\-environment container\(Xvfb \+ the target application: LibreOffice, GIMP, OpenToonz, etc\.\) ships with the benchmark and provides the actual UI to drive\. The benchmark’s verifier andinit\.max\_stepsbudget come from this container\.
- •AClaude Code CLI containerruns the agent, mounts both the per\-task workspace and the skill folder read\-only, and connects to the desktop container over an HTTP bridge \(GA\_BRIDGE\_URL\)\. Allcomputeractions are forwarded to the desktop container by the bridge; verifier scoring andenv\.step\(\)bookkeeping happen there\.
The split keeps the agent fully sandboxed from the host while preserving deterministic per\-task state: the desktop container is reset to a clean snapshot before each task\. Both containers are spun up by a singlerun\_task\.pyentry point per task, which also writes an MCP\-config file that registers the two MCP servers \(controller \+ skill loader\) in the Claude CLI container\.
#### Step budget\.
What counts against the per\-task budget is the number ofcomputer\-tool actions that mutate the desktop — click, drag, type, key, hotkey, scroll — as recorded byenv\.step\(\)on the desktop side\.get\_screenshot,get\_cursor\_position,Read,load\_topic, andlist\_topicsare observation / lookup operations and are*not*charged against the budget; this prevents skill consultation from competing with the actual GUI interaction\. The per\-task ceiling is whatever the benchmark ships intask\.json’sinit\.max\_steps\(median 60 in our Writer split; min 40, max 80\)\.
## Appendix EDatasets
OSExpert\-Eval\(Liu et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib6)\)is a curated benchmark designed to evaluate computer\-use agents on professional\-level challenges that go beyond routine GUI interaction\. The benchmark consists of 113 tasks spanning three major categories: Long Horizon Compositional Workflows, Unseen UI Generalization, and Fine\-Grained Action Execution\. Long Horizon contains 30 tasks, including 24 Office tasks and 6 GIMP tasks, emphasizing multi\-step workflows that require composing multiple unit functions in a correct and robust order\. Unseen UI contains 50 tasks, including 20 Tableau tasks and 30 MiniWord tasks, targeting novel layouts and interaction patterns that are uncommon in current agents’ training distributions\. Fine\-Grained contains 33 tasks, including 14 GIMP tasks and 19 Office tasks, requiring precise low\-level control such as accurate text selection, object manipulation, and spatial alignment\. Among these three categories, Long Horizon and Unseen UI are exactly the scenarios where agent skills are expected to be helpful\. For the Fine\-Grained category, Claude Code CLI powered by Claude Opus 4\.6 already performs well, so we do not include this subset\. Therefore, we include all tasks from the Long Horizon category, as well as the Tableau domain from Unseen UI\. \(The MiniWord domain in this category has missing verifier files, which is why we exclude it here\.\) The dataset is released under the MIT License\.
CUA\-World\(Aggarwal et al\.,[2026](https://arxiv.org/html/2606.18448#bib.bib2)\)is a large\-scale benchmark consisting of over 10K long\-horizon computer\-use tasks spanning diverse domains, including medical science, astronomy, engineering, creative tools, and enterprise software applications\. Each environment is configured with realistic data and includes predefined train/test splits\. The environments and tasks are generated through an agentic pipeline with executable verifiers for evaluation\. In this work, we manually verify and curate five domains for our experiments: LibreOffice Writer, LibreOffice Calc, LibreOffice Impress, QGIS, and OpenToonz\. The dataset is released under the MIT License\.
## Appendix FResults ofQwen3\.5\-397B\-A17B\-FP8
To see the generalizability of our multimodal skill generation pipeline, we additionally evaluateQwen3\.5\-397B\-A17B\-FP8on OSWorld\. We selected the LibreOffice domain for skill reusing\. As show in Table[6](https://arxiv.org/html/2606.18448#A6.T6), tasks with text\-skill achieve the highest scores, while tasks with multimodal skills get relativly similar results as the no\-skill baseline\. Investigating as the trajectories, we induce the reason as the burdon of overly long context\. Since when loading related topics from multimodal skills, all figures in that topic are dumped into the context\. Such context burdon can lead to degradation in model behavior such as malformating in tool calling\.
Table 6:Cross\-model replication onOSWorldwithQwen3\-397Bas the underlying agent\. Per\-domain mean score in\[0,1\]\[0,1\]; integer beneath each column header is the task count;Avgis the unweighted mean across the three LibreOffice domains\. The two skill conditions are the Stage 1 \(Documentation\-only\) variants in their text\-matched and multimodal forms; UI\-explorer \(Stage 2\) results are reported in Table[1](https://arxiv.org/html/2606.18448#S3.T1)\.
## Appendix GModel and Compute Information
Claude Code CLI was used as the primary agent for experiments\. The approximate API usage cost during the project was around $1,000\.
We additionally used QWEN3\.5\-397B\-A17B\-FP8 for parts of the experimental pipeline\. Training and evaluation experiments were conducted on 8H100 GPUs for approximately 48 hours, corresponding to roughly 384 H100 GPU hours\.
## Appendix HQualitative Analysis: Supporting Figures
This appendix provides supporting figures for the qualitative analysis in Section[4\.1](https://arxiv.org/html/2606.18448#S4.SS1)\. Figure[13](https://arxiv.org/html/2606.18448#A8.F13)illustrates case \(i\) on test taskmaintenance\_plan\_orientation: the text\-only agent clicks the title bar of the Save As dialog \(closing LibreOffice\), while the Stage 2\(b\) worker’s captured screenshot, delivered to the multimodal agent throughload\_topic, pins the Save button at its actual location in the dialog footer\.
\(a\)Text\-only skill\.Agent clicks\(1011, 44\)on the dialog title bar — LibreOffice closes\.
\(b\)Multimodal skill\.Reference screenshot delivered with the topic guide pins the correct Save button at the dialog footer\.
Figure 13:Case \(i\), Save\-button mis\-location on test taskmaintenance\_plan\_orientation\. Prose like “Save at the bottom of the dialog” is ambiguous; the reference screenshot removes the ambiguity\.Similar Articles
Agent Skills Should Go Beyond Text: The Case for Visual Skills
This paper argues that agent skills should incorporate visual information, not just text, and proposes a multimodal skill paradigm combining textual logic with visual support. Experiments show visual skills outperform text-only approaches in visual-centric tasks.
MMSkills: Towards Multimodal Skills for General Visual Agents
This paper introduces MMSkills, a framework for representing, generating, and using multimodal procedural knowledge for visual agents, combining textual procedures with visual state cards and keyframes, and demonstrates improvements in GUI and game-based visual agent benchmarks.
VisualClaw: A Real-Time, Personalized Agent for the Physical World
VisualClaw is a self-evolving multimodal agent that reduces deployment costs through hybrid encoding and skill evolution, while improving video-QA accuracy across multiple benchmarks.
Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation
This paper introduces AgentViSS, a benchmark evaluating visual social intelligence in multimodal social simulation, containing 240 scenarios with aligned visual-textual evidence. Evaluating seven recent MLLMs reveals a gap between local role enactment and visually grounded interaction management.
Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning
Skill-3D is a framework that enables AI agents to learn scene-aware skills through self-evolving memory and skill libraries, significantly improving tool utilization in 3D spatial reasoning tasks (e.g., from 39% to 78% on VSI-Bench).