MMSkills: Towards Multimodal Skills for General Visual Agents

arXiv cs.AI Papers

Summary

This paper introduces MMSkills, a framework for representing, generating, and using multimodal procedural knowledge for visual agents, combining textual procedures with visual state cards and keyframes, and demonstrates improvements in GUI and game-based visual agent benchmarks.

arXiv:2605.13527v1 Announce Type: new Abstract: Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next. We formalize this requirement as multimodal procedural knowledge and address three practical challenges: (I) what a multimodal skill package should contain; (II) where such packages can be derived from public interaction experience; and (III) how agents can consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots. We introduce MMSkills, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making. Each MMSkill is a compact, state-conditioned package that couples a textual procedure with runtime state cards and multi-view keyframes. To construct these packages, we develop an agentic trajectory-to-skill Generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. To use them, we introduce a branch-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent. Experiments across GUI and game-based visual-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model-internal priors.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 06:16 AM

# MMSkills: Towards Multimodal Skills for General Visual Agents
Source: [https://arxiv.org/html/2605.13527](https://arxiv.org/html/2605.13527)
Shuai ShaoQingyao LiJianghao LinLingyue FuShijian Wang Wenxiang JiaoYuan LuWeiwen LiuWeinan ZhangYong Yu\[\[\[

###### Abstract

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior primarily as textual prompts, executable code, or learned routines\. For visual agents, however, procedural knowledge is inherently multimodal: reuse depends not only on what operation to perform, but also on recognizing the relevant state, interpreting visual evidence of progress or failure, and deciding what to do next\. We formalize this requirement as*multimodal procedural knowledge*and address three practical challenges: \(I\)whata multimodal skill package should contain; \(II\)wheresuch packages can be derived from public interaction experience; and \(III\)howagents can consult multimodal evidence at inference time without excessive image context or over\-anchoring to reference screenshots\. We introduce*MMSkills*, a framework for representing, generating, and using reusable multimodal procedures for runtime visual decision making\. Each MMSkill is a compact, state\-conditioned package that couples a textual procedure with runtime state cards and multi\-view keyframes\. To construct these packages, we develop an agentic trajectory\-to\-skill Generator that transforms public non\-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta\-skill\-guided auditing\. To use them, we introduce a branch\-loaded multimodal skill agent: selected state cards and keyframes are inspected in a temporary branch, aligned with the live environment, and distilled into structured guidance for the main agent\. Experiments across GUI and game\-based visual\-agent benchmarks show that MMSkills consistently improve both frontier and smaller multimodal agents, suggesting that external multimodal procedural knowledge complements model\-internal priors\.

## 1Introduction

Skills have become one of the central abstractions for building useful agents: recent systems store reusable behaviors as prompts, code, execution graphs, or learned routines that can be retrieved and composed later\(Wang et al\.,[2023a](https://arxiv.org/html/2605.13527#bib.bib40); Zheng et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib61); Chen et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib6); Wang et al\.,[2026a](https://arxiv.org/html/2605.13527#bib.bib39)\)\. Despite differences in implementation, these skills largely share a common representational assumption: reusable knowledge can be expressed as a textual or code\-level specification of actions\. This design is effective when the relevant state can be adequately abstracted in language, but it is insufficient for multimodal agents whose decisions depend on visual evidence\. For such agents, reusable experience must specify not only what operation to perform, but also how to recognize the relevant state, and how visual evidence should guide the next decision\. A desktop agent may know the correct operation but fail to recognize that a dialog is not yet ready; a game agent may know the intended goal but still require visual cues to distinguish progress from completion\. This observation is consistent with human procedural learning, where visual information can complement verbal explanations\(Mayer,[2009](https://arxiv.org/html/2605.13527#bib.bib27)\)\. Consequently, text\-only skills become verbose yet underspecified, whereas demonstrations preserve visual context but are lengthy, instance\-specific, and difficult to adapt\.

This gap suggests the need for*multimodal procedural knowledge*: reusable guidance that binds action procedures to the visual evidence and state\-dependent decisions required for applying them\. Such knowledge is not simply a text skill with screenshots attached\. To be reusable, it must specify what procedure is being reused, when the procedure should or should not be used, which visible cues matter, and which evidence verifies progress, failure, or completion\. Turning this requirement into practical multimodal skill libraries raises three central challenges:

- •Representation\.What should a multimodal skill package contain, and how should it bind procedures, visible, and verification cues into a coherent reusable unit?
- •Generation\.Where can such packages be derived from, if they must use public non\-evaluation interaction experience rather than hand\-written examples or raw demonstration replay?
- •Utilization\.How can an agent consult multimodal skill evidence at inference time while avoiding excessive image context, distracting state descriptions, and over\-anchoring to reference screenshots?

We proposeMMSkills, a framework for representing, generating, and utilizing reusable multimodal procedures for runtime visual decision making\. Each MMSkill couples atextual procedure, which describes the reusable action pattern, withruntime state cards, which encode when\-to\-use and when\-not\-to\-use conditions, visible cues, verification cues, and available views, andmulti\-view keyframes, which ground critical states through full\-frame, focused, and optional before/after views\. The resulting package is not a text instruction with illustrative images attached\. It is a state\-conditioned procedure whose visual evidence helps the agent decide when to follow, skip, or verify the procedure\.

![Refer to caption](https://arxiv.org/html/2605.13527v1/x1.png)Figure 1:A concrete MMSkills example\. A multimodal skill package combines a textual procedure, runtime state cards, and multi\-view visual evidence\. For the same chart\-creation task, text\-only guidance can miss the active sheet state, while branch\-loaded MMSkills align skill evidence with the live screen and return state\-aware guidance for the main agent\.Togeneratethe multimodal skill package, we introduce anautomated trajectory\-to\-skill Generatorbuilt around an agentic, meta\-skill\-guided pipeline\. This generation problem is substantially harder than text\-skill extraction: while prior pipelines can often compress successful rollouts, failure analyses, or accumulated traces into reusable instructions or action abstractions\(Zheng et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib61); Wang et al\.,[2026a](https://arxiv.org/html/2605.13527#bib.bib39); Alzubi et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib3); Ma et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib26); Xia et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib48); Li et al\.,[2026b](https://arxiv.org/html/2605.13527#bib.bib19)\), generating MMSkills must also identify reusable visual states, select diagnostic frames, and bind each visual cue to the decision rule it supports\. Our Generator operates on public trajectories that areseparate from evaluation tasks: it groups related workflows, induces candidate procedures, merges overlapping candidates, grounds them in real non\-test trajectory frames, and audits the resulting packages with reusable multimodal\-skill\-factory meta\-skills\. This process converts public interaction data into compact visual procedural knowledge without storing raw demonstrations as the skill\.

For effectiveutilization, we introducebranch loadingto consult the multimodal skills without injecting the entire package into the main trajectory\. Existing skill agents commonly insert retrieved skills directly into the main interaction context\. This loading pattern becomes problematic for MMSkills: a single package may contain several state cards together with multi\-view screenshots, so direct insertion creates substantial context pressure and makes reference images compete with the live observation\. More importantly, the main agent can become visually anchored to superficially similar reference screenshots, planning around the skill example rather than the current environment\. Branch loading addresses this issue as a multimodal form of progressive disclosure over skill evidence\(Xu and Yan,[2026](https://arxiv.org/html/2605.13527#bib.bib51)\)\. When the main agent considers a skill, it opens a temporary branch that selects the needed state cards and keyframe views, aligns them with the live screen or scene, and returns compact structured guidance with applicability judgments, subgoals, and next\-step plans\. The main trajectory receives distilled decision support rather than the full skill package, as illustrated by the example in Figure[1](https://arxiv.org/html/2605.13527#S1.F1)\.

We evaluate MMSkills across GUI and game\-based visual agent tasks, including OSWorld\(Xie et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib49)\), macOSWorld\(Yang et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib54)\), VAB\-Minecraft from VisualAgentBench\(Liu et al\.,[2024a](https://arxiv.org/html/2605.13527#bib.bib22)\), and Super\-Mario in LMGame\-Bench\(Hu et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib12)\)\. Across frontier and smaller multimodal models, MMSkills improve performance over no\-skill and text\-only skill conditions, suggesting that external visual procedural knowledge complements model\-internal priors\.

Our main contributions are summarized as follows:

- •To the best of our knowledge, we are the first to introduce themultimodal skill package, formulating reusable skills for general visual agents as multimodal procedural knowledge: compact, state\-conditioned units that organize textual procedures, runtime state cards, and multi\-view keyframes for visual decision making\.
- •We develop an agentic trajectory\-to\-skillGeneratorthat turns public, non\-evaluation trajectories into multimodal skill packages through workflow grouping, procedure induction, visual grounding, and meta\-skill\-guided auditing\.
- •We proposebranch loading, a runtime mechanism that selects and aligns multimodal skill evidence in a temporary branch before returning structured decision support to the main agent\.
- •We demonstrate significant gains across GUI and game\-based visual\-agent benchmarks and multiple model families, showing that external multimodal procedural knowledge complements model\-internal priors\.

## 2Methods

### 2\.1Overview

MMSkills are designed around three components: a*multimodal skill package*that stores reusable visual procedural knowledge, a*Skill Generation pipeline*that constructs such packages from public trajectories, and a*branch\-loaded multimodal skill agent*that isolates skill\-environment grounding in a temporary branch and returns distilled decision support to the main trajectory at inference time\. Figure[2](https://arxiv.org/html/2605.13527#S2.F2)gives the system overview\.

![Refer to caption](https://arxiv.org/html/2605.13527v1/x2.png)Figure 2:Overview of the MMSkills framework\. A multimodal skill package stores a reusable textual procedure, runtime state cards, and multi\-view keyframes\. A meta\-skill\-guided Generator converts public non\-test trajectories into a reusable multimodal skill library\. At inference time, the main visual agent uses branch loading to inspect selected skill evidence in a temporary branch and receives compact structured guidance before acting\.At a high level, the Generator maps non\-evaluation trajectories𝒯=\{τi\}\\mathcal\{T\}=\\\{\\tau\_\{i\}\\\}into a multimodal skill libraryℳ=\{Mi\}i=1N\\mathcal\{M\}=\\\{M\_\{i\}\\\}\_\{i=1\}^\{N\}\. Before an episode begins, the runtime agent pre\-recalls a task\-level candidate set𝒞I⊂ℳ\\mathcal\{C\}\_\{I\}\\subset\\mathcal\{M\}from the instructionIIand compact skill descriptors\. During execution, the main agent observes the current visual observationOtO\_\{t\}, maintains a short historyHtH\_\{t\}, and either acts directly or consults a temporary skill branch for someMt∈𝒞IM\_\{t\}\\in\\mathcal\{C\}\_\{I\}:

direct:\\displaystyle\\text\{direct\}:At=πmain​\(Ot,Ht,𝒞I\),\\displaystyle A\_\{t\}=\\pi\_\{\\text\{main\}\}\(O\_\{t\},H\_\{t\},\\mathcal\{C\}\_\{I\}\),\(1\)branch:\\displaystyle\\text\{branch\}:Gt=Branch​\(Ot,Ht,Mt\),At=πmain​\(Ot,Ht,𝒞I,Gt\)\.\\displaystyle G\_\{t\}=\\text\{Branch\}\(O\_\{t\},H\_\{t\},M\_\{t\}\),\\quad A\_\{t\}=\\pi\_\{\\text\{main\}\}\(O\_\{t\},H\_\{t\},\\mathcal\{C\}\_\{I\},G\_\{t\}\)\.The branch output is a structured guidance tuple

Gt=\(applicablet,subgoalt,plant,do\_not\_dot,verifyt\),G\_\{t\}=\(\\text\{applicable\}\_\{t\},\\text\{subgoal\}\_\{t\},\\text\{plan\}\_\{t\},\\text\{do\\\_not\\\_do\}\_\{t\},\\text\{verify\}\_\{t\}\),\(2\)where the fields respectively give the applicability judgment, local subgoal, skill\-conditioned plan, negative constraints, and visual verification check\. The main agent usesGtG\_\{t\}as decision support, while executable action grounding remains tied to the live observation\.

### 2\.2Multimodal Skill Package

We represent each MMSkill as a state\-conditioned procedure package

M=\(D,P,S,K\),M=\(D,P,S,K\),\(3\)whereDDis a compact descriptor,PPis a reusable textual procedure,S=\{Sj\}j=1mS=\\\{S\_\{j\}\\\}\_\{j=1\}^\{m\}is a set of runtime state cards, andK=\{Kj\}j=1mK=\\\{K\_\{j\}\\\}\_\{j=1\}^\{m\}is a set of keyframe bundles aligned with those cards\. Each pair\(Sj,Kj\)\(S\_\{j\},K\_\{j\}\)corresponds to one decision\-relevant procedural state\. The procedure specifies the reusable workflow; the state card specifies when the workflow is valid or invalid; and the keyframes make the state visually recognizable at runtime\.

A runtime state card is an agent\-facing state node rather than an image caption\. It links a point in the procedure to when\-to\-use conditions, when\-not\-to\-use conditions, visible cues, verification cues, and available views:

Sj=\(when\_to\_usej,when\_not\_to\_usej,visible\_cuesj,verification\_cuej,𝒱j\),𝒱j=available\_viewsj\.\\begin\{split\}S\_\{j\}=\(&\\text\{when\\\_to\\\_use\}\_\{j\},\\text\{when\\\_not\\\_to\\\_use\}\_\{j\},\\text\{visible\\\_cues\}\_\{j\},\\\\ &\\text\{verification\\\_cue\}\_\{j\},\\mathcal\{V\}\_\{j\}\),\\qquad\\mathcal\{V\}\_\{j\}=\\text\{available\\\_views\}\_\{j\}\.\\end\{split\}\(4\)The first two fields define when the state should be followed or skipped,visible\_cuesj\\text\{visible\\\_cues\}\_\{j\}states what evidence to inspect,verification\_cuej\\text\{verification\\\_cue\}\_\{j\}defines the progress or completion check, and𝒱j\\mathcal\{V\}\_\{j\}lists which views may be loaded\. This schema makes the skill useful for decision making: the agent can decide whether to follow, skip, or verify the procedure\.

Each key state is grounded by a small multi\-view bundle\. Let

𝒱=\{full\_frame,focus\_crop,before,after\}\.\\mathcal\{V\}=\\\{\\text\{full\\\_frame\},\\text\{focus\\\_crop\},\\text\{before\},\\text\{after\}\\\}\.\(5\)Then

Kj=\{Kjv:v∈𝒱j,v∈𝒱\}\.K\_\{j\}=\\\{K\_\{j\}^\{v\}:v\\in\\mathcal\{V\}\_\{j\},\\ v\\in\\mathcal\{V\}\\\}\.\(6\)The full\-frame view preserves global context, the focus crop localizes the visual cue, and optional before/after views expose useful transitions\. These images are reference evidence, not coordinates to copy\. Under this representation, a text\-only skill is the degenerate package\(D,P,∅,∅\)\(D,P,\\emptyset,\\emptyset\); MMSkills extend it by binding procedure, decision conditions, and visual evidence into one reusable unit\.

### 2\.3Skill Generator from Public Trajectories

We build MMSkills from public interaction trajectories that are separate from evaluation tasks\. A trajectory is

τi=\(Ii,Oi,1:Ti,Ai,1:Ti\),\\tau\_\{i\}=\(I\_\{i\},O\_\{i,1:T\_\{i\}\},A\_\{i,1:T\_\{i\}\}\),\(7\)whereIiI\_\{i\}is the task instruction,Oi,tO\_\{i,t\}are visual observations,Ai,tA\_\{i,t\}are executed actions\. The Generator is controlled by a reusable multimodal\-skill\-factory meta\-skillℱ\\mathcal\{F\}:

𝒢ℱ:𝒯d↦ℳd,\\mathcal\{G\}\_\{\\mathcal\{F\}\}:\\mathcal\{T\}\_\{d\}\\mapsto\\mathcal\{M\}\_\{d\},\(8\)where𝒯d\\mathcal\{T\}\_\{d\}is the public trajectory pool for domainddandℳd\\mathcal\{M\}\_\{d\}is the generated domain skill library\. The pipeline comprises five stages:

𝒯d\\displaystyle\\mathcal\{T\}\_\{d\}→Phase 0: embed\+cluster𝒞d→Phase 1: cluster plan𝒜d→Phase 2: mergeℛd\\displaystyle\\xrightarrow\{\\text\{Phase 0: embed\+cluster\}\}\\mathcal\{C\}\_\{d\}\\xrightarrow\{\\text\{Phase 1: cluster plan\}\}\\mathcal\{A\}\_\{d\}\\xrightarrow\{\\text\{Phase 2: merge\}\}\\mathcal\{R\}\_\{d\}\(9\)→Phase 3: text draftℳ^d→Phase 4: image ground\+auditℳd\.\\displaystyle\\xrightarrow\{\\text\{Phase 3: text draft\}\}\\widehat\{\\mathcal\{M\}\}\_\{d\}\\xrightarrow\{\\text\{Phase 4: image ground\+audit\}\}\\mathcal\{M\}\_\{d\}\.
- •Phase 0: task embedding and clustering\.The pipeline embeds task instructions and trajectory metadata, then groups a broad domain into semantically focused clusters𝒞d\\mathcal\{C\}\_\{d\}\.
- •Phase 1: cluster\-level skill planning\.For each cluster, an LLM\-based agent proposes atomic skills with workflow boundaries, completion conditions, and covered task ids, producing a domain planning table𝒜d\\mathcal\{A\}\_\{d\}\.
- •Phase 2: skill merging\.Cluster\-level plans are deduplicated, merged, and generalized into merged skill specificationsℛd\\mathcal\{R\}\_\{d\}, while overly broad umbrella skills are rejected\.
- •Phase 3: text\-first drafting\.Without reading images, the Generator selects reference tasks and drafts the descriptorDD, textual procedurePP, and planned state cards, yieldingℳ^d\\widehat\{\\mathcal\{M\}\}\_\{d\}\.
- •Phase 4: image grounding and audit\.The Generator reads selected keyframes, grounds focus regions, constructs multi\-view bundles, and audits the final packages\.

For a merged skillr∈ℛdr\\in\\mathcal\{R\}\_\{d\}, finalization is written as

M^r=\(Dr,Pr,S^r,K^r\)→ground\+auditMr=\(Dr,Pr,Sr,Kr\)\.\\widehat\{M\}\_\{r\}=\(D\_\{r\},P\_\{r\},\\widehat\{S\}\_\{r\},\\widehat\{K\}\_\{r\}\)\\xrightarrow\{\\text\{ground\+audit\}\}M\_\{r\}=\(D\_\{r\},P\_\{r\},S\_\{r\},K\_\{r\}\)\.\(10\)The visual grounding policy is conservative: views are added only for state recognition, transition comparison, or completion verification, so the skill stores diagnostic states rather than replaying demonstrations\. The meta\-skillℱ\\mathcal\{F\}supplies reusable scripts, schemas, and quality gates for the LLM\-based Generator, while external services are limited to bounded support steps such as embedding/clustering and grounding\.

### 2\.4Branch\-loaded Multimodal Skills Agent

Most skill\-using agents load a retrieved skill directly into the main interaction context\. For short text skills, this is reasonable: the skill is read as an additional instruction alongside the observation\. For MMSkills, direct loading is brittle because state cards, multi\-view keyframes, and transition examples add substantial context pressure, and irrelevant reference views can anchor the agent away from the live environment\. Figure[2](https://arxiv.org/html/2605.13527#S2.F2)\(C\) illustrates the branch\-loaded alternative, which moves skill\-environment grounding out of the main trajectory\.

Stage 1: gated view selection\.Suppose the main agent callsMt=\(Dt,Pt,St,Kt\)∈𝒞IM\_\{t\}=\(D\_\{t\},P\_\{t\},S\_\{t\},K\_\{t\}\)\\in\\mathcal\{C\}\_\{I\}\. The branch first selects which state cards and view types are relevant to the live observation:

\(Jt,Rt\)=SelectViews​\(Ot,Ht−1,Pt,St\),Vt=\{Kjv:j∈Jt,v∈Rt,j\},\(J\_\{t\},R\_\{t\}\)=\\text\{SelectViews\}\(O\_\{t\},H\_\{t\-1\},P\_\{t\},S\_\{t\}\),\\qquad V\_\{t\}=\\\{K\_\{j\}^\{v\}:j\\in J\_\{t\},\\ v\\in R\_\{t,j\}\\\},\(11\)whereJtJ\_\{t\}indexes selected state cards andRt,j⊆𝒱jR\_\{t,j\}\\subseteq\\mathcal\{V\}\_\{j\}selects views for statejj\. The selector reads the live observation, recent history, textual procedure, and state\-card descriptions before loading images\. If text and state cards are sufficient,Rt,jR\_\{t,j\}may be empty\.

Stage 2: branch planning\.The branch then aligns the selected evidence with the live state and returns structured guidance:

Gt=PlanBranch​\(Ot,Ht−1,Pt,\{Sj:j∈Jt\},Vt\),G\_\{t\}=\\text\{PlanBranch\}\(O\_\{t\},H\_\{t\-1\},P\_\{t\},\\\{S\_\{j\}:j\\in J\_\{t\}\\\},V\_\{t\}\),\(12\)whereGtG\_\{t\}follows Eq\.[2](https://arxiv.org/html/2605.13527#S2.E2)\. The main agent does not executeGtG\_\{t\}mechanically; it usesGtG\_\{t\}as an intermediate planning signal and still chooses a grounded action from the live screenshot\. This preserves procedural guidance without allowing reference images to override the current observation\. Appendix[9](https://arxiv.org/html/2605.13527#S9)gives the full runtime loop in Algorithm[1](https://arxiv.org/html/2605.13527#alg1), and Appendix[10](https://arxiv.org/html/2605.13527#S10)reports the prompt templates used by the main agent and the two branch stages\.

## 3Experiments

We evaluate whether MMSkills provide useful external procedural knowledge for visual agents\. The experiments are organized around four research questions:

- •RQ1: Overall performance on GUI and game tasks\.Do MMSkills improve visual agents across realistic desktop environments and open\-ended visual game tasks?
- •RQ2: Ablations of skill content and branch loading\.Which parts of MMSkills matter, and how do branch loading and view selection affect multimodal skill use?
- •RQ3: Skill usage and interaction dynamics\.How often are MMSkills invoked, how do they affect interaction length, and which visual views are selected at runtime?
- •RQ4: Behavioral shift analysis\.How do MMSkills change the agent’s low\-level action patterns beyond final success rate?

### 3\.1Experimental Setup

In all settings, agents plan from visual observations, namely desktop or game screenshots\. We evaluate on OSWorld\(Xie et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib49)\), macOSWorld\(Yang et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib54)\), VAB\-Minecraft from VisualAgentBench\(Liu et al\.,[2024a](https://arxiv.org/html/2605.13527#bib.bib22)\), and Super Mario Bros from LMGame\-Bench\(Hu et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib12)\), covering both realistic GUI tasks and open visual game environments\. Detailed benchmark descriptions and test\-case distributions are illustrated in Appendix[6](https://arxiv.org/html/2605.13527#S6); implementation details, evaluation protocols, model choices, and runtime variants are given in Appendix[8](https://arxiv.org/html/2605.13527#S8)\.

All skills are extracted from non\-test data\. We evaluate frontier and smaller multimodal models and compare*no\-skill*,*text\-only skill*, and*MMSkills*conditions, with direct\-loading variants studied in the ablations\. Dataset\-specific skill sources, source statistics, and skill\-package distributions are provided in Appendix[7](https://arxiv.org/html/2605.13527#S7)\.

### 3\.2RQ1: Overall Performance on GUI and Game Tasks

Table[1](https://arxiv.org/html/2605.13527#S3.T1)reports OSWorld application\-level success rates, and Table[2](https://arxiv.org/html/2605.13527#S3.T2)reports the auxiliary GUI and game results\.MMSkills improve OSWorld overall performance across all evaluated model families\.Overall success increases for Gemini 3\.1 Pro \(44\.08%→50\.11%44\.08\\%\\\!\\to\\\!50\.11\\%\), Gemini 3 Flash \(36\.65%→47\.97%36\.65\\%\\\!\\to\\\!47\.97\\%\), Qwen3\-VL\-235B \(21\.34%→39\.17%21\.34\\%\\\!\\to\\\!39\.17\\%\), GLM\-5V, and Kimi\-K2\.6\. Text\-only skills help but are less stable across domains, suggesting that procedures alone are insufficient when skill use depends on visual state matching\.External multimodal procedural knowledge is especially valuable for weaker visual agents\.For Qwen3\-VL\-8B\-Instruct, MMSkills raise OSWorld from10\.78%10\.78\\%to25\.40%25\.40\\%and VAB\-Minecraft from23\.28%23\.28\\%to38\.79%38\.79\\%, indicating that explicit visual procedural knowledge can compensate for limited model\-internal priors\.

The gains transfer beyond Ubuntu desktop tasks\.On macOSWorld, MMSkills improve the completed large\-model runs, including Gemini 3 Flash and GLM\-5V, while VAB\-Minecraft shows consistent gains in both success rate and average score across all evaluated models\. Super Mario Bros follows the same pattern in the completed runs, with higher total performance and reward under MMSkills\. These results indicate that MMSkills are not specialized to a single GUI benchmark; the same state\-conditioned skill format helps in visually grounded game settings where recurring states and action strategies can be reused\.

Table 1:OSWorld application\-level success rates\. All entries are percentages\. “Calc”, “Impress”, and “Writer” denote LibreOffice applications\.Base modelSkill conditionChromeGIMPCalcImpressWriterMulti\-appOSMailVLCVS CodeOverallGemini 3\.1 ProNo skill53\.4734\.6257\.4540\.4347\.8231\.9754\.1740\.0035\.2956\.5244\.08Text\-only44\.3534\.6238\.3040\.3456\.5222\.3870\.8366\.6741\.1856\.5240\.76MMSkills59\.9150\.0053\.1953\.1960\.8624\.1170\.8366\.6770\.5965\.2250\.11Gemini 3 FlashNo skill37\.7850\.0038\.3029\.7352\.1721\.5154\.1766\.6752\.3947\.8336\.65Text\-only51\.0223\.0838\.3034\.0056\.5219\.1654\.1760\.0058\.8252\.1740\.27MMSkills55\.3742\.3153\.1940\.3456\.5230\.9875\.0066\.6752\.9460\.8747\.97Qwen3\-VL\-235BNo skill15\.5638\.4617\.0225\.5343\.489\.4825\.0026\.6717\.6534\.7821\.34Text\-only42\.2250\.0010\.6421\.3134\.7814\.8633\.3360\.0035\.2947\.8328\.57MMSkills59\.9169\.2323\.4032\.0147\.8219\.3541\.6773\.3341\.1856\.5239\.17GLM\-5VNo skill37\.7819\.2321\.2829\.7026\.0818\.7054\.1753\.3311\.7647\.8328\.71Text\-only53\.2453\.8531\.9131\.9852\.1720\.2420\.8346\.6735\.2965\.2236\.61MMSkills51\.0253\.8531\.9131\.8343\.4722\.2666\.6740\.0023\.5365\.2238\.51Kimi\-K2\.6No skill51\.0234\.6234\.0435\.3230\.4314\.8654\.1766\.6732\.6052\.1734\.98Text\-only57\.6940\.0040\.4336\.1417\.3822\.3862\.5053\.3358\.8243\.4839\.66MMSkills57\.6942\.3140\.4348\.9260\.8623\.4079\.1773\.3341\.1869\.5746\.59Qwen3\-VL\-8B\-InstructNo skill15\.477\.692\.138\.594\.347\.3325\.0013\.3329\.4117\.3910\.78Text\-only19\.9111\.546\.3816\.9917\.397\.3316\.6733\.3317\.6534\.7814\.93MMSkills39\.9142\.318\.5123\.3717\.3913\.4325\.0060\.0029\.4147\.8325\.40

*Note:*Due to the substantially higher inference cost and wall\-clock time of Gemini 3\.1 Pro and Kimi\-K2\.6, we report their full three\-condition results only on OSWorld\.

Table 2:Auxiliary GUI and game\-based visual\-agent results\. macOSWorld reports domain\-level and overall success rates; VAB\-Minecraft reports success rate and average score; Super Mario Bros reports total performance and total reward\.macOSWorldVAB\-MinecraftSuper Mario BrosBase modelSkill conditionFileMediaProd\.Sys/IFAppsOverallSuccessAvg\. scoreTotal perf\.Total rewardGemini 3 FlashNo skill41\.3833\.3360\.0062\.0755\.7955\.9467\.240\.7462411\.00766\.67Text\-only31\.0325\.0062\.8675\.8655\.2653\.8568\.960\.7541548\.00912\.00MMSkills58\.6250\.0077\.1465\.5265\.7365\.7373\.280\.7884624\.001081\.33Qwen3\-VL\-235BNo skill31\.0358\.3351\.4358\.6244\.7447\.5552\.590\.6308454\.50955\.50Text\-only34\.4833\.3337\.1451\.7252\.6343\.3655\.170\.6634610\.501138\.25MMSkills37\.9333\.3354\.2962\.0757\.8951\.7562\.070\.7114788\.001514\.25GLM\-5VNo skill24\.1416\.6740\.0041\.3839\.4734\.9756\.030\.6701612\.751191\.50Text\-only31\.0366\.6762\.8658\.6247\.3751\.7561\.200\.6938794\.501218\.00MMSkills44\.8366\.6748\.5758\.6250\.0051\.7568\.100\.7495950\.501384\.50Qwen3\-VL\-8B\-InstructNo skill10\.340\.0014\.293\.450\.006\.2923\.280\.3017415\.25928\.75Text\-only0\.008\.332\.863\.4510\.534\.9029\.310\.3754596\.50997\.25MMSkills6\.908\.338\.573\.455\.266\.2938\.790\.4668764\.001128\.75

![Refer to caption](https://arxiv.org/html/2605.13527v1/x3.png)Figure 3:Ablation results for MMSkills components and branch loading\. Bars report percentage\-point gains over the no\-skill baseline\. Panel \(A\) removes runtime state cards or visual keyframes from the skill package\. Panel \(B\) compares direct loading with branch loading and with or without view selection\.
### 3\.3RQ2: Ablations of Skill Content and Branch Loading

Figure[3](https://arxiv.org/html/2605.13527#S3.F3)combines the skill\-content and branch\-loading ablations\. Unless otherwise stated, skill variants use the branch\-loaded agent; the main exception is*Direct load*, which inserts skill content into the main context\. For skill content, we compare text\-only skills, MMSkills without state cards, MMSkills without images, and the complete MMSkills package\.State cards and multi\-view visual evidence both improve skill utility\.Text\-only branch loading already improves over the no\-skill baseline, but the complete MMSkills package is consistently stronger\. Removing state cards weakens the agent’s ability to distinguish relevant runtime states, while removing images preserves decision rules but removes visual grounding evidence\. Both removals reduce performance on OSWorld and VAB\-Minecraft, confirming that state cards and keyframes play complementary roles: one supports state discrimination, and the other helps the agent recognize the corresponding visual evidence\.Branch loading helps even for text\-only skills\.The branch\-loaded text\-only variant is stronger than direct text loading in most model–benchmark pairs, indicating that the temporary branch improves skill interpretation even before multimodal evidence is introduced\.

For branch loading, we ablate whether skill evidence is inspected in a temporary branch and whether Stage\-1 view selection filters state cards and keyframes\.Branch loading and view selection address different failure modes\.Direct\-full loading hurts performance because unfiltered images and state descriptions pollute the main context; view selection alone reduces this damage but stays near baseline\. Branch loading already gives clear gains, and the full two\-stage design performs best, indicating that separated evidence inspection and filtered visual evidence are both necessary\.

### 3\.4RQ3: Skill Usage and Interaction Dynamics

Table[3](https://arxiv.org/html/2605.13527#S3.T3)analyzes when and how agents call skills\.MMSkills are invoked more often than text\-only skills\.Invocation coverage increases on both OSWorld and VAB\-Minecraft for Gemini 3 Flash and Qwen3\-VL\-235B, with the largest OSWorld change rising from37\.50%37\.50\\%to65\.28%65\.28\\%for Qwen3\-VL\-235B\. This suggests that multimodal skills make external knowledge easier to recognize as relevant: state cards expose when\-to\-use and when\-not\-to\-use conditions, and visual cues help the agent detect when its current observation matches a reusable procedural state\.

Table 3:Skill invocation, interaction length, and selected views\. “Invoked” is the percentage of cases with at least one skill call, and “StepΔ\\Delta” is relative to the no\-skill baseline\.BenchmarkModelSkill conditionInvoked \(%\)Calls/caseStepsStepΔ\\DeltaViews \(Full/Focus/Before/After\)OSWorldGemini 3 FlashNo skill––13\.110\.00–Text\-only41\.110\.713915\.64\+2\.53–MMSkills62\.500\.955611\.86\-1\.2579/241/8/24Qwen3\-VL\-235BNo skill––15\.220\.00–Text\-only37\.500\.491713\.34\-1\.88–MMSkills65\.280\.92229\.87\-5\.3540/27/17/13VAB\-MinecraftGemini 3 FlashNo skill––16\.920\.00–Text\-only68\.971\.870617\.30\+0\.38–MMSkills81\.902\.431013\.75\-3\.17105/205/15/12Qwen3\-VL\-235BNo skill––34\.740\.00–Text\-only54\.311\.577631\.36\-3\.38–MMSkills64\.662\.353427\.07\-7\.6798/196/13/10

MMSkills shorten trajectories rather than merely adding extra consultation\.Text\-only skills can add overhead when they provide procedural hints without visual grounding, but MMSkills reduce average steps in every setting, with the largest reductions appearing for Qwen3\-VL\-235B\. These reductions indicate that multimodal skills help agents find shorter task\-solving paths and avoid unnecessary exploration or repeated low\-value actions\.Focus crops dominate selected visual evidence\.The branch does not load all views uniformly: focus crops are selected most frequently in three of four settings, while full\-frame, before, and after views provide global context, transition evidence, and completion references when local crops alone are insufficient\.

### 3\.5RQ4: Behavioral Shift Analysis

![Refer to caption](https://arxiv.org/html/2605.13527v1/x4.png)Figure 4:Behavioral shifts induced by MMSkills on OSWorld\. Panel \(A\) reports the distribution of executed action primitives\. Panel \(B\) compares the average number of low\-level primitives per task\. Panel \(C\) measures repetitive behavior through exact repeated actions, repeated action modes, and the longest same\-mode run normalized by the 20\-step budget\.Figure[4](https://arxiv.org/html/2605.13527#S3.F4)shows that the effect of MMSkills is not merely a success\-rate gain\.MMSkills reduce low\-level action load\.Gemini 3 Flash uses substantially fewer primitives per task, and Qwen3\-VL\-235B shows a similar reduction, especially in click actions\. This supports the view that multimodal state cards and visual evidence constrain the agent’s search space: the agent performs fewer exploratory GUI operations before reaching a useful state\.The behavioral shift is strongest for Qwen3\-VL\-235B\.Its click share drops from75\.8%75\.8\\%to63\.7%63\.7\\%, while keyboard and DONE actions increase, suggesting that MMSkills help click\-heavy agents move toward more structured input and stronger completion judgments\.

MMSkills suppress repetitive trajectories and improve completion awareness\.The effect is clearest for Qwen3\-VL\-235B: exact repeated actions fall from21\.8%21\.8\\%to6\.2%6\.2\\%, and the longest same\-mode run decreases substantially\. Gemini 3 Flash shows the same direction of change, though from a stronger baseline\. MMSkills also increase DONE behavior for both models, indicating that state cards and verification cues help agents decide not only what to do next, but also when the task is complete\. Overall, MMSkills reshape agent behavior from exploratory trial\-and\-error toward grounded, state\-aware execution; Appendix[11](https://arxiv.org/html/2605.13527#S11)provides the GLM\-5V and Kimi\-K2\.6 analysis\.

## 4Related Work

#### Skills for agents\.

Skill reuse has roots in temporal abstraction and motor primitives\(Sutton et al\.,[1999](https://arxiv.org/html/2605.13527#bib.bib36); Ijspeert et al\.,[2013](https://arxiv.org/html/2605.13527#bib.bib13)\), and recent LLM agents store reusable behavior as language, code, APIs, or learned libraries\(Ahn et al\.,[2022](https://arxiv.org/html/2605.13527#bib.bib2); Liang et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib20); Yao et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib56); Shinn et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib35); Wang et al\.,[2023a](https://arxiv.org/html/2605.13527#bib.bib40); Zheng et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib61); Chen et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib6); Wang et al\.,[2026a](https://arxiv.org/html/2605.13527#bib.bib39); Alzubi et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib3); Ma et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib26); Xia et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib48)\)\. A complementary line treats accumulated experience as long\-term agent memory\(Park et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib29); Packer et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib28)\), while surveys and benchmarks evaluate skill relevance, selection, and safety\(Xu and Yan,[2026](https://arxiv.org/html/2605.13527#bib.bib51); Li et al\.,[2026b](https://arxiv.org/html/2605.13527#bib.bib19); Wang et al\.,[2026b](https://arxiv.org/html/2605.13527#bib.bib41); Liu et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib24)\)\. MMSkills follows this modular view but stores state\-conditioned multimodal packages and uses branch loading instead of inserting full skill memory; Appendix[15](https://arxiv.org/html/2605.13527#S15)expands the discussion\.

#### Visual agents\.

Visual\-agent benchmarks span web, mobile, desktop, and embodied environments\(Deng et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib8); Zhou et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib62); Koh et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib15); He et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib10); Rawles et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib32); Xie et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib49); Yang et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib54); Liu et al\.,[2024a](https://arxiv.org/html/2605.13527#bib.bib22)\), and model and framework work improves screenshot grounding and GUI control\(Cheng et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib7); Wu et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib47); Qin et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib30); Agashe et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib1); Hong et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib11); Zheng et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib60); Zhang et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib57); Lu et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib25)\)\. Dedicated grounding benchmarks measure how reliably models localize UI elements from instructions\(Li et al\.,[2025a](https://arxiv.org/html/2605.13527#bib.bib16); Gou et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib9); Wang et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib44); Xu et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib52)\)\. MMSkills builds on these capabilities but operates higher: it tells the agent which procedural state matters and what visual evidence confirms it\.

Closest to our work, Mirage\-1 introduces hierarchical multimodal skills, XSkill extracts skills from visually grounded experience, and CUA\-Skill represents computer\-use skills as parameterized procedures and execution graphs\(Xie et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib50); Jiang et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib14); Chen et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib6)\)\. MMSkills differs by organizing skills around runtime state cards and multi\-view evidence, and by using branch loading to align selected evidence with the live observation before the main agent acts\.

## 5Conclusion and Limitations

We introducedMMSkills, a framework that represents reusable skills for visual agents as multimodal procedural knowledge\. By combining textual procedures, runtime state cards, multi\-view keyframes, and branch\-loaded use, MMSkills improve GUI and game\-based visual agents across model families\. The main limitations are dependence on source\-trajectory coverage, possible errors from skill generation or visual grounding, and extra inference cost from branch loading\. Extending MMSkills to broader embodied or safety\-critical settings will require stronger verification and online skill repair\.

## References

- Agashe et al\. \(2024\)Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang\.Agent S: An open agentic framework that uses computers like a human, 2024\.URL[https://arxiv\.org/abs/2410\.08164](https://arxiv.org/abs/2410.08164)\.
- Ahn et al\. \(2022\)Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang\-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng\.Do as i can, not as i say: Grounding language in robotic affordances, 2022\.URL[https://arxiv\.org/abs/2204\.01691](https://arxiv.org/abs/2204.01691)\.
- Alzubi et al\. \(2026\)Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu\.Evoskill: Automated skill discovery for multi\-agent systems, 2026\.URL[https://arxiv\.org/abs/2603\.02766](https://arxiv.org/abs/2603.02766)\.
- Bai et al\. \(2025\)Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, and Ke Zhu\.Qwen3\-VL technical report, 2025\.URL[https://arxiv\.org/abs/2511\.21631](https://arxiv.org/abs/2511.21631)\.
- Bai et al\. \(2024\)Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li\.Longbench: A bilingual, multitask benchmark for long context understanding, 2024\.URL[https://arxiv\.org/abs/2308\.14508](https://arxiv.org/abs/2308.14508)\.
- Chen et al\. \(2026\)Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida\.CUA\-skill: Develop skills for computer using agent, 2026\.URL[https://arxiv\.org/abs/2601\.21123](https://arxiv.org/abs/2601.21123)\.
- Cheng et al\. \(2024\)Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu\.Seeclick: Harnessing GUI grounding for advanced visual GUI agents\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, pages 9313–9332\. Association for Computational Linguistics, 2024\.[10\.18653/V1/2024\.ACL\-LONG\.505](https://arxiv.org/doi.org/10.18653/V1/2024.ACL-LONG.505)\.URL[https://doi\.org/10\.18653/v1/2024\.acl\-long\.505](https://doi.org/10.18653/v1/2024.acl-long.505)\.
- Deng et al\. \(2023\)Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su\.Mind2web: Towards a generalist agent for the web, 2023\.URL[https://arxiv\.org/abs/2306\.06070](https://arxiv.org/abs/2306.06070)\.
- Gou et al\. \(2025\)Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su\.Navigating the digital world as humans do: Universal visual grounding for gui agents, 2025\.URL[https://arxiv\.org/abs/2410\.05243](https://arxiv.org/abs/2410.05243)\.
- He et al\. \(2024\)Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu\.Webvoyager: Building an end\-to\-end web agent with large multimodal models\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, pages 6864–6890\. Association for Computational Linguistics, 2024\.[10\.18653/V1/2024\.ACL\-LONG\.371](https://arxiv.org/doi.org/10.18653/V1/2024.ACL-LONG.371)\.URL[https://doi\.org/10\.18653/v1/2024\.acl\-long\.371](https://doi.org/10.18653/v1/2024.acl-long.371)\.
- Hong et al\. \(2024\)Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang\.Cogagent: A visual language model for gui agents, 2024\.URL[https://arxiv\.org/abs/2312\.08914](https://arxiv.org/abs/2312.08914)\.
- Hu et al\. \(2025\)Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P\. Xing, Ion Stoica, Tajana Rosing, Haojian Jin, and Hao Zhang\.lmgame\-bench: How good are llms at playing games?, 2025\.URL[https://arxiv\.org/abs/2505\.15146](https://arxiv.org/abs/2505.15146)\.
- Ijspeert et al\. \(2013\)Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal\.Dynamical movement primitives: Learning attractor models for motor behaviors\.*Neural Computation*, 25\(2\):328–373, 2013\.[10\.1162/NECO\_a\_00393](https://arxiv.org/doi.org/10.1162/NECO_a_00393)\.URL[https://doi\.org/10\.1162/NECO\_a\_00393](https://doi.org/10.1162/NECO_a_00393)\.
- Jiang et al\. \(2026\)Guanyu Jiang, Zhaochen Su, Xiaoye Qu, and Yi R\. Fung\.Xskill: Continual learning from experience and skills in multimodal agents, 2026\.URL[https://arxiv\.org/abs/2603\.12056](https://arxiv.org/abs/2603.12056)\.
- Koh et al\. \(2024\)Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po\-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried\.Visualwebarena: Evaluating multimodal agents on realistic visual web tasks\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, pages 881–905\. Association for Computational Linguistics, 2024\.[10\.18653/V1/2024\.ACL\-LONG\.50](https://arxiv.org/doi.org/10.18653/V1/2024.ACL-LONG.50)\.URL[https://doi\.org/10\.18653/v1/2024\.acl\-long\.50](https://doi.org/10.18653/v1/2024.acl-long.50)\.
- Li et al\. \(2025a\)Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat\-Seng Chua\.Screenspot\-pro: Gui grounding for professional high\-resolution computer use, 2025a\.URL[https://arxiv\.org/abs/2504\.07981](https://arxiv.org/abs/2504.07981)\.
- Li et al\. \(2025b\)Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruiming Tang, Yasheng Wang, Yong Yu, and Weinan Zhang\.Rethinkmcts: Refining erroneous thoughts in monte carlo tree search for code generation, 2025b\.URL[https://arxiv\.org/abs/2409\.09584](https://arxiv.org/abs/2409.09584)\.
- Li et al\. \(2026a\)Qingyao Li, Xinyi Dai, Weiwen Liu, Xiangyang Li, Yasheng Wang, Ruiming Tang, Yong Yu, and Weinan Zhang\.ATGen: Adversarial reinforcement learning for test case generation\.In*The Fourteenth International Conference on Learning Representations*, 2026a\.URL[https://openreview\.net/forum?id=Sxj4o3qXtl](https://openreview.net/forum?id=Sxj4o3qXtl)\.
- Li et al\. \(2026b\)Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, and Han chung Lee\.Skillsbench: Benchmarking how well agent skills work across diverse tasks, 2026b\.URL[https://arxiv\.org/abs/2602\.12670](https://arxiv.org/abs/2602.12670)\.
- Liang et al\. \(2023\)Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng\.Code as policies: Language model programs for embodied control\.In*IEEE International Conference on Robotics and Automation, ICRA 2023*, pages 9493–9500\. IEEE, 2023\.[10\.1109/ICRA48891\.2023\.10160591](https://arxiv.org/doi.org/10.1109/ICRA48891.2023.10160591)\.URL[https://doi\.org/10\.1109/ICRA48891\.2023\.10160591](https://doi.org/10.1109/ICRA48891.2023.10160591)\.
- Liu et al\. \(2023\)Nelson F\. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\.Lost in the middle: How language models use long contexts, 2023\.URL[https://arxiv\.org/abs/2307\.03172](https://arxiv.org/abs/2307.03172)\.
- Liu et al\. \(2024a\)Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, and Jie Tang\.Visualagentbench: Towards large multimodal models as visual foundation agents, 2024a\.URL[https://arxiv\.org/abs/2408\.06327](https://arxiv.org/abs/2408.06327)\.
- Liu et al\. \(2024b\)Yifan Liu, Kangning Zhang, Xiangyuan Ren, Yanhua Huang, Jiarui Jin, Yingjie Qin, Ruilong Su, Ruiwen Xu, Yong Yu, and Weinan Zhang\.Alignrec: Aligning and training in multimodal recommendations\.In*Proceedings of the 33rd ACM International Conference on Information and Knowledge Management*, CIKM ’24, pages 1503–1512, New York, NY, USA, 2024b\. Association for Computing Machinery\.ISBN 9798400704369\.[10\.1145/3627673\.3679626](https://arxiv.org/doi.org/10.1145/3627673.3679626)\.URL[https://doi\.org/10\.1145/3627673\.3679626](https://doi.org/10.1145/3627673.3679626)\.
- Liu et al\. \(2026\)Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang\.How well do agentic skills work in the wild: Benchmarking LLM skill usage in realistic settings, 2026\.URL[https://arxiv\.org/abs/2604\.04323](https://arxiv.org/abs/2604.04323)\.
- Lu et al\. \(2024\)Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah\.Omniparser for pure vision based gui agent, 2024\.URL[https://arxiv\.org/abs/2408\.00203](https://arxiv.org/abs/2408.00203)\.
- Ma et al\. \(2026\)Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu\.Skillclaw: Let skills evolve collectively with agentic evolver, 2026\.URL[https://arxiv\.org/abs/2604\.08377](https://arxiv.org/abs/2604.08377)\.
- Mayer \(2009\)Richard E\. Mayer\.*Multimedia Learning*\.Cambridge University Press, 2009\.[10\.1017/CBO9780511811678](https://arxiv.org/doi.org/10.1017/CBO9780511811678)\.URL[https://doi\.org/10\.1017/CBO9780511811678](https://doi.org/10.1017/CBO9780511811678)\.
- Packer et al\. \(2024\)Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G\. Patil, Ion Stoica, and Joseph E\. Gonzalez\.Memgpt: Towards llms as operating systems, 2024\.URL[https://arxiv\.org/abs/2310\.08560](https://arxiv.org/abs/2310.08560)\.
- Park et al\. \(2023\)Joon Sung Park, Joseph C\. O’Brien, Carrie J\. Cai, Meredith Ringel Morris, Percy Liang, and Michael S\. Bernstein\.Generative agents: Interactive simulacra of human behavior, 2023\.URL[https://arxiv\.org/abs/2304\.03442](https://arxiv.org/abs/2304.03442)\.
- Qin et al\. \(2025\)Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, and Guang Shi\.UI\-TARS: Pioneering automated GUI interaction with native agents, 2025\.URL[https://arxiv\.org/abs/2501\.12326](https://arxiv.org/abs/2501.12326)\.
- Rawles et al\. \(2023\)Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap\.Android in the wild: A large\-scale dataset for android device control, 2023\.URL[https://arxiv\.org/abs/2307\.10088](https://arxiv.org/abs/2307.10088)\.
- Rawles et al\. \(2025\)Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell\-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva\.Androidworld: A dynamic benchmarking environment for autonomous agents, 2025\.URL[https://arxiv\.org/abs/2405\.14573](https://arxiv.org/abs/2405.14573)\.
- Shao et al\. \(2026a\)Shuai Shao, Yixiang Liu, Bingwei Lu, and Weinan Zhang\.Monoscale: Scaling multi\-agent system with monotonic improvement, 2026a\.URL[https://arxiv\.org/abs/2601\.23219](https://arxiv.org/abs/2601.23219)\.
- Shao et al\. \(2026b\)Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, and Jing Shao\.Your agent may misevolve: Emergent risks in self\-evolving llm agents, 2026b\.URL[https://arxiv\.org/abs/2509\.26354](https://arxiv.org/abs/2509.26354)\.
- Shinn et al\. \(2023\)Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: language agents with verbal reinforcement learning\.In*Advances in Neural Information Processing Systems 36*, 2023\.URL[http://papers\.nips\.cc/paper\_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90\-Abstract\-Conference\.html](http://papers.nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)\.
- Sutton et al\. \(1999\)Richard S\. Sutton, Doina Precup, and Satinder Singh\.Between MDPs and semi\-MDPs: A framework for temporal abstraction in reinforcement learning\.*Artificial Intelligence*, 112\(1–2\):181–211, 1999\.[10\.1016/S0004\-3702\(99\)00052\-1](https://arxiv.org/doi.org/10.1016/S0004-3702(99)00052-1)\.URL[https://doi\.org/10\.1016/S0004\-3702\(99\)00052\-1](https://doi.org/10.1016/S0004-3702(99)00052-1)\.
- Team et al\. \(2026a\)Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S\. H\. Cai, Yuan Cao, Y\. Charles, H\. S\. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen, Dazhi Cheng, Minghan Chu, Jialei Cui, Jiaqi Deng, Muxi Diao, Hao Ding, Mengfan Dong, Mengnan Dong, Yuxin Dong, Yuhao Dong, Angang Du, Chenzhuang Du, Dikang Du, Lingxiao Du, Yulun Du, Yu Fan, Shengjun Fang, Qiulin Feng, Yichen Feng, Garimugai Fu, Kelin Fu, Hongcheng Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Chengyang Gong, Xiaochen Gong, Zhuoma Gongque, Qizheng Gu, Xinran Gu, Yicheng Gu, Longyu Guan, Yuanying Guo, Xiaoru Hao, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Jiaxi Hu, Yangyang Hu, Zhenxing Hu, Ke Huang, Ruiyuan Huang, Weixiao Huang, Zhiqi Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yu Jing, Guokun Lai, Aidi Li, C\. Li, Cheng Li, Fang Li, Guanghe Li, Guanyu Li, Haitao Li, Haoyang Li, Jia Li, Jingwei Li, Junxiong Li, Lincan Li, Mo Li, Weihong Li, Wentao Li, Xinhang Li, Xinhao Li, Yang Li, Yanhao Li, Yiwei Li, Yuxiao Li, Zhaowei Li, Zheming Li, Weilong Liao, Jiawei Lin, Xiaohan Lin, Zhishan Lin, Zichao Lin, Cheng Liu, Chenyu Liu, Hongzhang Liu, Liang Liu, Shaowei Liu, Shudong Liu, Shuran Liu, Tianwei Liu, Tianyu Liu, Weizhou Liu, Xiangyan Liu, Yangyang Liu, Yanming Liu, Yibo Liu, Yuanxin Liu, Yue Liu, Zhengying Liu, Zhongnuo Liu, Enzhe Lu, Haoyu Lu, Zhiyuan Lu, Junyu Luo, Tongxu Luo, Yashuo Luo, Long Ma, Yingwei Ma, Shaoguang Mao, Yuan Mei, Xin Men, Fanqing Meng, Zhiyong Meng, Yibo Miao, Minqing Ni, Kun Ouyang, Siyuan Pan, Bo Pang, Yuchao Qian, Ruoyu Qin, Zeyu Qin, Jiezhong Qiu, Bowen Qu, Zeyu Shang, Youbo Shao, Tianxiao Shen, Zhennan Shen, Juanfeng Shi, Lidong Shi, Shengyuan Shi, Feifan Song, Pengwei Song, Tianhui Song, Xiaoxi Song, Hongjin Su, Jianlin Su, Zhaochen Su, Lin Sui, Jinsong Sun, Junyao Sun, Tongyu Sun, Flood Sung, Yunpeng Tai, Chuning Tang, Heyi Tang, Xiaojuan Tang, Zhengyang Tang, Jiawen Tao, Shiyuan Teng, Chaoran Tian, Pengfei Tian, Ao Wang, Bowen Wang, Chensi Wang, Chuang Wang, Congcong Wang, Dingkun Wang, Dinglu Wang, Dongliang Wang, Feng Wang, Hailong Wang, Haiming Wang, Hengzhi Wang, Huaqing Wang, Hui Wang, Jiahao Wang, Jinhong Wang, Jiuzheng Wang, Kaixin Wang, Linian Wang, Qibin Wang, Shengjie Wang, Shuyi Wang, Si Wang, Wei Wang, Xiaochen Wang, Xinyuan Wang, Yao Wang, Yejie Wang, Yipu Wang, Yiqin Wang, Yucheng Wang, Yuzhi Wang, Zhaoji Wang, Zhaowei Wang, Zhengtao Wang, Zhexu Wang, Zihan Wang, Zizhe Wang, Chu Wei, Ming Wei, Chuan Wen, Zichen Wen, Chengjie Wu, Haoning Wu, Junyan Wu, Rucong Wu, Wenhao Wu, Yuefeng Wu, Yuhao Wu, Yuxin Wu, Zijian Wu, Chenjun Xiao, Jin Xie, Xiaotong Xie, Yuchong Xie, Yifei Xin, Bowei Xing, Boyu Xu, Jianfan Xu, Jing Xu, Jinjing Xu, L\. H\. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinbo Xu, Xinran Xu, Yangchuan Xu, Yichang Xu, Yuemeng Xu, Zelai Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Guangyao Yang, Hao Yang, Junwei Yang, Kai Yang, Ningyuan Yang, Ruihan Yang, Xiaofei Yang, Xinlong Yang, Ying Yang, Yi Yang, Yi Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Dan Ye, Wenjie Ye, Zhuorui Ye, Bohong Yin, Chengzhen Yu, Longhui Yu, Tao Yu, Tianxiang Yu, Enming Yuan, Mengjie Yuan, Xiaokun Yuan, Yang Yue, Weihao Zeng, Dunyuan Zha, Haobing Zhan, Dehao Zhang, Hao Zhang, Jin Zhang, Puqi Zhang, Qiao Zhang, Rui Zhang, Xiaobin Zhang, Y\. Zhang, Yadong Zhang, Yangkun Zhang, Yichi Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yushun Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Chenguang Zhao, Feifan Zhao, Jinxiang Zhao, Shuai Zhao, Xiangyu Zhao, Yikai Zhao, Zijia Zhao, Huabin Zheng, Ruihan Zheng, Shaojie Zheng, Tengyang Zheng, Junfeng Zhong, Longguang Zhong, Weiming Zhong, M\. Zhou, Runjie Zhou, Xinyu Zhou, Zaida Zhou, Jinguo Zhu, Liya Zhu, Xinhao Zhu, Yuxuan Zhu, Zhen Zhu, Jingze Zhuang, Weiyu Zhuang, Ying Zou, and Xinxing Zu\.Kimi k2\.5: Visual agentic intelligence, 2026a\.URL[https://arxiv\.org/abs/2602\.02276](https://arxiv.org/abs/2602.02276)\.
- Team et al\. \(2026b\)V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehan Qi, Zehai He, Yutao Zhang, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yu Yang, Yongbin Liu, Yijian Lu, Yifan Xu, Yanzi Wang, Yanxiao Zhao, Yanfeng Wang, Yadong Xue, Yabo Xu, Xinyu Zhang, Xinyu Liu, Xiao Liu, Wenyi Zhao, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shudan Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, lat Long long, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Jiadai Sun, Haozhi Zheng, Haoran Wang, Haochen Li, Hanyu Liu, Han Xu, Fan Yang, Dan Zhang, Da Yin, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowen Lv, Bowei Jia, Bo Li, Bin Chen, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, and Jie Tang\.Glm\-5v\-turbo: Toward a native foundation model for multimodal agents, 2026b\.URL[https://arxiv\.org/abs/2604\.26752](https://arxiv.org/abs/2604.26752)\.
- Wang et al\. \(2026a\)Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, and Shumin Deng\.Skillx: Automatically constructing skill knowledge bases for agents, 2026a\.URL[https://arxiv\.org/abs/2604\.04804](https://arxiv.org/abs/2604.04804)\.
- Wang et al\. \(2023a\)Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar\.Voyager: An open\-ended embodied agent with large language models, 2023a\.URL[https://arxiv\.org/abs/2305\.16291](https://arxiv.org/abs/2305.16291)\.
- Wang et al\. \(2026b\)Leye Wang, Zixing Wang, and Anjie Xu\.Skilltester: Benchmarking utility and security of agent skills, 2026b\.URL[https://arxiv\.org/abs/2603\.28815](https://arxiv.org/abs/2603.28815)\.
- Wang et al\. \(2026c\)Shijian Wang, Jiarui Jin, Runhao Fu, Zexuan Yan, Xingjian Wang, Mengkang Hu, Eric Wang, Xiaoxi Li, Kangning Zhang, Li Yao, Wenxiang Jiao, Xuelian Cheng, Yuan Lu, and Zongyuan Ge\.Museagent: A multimodal reasoning agent with stateful experiences, 2026c\.URL[https://arxiv\.org/abs/2603\.27813](https://arxiv.org/abs/2603.27813)\.
- Wang et al\. \(2025a\)Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Haotian Yao, Ziwei Chen, Qizheng Gu, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y\. Charles, Zhilin Yang, and Tao Yu\.Opencua: Open foundations for computer\-use agents, 2025a\.URL[https://arxiv\.org/abs/2508\.09123](https://arxiv.org/abs/2508.09123)\.
- Wang et al\. \(2025b\)Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, Weiyun Wang, Xiangyu Zhao, Jixuan Chen, Haodong Duan, Tianbao Xie, Chenyu Yang, Shiqian Su, Yue Yu, Yuan Huang, Yiqian Liu, Xiao Zhang, Yanting Zhang, Xiangyu Yue, Weijie Su, Xizhou Zhu, Wei Shen, Jifeng Dai, and Wenhai Wang\.Mmbench\-gui: Hierarchical multi\-platform evaluation framework for gui agents, 2025b\.URL[https://arxiv\.org/abs/2507\.19478](https://arxiv.org/abs/2507.19478)\.
- Wang et al\. \(2023b\)Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang\.Jarvis\-1: Open\-world multi\-task agents with memory\-augmented multimodal language models, 2023b\.URL[https://arxiv\.org/abs/2311\.05997](https://arxiv.org/abs/2311.05997)\.
- Wang et al\. \(2024\)Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang\.Describe, explain, plan and select: Interactive planning with large language models enables open\-world multi\-task agents, 2024\.URL[https://arxiv\.org/abs/2302\.01560](https://arxiv.org/abs/2302.01560)\.
- Wu et al\. \(2024\)Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao\.OS\-ATLAS: A foundation action model for generalist GUI agents, 2024\.URL[https://arxiv\.org/abs/2410\.23218](https://arxiv.org/abs/2410.23218)\.
- Xia et al\. \(2026\)Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao\.Skillrl: Evolving agents via recursive skill\-augmented reinforcement learning, 2026\.URL[https://arxiv\.org/abs/2602\.08234](https://arxiv.org/abs/2602.08234)\.
- Xie et al\. \(2024\)Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu\.Osworld: Benchmarking multimodal agents for open\-ended tasks in real computer environments, 2024\.URL[https://arxiv\.org/abs/2404\.07972](https://arxiv.org/abs/2404.07972)\.
- Xie et al\. \(2025\)Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, and Liqiang Nie\.Mirage\-1: Augmenting and updating GUI agent with hierarchical multimodal skills, 2025\.URL[https://arxiv\.org/abs/2506\.10387](https://arxiv.org/abs/2506.10387)\.
- Xu and Yan \(2026\)Renjun Xu and Yang Yan\.Agent skills for large language models: Architecture, acquisition, security, and the path forward, 2026\.URL[https://arxiv\.org/abs/2602\.12430](https://arxiv.org/abs/2602.12430)\.
- Xu et al\. \(2025\)Yibin Xu, Liang Yang, Hao Chen, Hua Wang, Zhi Chen, and Yaohua Tang\.Deskvision: Large scale desktop region captioning for advanced gui agents, 2025\.URL[https://arxiv\.org/abs/2503\.11170](https://arxiv.org/abs/2503.11170)\.
- Yang et al\. \(2025a\)Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao\.Riosworld: Benchmarking the risk of multimodal computer\-use agents, 2025a\.URL[https://arxiv\.org/abs/2506\.00618](https://arxiv.org/abs/2506.00618)\.
- Yang et al\. \(2025b\)Pei Yang, Hai Ci, and Mike Zheng Shou\.macosworld: A multilingual interactive benchmark for GUI agents, 2025b\.URL[https://arxiv\.org/abs/2506\.04135](https://arxiv.org/abs/2506.04135)\.
- Yang et al\. \(2025c\)Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang\.Agentnet: Decentralized evolutionary coordination for llm\-based multi\-agent systems, 2025c\.URL[https://arxiv\.org/abs/2504\.00587](https://arxiv.org/abs/2504.00587)\.
- Yao et al\. \(2023\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R\. Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.In*The Eleventh International Conference on Learning Representations, ICLR 2023*\. OpenReview\.net, 2023\.URL[https://openreview\.net/forum?id=WE\_vluYUL\-X](https://openreview.net/forum?id=WE_vluYUL-X)\.
- Zhang et al\. \(2023\)Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu\.Appagent: Multimodal agents as smartphone users, 2023\.URL[https://arxiv\.org/abs/2312\.13771](https://arxiv.org/abs/2312.13771)\.
- Zhang et al\. \(2024\)Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu\.Dream: A dual representation learning model for multimodal recommendation, 2024\.URL[https://arxiv\.org/abs/2404\.11119](https://arxiv.org/abs/2404.11119)\.
- Zhang et al\. \(2025\)Kangning Zhang, Wenxiang Jiao, Kounianhua Du, Yuan Lu, Weiwen Liu, Weinan Zhang, and Yong Yu\.Looptool: Closing the data\-training loop for robust llm tool calls, 2025\.URL[https://arxiv\.org/abs/2511\.09148](https://arxiv.org/abs/2511.09148)\.
- Zheng et al\. \(2024\)Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su\.Gpt\-4v\(ision\) is a generalist web agent, if grounded, 2024\.URL[https://arxiv\.org/abs/2401\.01614](https://arxiv.org/abs/2401.01614)\.
- Zheng et al\. \(2025\)Boyuan Zheng, Michael Y\. Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su\.Skillweaver: Web agents can self\-improve by discovering and honing skills, 2025\.URL[https://arxiv\.org/abs/2504\.07079](https://arxiv.org/abs/2504.07079)\.
- Zhou et al\. \(2024\)Shuyan Zhou, Frank F\. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig\.Webarena: A realistic web environment for building autonomous agents, 2024\.URL[https://arxiv\.org/abs/2307\.13854](https://arxiv.org/abs/2307.13854)\.

\\beginappendix

## 6Benchmark Statistics

We use four visual\-agent benchmarks\.OSWorldis the primary GUI benchmark and contains Ubuntu desktop tasks across browsers, office software, creative tools, media applications, system settings, code editors, email, and multi\-application workflows\(Xie et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib49)\)\.macOSWorldprovides an auxiliary cross\-operating\-system GUI evaluation with file management, media, productivity, system/interface, and system\-application tasks\(Yang et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib54)\)\.VAB\-Minecraftis the Minecraft subset of VisualAgentBench and evaluates item\-acquisition tasks that require visual grounding, inventory tracking, recipe reasoning, tool use, and handling failed actions\(Liu et al\.,[2024a](https://arxiv.org/html/2605.13527#bib.bib22)\)\.LMGame\-Benchevaluates game\-playing agents through a unified interface\(Hu et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib12)\); we use Super Mario Bros because its recurring visual situations naturally align with reusable multimodal skills\.

Table 4:Test\-case distributions for OSWorld and macOSWorld\. OSWorld contains 360 test cases; macOSWorld contains 143 test cases\. “Share” is the percentage of test cases in each domain within the corresponding benchmark\.BenchmarkDomainCountShareSnapshot\-enSnapshot\-appsOSWorldMulti\-app9325\.83––OSWorldLibreOffice Calc4713\.06––OSWorldLibreOffice Impress4713\.06––OSWorldChrome4512\.50––OSWorldGIMP267\.22––OSWorldOS246\.67––OSWorldLibreOffice Writer236\.39––OSWorldVS Code236\.39––OSWorldVLC174\.72––OSWorldThunderbird154\.17––macOSWorldFile management2920\.28290macOSWorldMedia128\.39012macOSWorldProductivity3524\.481619macOSWorldSystem and interface2920\.28290macOSWorldSystem apps3826\.57380
## 7Skill Source Statistics

All MMSkills are extracted from non\-test trajectories\. For OSWorld and macOSWorld, we use the Ubuntu and macOS subsets of OpenCUA trajectories as GUI skill sources\(Wang et al\.,[2025a](https://arxiv.org/html/2605.13527#bib.bib43)\)\. For macOS, the raw OpenCUA trajectories do not directly follow the five macOSWorld categories; we therefore perform additional clustering and relevance filtering before assigning trajectories to the analysis categories below\.

Table 5:OpenCUA trajectory statistics used for GUI skill extraction\. “Tasks” counts source trajectories, “Share” is the within\-platform percentage, and “Clusters” is the number of Phase\-0 semantic trajectory clusters used for downstream skill planning\.PlatformDomainTasksShareClustersUbuntuChrome71817\.117UbuntuLibreOffice Impress60514\.411UbuntuVS Code60514\.44UbuntuOS49711\.82UbuntuGIMP49211\.714UbuntuLibreOffice Writer49011\.73UbuntuThunderbird3007\.111UbuntuLibreOffice Calc2987\.13UbuntuVLC2004\.88macOSProductivity1,42445\.120macOSSystem apps76824\.311macOSFile management34110\.89macOSMedia31510\.07macOSSystem and interface3099\.812Table 6:OSWorld MMSkill package statistics\. “\#Skills” counts unique skill packages, while “Skills/Task” reports the average number of skill matches assigned to evaluation tasks and therefore need not equal \#Skills/\#Tasks\. Word statistics are median/mean over skill procedures\. “Full/Focus” and “Before/After” report counts of those view types; “Transition Cards” counts state cards with at least one before/after transition view, with percentages over state cards\. The Total/Avg row reports total counts and weighted averages;†\\daggermarks a fitted value estimated from domain\-level medians\.Domain\#Tasks\#SkillsSkills/TaskWords Med/Mean\#CardsCards/Skill\#ViewsViews/CardFull/FocusBefore/AfterTransition CardsChrome45341\.20653 / 630\.91343\.942922\.18134/13413/1124 \(17\.9%\)GIMP26261\.19470 / 400\.2772\.961902\.4777/7714/2236 \(46\.8%\)Calc47261\.36278 / 278\.1793\.041842\.3379/797/1926 \(32\.9%\)Impress47201\.32498 / 466\.2603\.001402\.3360/601/1920 \(33\.3%\)Writer23231\.13264 / 289\.2713\.091442\.0371/711/12 \(2\.8%\)Multi\-apps93201\.19574 / 502\.0824\.101642\.0082/820/00 \(0\.0%\)OS24371\.21544 / 539\.81393\.762832\.04139/1395/05 \(3\.6%\)Thunderbird15251\.20508 / 542\.5873\.481922\.2187/846/1521 \(24\.1%\)VLC17181\.00260 / 275\.3613\.391222\.0061/610/00 \(0\.0%\)VS Code23181\.09391 / 389\.3894\.941872\.1089/899/09 \(10\.1%\)Total / Avg\.3602471\.21498\.0†/ 447\.88793\.5618982\.16879/87656/87143 \(16\.3%\)

Table 7:macOSWorld MMSkill package statistics\. “\#Skills” counts unique skill packages, while “Skills/Task” reports the average number of skill matches assigned to evaluation tasks\. Word statistics are median/mean over skill procedures\. “Full/Focus” and “Before/After” report counts of those view types; “Transition Cards” counts state cards with at least one before/after transition view, with percentages over state cards\.Domain\#Tasks\#SkillsSkills/TaskWords Med/Mean\#CardsCards/Skill\#ViewsViews/CardFull/FocusBefore/AfterTransition CardsFile management29301\.03358 / 374\.5622\.071282\.0662/624/04 \(6\.5%\)Media12252\.08378 / 400\.8552\.201162\.1155/556/06 \(10\.9%\)Productivity35591\.69324 / 330\.21252\.122612\.09125/12511/011 \(8\.8%\)System/interface29883\.03282 / 285\.51822\.073802\.09182/18216/016 \(8\.8%\)System apps38461\.21347 / 352\.0982\.132122\.1698/986/1016 \(16\.3%\)Total / Avg\.1432481\.73324 / 330\.95222\.1010972\.10522/52243/1053 \(10\.2%\)

Table 8:Game benchmark MMSkill package statistics\. Word statistics are median/mean over skill procedures and plans\. “Full/Focus” and “Before/After” report counts of those view types; “Transition Cards” counts state cards with at least one before/after transition view, with percentages over state cards\.†\\daggermarks a fitted value estimated from the available before/after view counts\.Benchmark\#SkillsSkill Words Med/MeanPlan Words Med/Mean\#CardsCards/Skill\#ViewsViews/CardFull/FocusBefore/AfterTransition CardsVAB\-Minecraft24278\.5 / 281\.768\.0 / 68\.4793\.291852\.3479/798/1920 \(25\.3%\)Super Mario Bros10374\.0 / 370\.8280\.0 / 271\.0343\.4048†1\.41†34/05/914 \(41\.2%\)†

For VAB\-Minecraft, we use the official training set as the source for extracting multimodal skill packages\. For Super Mario Bros from LMGame\-Bench, MMSkills are extracted from multiple runs over four source cases\. In both settings, the skill\-source data are disjoint from the final evaluation cases\.

## 8Experiment Details

Across all evaluations, agents plan from visual environment observations rather than privileged state, using desktop screenshots for GUI tasks and game screenshots for game tasks\. For OSWorld and macOSWorld, we run the full evaluations primarily on Amazon Web Services using the official benchmark images and task definitions\. The agent interacts through the benchmark harness, and we use a maximum interaction budget of 20 steps for both GUI benchmarks\. VAB\-Minecraft and Super Mario Bros follow their official evaluation protocols\.

For VAB\-Minecraft, we use the official test set for evaluation\. The training trajectories described in Appendix[7](https://arxiv.org/html/2605.13527#S7)are used only to generate reusable procedures, state cards, and keyframes; no test episodes are used during skill construction\.

For Super Mario Bros from LMGame\-Bench, we split the available game cases into disjoint source and evaluation subsets\. The source cases are described in Appendix[7](https://arxiv.org/html/2605.13527#S7), while a separate set of four held\-out cases is used for final evaluation\. This separation ensures that the generated skills capture reusable game situations rather than memorizing the measured episodes\.

We evaluate both frontier and smaller multimodal models: Gemini 3\.1 Pro, Gemini 3 Flash111[https://storage\.googleapis\.com/deepmind\-media/Model\-Cards/Gemini\-3\-Flash\-Model\-Card\.pdf](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf), Qwen3\-VL\-235B\-A22B\-Thinking\(Bai et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib4)\), GLM\-5V\-Turbo\(Team et al\.,[2026b](https://arxiv.org/html/2605.13527#bib.bib38)\), Kimi\-K2\.6\(Team et al\.,[2026a](https://arxiv.org/html/2605.13527#bib.bib37)\), and Qwen3\-VL\-8B\-Instruct\(Bai et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib4)\)\. For each base model, we compare*no\-skill*,*text\-only skill*, and*MMSkills*conditions\. Unless otherwise stated, skill conditions use branch loading: text\-only skills use the same branch mechanism without state cards or images, while MMSkills inspect selected state cards and multi\-view keyframes before returning structured guidance to the main agent\. Direct text\-skill loading and direct multimodal loading are evaluated only as ablation variants\.

## 9Branch\-Loaded Runtime Algorithm

Algorithm[1](https://arxiv.org/html/2605.13527#alg1)summarizes the branch\-loaded runtime loop\. Candidate skills are selected before task execution, while branch calls occur only when the main agent decides to consult a specific skill\. The main trajectory receives the structured guidanceGtG\_\{t\}rather than the full multimodal skill package\.

Algorithm 1Branch\-loaded MMSkill Agent1:Skill library

ℳ\\mathcal\{M\}, task instruction

II, visual environmentEnv

2:Initialize history

H0←∅H\_\{0\}\\leftarrow\\emptyset
3:Pre\-recall candidate skills

𝒞I←PreRecall​\(I,ℳ\)\\mathcal\{C\}\_\{I\}\\leftarrow\\text\{PreRecall\}\(I,\\mathcal\{M\}\)
4:for

t=1,2,…t=1,2,\\ldotsdo

5:Observe current visual observation

OtO\_\{t\}fromEnv

6:Main agent chooses either action

AtA\_\{t\}or skill request

Mt∈𝒞IM\_\{t\}\\in\\mathcal\{C\}\_\{I\}
7:ifthe main agent chooses action

AtA\_\{t\}then

8:Execute

AtA\_\{t\}inEnvand update

HtH\_\{t\}
9:else

10:Unpack

Mt=\(Dt,Pt,St,Kt\)M\_\{t\}=\(D\_\{t\},P\_\{t\},S\_\{t\},K\_\{t\}\)
11:Stage 1:

\(Jt,Rt\)←SelectViews​\(Ot,Ht−1,Pt,St\)\(J\_\{t\},R\_\{t\}\)\\leftarrow\\text\{SelectViews\}\(O\_\{t\},H\_\{t\-1\},P\_\{t\},S\_\{t\}\)
12:Load

Vt←\{Kjv:j∈Jt,v∈Rt,j\}V\_\{t\}\\leftarrow\\\{K\_\{j\}^\{v\}:j\\in J\_\{t\},\\ v\\in R\_\{t,j\}\\\}
13:Stage 2:

Gt←PlanBranch​\(Ot,Ht−1,Pt,\{Sj:j∈Jt\},Vt\)G\_\{t\}\\leftarrow\\text\{PlanBranch\}\(O\_\{t\},H\_\{t\-1\},P\_\{t\},\\\{S\_\{j\}:j\\in J\_\{t\}\\\},V\_\{t\}\)
14:Choose grounded action

At←πmain​\(Ot,Ht−1,Gt\)A\_\{t\}\\leftarrow\\pi\_\{\\text\{main\}\}\(O\_\{t\},H\_\{t\-1\},G\_\{t\}\)
15:Execute

AtA\_\{t\}inEnvand update

HtH\_\{t\}
16:endif

17:ifthe task is verified completethen

18:returnsuccess

19:endif

20:endfor

![Refer to caption](https://arxiv.org/html/2605.13527v1/x5.png)Figure 5:Prompt surfaces used by the branch\-loaded multimodal skill agent\. The main agent prompt decides whether to act directly or consult a skill branch, Stage 1 selects the relevant state cards and keyframe views, and Stage 2 returns compact structured guidance to the main agent\.
## 10MMSkillAgent Prompt Templates

This section reports the prompt templates used by the branch\-loaded MMSkillAgent\. Dynamic fields are shown as placeholders such as\{instruction\},\{available\_skills\}, and\{previous\_steps\}\. The implementation instantiates these templates with the current screenshot, recent trajectory, execution feedback, candidate skills, state\-card summaries, and selected keyframe views\. The Stage\-2 JSON contains a few implementation\-facing fields beyond Eq\.[2](https://arxiv.org/html/2605.13527#S2.E2); they are collapsed intoGtG\_\{t\}in the method description\.

Main\-Agent Skill\-Calling System PromptRole\.Follow the user instruction to perform desktop computer tasks\. You control the computer using Python code withpyautogui\. At each step, you receive the current screenshot and recent visible trajectory history\. Use the current screenshot to decide the next action; do not assume previous clicks succeeded\.Skill consultation policy\.•Task skills are optional procedural planners only\.•The final user message includes each non\-exhausted skill’s short description and minimal runtime state hints\. Use these hints to judge whether a skill is genuinely relevant before callingLOAD\_SKILL\(\.\.\.\)\.•CallLOAD\_SKILL\("<exact\_skill\_name\>"\)only when the current screenshot, recent steps, and skill hints suggest that extra procedural guidance is useful\.•LOAD\_SKILL\(\.\.\.\)opens a temporary planner branch for extra skill\-guided reasoning; it does not execute the action\.•Skill hints and planner notes are references only, never coordinate templates\.•Each skill may be consulted at most\{consult\_limit\}times in one trajectory\. Exhausted skills are removed from the available\-skill list and must not be called again\.Available skills\.\{available\_skills\}lists non\-exhausted candidate skills for the task\.Action rules\.•Usepyautoguionly for GUI actions\. Do not usepyautogui\.locateCenterOnScreenorpyautogui\.screenshot\(\)\.•Each response must be self\-contained and must not rely on variables from previous steps\.•If a click does not work, revise the target from the new screenshot instead of repeating the same guess\.•Prefer short, direct, grounded actions over long speculative scripts; avoid repetitive unproductive loops\.•Before outputtingDONE, verify that the full user instruction has been completed, not only a local subgoal\.Output interface\.Return exactly one code block containing one of: Python code usingpyautogui,WAIT,DONE,FAIL, orLOAD\_SKILL\("<exact\_skill\_name\>"\)\. Do not mix Python code with a skill call, do not load more than one skill, and do not return prose outside the code block\. If returning Python, include concise\#comments\. UseWAITonly for loading UI,DONEonly after full verification, andFAILonly when the task is truly impossible\. Canonical outputs includeLOAD\_SKILL\("Example\_Skill\_Name"\)and a single grounded action such aspyautogui\.click\(120, 54\)\.Coordinate and task context\.Use the declared screen resolution for allpyautoguicoordinates\. The computer password is available as\{client\_password\}when needed\. The task is\{instruction\}\.

Main\-Agent Per\-Step User InstructionDecision request\.Decide the next grounded response for the current screenshot\. Return either the next GUI action orLOAD\_SKILL\(\.\.\.\)when extra procedural guidance is useful\.Per\-step context\.•Instruction:\{instruction\}•Available non\-exhausted skills:\{skills\_with\_state\_previews\}, including each skill name, short description, and minimal when\-to\-use state hints\.•Active planner memo:\{active\_memo\}•Planner notes returned in this step:\{current\_step\_planner\_summaries\}•Previous steps:\{previous\_steps\}, including full model responses and action comments\.•Execution feedback:optional feedback for the current step and optional loop\-warning diagnostics\.•Screen resolution:\{screen\_resolution\_prompt\}\.Grounding rules\.•Ground every action in the current screenshot\.•Planner notes are fallible references; re\-decide the real action from the current screenshot, recent history, and execution feedback\.•Treat state hints, selected reference views, and planner notes as references only, never coordinate templates\.•If no listed skill is clearly useful, act directly from the current screenshot\.•If planner notes already exist for this step, use them before consulting another branch\.•If recent actions repeat without progress, change strategy\.•BeforeDONE, verify the full instruction; if returning Python, include concise comments\.

Branch Stage 1 Prompt: Gated State\-View SelectionBranch reference package\.The branch receives the requested callLOAD\_SKILL\("\{skill\_name\}"\), the selected skill text, runtime state bundles, and compact state\-card manifests\. These materials are supplemental procedural references only\. Stage 1 must decide whether visual reference images are needed at all and, if so, which state IDs and view types should be loaded\. The main agent, not the branch, will choose the concrete GUI action\.Role\.You are inside Stage 1 of a temporary state\-view selection branch for a single desktop step\. Decide whether visual reference images are needed before planner reasoning and which evidence goal they should serve\.View semantics\.•full\_frame: global placement and window context\.•focus\_crop: detailed control localization\.•before: pre\-change state, useful for recognizing whether the UI is still before a change and for avoiding repeated toggles\.•after: target completion state, useful for verifying the result after save, enable, format, or apply operations\.Evidence goals\.•locate\_control: request exactly one offull\_frameorfocus\_crop\.•recognize\_before: requestbefore, optionally withfull\_frame\.•verify\_after: requestafter, optionally withfull\_frame\.•compare\_transition: request minimal transition evidence; avoid defaulting to thefull\_frame\+focus\_croppair and preferbefore/afterwhen useful\.Visual gating policy\.First decidevisual\_reference\_needed\. If the useful help is a generic shortcut, formula, file operation, stable menu path, or textual procedure, default tofalse\. Load images only for state transitions, visual result verification, or complex UI\-state recognition where text alone is likely insufficient\. Keep the request minimal: at most\{max\_states\}states and\{max\_views\}total views\.Input fields\.Stage 1 receives\{instruction\},\{previous\_steps\}, environment feedback from the previous step, loop warnings if present, the screen\-resolution prompt, and the current screenshot\.Output interface\.Return exactly one code block containing oneLOAD\_STATE\_VIEWS\(\.\.\.\)call\. Its JSON payload contains:•"visual\_reference\_needed": true or false;•"why\_not\_text\_only": why text\-only is insufficient, or why no images are needed;•"requests": a list of objects, each with exact"state\_id", exact"views","evidence\_goal", and"reason"\.When"visual\_reference\_needed"is false,"requests"must be empty\. Do not return Python code, planner JSON,WAIT,DONE,FAIL,LOAD\_SKILL, or prose outside the code block\.Canonical examples\.A transition request sets"visual\_reference\_needed": trueand requests a state with"views": \["before", "after"\]under"evidence\_goal": "compare\_transition"\. A text\-only branch sets"visual\_reference\_needed": false, gives a brief reason in"why\_not\_text\_only", and returns"requests": \[\]\.

Branch Stage 2 Prompt: Planner JSONSelected evidence package\.Stage 2 receives the Stage\-1 selection record, including the evidence goal, selected states, requested view types, reasons, when\-to\-use conditions, verification cues, and any loaded keyframe views\. Loaded views are supplemental references only and are never coordinate templates\.Role\.You are inside Stage 2 of a temporary planner\-only skill consultation branch for a single desktop step\. Do not return a GUI action\. Return a structured planner summary for the current state\.Branch rules\.•Do not return Python code,WAIT,DONE,FAIL,LOAD\_SKILL,LOAD\_SKILL\_IMAGE, orLOAD\_STATE\_VIEWS\. Do not request another skill in this branch\.•Use the current screenshot first\. Skill text, runtime state bundles, Stage\-1 decisions, and loaded reference views are supplemental only\.•If Stage 1 chose no visual references, respect that decision and avoid inventing image\-based assumptions\.•If the skill is ineffective for the current state, say so clearly and avoid forcing the plan toward it\.•Treat reference views as state references, never as coordinate templates\.Planning requirements\.•subgoal: next immediate local milestone under the live UI\.•plan: longer\-range route grounded in the current screenshot, including the relevant UI surface, the next 2–4 actions/checks/transitions, and the cue that means advance versus re\-plan\.•do\_not\_do: the likely wrong path or skill\-induced mistake to avoid\.•fallback\_if\_no\_progress: a concrete alternate route if the skill\-guided path stalls\.•expected\_state: visible screenshot cues the main agent should aim to reveal next\.•completion\_scope: whether the branch only advances a local step, still needs verification, or may be complete after verification\.Per\-step input fields\.Stage 2 receives\{instruction\},\{stage1\_decision\},\{selected\_state\_views\},\{previous\_steps\}, environment feedback, optional loop warnings, the screen\-resolution prompt, and the live screenshot, which is more authoritative than any skill reference view\.Output interface\.Return exactly one code block containing one JSON object with keys:"skill\_applicability","subgoal","plan","do\_not\_do","fallback\_if\_no\_progress","expected\_state", and"completion\_scope"\. The values of"skill\_applicability"are"effective","ineffective", or"uncertain"; the values of"completion\_scope"are"local\_only","needs\_verification", or"maybe\_complete"\. Do not return prose outside the code block\.Canonical example shape\.A valid planner object may mark the skill as"effective", set a local"subgoal"such as opening the visible settings surface, give a grounded multi\-step"plan", block a likely repeated or irrelevant click through"do\_not\_do", provide a concrete fallback route, and describe the next visible"expected\_state"with"completion\_scope": "local\_only"\.

## 11Additional Behavioral Shift Analysis

Figure[6](https://arxiv.org/html/2605.13527#S11.F6)complements Figure[4](https://arxiv.org/html/2605.13527#S3.F4)with the same OSWorld behavioral analysis for GLM\-5V and Kimi\-K2\.6\.

![Refer to caption](https://arxiv.org/html/2605.13527v1/x6.png)Figure 6:Behavioral shifts induced by MMSkills on OSWorld for GLM\-5V and Kimi\-K2\.6\. The panels follow the same metrics as Figure[4](https://arxiv.org/html/2605.13527#S3.F4): action primitive distribution, low\-level primitives per task, and repetitive behavior statistics\.
## 12Interaction Case Studies

Figures[7](https://arxiv.org/html/2605.13527#S12.F7)and[8](https://arxiv.org/html/2605.13527#S12.F8)show two representative OSWorld interaction traces\. The first case illustrates a LibreOffice Calc workflow in which the agent consults different spreadsheet skills at different stages of table construction\. The second case illustrates a terminal file\-organization task where branch guidance helps move past an initially brittle command and then verifies the final archive structure\.

![Refer to caption](https://arxiv.org/html/2605.13527v1/x7.png)Figure 7:Representative interaction case with branch\-loaded MMSkills: LibreOffice Calc table construction\. Colored turn labels distinguish direct GUI actions, skill loading, branch guidance, evidence\-gated reasoning, and final completion\.![Refer to caption](https://arxiv.org/html/2605.13527v1/x8.png)Figure 8:Representative interaction case with branch\-loaded MMSkills: terminal file organization and compression\. Colored turn labels distinguish direct GUI actions, skill loading, branch guidance, evidence\-gated reasoning, and final completion\.
## 13Broader Impact

MMSkills are intended to make visual agents more reliable by externalizing reusable multimodal procedural knowledge\. Potential benefits include improved desktop automation, reduced repeated trial\-and\-error interactions, better support for smaller models, and more reusable agent knowledge across GUI and game\-like visual environments\. At the same time, more capable visual agents may also increase the risk of unwanted automation, misuse in interactive software, or accidental actions in sensitive environments\. Multimodal skill packages can also contain screenshots or cropped visual evidence, so their construction should avoid private or proprietary user data unless appropriate consent, filtering, and access controls are in place\. In this work, we construct skills from public non\-evaluation trajectories and store compact state evidence rather than raw demonstrations whenever possible\. Future deployments should combine MMSkills with permission controls, task\-level safety policies, sensitive\-information filtering, and auditing of generated skill packages before they are made available to autonomous agents\.

## 14Use of LLMs

Large language models are used in this work as both research artifacts and research assistants\. Methodologically, LLM\-based agents are used in the skill\-generation pipeline to process and filter trajectories, propose reusable procedures, draft state cards, and generate multimodal skill packages under human\-designed schemas and quality checks\. LLMs also serve as the evaluated visual agents in the benchmark results\. In addition, LLM tools were used during manuscript preparation for editing, polishing, and organizing written content\. The authors remained responsible for experimental design, result interpretation, citation checking, and final paper content\.

## 15Detailed Related Work

This section provides the expanded related\-work discussion summarized in Section[4](https://arxiv.org/html/2605.13527#S4)\.

#### Skills for agents\.

Skill reuse has a long history in temporal abstraction for reinforcement learning and motor primitives for robotics\(Sutton et al\.,[1999](https://arxiv.org/html/2605.13527#bib.bib36); Ijspeert et al\.,[2013](https://arxiv.org/html/2605.13527#bib.bib13)\)\. Recent LLM agents have made skills a practical interface for storing and composing procedural knowledge in language\-conditioned environments\. Early systems connected language models to action by grounding language in affordances\(Ahn et al\.,[2022](https://arxiv.org/html/2605.13527#bib.bib2)\), emitting executable programs\(Liang et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib20)\), or interleaving reasoning and acting\(Yao et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib56)\); adjacent code\- and tool\-agent work studies robust tool\-call data loops, search\-based code refinement, and adversarial test\-case generation\(Zhang et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib59); Li et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib17),[2026a](https://arxiv.org/html/2605.13527#bib.bib18)\)\. Reflection mechanisms then made agent behavior more persistent across attempts\(Shinn et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib35)\)\. In open\-ended environments, systems such as DEPS, Voyager, and JARVIS\-1 showed that large models can use language, stored experience, and self\-generated programs to acquire or reuse behaviors over extended task horizons\(Wang et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib46),[2023a](https://arxiv.org/html/2605.13527#bib.bib40),[2023b](https://arxiv.org/html/2605.13527#bib.bib45)\)\. These works motivate our focus on procedural reuse, but their reusable knowledge is primarily textual, symbolic, or programmatic\.

More recent work treats skills as an explicit substrate for agent improvement\. SkillWeaver distills web exploration into reusable API\-like skills\(Zheng et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib61)\); CUA\-Skill builds a parameterized skill base with execution and composition graphs for computer\-using agents\(Chen et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib6)\); SkillX automatically constructs hierarchical skill knowledge bases from agent experience\(Wang et al\.,[2026a](https://arxiv.org/html/2605.13527#bib.bib39)\); EvoSkill studies automated skill discovery through failure analysis in multi\-agent settings\(Alzubi et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib3)\), where decentralized coordination and scalable improvement are also central concerns\(Yang et al\.,[2025c](https://arxiv.org/html/2605.13527#bib.bib55); Shao et al\.,[2026a](https://arxiv.org/html/2605.13527#bib.bib33)\); SkillClaw evolves shared skills from multi\-user trajectories\(Ma et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib26)\); and SkillRL co\-evolves a hierarchical skill library with reinforcement learning\(Xia et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib48)\)\. A recent survey frames agent skills as portable packages of instructions, code, and resources loaded through progressive disclosure\(Xu and Yan,[2026](https://arxiv.org/html/2605.13527#bib.bib51)\)\. A complementary perspective treats accumulated agent experience as long\-term memory: Generative Agents maintain a memory stream that supports recall, reflection, and planning\(Park et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib29)\), while MemGPT introduces an OS\-style memory hierarchy that pages information in and out of the model’s working context\(Packer et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib28)\)\. MMSkills follows this broader move toward modular procedural knowledge, but changes the unit being stored: instead of treating skills mainly as text, code, APIs, or execution graphs, we define a skill package whose central evidence is a set of visually grounded runtime states\. Branch loading also takes inspiration from memory\-paging ideas, by inspecting selected multimodal evidence in a temporary branch rather than flooding the main context\.

This emerging ecosystem has also motivated dedicated evaluation of skill utility\. SkillsBench measures how skills affect agent performance across diverse tasks\(Li et al\.,[2026b](https://arxiv.org/html/2605.13527#bib.bib19)\), SkillTester evaluates utility and security risks of agent skills\(Wang et al\.,[2026b](https://arxiv.org/html/2605.13527#bib.bib41)\), and recent work studies skill usage under more realistic retrieval and adaptation settings\(Liu et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib24)\)\. These benchmarks show that skills are not automatically beneficial; their value depends on relevance, compactness, selection, and safe use, especially as self\-evolving agents may introduce emergent risks\(Shao et al\.,[2026b](https://arxiv.org/html/2605.13527#bib.bib34)\)\. Our work addresses a complementary question for visual agents: what evidence should a skill expose, and how should that evidence be loaded, when correct use depends on the current visual state?

The closest line to our work is multimodal and GUI\-specific skill augmentation\. Mirage\-1 introduces hierarchical multimodal skills for GUI agents and uses them with search to support long\-horizon control\(Xie et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib50)\); XSkill continually extracts experiences and skills for multimodal agents from visually grounded rollouts\(Jiang et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib14)\); MuSEAgent studies stateful experiences for multimodal reasoning agents\(Wang et al\.,[2026c](https://arxiv.org/html/2605.13527#bib.bib42)\); and CUA\-Skill builds computer\-use skills as parameterized procedures and execution graphs\(Chen et al\.,[2026](https://arxiv.org/html/2605.13527#bib.bib6)\)\. MMSkills differs in emphasis: we define the skill artifact around reusable visual state evidence, not only around executable procedure structure or memory accumulation\. Each skill is organized around when\-to\-use conditions, visible cues, verification cues, and multi\-view state evidence, and the runtime first selects the relevant evidence before exposing it to the main agent\. This makes the contribution a representation and loading mechanism for multimodal procedural cues, rather than another text skill library or GUI action graph\.

#### Visual agents\.

Visual agents have rapidly advanced from web navigation to general computer use\. Benchmarks such as Mind2Web and WebArena established realistic web\-agent evaluation beyond synthetic interfaces\(Deng et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib8); Zhou et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib62)\); VisualWebArena showed that many web tasks require visual grounding rather than text\-only reasoning\(Koh et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib15)\); and WebVoyager demonstrated end\-to\-end web interaction with large multimodal models on real websites\(He et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib10)\)\. The same trend appears in mobile, desktop, and embodied settings: Android in the Wild and AndroidWorld study device control from visual UI observations\(Rawles et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib31),[2025](https://arxiv.org/html/2605.13527#bib.bib32)\), OSWorld and macOSWorld evaluate agents in real operating\-system environments\(Xie et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib49); Yang et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib54)\), RiOSWorld evaluates risks in multimodal computer\-use agents\(Yang et al\.,[2025a](https://arxiv.org/html/2605.13527#bib.bib53)\), and VisualAgentBench includes VAB\-Minecraft and VAB\-OmniGibson for open\-world and household embodied interaction\(Liu et al\.,[2024a](https://arxiv.org/html/2605.13527#bib.bib22)\)\.

Model and framework work has likewise moved toward visually grounded action, reflecting the shared multimodal objective of aligning visual and textual representations\(Liu et al\.,[2024b](https://arxiv.org/html/2605.13527#bib.bib23); Zhang et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib58)\)\. SeeClick trains GUI grounding for screenshot\-only agents\(Cheng et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib7)\); CogAgent introduces a visual language model dedicated to GUI understanding and operation\(Hong et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib11)\); OS\-ATLAS learns a foundation action model for GUI control\(Wu et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib47)\); UI\-TARS develops native GUI agents that perceive screenshots and emit keyboard/mouse actions\(Qin et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib30)\); SeeAct builds web agents around general\-purpose vision\-language models\(Zheng et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib60)\); AppAgent learns smartphone skills from on\-device demonstrations\(Zhang et al\.,[2023](https://arxiv.org/html/2605.13527#bib.bib57)\); OmniParser provides a pure\-vision parser that turns screenshots into structured GUI elements\(Lu et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib25)\); and Agent S provides a general computer\-use framework built around GUI interaction\(Agashe et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib1)\)\. These systems improve the agent’s perceptual and action interface\. MMSkills instead targets the external knowledge layer used by such agents\. A stronger GUI action model may click more accurately, but it still benefits from knowing which procedural state matters, which visual cue confirms progress, and which state indicates that a skill should not be applied\. MMSkills represents that knowledge as a compact, reusable multimodal skill package\.

#### GUI grounding benchmarks\.

Alongside task\-completion benchmarks, a separate line of work measures how reliably GUI agents can localize UI elements from natural\-language instructions\. ScreenSpot\-Pro extends earlier ScreenSpot evaluations to high\-resolution, professional desktop environments, where target elements often occupy less than0\.1%0\.1\\%of the screen and the strongest grounding models still fall well below human performance\(Li et al\.,[2025a](https://arxiv.org/html/2605.13527#bib.bib16)\)\.Gou et al\. \([2025](https://arxiv.org/html/2605.13527#bib.bib9)\)push toward universal visual grounding that lets agents identify GUI elements purely from screenshots, in the spirit of how humans navigate digital interfaces\. MMBench\-GUI organizes evaluation hierarchically, from content understanding and element grounding to task automation and multi\-agent collaboration\(Wang et al\.,[2025b](https://arxiv.org/html/2605.13527#bib.bib44)\), and DeskVision contributes a large\-scale desktop dataset and evaluation suite that broadens grounding research across operating systems\(Xu et al\.,[2025](https://arxiv.org/html/2605.13527#bib.bib52)\)\. These benchmarks isolate the perceptual layer of visual agents\. MMSkills is complementary: rather than improving where to click, it provides procedural and visual evidence about which state matters at each step, and lets the underlying grounding capability translate that evidence into precise actions\.

#### Long\-context reliability\.

Recent studies have shown that simply enlarging the context window does not guarantee that all evidence is used effectively\.Liu et al\. \([2023](https://arxiv.org/html/2605.13527#bib.bib21)\)report that language models often fail to retrieve information placed in the middle of long contexts, and benchmarks such as LongBench reveal substantial degradation as the input grows in length and modality\(Bai et al\.,[2024](https://arxiv.org/html/2605.13527#bib.bib5)\)\. These observations motivate our branch\-loaded design: rather than directly inserting state cards, multi\-view keyframes, and transition examples into the main agent context, the runtime first inspects selected evidence in a temporary branch and returns a compact structured guidance tuple\. This isolates expensive multimodal evidence reading from action generation, and avoids the long\-context failure modes that arise when reference views and live observations compete for the same context window\.

Similar Articles

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Hugging Face Daily Papers

MM-WebAgent is a hierarchical agentic framework that generates coherent and visually consistent webpages by coordinating AIGC-based element generation through joint optimization of layout and multimodal content. The paper introduces a benchmark and multi-level evaluation protocol, demonstrating improvements over code-generation and agent-based baselines.