@omarsar0: // Automating SKILL.md Generation // Increasingly, mining sessions is one of the best ways to improve your agents. Open…
Summary
This paper from MIT and Harvard explores automating SKILL.md generation by mining GUI interaction trajectories, finding that clusters are readable but do not improve policy performance across domains.
View Cached Full Text
Cached at: 06/20/26, 04:19 PM
// Automating SKILL.md Generation //
Increasingly, mining sessions is one of the best ways to improve your agents.
OpenAI released something similar yesterday that lets Codex package skills from interactions.
(bookmark it)
This paper explains a related approach.
They run a three-stage pipeline that segments GUI trajectories, clusters them into candidate skills, and trains a skill-aware policy.
The clusters are genuinely readable, with five of eight hitting 0.95 or higher purity against ground-truth workflow labels.
But readability does not transfer. GRPO lifts skill-step accuracy only from 18.5% to 20.5%, leaves BrowseComp+ flat, and loses to trivial frequency priors.
The authors name the three culprits: a weak boundary detector, an orderless segment representation, and an offline reward model.
Paper: https://arxiv.org/abs/2606.20363
Learn to build effective AI agents in our academy: https://academy.dair.ai
Automating SKILL.md Generation for Computer-Using Agents via Interaction Trajectory Mining
Source: https://arxiv.org/html/2606.20363 Yuexing Hao1,Xiaomin Li2 1Massachusetts Institute of Technology,2Harvard University [email protected]
Abstract
Explicit skill libraries make computer-using agents easier to inspect, but it remains unclear whether such libraries can be mined from interaction data in a way that improves downstream policies. We study this question through a three-stage pipeline that segments GUI trajectories, clusters segments into candidate skills, and trains a skill-aware policy from the resulting annotations. The mined clusters are readable on the source benchmark: five of eight clusters have at least 0.95 purity against InteraSkill Workflows labels. However, readability does not imply transfer. GRPO improves IW skill-step accuracy only from 18.5% to 20.5%, leaves BrowseComp+ essentially unchanged, and underperforms trivial frequency priors on key source-domain metrics. We therefore present the method as a diagnostic study: trajectory mining can expose inspectable skill structure, but the current boundary detector, orderless segment representation, and offline reward model are insufficient for reliable cross-domain policy improvement.
1Introduction
Computer-using agents (CUAs) act on graphical user interfaces (GUIs) by clicking, typing, scrolling, copying, and pasting[44,7,46]. As these agents move from single-step web tasks to longer workflows, repeated patterns become important. A user may need to search for a page, copy a value, switch applications, fill a form, or send a message. Agent systems often package these repeated routines as skills: named procedures that sit above primitive user-interface (UI) actions and below full task plans. We useSKILL.mdto refer to this kind of explicit skill file. These files make behavior easier to inspect and debug, but they are usually written by hand.
Hand-written skills create a practical bottleneck. They must be named, scoped, documented, and updated as interfaces change. They also encode designer assumptions about which behaviors are reusable. A trajectory dataset contains another possible source of structure: if many users repeatedly perform similar action subsequences, those subsequences may reveal natural skill units. This paper asks whether we can build explicit skill libraries from such trajectories. The hard part is not merely clustering trajectories. The hard part is showing that the discovered clusters help a policy on new tasks. A cluster can be coherent without being useful. A policy can improve because it sees more action data, not because it learned a transferable skill vocabulary. Group Relative Policy Optimization (GRPO) can also optimize the reward model without improving benchmark accuracy. We therefore ask a narrow question:what transfers from mined skills?
Our pipeline has three steps. First, we cut trajectories at large action changes. Second, we cluster the resulting segments and refine the cluster embedding with pseudo-label contrastive learning. Third, we train Qwen3-8B with GRPO from the base model. The reward model scores full skill-aware responses, but the main reported Phase 3 evaluation is skill-sequence composition: predicting the next mined skill labels and comparing the resulting skill sequence with the reference sequence. We compare this policy to zero-shot baselines with the same skill-oriented format on IW, WebArena, and BrowseComp+. We additionally report WorkArena-NLP as a text-only diagnostic and a Mind2Web zero-shot baseline as context, but neither is used to claim current GRPO transfer.
The main results are mixed. On the source IW benchmark, the mined library is readable: five of eight clusters have at least 0.95 purity against IW ground-truth skills. As a downstream training signal, however, the current setup is weak. GRPO raises IW skill-step accuracy only from 18.5% to 20.5%, while BrowseComp+ skill-step accuracy changes from 43.5% to 43.3%. A trivial Frequency baseline is a stronger next-skill predictor on IW than the proposed multi-layer perceptron (MLP) and GRPO, and it has lower edit distance than Auto-SKILL.mdat every data size. We therefore treat the result as evidence about the limits of the current pipeline. Trajectory mining can produce readable skill structure. Our current reward model, orderless segment representation, and GRPO setup do not turn that structure into a strong skill-composition policy, and several learned variants underperform a most-common-skill baseline.
This paper is a diagnostic study rather than a success claim. We make three contributions. First, we present a simple pipeline for mining explicit SKILL.md-style routines from GUI trajectories and show that it produces readable source-domain structure. Second, we evaluate whether these mined skills improve downstream skill composition under several baselines and transfer checks. Third, we report a negative result: the current learned components do not outperform trivial frequency priors, and verified cross-domain gains are absent or negative. These findings clarify which parts of trajectory-mined skill libraries are currently useful, and which parts remain unsolved. Our reproducible codes are available in the anonymous Github repository111Anonymous project repository:https://anonymous.4open.science/r/CUA-1680..
2Related Work
Modern CUAs use structured UI observations and fixed action spaces. WebShop[44]and Mind2Web[7]define common primitives, while WebArena[46], VisualWebArena[20], WorkArena[9], and OSWorld[41]test realistic web and operating-system tasks. Foundation CUA systems such as OpAgent[15], OpenCUA[39], and UltraCUA[43]push this direction with larger models and trajectory corpora. Our emphasis on inspectable skill files also connects to human-centered and rule-grounded AI work outside GUI-agent benchmarks. Hao et al. study AI systems for shared decision-making with older adult cancer patients[18], EHR-integrated LLM agents for prostate-cancer patient education[17], and physician–AI relevance alignment in medical question answering[16]; Li et al. study guideline-following medical decisions, rule-based data selection, and adaptive safety rules for reward modeling[24,26,25]. These systems motivate the same design principle we adopt here: automated agents should expose intermediate structure that humans can inspect, question, and correct.
Prior work already mines or synthesizes reusable workflow artifacts. Agent Workflow Memory (AWM)[40]induces routines from trajectories, SkillWeaver[45]distills website practice into reusable API-style skills, AutoManual[4]constructs environmental manuals, ICAL[32]distills demonstrations into cognitive abstractions, LearnAct[27]studies demonstration-based mobile GUI agents, and Open-World Skill Discovery uses action-prediction error for boundary detection[6]. The broader reinforcement-learning literature gives the formal background for temporal abstraction: options and option-critic methods[35,2], deep HRL systems[21,38,28,31,23], and meta-learned or offline primitives[12,1]. Unsupervised skill-discovery methods such as VIC, DIAYN, DADS, CIC, and actionable representations learn reusable behaviors through mutual-information, contrastive, or goal-conditioned objectives[14,10,33,22,13]; recent analysis cautions that mutual-information skills are not universally optimal for every downstream reward[11]. Our negative results are consistent with this caution: a coherent skill space is not automatically a useful cross-domain policy.
Recent GUI-agent training work informs our GRPO setup. DigiRL[3]and WebRL[30]optimize agents with online RL curricula; AgentTrek[42]and OS-Genesis[34]generate agent trajectories; Proposer-Agent-Evaluator (PAE)[47]uses evaluator feedback for autonomous skill discovery; and Skills-Coach[36]applies a GRPO-style loop to generated task suites. Our experiment is narrower: it uses an offline IW-derived reward over text skill plans, does not interact with a live GUI during RL, and does not train the reward on target-domain task success. We therefore interpret weak GRPO transfer as a pipeline-level result, not as evidence against GUI-agent RL broadly.
3Problem Setup
We treat a trajectory as a sequence of UI observations and primitive actions. Primitive actions include click, type, scroll, copy, and paste. A skill label summarizes a contiguous action segment. Formally, the input dataset is
𝒟={τ(n)}n=1N,τ(n)=((o1,a1),…,(oT,aT)),\mathcal{D}=\{\tau^{(n)}\}_{n=1}^{N},\qquad\tau^{(n)}=((o_{1},a_{1}),\ldots,(o_{T},a_{T})),(1)whereoto_{t}is a GUI observation andat∈𝒜lowa_{t}\in\mathcal{A}_{\text{low}}is a primitive UI action. The goal is to induce a skill vocabulary𝒵\mathcal{Z}and a segmentation of each trajectory into contiguous intervals, each assigned one skillz∈𝒵z\in\mathcal{Z}. In hand-authored systems,𝒵\mathcal{Z}is provided by designers. In our setting,𝒵\mathcal{Z}is induced from trajectories.
The question is whether the induced vocabulary𝒵\mathcal{Z}helps the policy. We therefore focus the main evaluation on skill composition: whether a model can choose the right mined skill sequence on held-out tasks and transfer settings. Primitive UI-action accuracy is reported only for the verified Mind2Web zero-shot diagnostic, where benchmark annotations directly support that metric; it is not the primary Phase 3 claim.
4Method: AutomatedSKILL.mdGeneration
The pipeline has three stages. It segments trajectories, clusters the segments into skills, and trains a CUA policy with the resulting annotations. The first two stages build the skill library. The third stage tests whether the library helps. Figure1summarizes the study design. The equations below are operational definitions rather than standalone theoretical claims. Equation2decides where candidate skills begin and end; Equation3turns each variable-length segment into a fixed-length vector; Equation4turns those vectors into a distance matrix for clustering; and Equation5refines the resulting pseudo-labels into embeddings used by the sequence models. Methodologically, the pipeline combines simple offline change-point detection ideas[37], Gaussian optimal-transport geometry[8,29,5], and supervised contrastive representation learning[19]. The Results section organizes these tests around the main findings: boundary recall is easier than boundary precision, readable clusters remain source-bound, and the learned policies do not yet beat simple statistical priors.
Figure 1:Study design for automatedSKILL.mdgeneration. IW is the source dataset for trajectory segmentation, skill-library construction, and Phase 3 GRPO policy training; WebArena and BrowseComp+ are the completed held-out transfer checks. Mind2Web zero-shot and WorkArena-NLP are reported only as diagnostics, not as current GRPO transfer evidence. The paper evaluates boundary quality, cluster quality, auto-generated versus hand-craftedSKILL.mdfiles, simple priors, and completed transfer checks.### 4.1Phase 1: Trajectory Segmentation (Skill Boundary Detection)
Given an action trajectory, we use adjacent action distance as a cheap change-point signal. For a trajectory(a0,a1,…,aT)(a_{0},a_{1},\ldots,a_{T}), we compute:
Δat=‖at−at−1‖2,t∈ℬifΔat>θ,\Delta a_{t}=\|a_{t}-a_{t-1}\|_{2},\quad t\in\mathcal{B}\;\text{if}\;\Delta a_{t}>\theta,(2)whereθ\thetais selected on held-out IW data by sweeping empirical percentiles ofΔat\Delta a_{t}and maximizing boundary F1. The boundary setℬ\mathcal{B}splits a trajectory into candidate skill segments. Each action vector has 15 normalized features: a 10-way primitive-action one-hot vector, screen coordinates(x,y)∈[0,1]2(x,y)\in[0,1]^{2}, normalized timestamp, clipped text length, and clipped scroll amount. The Euclidean score is unweighted over these features and does not use DOM, screenshot, accessibility-tree, or language state. For transfer, we apply the IW-derived threshold directly; target-domain threshold sweeps are reported only as oracle diagnostics. On IW, F1 is stable over the 40th–60th percentile range (Appendix TableA6).
The rule is intentionally simple, but it can over-split skills: click-to-type transitions, typing-length changes, and large coordinate jumps can exceedθ\thetaeven when the user remains inside one skill. A natural replacement would learn boundaries from action-prediction error, as in skill-boundary detection for unsegmented demonstrations[6]; we leave that comparison for future work.
4.2Phase 2: Skill Embedding (Skill Library Construction)
Step 1: Segment Representation.
Let segmentτi\tau_{i}containTiT_{i}action vectorsai,1,…,ai,Ti∈ℝda_{i,1},\ldots,a_{i,T_{i}}\in\mathbb{R}^{d}. We summarize the segment by the mean and diagonal variance of its action vectors:
μi=1Ti∑t=1Tiai,t,Σi=diag(1Ti∑t=1Ti(ai,t−μi)⊙(ai,t−μi)+ϵ𝟏),\mu_{i}=\frac{1}{T_{i}}\sum_{t=1}^{T_{i}}a_{i,t},\qquad\Sigma_{i}=\mathrm{diag}\!\left(\frac{1}{T_{i}}\sum_{t=1}^{T_{i}}(a_{i,t}-\mu_{i})\odot(a_{i,t}-\mu_{i})+\epsilon\mathbf{1}\right),(3)where⊙\odotdenotes element-wise multiplication andϵ=10−4\epsilon=10^{-4}prevents degenerate variances. This is a length-invariant bag-of-actions summary. It is not a Gaussian-mixture model and it does not preserve within-segment order. This choice is useful for cheap clustering because segments with similar action inventories are close even when they have different lengths, but it discards the sequential structure that makes many GUI skills executable: selecting before copying, copying before pasting, opening a menu before choosing an item, or navigating before filling a form. Thus a cluster can be readable while still missing the order information needed for reliable downstream composition. Despite this simplification, the representation produces clusters that align with IW ground-truth skills (Section6.2); the downstream sequence results test whether that bag-of-actions abstraction is sufficient for policy learning.
Step 2: Wasserstein Clustering.
Letvi=diag(Σi)v_{i}=\mathrm{diag}(\Sigma_{i})denote the vector of diagonal variances. We group similar segments using the squared Bures distance between the diagonal Gaussians𝒩(μi,Σi)\mathcal{N}(\mu_{i},\Sigma_{i}):
D(τi,τj)=‖μi−μj‖22+‖vi−vj‖22=‖μi−μj‖22+∑k=1d(vi,k−vj,k)2,D(\tau_{i},\tau_{j})=\|\mu_{i}-\mu_{j}\|_{2}^{2}+\left\|\sqrt{v_{i}}-\sqrt{v_{j}}\right\|_{2}^{2}=\|\mu_{i}-\mu_{j}\|_{2}^{2}+\sum_{k=1}^{d}\left(\sqrt{v_{i,k}}-\sqrt{v_{j,k}}\right)^{2},(4)This is the closed-form squared 2-Wasserstein distance for diagonal Gaussian summaries, a special case of the Fréchet/Bures distance between Gaussian measures[8,29]. Becauseviv_{i}stores variances, the covariance term compares standard deviations, not raw variance vectors; using‖vi−vj‖22\|v_{i}-v_{j}\|_{2}^{2}would be a different heuristic and is not the intended metric. The implementation follows Equation4by taking square roots of the diagonal variances before computing the covariance distance. We run average-linkage agglomerative clustering on this distance matrix and sweepkkfrom 8 to 16.
Step 3: Supervised-Contrastive Refinement.
Letfθ:ℝ2d→ℝdskillf_{\theta}:\mathbb{R}^{2d}\to\mathbb{R}^{d_{\text{skill}}}be an MLP encoder mapping a segment summaryxi=[μi;diag(Σi)]x_{i}=[\mu_{i};\mathrm{diag}(\Sigma_{i})]to anℓ2\ell_{2}-normalized embeddingzi=fθ(xi)z_{i}=f_{\theta}(x_{i}). Using the Wasserstein cluster assignmentscic_{i}aspseudo-labels, we trainfθf_{\theta}with the supervised-contrastive loss[19]:
ℒsup-con=1|ℬ|∑i∈ℬ−1|P(i)|∑p∈P(i)logexp(zi⊤zp/T)∑a∈ℬ∖{i}exp(zi⊤za/T),\mathcal{L}_{\text{sup-con}}=\frac{1}{|\mathcal{B}|}\sum_{i\in\mathcal{B}}\;\frac{-1}{|P(i)|}\sum_{p\in P(i)}\log\frac{\exp(z_{i}^{\top}z_{p}/T)}{\sum_{a\in\mathcal{B}\setminus\{i\}}\exp(z_{i}^{\top}z_{a}/T)},(5)whereℬ\mathcal{B}is a mini-batch,P(i)={p∈ℬ∖{i}:cp=ci}P(i)=\{p\in\mathcal{B}\setminus\{i\}:c_{p}=c_{i}\}is the set of same-pseudo-label positives for anchorii,T=0.07T=0.07is a temperature, and the denominator sums over all other batch elements. We use a class-balanced sampler so each batch contains roughly equal members per pseudo-label.
The encoder is an MLP overDa=15D_{a}=15action-feature dimensions, so its input is the 30-dimensional vector[μ;diag(Σ)][\mu;\mathrm{diag}(\Sigma)]. It maps30→64→32→dskill=1630\to 64\to 32\to d_{\text{skill}}=16with ReLU activations and anℓ2\ell_{2}-normalized output. We train for 200 epochs with AdamW (lr=10−310^{-3}, weight decay10−410^{-4}), batch size 256, temperatureT=0.07T=0.07, and an 80/20 stratified train/validation split by pseudo-label. The IW run takes less than five minutes on CPU.
*Pseudo-supervision, not self-supervision.*Step 2 assigns labels by clustering. Step 3 trains an encoder to predict those labels through a supervised-contrastive objective. No human skill labels enter the clustering, but the encoder is still trained with pseudo-labels. This could simply overfit to the initial clusters. We test this with random labels and a k-means baseline in Section6.2. Wasserstein pseudo-labels perform better than both.
4.3Phase 3: Skill-Aware GRPO Training
After mining the skill library, we train models for skill-sequence composition. The lightweight baselines test whether the mined representation is useful without an LLM: an MLP receives the current skill embedding plus sequence position and predicts the next skill, while a 2-layer causal Transformer (dmodel=64d_{\text{model}}=64, 4 heads) attends over the skill sequence to test whether sequence context helps. The main policy is Qwen3-8B trained with GRPO from the base model on 1,275 prompts containing task context and mined skill names; its response format includes a skill label and a next-action field, but the reported IW/WebArena/BrowseComp+/WorkArena-NLP metrics evaluate the predicted skill sequence. A learned trajectory reward model scores the full prompt-response pair, and GRPO updates the base policy directly with no supervised warm start. We use 8 candidate responses per prompt at temperature 0.7, maximum completion length 192, learning rate5×10−65\times 10^{-6}, gradient accumulation 8, reward clipping at 5.0, and one training epoch. The completed run took 6,072 seconds on 4 NVIDIA H200 NVL GPUs with 143,771 MiB memory per GPU.
4.4Transfer Evaluation
We evaluate transfer with held-out benchmarks primarily through the IW-derived skill vocabulary. The claimed Phase 3 comparison covers IW as the source benchmark and WebArena/BrowseComp+ as held-out transfer checks. For Mind2Web, we separately report the verified zero-shot task-completion, action-accuracy, and element-accuracy diagnostics, but we do not use those numbers as evidence that the current GRPO policy improves primitive UI-action prediction. WorkArena-NLP is reported only as an auxiliary text-only planning diagnostic.
5Experimental Setup
We evaluate the claimed pipeline on one source benchmark and two held-out transfer checks. IW is the source distribution: 2,000 synthetic enterprise-style trajectories with explicit skill boundaries and labels, used for segmentation, clustering, and policy training. WebArena and BrowseComp+ test whether the learned skill vocabulary transfers to external browsing traces. We also include two non-claim diagnostics: a verified Mind2Web zero-shot baseline for web-action context, and WorkArena-NLP, a text-only conversion of WorkArena task schemas into natural-language goals and structured JSON targets. Live WorkArena and current-run GRPO Mind2Web results are not part of the claimed evaluation in this submission. AppendixBgives dataset sizes, benchmark status, and diagnostic details.
For Phase 3 skill composition, the main comparison is Qwen3-8B zero-shot versus Qwen3-8B trained with GRPO from the base model. We also include zero-shot Llama-3.1-70B, DeepSeek-R1-Distill-Qwen-32B, Gemma-4-31B, OLMo-3-7B, GPT-5, Claude Sonnet 4.5, and Claude Haiku 4.5 where available, plus simple non-LLM baselines. Segmentation is scored by boundary precision, recall, and F1; clustering by NMI, silhouette, and purity; skill composition by per-step accuracy, exact sequence match, and normalized edit distance; and Mind2Web by task completion, action accuracy, and element accuracy. Additional model, metric, and hyperparameter details are in AppendixBand AppendixA.
6Results
6.1Action Jumps Find Boundaries but Also Split Skills
The first surprise is that the simplest signal works, but only in the wrong direction for a reusable skill library. Equation2asks whether large adjacent action changes mark true skill switches. On IW, the best threshold isθ=1.545\theta=1.545(50th percentile), giving precision 0.419, recall 0.803, and F1 0.538 (Appendix FigureA2). High recall means most true skill boundaries do create visible action jumps. Low precision means the same jumps also happen inside a skill: users click before typing, move between distant UI elements, scroll during review, or paste text as one step of a longer data-transfer routine. The boundary detector therefore discovers many real transitions, but it over-splits ordinary within-skill behavior.
The second part of the finding is that the threshold is not domain stable. On WebArena, applying the IW-derived thresholdθ=1.545\theta=1.545gives precision 1.000, recall 0.100, and F1 0.119. A target-domain oracle sweep can chooseθ=0.603\theta=0.603and reach F1 0.851, but that uses WebArena boundary labels and is not a valid zero-shot transfer result. The oracle result only shows that WebArena map-navigation trajectories contain clear discontinuities under their own action scale. It does not show that an IW-calibrated boundary rule transfers, and it should not be read as a deployable fix for Phase 1.
6.2The Mined Skills Are Readable but Source-Bound
The main positive result is narrower than expected: trajectory mining produces readable source-domain skills, but the structure does not automatically become a portable vocabulary. Equation3compresses each discovered segment into a mean/variance action profile, and Equation4clusters those profiles. With the Bures metric, agglomerative clustering reaches NMI = 0.650 atk=8k=8on IW (Appendix TableA7). NMI drops for largerkk, while purity stays near 0.63, so we usek=8k=8in the main analysis.
The pseudo-label contrastive encoder strengthens this source-domain structure. After 200 epochs, KMeans in the 16-dimensional latent space reaches NMI = 0.862, silhouette = 0.554, and purity = 0.837, a 33% relative NMI gain over the Wasserstein baseline. Appendix FigureA3shows the t-SNE projection, and AppendixC.5reports the training curve. This result shows that the cluster-derived pseudo-labels are internally consistent enough to train an embedding model.
The negative part is the transfer behavior. WebArena looks very different: with the IW threshold, segmentation F1 falls to 0.119, and the learned embedding is weak (NMI = 0.049, silhouette =−0.255-0.255). The target-tuned WebArena threshold can give F1 = 0.851, but that is an oracle diagnostic, not a transferable setting. The map-only trajectories do not contain enough skill diversity to form a rich library, and their discontinuity scale differs from IW. The mined taxonomy is therefore best read as a source-domain discovery result, not as evidence of a general GUI skill vocabulary.
Table1explains why the result is tempting but insufficient. Five of eight clusters have purity at least 0.95 against one IW skill. Their action profiles are also interpretable:document_editcontains click/format/save,data_transfercontains click/copy/paste, andorganize_filescontains click/right-click patterns. The weaker clusters, such assend_messageandreview_content, absorb broad click/type/scroll routines shared by several IW skills. The source-domain taxonomy is readable, but readability does not imply downstream usefulness. Appendix FigureA4provides the full action-distribution heatmap.
Table 1:Selectedk=8k=8cluster characterization on IW data. The Size column counts ground-truth segments assigned to each cluster, not primitive actions; the sizes sum to the 8,290 IW ground-truth segments. The action-share column lists nonzero within-cluster action percentages, and the trajectory column lists unique action-type sequences sorted by frequency. Clustering quality across different values ofkkis moved to Appendix TableA7.
6.3Learned Skill Composition Does Not Beat Simple Priors
The strongest negative result appears after the readable clusters are used for prediction. The MLP and Transformer consume the learned segment embeddingsziz_{i}, while the GRPO policy consumes text prompts built from the same mined skill labels. Appendix TableA3compares these pipeline variants on IW, but it is not a controlled architecture-only or reward-model ablation: input modality, model class, supervision signal, and optimizer all change at once. The comparison is still useful as a sanity check, because every variant is supposed to benefit from the mined skill structure.
The sanity check fails for the main learned policy. The GRPO-trained Qwen3-8B policy reaches 20.5% skill-step accuracy, below the MLP baseline (23.3%), the Transformer baseline (34.6%), and the trivial Frequency baseline in Appendix TableA2(34.9%). Exact sequence match is 0% for the GRPO policy. The result should be interpreted as a failure of the current text-prompted GRPO setup as a whole, not as proof that the reward model alone is responsible. A reward-specific conclusion would require a modality-matched control, such as Qwen3-8B trained on the same text prompts with supervised next-skill loss, or an embedding-based policy trained with and without the same reward model.
One likely reason is that the useful part of a GUI skill is often ordered, while the segment embedding is not. The Phase 2 representation summarizes a segment by mean and variance, so it does not record whether a click happened before or after a type, copy, paste, or menu action. Per-skill accuracy follows this pattern: it ranges from 94.1% forsearch_navigateto 55.6% forpresentation_edit(TableA8). Skills with distinct action signatures are easier to predict; skills that share click/type/scroll routines are harder. Appendix FigureA1shows the per-position MLP and Transformer comparison.
6.4GRPO Does Not Establish Transfer on Completed Checks
Table2gives the completed benchmark-level view. IW is the source benchmark; WebArena and BrowseComp+ are the held-out transfer checks used for the paper’s policy-transfer claim. WorkArena-NLP is included in the same table only as an auxiliary text-only planning diagnostic, not as a substitute for live WorkArena. The API models are zero-shot format-constrained baselines. The current GRPO run improves IW skill-step accuracy only slightly over zero-shot Qwen3-8B (18.5% to 20.5%), decreases on WebArena (55.8% to 44.2%), is unchanged on BrowseComp+ (43.5% to 43.3%), and matches zero-shot Qwen3-8B on WorkArena-NLP field accuracy (37.0% for both, with 0% exact match). We therefore limit the transfer conclusion to the completed WebArena and BrowseComp+ checks.
This is surprising because the training signal is built around the mined skills, but the verified transfer gains are absent or negative. Closed-source models improve the prompting baseline but do not dominate every benchmark: GPT-5 and Claude Sonnet 4.5 are strongest on WebArena, while OLMo-3-7B remains strongest on BrowseComp+. WorkArena-NLP completed runs range from OLMo-3-7B at 12.8% to GPT-5 at 40.6%, with DeepSeek-R1-Distill-32B at 31.2% and both Qwen3-8B variants at 37.0%; these numbers diagnose text-only schema recovery, not live enterprise automation. Appendix TableA2reinforces the negative result: a trivial Frequency baseline is strong on IW, strictly outperforming both the MLP and GRPO policies on skill-step accuracy. The current learned skill-composition methods have therefore not shown value beyond dataset class imbalance.
Table 2:Completed model comparison for composing auto-generated skills. IW is the source benchmark; WebArena and BrowseComp+ are the held-out transfer checks used for the policy-transfer claim. Scores for these three columns are single-run skill-step percentages (%). WorkArena-NLP is an auxiliary text-only diagnostic and reports field accuracy on 100 structured-planning examples; it is not a live WorkArena substitute. API models are zero-shot format-constrained prompting baselines. The DeepSeek BrowseComp+ score is a verified near-zero result: 1/2,194 correct samples, or 0.0456%, rounded to 0.05. Best per-column scores are shown inbold.Mind2Web is not included in Table2because the current GRPO run does not have a completed Mind2Web evaluation. AppendixB.5reports only a zero-shot baseline for context.
6.5Auto-SKILL.mdCan Beat a Manual Table but Not Frequency
Figure 2:Data-efficiency comparison for generatedSKILL.md. Lower normalized edit distance is better; higher exact match and mean position accuracy are better. The generated specification improves over the hand-written baseline at some sizes, including the largest setting, but it remains worse than the Frequency baseline across the evaluated sizes. This supports a cautious conclusion: trajectory-mined specifications can be competitive with simple manual tables, but the current pipeline is not yet a reliable replacement for trivial statistical baselines or manual design.The final result concerns the artifact that motivated the pipeline: generatedSKILL.mdfiles. We run the system on IW subsets and generate skill descriptions, transition probabilities, workflows, and error-handling patterns, then compare these files with hand-written baselines. AppendixD.2gives three same-task qualitative comparisons.
Figure2and Appendix TableA9compare generated and hand-writtenSKILL.mdfiles as the number of IW training trajectories changes. The important question is not whether the generated transition table looks different from the hand-written one, but whether it improves next-skill prediction. Relative to the hand-written table, the result is mixed: Auto-generatedSKILL.mdis better atN=100N=100,250250, and2,0002{,}000, but worse atN=500N=500and1,0001{,}000.
The negative finding is sharper when the trivial baseline is included. Frequency has lower normalized edit distance at every evaluated training size; atN=2,000N=2{,}000, Frequency reaches 0.485 while Auto-SKILL.mdreaches 0.528. Thus automated generation can beat a simple hand-written table in some settings, but it does not yet beat the strongest trivial baseline. We therefore report the data-efficiency curve in Figure2, which compares methods on normalized edit distance, exact sequence match, and mean position accuracy.
7Discussion
Our results demonstrate that mined-skill methods should be evaluated against frequency and transition priors before being compared to large language models. In our experiments, the most-common-skill baseline is not a strawman: it captures class imbalance and repetitive workflow structure that learned systems can easily appear to exploit. Future work on GUI skill discovery should therefore report at least three controls: a frequency prior, a transition-memory prior, and a modality-matched supervised policy. Without these controls, improvements from mined skill labels may be mistaken for improvements from dataset imbalance or output-format adaptation.
The small Transformer beats the GRPO-trained Qwen3-8B policy on IW, but that comparison changes input modality and optimization objective at the same time: the Transformer consumes continuous segment embeddings and is trained directly for sequence prediction, while Qwen3-8B consumes text trajectory prompts and is optimized through a learned trajectory reward model. These results show that the current pipeline can produce readable transition priors, but they do not show that the learned components capture reusable skill-composition structure beyond dataset imbalance.
The limitations point to concrete next steps. The contrastive encoder uses cluster-derived pseudo-labels, so the pipeline is not fully unsupervised; IW is synthetic and may not capture real enterprise complexity; and the Phase 1ℓ2\ell_{2}boundary heuristic should be compared against learned action-prediction-error segmenters before being treated as robust. Stronger claims would require completed Mind2Web GRPO and live WorkArena evaluations; the present submission does not make those claims. Better follow-up experiments would mix IW with target-domain prompts, replace the bag-of-actions segment representation with an order-aware encoder, train a denser action-level or task-success reward model, and run factorial ablations that hold two stages fixed while replacing the third with ground-truth boundaries, oracle target-domain boundaries, or supervised controls. Because CUAs can automate useful repetitive work but can also be misused, generatedSKILL.mdartifacts should remain reviewable by humans before deployment.
8Conclusion
We mined skill specifications from GUI trajectories and tested what transfers. The mined skills are readable on the source data: five of eight clusters reach at least 0.95 purity. The trained policy does not transfer well. GRPO from base Qwen3-8B reaches 20.5% IW skill-step accuracy versus 18.5% zero-shot, and 43.3% BrowseComp+ skill-step accuracy versus 43.5% zero-shot. The learned components also fail important sanity checks: a Frequency baseline beats the proposed MLP and GRPO policies on IW skill-step accuracy, and it beats Auto-SKILL.mdon edit distance across all evaluated data sizes. Mined skills are useful as inspectable structure and reward-model training signal. They are not yet a reliable replacement for trivial statistical baselines, manual design, or a cross-domain policy.
References
- [1](2021)OPAL: offline primitive discovery for accelerating offline reinforcement learning.InInternational Conference on Learning Representations,Cited by:§2.
- [2]P. Bacon, J. Harb, and D. Precup(2017)The option-critic architecture.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:§2.
- [3]H. Bai, Y. Zhou, M. Cemri, J. Pan, A. Suhr, S. Levine, and A. Kumar(2024)DigiRL: training in-the-wild device-control agents with autonomous reinforcement learning.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [4]M. Chen, Y. Li, Y. Yang, S. Yu, B. Lin, and X. He(2024)AutoManual: constructing instruction manuals by LLM agents via interactive environmental learning.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [5]M. Cuturi(2013)Sinkhorn distances: lightspeed computation of optimal transportation distances.InAdvances in Neural Information Processing Systems,pp. 2292–2300.Cited by:§4.
- [6]J. Deng, Z. Wang, S. Cai, A. Liu, and Y. Liang(2025)Open-world skill discovery from unsegmented demonstration videos.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp. 10708–10718.Cited by:§2,§4.1.
- [7]X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su(2023)Mind2Web: towards a generalist agent for the web.Advances in Neural Information Processing Systems36,pp. 28091–28114.Cited by:§B.1,§1,§2.
- [8]D. C. Dowson and B. V. Landau(1982)The Fréchet distance between multivariate normal distributions.Journal of Multivariate Analysis12(3),pp. 450–455.Cited by:§4.2,§4.
- [9]A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. Del Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste(2024)WorkArena: how capable are web agents at solving common knowledge work tasks?.InInternational Conference on Machine Learning,Cited by:§B.1,§2.
- [10]B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine(2019)Diversity is all you need: learning skills without a reward function.InInternational Conference on Learning Representations,Cited by:§2.
- [11]B. Eysenbach, R. Salakhutdinov, and S. Levine(2022)The information geometry of unsupervised reinforcement learning.InInternational Conference on Learning Representations,Cited by:§2.
- [12]K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman(2018)Meta learning shared hierarchies.InInternational Conference on Learning Representations,Cited by:§2.
- [13]D. Ghosh, A. Gupta, and S. Levine(2019)Learning actionable representations with goal-conditioned policies.InInternational Conference on Learning Representations,Cited by:§2.
- [14]K. Gregor, D. J. Rezende, and D. Wierstra(2017)Variational intrinsic control.InInternational Conference on Learning Representations,Cited by:§2.
- [15]Y. Guo, W. Yang, S. Yang,et al.(2026)OpAgent: operator agent for web navigation.arXiv preprint arXiv:2602.13559.Cited by:§2.
- [16]Y. Hao, K. Alhamoud, H. Jeong, H. Zhang, I. Puri, P. Torr, M. Schaekermann, A. D. Stern, and M. Ghassemi(2025)MedPAIR: measuring physicians and AI relevance alignment in medical question answering.arXiv preprint arXiv:2505.24040.Cited by:§2.
- [17]Y. Hao, J. Holmes, M. R. Waddle, B. J. Davis, N. Y. Yu, K. Vickers, H. Preston, D. Margolin, C. E. Lockenhoff, A. Vashistha, S. Kalantari, M. Ghassemi, and W. Liu(2025)Personalizing prostate cancer education for patients using an EHR-integrated LLM agent.npj Digital Medicine.Cited by:§2.
- [18]Y. Hao, Z. Liu, B. Riter, and S. Kalantari(2024)Advancing patient-centered shared decision-making with AI systems for older adult cancer patients.InProceedings of the CHI Conference on Human Factors in Computing Systems,pp. 1–19.External Links:DocumentCited by:§2.
- [19]P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan(2020)Supervised contrastive learning.InAdvances in Neural Information Processing Systems,Cited by:§4.2,§4.
- [20]J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. C. Lim, P. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried(2024)VisualWebArena: evaluating multimodal agents on realistic visual web tasks.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:§2.
- [21]T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum(2016)Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [22]M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel(2022)Unsupervised reinforcement learning with contrastive intrinsic control.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [23]S. Li, R. Wang, M. Tang, and C. Zhang(2019)Hierarchical reinforcement learning with advantage-based auxiliary rewards.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [24]X. Li, M. Gao, Y. Hao, T. Li, G. Wan, Z. Wang, and Y. Wang(2025)MedGUIDE: benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613.Cited by:§2.
- [25]X. Li, M. Gao, Z. Zhang, J. Fan, and W. Li(2025)Data-adaptive safety rules for training reward models.arXiv preprint arXiv:2501.15453.Cited by:§2.
- [26]X. Li, M. Gao, Z. Zhang, C. Yue, and H. Hu(2024)Selection of LLM fine-tuning data based on orthogonal rules.arXiv preprint arXiv:2410.04715.Cited by:§2.
- [27]G. Liu, P. Zhao, L. Liu, Z. Chen, Y. Chai, S. Ren, H. Wang, S. He, and W. Meng(2025)LearnAct: few-shot mobile GUI agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805.Cited by:§2.
- [28]O. Nachum, S. Gu, H. Lee, and S. Levine(2018)Data-efficient hierarchical reinforcement learning.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [29]G. Peyré and M. Cuturi(2019)Computational optimal transport: with applications to data science.Foundations and Trends in Machine Learning11(5–6),pp. 355–607.Cited by:§4.2,§4.
- [30]Z. Qi, X. Liu, I. L. Iong, H. Lai, X. Sun, W. Zhao, Y. Yang, X. Yang, J. Sun, S. Yao, T. Zhang, W. Xu, J. Tang, and Y. Dong(2025)WebRL: training LLM web agents via self-evolving online curriculum reinforcement learning.InInternational Conference on Learning Representations,Cited by:§2.
- [31]M. Riemer, M. Liu, and G. Tesauro(2018)Learning abstract options.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [32]G. Sarch, L. Jang, M. J. Tarr, W. W. Cohen, K. Marino, and K. Fragkiadaki(2024)VLM agents generate their own memories: distilling experience into embodied programs of thought.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [33]A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman(2020)Dynamics-aware unsupervised discovery of skills.InInternational Conference on Learning Representations,Cited by:§2.
- [34]Q. Sun, K. Cheng, Z. Ding, C. Jin, Y. Wang, F. Xu, Z. Wu, C. Jia, L. Chen, Z. Liu, B. Kao, G. Li, J. He, Y. Qiao, and Z. Wu(2025)OS-Genesis: automating GUI agent trajectory construction via reverse task synthesis.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,pp. 5555–5579.Cited by:§2.
- [35]R. S. Sutton, D. Precup, and S. Singh(1999)Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning.Artificial intelligence112(1-2),pp. 181–211.Cited by:§2.
- [36]Y. Tian, J. Chen, L. Zheng, M. Tao, X. Zeng, Z. Yin, H. Su, and X. Sun(2026)Skills-coach: a self-evolving skill optimizer via training-free GRPO.arXiv preprint arXiv:2604.27488.Cited by:§2.
- [37]C. Truong, L. Oudre, and N. Vayatis(2020)Selective review of offline change point detection methods.Signal Processing167,pp. 107299.External Links:DocumentCited by:§4.
- [38]A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu(2017)FeUdal networks for hierarchical reinforcement learning.InProceedings of the 34th International Conference on Machine Learning,pp. 3540–3549.Cited by:§2.
- [39]X. Wang, B. Wang, D. Lu,et al.(2025)OpenCUA: open foundations for computer-use agents.arXiv preprint arXiv:2508.09123.Cited by:§2.
- [40]Z. Z. Wang, J. Mao, D. Fried, and G. Neubig(2024)Agent workflow memory.arXiv preprint arXiv:2409.07429.Cited by:§B.1,Table A2,§2.
- [41]T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu(2024)OSWorld: benchmarking multimodal agents for open-ended tasks in real computer environments.InAdvances in Neural Information Processing Systems,Cited by:§2.
- [42]Y. Xu, D. Lu, Z. Shen, J. Wang, Z. Wang, Y. Mao, C. Xiong, and T. Yu(2025)AgentTrek: agent trajectory synthesis via guiding replay with web tutorials.InInternational Conference on Learning Representations,Cited by:§2.
- [43]Y. Yang, Z. Yang, Z. Dou,et al.(2025)UltraCUA: a foundation model for computer use agents with hybrid action.arXiv preprint arXiv:2510.17790.Cited by:§2.
- [44]S. Yao, H. Chen, J. Yang, and K. Narasimhan(2022)WebShop: towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems35,pp. 20744–20757.Cited by:§1,§2.
- [45]B. Zheng, M. Y. Fatemi, X. Jin, Z. Z. Wang, A. Gandhi, Y. Song, Y. Gu, J. Srinivasa, G. Liu, G. Neubig, and Y. Su(2025)SkillWeaver: web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079.Cited by:§2.
- [46]S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y. Bisk, D. Fried, U. Alon, and G. Neubig(2024)WebArena: a realistic web environment for building autonomous agents.InInternational Conference on Learning Representations,Cited by:§B.1,§1,§2.
- [47]Y. Zhou, Q. Yang, K. Lin, M. Bai, X. Zhou, Y. Wang, S. Levine, and E. Li(2025)Proposer-agent-evaluator (PAE): autonomous skill discovery for foundation model internet agents.InProceedings of the 42nd International Conference on Machine Learning,Cited by:§2.
Appendix AAppendix
A.1Additional GRPO Training Sessions
We also run a scale-control GRPO session on Llama-3.1-70B-Instruct with quantized low-rank adaptation (QLoRA) adapters. This anti-collapse rerun starts from the base model, uses 2 candidate responses per prompt, maximum completion length 64, learning rate10−510^{-5}, gradient accumulation 16, and one epoch. The reward is a hand-weighted skill-format reward with componentscorrect_skill(+1.0),format(+0.1),reasoning(+0.1), andinvalid(-0.3). The run completes 1,149 optimizer steps in 224,341 seconds on 8 NVIDIA RTX A6000 GPUs with 49,140 MiB memory per GPU, with final training loss−1.16×10−4-1.16\times 10^{-4}, mean reward 0.529, reward standard deviation 0.460, and mean entropy 0.277.
A.2Hyperparameters
TableA1summarizes the model and optimizer settings used for the encoder, sequence baseline, and GRPO runs.
Table A1:Complete hyperparameters for all model configurations.
A.3Assets, Licenses, and New Artifacts
We use third-party benchmarks and models only as cited research artifacts or through their public provider interfaces. The new dataset introduced by this work are the synthetic IW trajectories, mined cluster assignments, generatedSKILL.mdspecifications, evaluation prompts/predictions derived from our runs, and the WorkArena-NLP diagnostic conversion.
Section5documents the dataset composition and curation logic; AppendixBdocuments benchmark status and diagnostics; AppendixDdocuments generated skill specifications and examples. Code and public-release artifacts are available through the anonymous project repository linked in the main text.
A.4GRPO Reward Model
The current GRPO run uses a learned trajectory reward model rather than a hand-weighted format reward. The reward model is a Qwen3-8B transformer with a scalar sequence-classification head, loaded throughAutoModelForSequenceClassification(num_labels=1). For an input stringxx, it returns one real-valued logitrϕ(x)r_{\phi}(x). It is trained with the Transformer Reinforcement Learning (TRL) library’sRewardTraineron pairwise preferences over full skill plans. The input text is the same planning prompt used by the policy concatenated with a candidate completion of the form[Plan] skill_1 -> skill_2 -> ....
The reward-model data comes from the train and validation JSONL files underdata/interaskill/conversations/. These files contain 1,275 train conversations and 225 validation conversations with task text, message history, and a ground-truthskill_flow. We keep examples with at least two skills. For each prompt, the ground-truth skill flow is the chosen completion. Rejected completions are synthetic hard negatives generated by adjacent swaps, single-skill replacements, deletions, insertions, global swaps, and duplicate/mutation operations. The training run uses three negatives per train prompt and two negatives per validation prompt, producing approximately 3,825 train preference pairs and 450 validation pairs.
Preference margins are generated heuristically rather than from held-out benchmark correctness. A candidate plan is scored against the ground-truth skill flow by a weighted combination of prefix match (0.45), longest-common-subsequence overlap (0.30), unordered skill overlap (0.20), and length agreement (0.05). The chosen plan receives the self-score; each rejected plan receives its heuristic score, and the preference marginmmis clipped to[0.05,0.95][0.05,0.95]. TRL’s reward training objective is a pairwise logistic ranking loss over the chosen and rejected scalar logits, with the margin requiring the chosen score to exceed the rejected score by more than the heuristic gap:
ℒRM=−logσ(rϕ(p,y+)−rϕ(p,y−)−m),\mathcal{L}_{\mathrm{RM}}=-\log\sigma\!\left(r_{\phi}(p,y^{+})-r_{\phi}(p,y^{-})-m\right),(6)whereppis the prompt,y+y^{+}is the ground-truth skill-flow completion, andy−y^{-}is a synthetic near-miss plan. Thus the reward model learns to rank IW-style skill plans by similarity to the annotated skill flow. It is not trained on WebArena, WorkArena, BrowseComp+, Mind2Web, or live task-success labels.
There is no separate methodology for labeling “held-out benchmark correctness” for reward-model training: those labels are not used. Validation during reward-model training uses held-out IW conversations from the validation split indata/interaskill/conversations/with the same synthetic preference construction as training. Held-out benchmark correctness is measured only after GRPO by running the policy evaluator on the completed IW, WebArena, and BrowseComp+ checks. This distinction is important because the learned reward is a proxy for IW skill-flow similarity, not an estimator of cross-domain task success.
Reward-model training uses Qwen/Qwen3-8B, maximum length 768, one epoch, per-device batch size 1, gradient accumulation 32 in the reported script, learning rate10−510^{-5}, weight decay 0.01, warmup ratio 0.1, cosine scheduling, bf16, and gradient checkpointing. The reported script can optionally load the reward backbone in 4-bit, but the final trajectory reward model used by GRPO is saved as a standard sequence-classification checkpoint. During GRPO, the reward function concatenates the policy prompt and completion, tokenizes to maximum length 1024, runs the reward model, and returns the scalar logit. The reported GRPO run clips this scalar to[−5,5][-5,5]before within-group advantages are computed. This design avoids directly rewarding superficial format features, but it means reward quality is limited by the IW skill-flow preference construction and by the mismatch between text-plan rewards and downstream benchmark correctness.
Appendix BBenchmark Status and Transfer Diagnostics
B.1Detailed Experimental Setup
Datasets.
The claimed pipeline evaluation uses one source dataset and two held-out transfer checks, with two additional diagnostics. IW is the source dataset because it provides ground-truth skill boundaries and labels: 2,000 fabricated interaction trajectories spanning 12 skill types (e.g.,document_edit,search_navigate,send_message), totaling 8,290 ground-truth segments and 40,774 primitive actions. We curate IW to cover repeated enterprise routines such as document editing, communication, file organization, status monitoring, and cross-application data transfer. WebArena[46]contributes 1,000 real map-navigation trajectories with 3,140 segments; it is useful as a stress test for realistic but homogeneous web behavior. BrowseComp+ contains trajectories derived from BrowseComp and tests complex browsing transfer with the same skill-aware response format as IW. Mind2Web[7]is used only for a verified zero-shot diagnostic in this submission. WorkArena[9]is the closest enterprise transfer benchmark, but live WorkArena results are not claimed; WorkArena-NLP is a text-only diagnostic constructed by converting WorkArena task schemas into natural-language goals paired with structured JSON targets across navigation, sorting, filtering, record creation, and service-catalog ordering.
Models and baselines.
For Phase 3 skill composition, we evaluate Qwen3-8B in zero-shot and GRPO-trained settings. We compare against zero-shot Llama-3.1-70B, DeepSeek-R1-Distill-Qwen-32B, Gemma-4-31B, OLMo-3-7B, GPT-5, Claude Sonnet 4.5, and Claude Haiku 4.5. We additionally include three non-LLM baselines: a manually authoredSKILL.mdtransition table, AWM-style bigram transitions[40], and a most-frequent-skill predictor.
Metrics.
For Phase 1 segmentation we report boundary precision, recall, and F1 against ground-truth skill boundaries. For Phase 2 clustering we report Normalized Mutual Information (NMI), silhouette score, and purity, both on the raw Wasserstein clusters and on the supervised-contrastive refined latent space. For Phase 3 skill composition we report per-position accuracy, exact sequence match rate, and normalized edit distance between predicted and ground-truth skill sequences. Per-position accuracy reveals error compounding along the sequence; edit distance captures the cost of fixing a predicted sequence in tokens. For the Mind2Web zero-shot diagnostic, a task is counted as complete under task completion rate (TCR) only when every predicted action matches the ground-truth action in sequence; we additionally report per-step action accuracy and element accuracy.
B.2Additional Transfer Tables
TablesA2–A5provide supporting transfer results that are useful for context but are not the headline claims in the main text. They also expose a negative result: on IW, the most-common-skill Frequency baseline is stronger than the proposed MLP and GRPO policies, so these learned methods should not be interpreted as solving skill composition.
Table A2:Auto-generated vs. hand-crafted skill specifications. IW Acc. is InteraSkill Workflows skill-sequence accuracy and WebArena (WA) Acc. is WebArena skill-sequence accuracy. Mean position accuracy is reported. The hand-crafted baseline is a simple transition table, not an optimized expert system. The Frequency baseline is trivial but strong, outperforming the proposed MLP and GRPO policies on IW skill-step accuracy.Table A3:Skill composition on IW data using mined skills. This table is not a controlled input-modality or reward-model ablation: the MLP and Transformer operate on continuous segment embeddings, while the GRPO-trained Qwen3-8B policy operates on text trajectory prompts and is optimized through a learned reward model. The rows should therefore be read as pipeline variants, not as evidence that the reward model alone causes the performance gap.
Figure A1:Next-skill prediction accuracy for MLP versus Transformer on IW data. Step 2 predicts the second skill given the ground-truth first skill; the seeded first skill at Step 1 is omitted. Both models operate on learned skill embeddings from Phase 2. Accuracy degrades at later steps due to error compounding, with the Transformer’s sequential attention providing consistent improvement.Table A4:Verified Mind2Web zero-shot baseline. The current GRPO run requires a fresh Mind2Web evaluation before it can be compared here.Table A5:Per-domain Mind2Web zero-shot results (test_task split).
B.3WorkArena Diagnostic Status
Live WorkArena is not part of the claimed evaluation in this submission. It would be a stronger test of enterprise transfer than WebArena because its tasks involve knowledge-base lookup, record editing, forms, lists, dashboards, and service catalog workflows, which are close to the workflow types that IW tries to simulate.
We added an offline WorkArena evaluation path for future use. It uses the same conversation format as IW, WebArena, and BrowseComp+. It expects WorkArena rollouts underdata/workarena/trajectories/, converts them to JSONL conversations underdata/workarena/conversations/, and evaluates skill prediction withinteraskill.eval_model. Because the rollout file has not been collected, no live WorkArena result is reported or used in the paper’s transfer claims.
We also construct a text-only WorkArena-NLP variant for diagnosing whether models understand the enterprise task specifications independently of browser control. On the first 100 examples, completed runs reach 12.8–40.6% field accuracy and 0% exact match, showing that partial schema recovery is possible but full structured reconstruction remains difficult. This is not a replacement for live WorkArena: it removes visual grounding, state tracking, clicking, and form interaction. It instead tests instruction understanding and workflow planning. AppendixB.4gives the conversion details.
B.4Text-Only WorkArena Conversion
WorkArena is originally a live computer-using-agent benchmark. A model observes a ServiceNow page and must issue browser actions such asclick,fill,select_option,scroll, andpress. The ground truth is therefore an environment state, not a text label. For example, the basic menu-navigation task is correct only when the browser reaches the exact target ServiceNow URL; form and list tasks are correct only when the corresponding record, filter, sort order, or catalog order exists in the web application.
To separate task understanding from browser grounding, we define a derived text-only benchmark, WorkArena-NLP. We do not use screenshots, accessibility trees, or live browser state. Instead, we read WorkArena’s configuration files and task schemas and convert each task into a pair(g,y)(g,y), whereggis the natural-language goal shown to the agent andyyis a structured JSON plan containing the fields that the live task evaluator would eventually check. The model is prompted withggand must outputyy.
The conversion preserves the semantic target of each task family:
- •Navigation tasks fromall_menu.jsonbecome structured plans withtask_type,application,module,target_url, andrequired_final_action.
- •Sorting and filtering tasks expose the list name, filter kind, filter fields, filter values, sort fields, and sort directions.
- •Record-creation tasks expose the ServiceNow table, requested fields, field labels, target values, andrequired_final_action=submit_record.
- •Service-catalog orders expose the item, description, quantity, item configuration values, andrequired_final_action=order_item.
The generated dataset contains 2,210 examples: 1,000 navigation examples, 300 sort examples, 300 filter examples, 250 create-record examples, and 360 catalog-order examples. We score model outputs by parsing the returned JSON and comparing it with the expected plan. Exact match requires all flattened fields to match after simple string normalization. Field accuracy is the fraction of expected flattened fields that match. This metric is intentionally stricter than semantic similarity but easier to audit.
WorkArena-NLP should be interpreted as a diagnostic benchmark. It answers whether a model can parse an enterprise workflow instruction into the right structured intent. It does not evaluate visual grounding, dynamic state tracking, login/session handling, menu expansion, clicking, form entry, or recovery from web-interface errors. Thus, WorkArena-NLP results are complementary to live WorkArena results rather than substitutes for them.
B.5Mind2Web Diagnostic Status
Mind2Web would be a stronger test of cross-domain web action transfer than the skill-sequence checks in Table2, but the current GRPO run does not have a completed Mind2Web evaluation. TableA4therefore reports only the verified Mind2Web zero-shot baseline as context, not as evidence for the proposed policy.
Base Qwen3-8B reaches 9.5% task completion, 0.273 reward, 28.8% action accuracy, and 33.8% element accuracy on Mind2Web test_task. A future GRPO evaluation would need to beat this number before the paper could claim Mind2Web transfer. Zero-shot performance varies by domain: Shopping is easiest and Travel is hardest (TableA5). The present submission makes no Mind2Web or live WorkArena transfer claim.
The current result does not demonstrate whether the IW skill vocabulary transfers to Mind2Web. It only shows that the source-domain clusters are readable and that the current policy result is weak on the benchmarks we verified.
Appendix CSkill Discovery Diagnostics
C.1Segmentation Threshold Sensitivity
TableA6reports the source-domain sweep used to select the action-discontinuity thresholdθ\theta.
Table A6:Sensitivity of segmentation F1 to the percentile used forθ\thetaon IW. Stable within a±10\pm 10-percentile window around the optimum.
Figure A2:IW action-discontinuity scores. The dashed line marks the selected source-domain thresholdθ=1.545\theta=1.545.
C.2Embedding Visualization
FigureA3visualizes the learned segment embedding space used to inspect whether pseudo-label refinement separates source-domain skill types.
Figure A3:t-distributed stochastic neighbor embedding (t-SNE) projection of 16-dim skill embeddings on the IW benchmark. Colors and numbered badges indicate ground-truth skill types, with the embedded legend mapping each badge to a skill name. Well-separated groups confirm discriminative representation learning.
C.3Wasserstein Cluster Count Sweep
TableA7reports the unsupervised cluster-count sweep used to choose thek=8k=8clustering analyzed in the main text.
Table A7:Wasserstein clustering quality across different numbers of clusterskkon IW. We usek=8k=8in the main analysis because it gives the best NMI under the corrected diagonal 2-Wasserstein metric.
C.4Action Distribution Diagnostic
FigureA4gives the action-type distribution behind the qualitative cluster descriptions in Table1.
Figure A4:Per-cluster action-type distributions for the 8 auto-discovered clusters on IW data. Each row is a cluster, labelled with its dominant ground-truth skill and purity; columns are action types.
C.5Contrastive Training Curves
FigureA5shows the supervised-contrastive optimization trajectory for the skill encoder.
Figure A5:Supervised-contrastive training and validation loss over 200 epochs. The horizontal axis denotes epoch. Smooth convergence with stable validation loss indicates no overfitting.
C.6Per-Skill Accuracy and Confusion Matrix
TableA8reports skill-level accuracy and FigureA6shows how discovered clusters align with ground-truth skill labels. Here accuracy is the per-step next-skill, or tool-call, classification accuracy: among held-out IW steps whose ground-truth next skill is a given row label, it is the fraction for which the model predicts that same skill label. It is not primitive UI-action accuracy and not exact sequence match.
Table A8:Per-skill next-skill/tool-call prediction accuracy on IW data, sorted by accuracy. Std. is the binomial standard error over per-step correctness for the single held-out evaluation, and 95% CI is the Wilson interval.
Figure A6:Normalized alignment between Wasserstein clustering assignments and ground-truth skill labels on IW data. Rows are discovered clusters and columns are ground-truth skill labels.
C.7Qualitative Skill Descriptions
For each of the 8 discovered clusters in Table1, we asked Claude Haiku 4.5 to describe the cluster given its top action types, dominant ground-truth skill label, purity, and three nearest-to-centroid exemplar segments. Thus these descriptions are not a blind semantic validation of the clusters: the LLM saw the plurality ground-truth label used in the table. We use them only as concise natural-language interpretations of the same cluster statistics and exemplars reported in Table1. High-purity clusters correspond to cleaner behavioral patterns; low-purity clusters should be read as mixed action motifs rather than clean skills.
- •C0 (monitor_status, purity 0.66): Users inspect status-oriented interfaces through repeated click and scroll actions, often drilling into details and returning to nearby controls. The cluster is interpretable as monitoring behavior, but its moderate purity indicates overlap with other navigation-heavy skills.
- •C1 (send_message, purity 0.28): Users perform broad click/type routines such as focusing a text field, entering content, and clicking again to submit or advance. Although the plurality label issend_message, the low purity shows that this cluster mixes several text-entry skills.
- •C2 (document_edit, purity 1.00): Users perform document editing and formatting tasks by clicking to position the cursor, selecting and modifying text, applying formatting, and saving changes.
- •C3 (data_transfer, purity 1.00): Users copy information from one application and paste it into another by selecting content, copying, switching applications, positioning the cursor, and inserting the data.
- •C4 (organize_files, purity 1.00): Users organize files or folders by selecting an item, opening a context menu, and executing an organizational action.
- •C5 (export_publish, purity 1.00): Users complete an export or publish workflow through a short sequence of confirmation clicks, consistent with finalizing an already configured artifact.
- •C6 (review_content, purity 0.46): Users review or annotate displayed content by scrolling, selecting interface items, entering text, and moving through follow-up clicks. The cluster captures a reusable review-like action motif, but its sub-0.5 purity means it is not a clean one-skill cluster.
- •C7 (presentation_edit, purity 1.00): Users edit presentation content by locating content, typing updates, and saving.
Appendix DGenerated Skill Specifications
This section supports the automatedSKILL.mdanalysis in Section6.5. The table gives the data-efficiency numbers behind Figure2; the examples show what the generated skill descriptions look like against hand-written specifications.
D.1Data-Efficiency Table
TableA9gives the numeric data-efficiency results behind the main-text generated-SKILL.mdcomparison.
Table A9:Automated vs. hand-craftedSKILL.mdgeneration on IW trajectories. Normalized edit distance is lower better. The hand-crafted baseline is a simple expert-authored transition table. Auto-SKILL.mdimproves over the hand-crafted table at some sizes but is worse than the Frequency baseline at every evaluated size.
D.2Hand-Crafted vs. Auto-Generated Skill Examples
The following examples compare hand-crafted and auto-generatedSKILL.mddescriptions on realistic workflow tasks. They are qualitative examples, not additional metrics. The hand-crafted examples show the kind of detailed manual specification an engineer might write; the auto-generated examples instantiate trajectory-derived reusable structure with the concrete entities, fields, and goals from the task. Across the examples, the main difference is that hand-crafted skills include more expert validation and recovery rules, while auto-generated skills combine reusable action patterns with task-specific context.
Example 1: Service Ticket Update.
Task: In an IT service-management portal, find an open incident by ticket number, update the assignment group and priority, add a work note, and save the record.
Hand-craftedSKILL.md update_service_ticket•When:ticket, case, incident, or change request updates.•Preconditions:logged in; search or list filter visible; ticket identifier known.•Procedure:search exact ticket ID; open the matching record; verify record number; locateAssignment group,Priority,State, andWork notes; update requested fields only; save.•Validation:record number unchanged; edited values visible; note appears in activity stream.•Recovery:filter duplicate search results; stop on read-only fields; repair required-field errors before one retry.
Auto-generatedSKILL.md Cluster C0 mapped torecord_field_update•Observed pattern:click search/list field→\rightarrowtype ID→\rightarrowopen result→\rightarrowedit labeled fields→\rightarrowclick save/update.•Task context:target is an open incident; required fields are assignment group, priority, and work note.•Reusable steps:locate the incident by ticket number; bind task values to nearby labels; enter only requested updates; commit changes.•Generated validation:prefer exact record match; confirm edited labels still show requested values; final save/update terminates the skill.•Generated caution:preserve fields not mentioned in the task.
Example 2: Spreadsheet-to-CRM Data Transfer.
Task: Copy a customer’s renewal amount and renewal date from a spreadsheet, switch to a CRM opportunity page, paste the values into the matching fields, and submit the update.
Hand-craftedSKILL.md transfer_spreadsheet_value_to_crm•When:move values from a spreadsheet, table, email, or report into a business system.•Preconditions:source and destination are reachable; source row is identifiable; destination field names are known or inferable.•Procedure:locate source row; select cell; copy; switch to CRM; verify account or opportunity; paste into matching field; repeat for each value; save.•Validation:compare currency symbols, decimals, dates, and units; confirm destination record saved.•Recovery:recopy on clipboard failure; accept destination reformatting only if semantically equivalent; disambiguate duplicate customer rows by renewal period.
Auto-generatedSKILL.md Cluster C6 mapped todata_transfer•Observed pattern:click source cell→\rightarrowcopy value→\rightarrowswitch app/tab→\rightarrowclick destination input→\rightarrowpaste→\rightarrowconfirm.•Task context:source is the customer’s spreadsheet row; values are renewal amount and renewal date; destination is the CRM opportunity page.•Reusable steps:identify row by customer; copy each required value; switch to CRM; bind destination by field label; paste amount and date; submit update.•Generated validation:compare visible pasted strings with source values; allow formatting normalization only after checking semantic equivalence.•Generated caution:app switching is expected and should not reset the task state.
Example 3: Shared Drive File Organization.
Task: In a shared drive, find the latest quarterly budget file, rename it with the approved naming convention, move it into the finance archive folder, and verify that the old copy is no longer in the working folder.
Hand-craftedSKILL.md organize_shared_drive_file•When:rename, move, archive, duplicate, or clean up files and folders.•Preconditions:file browser open; source folder accessible; metadata sufficient to identify target file.•Procedure:search or sort folder; identify latest matching file by name, date, and extension; select file; open context menu; rename; confirm; choose move; navigate to archive; confirm move.•Validation:find new filename in destination; check extension and date; verify source folder no longer contains the moved file.•Recovery:disambiguate equally recent files; avoid overwriting filename conflicts; fall back from drag-and-drop to context-menu move.
Auto-generatedSKILL.md Cluster C7 mapped toorganize_files•Observed pattern:click file-like item→\rightarrowopen actions menu→\rightarrowchoose rename/move operation→\rightarrowclick destination or confirmation.•Task context:target is the latest quarterly budget file; destination is the finance archive folder; operation is rename then move.•Reusable steps:locate latest matching file; select it; rename with requested convention; open move action; choose archive folder; confirm.•Generated validation:confirm the renamed file appears in the destination and no longer appears in the working folder.•Generated caution:menu labels override positional assumptions; confirmation clicks often terminate the skill.
Similar Articles
@omarsar0: New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize.…
Microsoft Research introduces SkillOpt, a method that treats agent skill documents as trainable external state, using an optimizer model to make bounded edits validated by a held-out set. The approach achieves best or tied results across 52 evaluation cells and improves accuracy by over 23 points on GPT-5.5, with zero extra inference cost and transferable skills.
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2069064122218717387
This article explores how AI agents can automatically write and optimize their skill files using techniques like SkillOpt from Microsoft Research, which treats skill documents as trainable state and delivers significant performance improvements. It addresses the challenge of manual skill tuning and presents frameworks like GEPA and EvoSkill as evolutionary approaches.
Why are companies adopting SKILL.md instead of relying only on AI tools?
The article discusses the growing adoption of SKILL.md for defining reusable agent skills, and questions its advantages over relying solely on AI tools like ChatGPT and Claude, considering factors like offline usage, standardization, workflows, and cost savings.
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation
This paper presents COLLEAGUE.SKILL, an open-source system for automatically distilling person-grounded AI skills from heterogeneous traces into inspectable, correctable, and portable skill packages, enabling LLM agents to carry bounded representations of human expertise and interaction style.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.