SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

arXiv cs.AI Papers

Summary

Introduces SKILL.nb, a framework for governing reusable agent workflows through evidence-calibrated lifecycle policies, featuring selective formalization and gate-conditioned execution. It achieves significant improvements on web automation benchmarks and demonstrates resilience to environment drift.

arXiv:2606.08049v1 Announce Type: new Abstract: AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:54 AM

# Selective Formalization and Gated Execution for Durable Agent Workflows
Source: [https://arxiv.org/html/2606.08049](https://arxiv.org/html/2606.08049)
Amine El Hattami1,2,3, Nicolas Chapados1, Christopher Pal1,2,3,4 1ServiceNow Research,2Mila,3Polytechnique Montréal,4Canada CIFAR AI Chair

###### Abstract

AI agents increasingly convert past experience into reusable artifacts such as code, workflows, and procedural memories\. Reuse improves efficiency but creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation\. We introduceSKILL\.nb, a framework for governing reusable agent workflows through evidence\-calibrated lifecycle policies\. Its key mechanism is*selective formalization*: execution evidence decides which workflow steps should become executable code, which should remain natural\-language\-guided, and when those choices should be revised\.SKILL\.nbstores workflows as auditable, versioned notebooks that interleave natural\-language guidance, multi\-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces\. At runtime,SKILL\.nbperforms*gate\-conditioned execution*: unlike all\-or\-nothing scripts, each step can execute code when its gates validate, or fall back locally to an NL procedure or step intent when drift invalidates the executable realization\. Cell\-level records of attempted realizations, gate outcomes, outputs, screenshots, and fallbacks make both workflow updates and executions auditable\. On WebArena\-Verified,SKILL\.nbachieves 53\.7% single\-round success, improving over the strongest baseline by 3\.9 percentage points\. Across three re\-executions, it retains 91\.7% of initially successful tasks, 15\.5 points above the next best method\. Under bounded repair, it recovers 72\.9% of subsequent failures while limiting post\-repair regressions to 4\.2%, compared with 15\.0–17\.0% regression rates for persistent baselines\. It also leads the compared methods on Mind2Web cross\-website and cross\-domain splits\. In a realistic GitLab migration test,SKILL\.nbpreserves performance when reusing frozen state learned on GitLab 15\.7, with frozen\-versus\-fresh target\-version gaps of only−1\.7\-1\.7points on GitLab 16\.11 and\+0\.6\+0\.6points on GitLab 18\.9; the least\-degraded persistent baseline drops by10\.610\.6–11\.111\.1points\. These results identify lifecycle governance and gate\-conditioned execution as reliability axes beyond one\-shot task success\. Code, data, and evaluation scripts are available at[https://github\.com/Am1n3e/skill\-nb\.git](https://github.com/Am1n3e/skill-nb.git)\.

## 1Introduction

AI agents increasingly produce and rely on durable external artifacts such as code, workflows, and procedural memories\[[22](https://arxiv.org/html/2606.08049#bib.bib12),[43](https://arxiv.org/html/2606.08049#bib.bib15),[41](https://arxiv.org/html/2606.08049#bib.bib30),[27](https://arxiv.org/html/2606.08049#bib.bib31)\]\. As these artifacts are reused, the central question shifts from one\-shot task completion to whether the artifact remains effective as conditions change, such as when the target site for a web agent changes its UI or data layout\[[19](https://arxiv.org/html/2606.08049#bib.bib11)\]\. Recent memory and workflow systems treat agent experience as reusable\[[50](https://arxiv.org/html/2606.08049#bib.bib18),[48](https://arxiv.org/html/2606.08049#bib.bib38),[7](https://arxiv.org/html/2606.08049#bib.bib39),[41](https://arxiv.org/html/2606.08049#bib.bib30),[27](https://arxiv.org/html/2606.08049#bib.bib31)\], but repeated use introduces software\-maintenance lifecycle concerns\[[31](https://arxiv.org/html/2606.08049#bib.bib48)\]: artifacts need versioning, validation, repair, regression control, and retirement when assumptions fail\. Existing memory\-centric systems often reuse experience as prompt context, while workflow and artifact systems provide execution, validation, versioning, or state\-management primitives\[[41](https://arxiv.org/html/2606.08049#bib.bib30),[27](https://arxiv.org/html/2606.08049#bib.bib31),[23](https://arxiv.org/html/2606.08049#bib.bib24),[10](https://arxiv.org/html/2606.08049#bib.bib25),[20](https://arxiv.org/html/2606.08049#bib.bib26),[26](https://arxiv.org/html/2606.08049#bib.bib28)\]\. These mechanisms are typically studied separately rather than as a joint lifecycle policy\. For sustained automation, the question is not merely how to recall prior experience, but how accumulated execution evidence should govern reusable workflows: when to promote a candidate, when to formalize a step, when to repair or demote brittle realizations, and when to retire a workflow whose assumptions no longer hold\.

We study this lifecycle problem in web environments because they provide a high\-drift setting for evaluating learned artifacts\. Site changes are outside the automation system’s control: unlike desktop applications that can often be version\-pinned, or APIs that typically expose explicit contracts and announce breaking changes, web interfaces can drift without preserving automation anchors such as selectors or layouts\. While APIs are often the preferred automation interface, many real workflows still depend on UI\-level interaction\[[47](https://arxiv.org/html/2606.08049#bib.bib16)\]\. Automating web workflows therefore forces a formalization choice: which steps should be hardened into code, and which should remain natural\-language\-guided \(NL\-guided\) when interface drift would make repeated repair too costly\. As a limiting case, mathematical autoformalization illustrates the fully formal endpoint: translating informal arguments into proof\-assistant artifacts such as*Lean*yields terms checked by a well\-defined logical kernel\[[3](https://arxiv.org/html/2606.08049#bib.bib49),[42](https://arxiv.org/html/2606.08049#bib.bib50),[14](https://arxiv.org/html/2606.08049#bib.bib51)\]\. Durable agent workflows occupy an intermediate point on this spectrum\. Leaving steps NL\-guided preserves flexibility but weakens artifact\-level control, while hardening them into code makes reuse more controlled but can become brittle under drift\. Their validity therefore depends both on a task\-level procedure and on an environment\-dependent realization whose assumptions can break\. For durable workflows, formalization is therefore lifecycle\-governed\. The system must decide when to create or retire a workflow, which steps remain NL procedures versus executable code, and when execution evidence warrants revising those choices as interfaces drift\.

We introduceSKILL\.nb, a framework for learning evidence\-based lifecycle policies over reusable agent workflows\. Its central mechanism is*selective formalization*: using execution evidence to decide which steps remain NL\-guided, which become executable code, and when those choices should be revised\.SKILL\.nbcouples this with*gate\-conditioned execution*: runtime gates decide whether to execute code or fall back to an NL procedure\. Each workflow is represented as an auditable, versioned notebook that interleaves NL guidance, executable code, validation gates, fallback paths, and cell\-level evidence\. Appendix[A\.1](https://arxiv.org/html/2606.08049#A1.SS1)visualizes how a provisional task notebook is promoted into a released workflow\. Offline lifecycle learning aggregates traces into evidence counts for workflow creation and step formalization, and into accepted\-repair signals, including repair counts and token\-weighted repair burden as a proxy for instability in step demotion and workflow retirement\. Thresholds are calibrated by replaying candidate lifecycle policies on historical workflows, selecting low\-maintenance policies whose validation\-failure rate stays within budget\. This calibration uses accumulated execution evidence rather than hidden evaluator labels, and remains replay\-relative rather than a guarantee under future interface shifts\. Because web drift makes hardened code useful but brittle, the representation must preserve execution evidence, not just instructions\. Otherwise, maintenance cannot tell whether code was merely suggested or actually ran, which gates validated it, what evidence it produced, or when execution fell back\. This raises a natural question: why not use a conventionalSKILL\.mdfile with instructions and code snippets? Markdown can describe code, but it does not by itself record execution boundaries, gate outcomes, fallbacks, or cell\-local evidence\.SKILL\.nbmakes code execution first\-class and auditable by storing these objects as versioned notebooks\. Appendix[A\.3](https://arxiv.org/html/2606.08049#A1.SS3)compares the two representations along four axes: executable workflow state, validation feedback, evidence retention, and failure localization\.

Empirically,SKILL\.nbimproves one\-shot task performance and controlled repeated\-use reliability\. On WebArena\-Verified\[[5](https://arxiv.org/html/2606.08049#bib.bib29)\], it achieves the highest single\-round success rate among compared methods—CodeAct\[[39](https://arxiv.org/html/2606.08049#bib.bib13)\], AWMonline\[[41](https://arxiv.org/html/2606.08049#bib.bib30)\], and ReasoningBank\[[27](https://arxiv.org/html/2606.08049#bib.bib31)\]—at 53\.7%, outperforming the next\-best method by 3\.9 percentage points \(paired McNemar,p=0\.029p=0\.029\)\. In controlled repeated execution,SKILL\.nbretains 91\.7% of initially successful WebArena\-Verified tasks across three re\-executions, 15\.5 percentage points above the next best method\. Under bounded repair, it recovers 72\.9% of subsequent failures while limiting post\-repair regressions to 4\.2%, compared with 15\.0–17\.0% for persistent baselines under their native update paths\. As a secondary transfer evaluation on Mind2Web\[[4](https://arxiv.org/html/2606.08049#bib.bib10)\],SKILL\.nbleads the compared methods on all four metrics in both the cross\-website and cross\-domain splits, with step success reaching 38\.1% and 39\.7%, respectively\. In a controlled GitLab migration test,SKILL\.nbpreserves target\-version performance when reusing frozen state learned on GitLab 15\.7, while persistent\-memory baselines degrade by 10\.6–14\.4 points\. These results show that governed workflow artifacts improve task completion, reuse, repair, and regression control under those evaluated protocols\.

Our key contributions are summarized as follows:

- •We formulate durable web\-agent automation as*selective formalization*and lifecycle governance for reusable, executable artifacts that must be formalized, validated, repaired, and retired as assumptions fail\.
- •We presentSKILL\.nb, a notebook\-native framework that couples selective formalization with gate\-conditioned execution, enabling versioned workflows that mix NL guidance, executable cells, validation gates, fallback, and auditable multimodal execution evidence\.
- •We evaluateSKILL\.nbon WebArena\-Verified and Mind2Web, showing the highest single\-round performance among compared methods and improved retention, repair recovery, and regression control in repeated WebArena\-Verified execution, and include a controlled GitLab version\-drift protocol comparing fresh runs with old\-state reuse after application migration\.

## 2Related Work

Agent Experience Memory\.Memory is an essential module for agents\[[46](https://arxiv.org/html/2606.08049#bib.bib32)\], with representations ranging from virtual paging\[[28](https://arxiv.org/html/2606.08049#bib.bib33)\]and structured graphs\[[1](https://arxiv.org/html/2606.08049#bib.bib34),[44](https://arxiv.org/html/2606.08049#bib.bib35)\]to hierarchical working memory and consolidation mechanisms\[[12](https://arxiv.org/html/2606.08049#bib.bib36),[34](https://arxiv.org/html/2606.08049#bib.bib37),[51](https://arxiv.org/html/2606.08049#bib.bib19)\]\. A complementary line of work focuses on reusing past experience for future tasks\. For instance, Synapse retrieves raw trajectories as in\-context exemplars\[[50](https://arxiv.org/html/2606.08049#bib.bib18)\], while ExPeL and MemP distill experience into procedural insights and memory\[[48](https://arxiv.org/html/2606.08049#bib.bib38),[7](https://arxiv.org/html/2606.08049#bib.bib39)\]\. Extending this, Agent Workflow Memory \(AWM\) and ReasoningBank abstract execution traces into reusable workflows and reasoning strategies\[[41](https://arxiv.org/html/2606.08049#bib.bib30),[27](https://arxiv.org/html/2606.08049#bib.bib31)\]\. Crucially, across all these systems, retrieved experience remains*advisory*prompt context: it guides generation but is not natively executed, verified, or version\-controlled\.SKILL\.nbbridges this gap by operationalizing memory into governed, executable workflow artifacts validated by deterministic acceptance checks rather than stochastic LLM judgments\.

Self\-Evolving Agents\.Continuous adaptation is a core desideratum for autonomous agents\[[8](https://arxiv.org/html/2606.08049#bib.bib40),[21](https://arxiv.org/html/2606.08049#bib.bib41)\]\. To achieve cross\-task evolution, systems maintain refined causal abstractions \(CLIN\[[25](https://arxiv.org/html/2606.08049#bib.bib42)\]\), construct training curricula\[[33](https://arxiv.org/html/2606.08049#bib.bib43)\], or consolidate transferable reasoning strategies \(ICE\[[30](https://arxiv.org/html/2606.08049#bib.bib45)\], ChemAgent\[[35](https://arxiv.org/html/2606.08049#bib.bib46)\], Contextual Replay\[[24](https://arxiv.org/html/2606.08049#bib.bib44)\]\)\. More closely related to our approach are Voyager and TroVE, which construct open\-ended skill libraries of executable code through exploration\[[38](https://arxiv.org/html/2606.08049#bib.bib14),[40](https://arxiv.org/html/2606.08049#bib.bib47)\]\. However, because these methods typically append distilled lessons or skills to memory without strict lifecycle gating, a single erroneous update can silently corrupt an agent’s future behavior\.SKILL\.nbmitigates this risk by enforcing governed evolution: candidate updates enter the repository only after offline verification and deterministic gate checks, enabling safe rollbacks if regressions occur\.

Durable Agent Artifacts\.As agents produce durable outputs, governing their creation, validation, and maintenance becomes critical\. Systems typically manage this evolution through versioned execution logs and localized repair \(ALAS\[[10](https://arxiv.org/html/2606.08049#bib.bib25)\]\), git\-like state checkpointing \(AgentGit\[[20](https://arxiv.org/html/2606.08049#bib.bib26)\]\), or iterative notebook refinement\[[45](https://arxiv.org/html/2606.08049#bib.bib20),[6](https://arxiv.org/html/2606.08049#bib.bib21)\]\. At the runtime level, Atomix enforces strict transactional semantics for tool calls\[[26](https://arxiv.org/html/2606.08049#bib.bib28)\], while ReUseIt synthesizes guarded workflows from repeated attempts\[[23](https://arxiv.org/html/2606.08049#bib.bib24)\]\. However, ReUseIt validates these guards via stochastic LLM screenshot analysis, and other complementary systems rely heavily on human\-in\-the\-loop oversight during failures\[[29](https://arxiv.org/html/2606.08049#bib.bib27),[13](https://arxiv.org/html/2606.08049#bib.bib23)\]or external skill induction\[[36](https://arxiv.org/html/2606.08049#bib.bib22),[49](https://arxiv.org/html/2606.08049#bib.bib17)\]\.SKILL\.nbdifferentiates itself from these approaches by governing artifact lifecycles entirely offline, replacing stochastic runtime validation and human oversight with deterministic gate checks and trace\-replay calibration\.

Taken together, prior work supplies isolated primitives for memory, workflow induction, and artifact governance\.SKILL\.nbintegrates these strands into a unified lifecycle, complementing memory\-reuse frameworks, skill\-induction systems, and artifact managers\.

## 3SKILL\.nb

We presentSKILL\.nb, an online–offline framework for governing reusable workflow artifacts with an RLVR\-inspired preference for execution\-grounded evidence\[[15](https://arxiv.org/html/2606.08049#bib.bib1),[11](https://arxiv.org/html/2606.08049#bib.bib2)\]\. Its core mechanism is*selective formalization*: deciding when steps remain NL\-guided, become executable, are demoted after instability, or contribute to workflow retirement\. We describe the artifact representation, lifecycle policy, adaptive thresholds, and runtime loop; complete algorithms are in Appendix[A\.6](https://arxiv.org/html/2606.08049#A1.SS6)\.

### 3\.1Versioned Workflow Artifacts

SKILL\.nbstores each reusable procedure as a versioned artifact in a repository𝒦\\mathcal\{K\}, supporting retrieval, updates, and rollback\. A workflow version𝒲v\\mathcal\{W\}\_\{v\},v∈\{1,2,…\}v\\in\\\{1,2,\\ldots\\\}, and its steps are

𝒲v=⟨I,X,S,MW⟩,si=⟨Ii,Pi,Ci,Γi,MiS⟩,Γi=\(γi,pre,γi,post\)\.\\mathcal\{W\}\_\{v\}=\\langle I,X,S,M^\{W\}\\rangle,\\qquad s\_\{i\}=\\langle I\_\{i\},P\_\{i\},C\_\{i\},\\Gamma\_\{i\},M\_\{i\}^\{S\}\\rangle,\\qquad\\Gamma\_\{i\}=\(\\gamma\_\{i,\\mathrm\{pre\}\},\\gamma\_\{i,\\mathrm\{post\}\}\)\.HereIIis the workflow intent,XXthe input schema and validation rules,S=\(s1,…,sn\)S=\(s\_\{1\},\\ldots,s\_\{n\}\)the ordered steps, andMWM^\{W\}workflow metadata\. Each step stores a local intentIiI\_\{i\}, an NL procedurePiP\_\{i\}, an optional executable realizationCiC\_\{i\}, executable pre/post gatesΓi\\Gamma\_\{i\}, and step metadataMiSM\_\{i\}^\{S\}\.

Verification gates are predicates over environment\-observable states and do not access benchmark evaluators or hidden success labels\. Metadata supports retrieval at workflow and step level:MWM^\{W\}indexes similar task flows, whileMiSM\_\{i\}^\{S\}indexes analogous or specialized steps\. In our implementation, artifacts are Jupyter notebooks111[https://jupyter\.org/](https://jupyter.org/)interleaving these components\. Appendix[A\.1](https://arxiv.org/html/2606.08049#A1.SS1)shows an example\.

### 3\.2Selective Formalization as a Lifecycle Policy

Selective formalization governs both workflow maturity and step representation\. For workflow version𝒲v\\mathcal\{W\}\_\{v\}, let

y​\(𝒲v\)∈\{provisional,released,retired\}y\(\\mathcal\{W\}\_\{v\}\)\\in\\\{\\texttt\{provisional\},\\texttt\{released\},\\texttt\{retired\}\\\}denote whether it is under validation, available for retrieval and execution, or removed from active retrieval while retained for rollback and analysis\. For each stepsis\_\{i\}, letzi∈\{0,1\}z\_\{i\}\\in\\\{0,1\\\}indicate whether a validated executable realizationCiC\_\{i\}is available\. Ifzi=0z\_\{i\}=0, execution falls back toPiP\_\{i\}or, when necessary, the bare intentIiI\_\{i\};z=\(z1,…,zn\)z=\(z\_\{1\},\\ldots,z\_\{n\}\)is the workflow formalization pattern\.

Ideally, a lifecycle controller would choose\(y,z\)\(y,z\)to trade offline maintenance cost against repeated runtime cost:

\(y⋆,z⋆\)\\displaystyle\(y^\{\\star\},z^\{\\star\}\)=arg⁡miny,z⁡Cmaint​\(y,z∣𝒲v\)\+Crun​\(y,z∣𝒲v\)\\displaystyle=\\arg\\min\_\{y,z\}\\;C\_\{\\mathrm\{maint\}\}\(y,z\\mid\\mathcal\{W\}\_\{v\}\)\+C\_\{\\mathrm\{run\}\}\(y,z\\mid\\mathcal\{W\}\_\{v\}\)s\.t\.Jperf​\(y,z∣𝒲v\)≤Jperfref\+ϵ\.\\displaystyle\\text\{s\.t\.\}\\quad J\_\{\\mathrm\{perf\}\}\(y,z\\mid\\mathcal\{W\}\_\{v\}\)\\leq J\_\{\\mathrm\{perf\}\}^\{\\mathrm\{ref\}\}\+\\epsilon\.HereCmaintC\_\{\\mathrm\{maint\}\}is cumulative offline maintenance inference cost, including distillation, validation, repair, promotion, demotion, and retirement review\. We approximate each maintenance eventeeby token costc​\(e\)=tokin​\(e\)\+tokout​\(e\)c\(e\)=\\mathrm\{tok\}\_\{\\mathrm\{in\}\}\(e\)\+\\mathrm\{tok\}\_\{\\mathrm\{out\}\}\(e\)\. The runtime termCrunC\_\{\\mathrm\{run\}\}includes online inference, fallbacks, retries, and browser/tool actions\. The lossJperfJ\_\{\\mathrm\{perf\}\}is estimated from offline replay validation and captures downstream failure, regressions on previously passing traces, or step\-level degradation relative toJperfrefJ\_\{\\mathrm\{perf\}\}^\{\\mathrm\{ref\}\}, with toleranceϵ≥0\\epsilon\\geq 0\. Runtime gates do not accessJperfJ\_\{\\mathrm\{perf\}\}or hidden benchmark labels\.

The full objective is not directly solved, since lifecycle actions change future repository state, trace distributions, and maintenance opportunities\.SKILL\.nbinstead uses a restricted threshold policyπθ\\pi\_\{\\theta\}, withθ=\(τcreate,τform,τdemote,τretire\)\\theta=\(\\tau\_\{\\mathrm\{create\}\},\\tau\_\{\\mathrm\{form\}\},\\tau\_\{\\mathrm\{demote\}\},\\tau\_\{\\mathrm\{retire\}\}\), that emits create, form, demote, or retire actions from logged evidence\. Create, form, and demote use count thresholds; retirement uses a normalized repair burden in\[0,1\]\[0,1\]\. Threshold crossings trigger maintenance review, and repository state changes only after the corresponding artifact update passes validation\.

The thresholds use two evidence types\.*Trace\-support evidence*governs workflow creation and step formalization\. For requestqq, let𝒞​\(q\)\\mathcal\{C\}\(q\)be the cluster of similar prior traces available before the lifecycle decision\. The workflow\-level count\|𝒞​\(q\)\|\|\\mathcal\{C\}\(q\)\|is compared withτcreate\\tau\_\{\\mathrm\{create\}\}\. For each stepsis\_\{i\}, maintenance aligns traces in𝒞​\(q\)\\mathcal\{C\}\(q\)to the workflow sequence and counts validated execution segments supporting that step, givingnievidencen\_\{i\}^\{\\mathrm\{evidence\}\}, which is compared withτform\\tau\_\{\\mathrm\{form\}\}\.

*Repair evidence*governs demotion and retirement\. Letmirepairm\_\{i\}^\{\\mathrm\{repair\}\}be the number of accepted repairs affecting stepsis\_\{i\}in version𝒲v\\mathcal\{W\}\_\{v\}; only repairs passing maintenance validation are counted\. Ifmirepair≥τdemotem\_\{i\}^\{\\mathrm\{repair\}\}\\geq\\tau\_\{\\mathrm\{demote\}\}, the executable realization forsis\_\{i\}is demoted to NL\-guided execution\. For retirement, repair count alone is too coarse, so we use a token\-weighted burden\. Letℛi​\(𝒲v\)\\mathcal\{R\}\_\{i\}\(\\mathcal\{W\}\_\{v\}\)be accepted repairs affecting step slotiialong the lineage ending at𝒲v\\mathcal\{W\}\_\{v\}, letℛcal\\mathcal\{R\}\_\{\\mathrm\{cal\}\}be accepted repairs in calibration logs, and setcref=maxe∈ℛcal⁡c​\(e\)c\_\{\\mathrm\{ref\}\}=\\max\_\{e\\in\\mathcal\{R\}\_\{\\mathrm\{cal\}\}\}c\(e\)\. Ifℛcal\\mathcal\{R\}\_\{\\mathrm\{cal\}\}is empty, automatic repair\-burden retirement is deferred; otherwise,

ρrepair​\(𝒲v\)=1\|S\|​∑i=1\|S\|min⁡\(1,∑e∈ℛi​\(𝒲v\)c​\(e\)cref\)∈\[0,1\]\.\\rho\_\{\\mathrm\{repair\}\}\(\\mathcal\{W\}\_\{v\}\)=\\frac\{1\}\{\|S\|\}\\sum\_\{i=1\}^\{\|S\|\}\\min\\\!\\left\(1,\\frac\{\\sum\_\{e\\in\\mathcal\{R\}\_\{i\}\(\\mathcal\{W\}\_\{v\}\)\}c\(e\)\}\{c\_\{\\mathrm\{ref\}\}\}\\right\)\\in\[0,1\]\.This averages capped per\-step repair burden, preventing a single repeatedly repaired step from dominating the retirement signal\.

### 3\.3Adaptive Thresholds via Group Specialization

A single globalθ\\thetais often too aggressive for sparse domains and too conservative for well\-observed ones\.SKILL\.nbtherefore treats groups as threshold\-sharing units: workflow decisions \(create,retire\) use workflow groupsgW∈𝒢Wg^\{W\}\\in\\mathcal\{G\}^\{W\}, while step decisions \(form,demote\) use step groupsgS∈𝒢Sg^\{S\}\\in\\mathcal\{G\}^\{S\}\. Groups are canonicalized from artifact metadata such as site family, task type, action type, and interface properties, excluding benchmark task IDs and hidden labels\. When a statement applies to either workflow or step groups, we writeg∈𝒢dg\\in\\mathcal\{G\}\_\{d\}, whereddis the lifecycle decision\.

Letℒthr\\mathcal\{L\}\_\{\\mathrm\{thr\}\}be historical execution logs, separate from final evaluation tasks\. A threshold\-estimation casejjfor decisionddcontains the signaluju\_\{j\}, candidate lifecycle action, attributable maintenance token cost, and offline validation outcome; cases missing any of these fields are excluded\. Let𝒟g,d⊆ℒthr\\mathcal\{D\}\_\{g,d\}\\subseteq\\mathcal\{L\}\_\{\\mathrm\{thr\}\}be the usable cases for groupggand decisiondd, withng,d=\|𝒟g,d\|n\_\{g,d\}=\|\\mathcal\{D\}\_\{g,d\}\|\. The signal is\|𝒞​\(q\)\|\|\\mathcal\{C\}\(q\)\|for creation,nievidencen\_\{i\}^\{\\mathrm\{evidence\}\}for formalization,mirepairm\_\{i\}^\{\\mathrm\{repair\}\}for demotion, andρrepair​\(𝒲v\)\\rho\_\{\\mathrm\{repair\}\}\(\\mathcal\{W\}\_\{v\}\)for retirement\. Let𝒯d\\mathcal\{T\}\_\{d\}be the sorted unique signal values in the logs; thresholdτ\\tauadmits casejjwhenuj≥τu\_\{j\}\\geq\\tau\.

For candidateτ\\tau,C^maint\(d\)​\(g,τ\)\\hat\{C\}^\{\(d\)\}\_\{\\mathrm\{maint\}\}\(g,\\tau\)is the replay\-estimated maintenance token cost for decisionddon𝒟g,d\\mathcal\{D\}\_\{g,d\}, omitting costs common to all thresholds\. This optimizes maintenance compute rather than fullCtotalC\_\{\\mathrm\{total\}\}: runtime effects of suppressed actions are not reliably counterfactually observed; runtime efficiency, recovery, and regression are instead controlled through validation filters and measured downstream\.

The validation\-violation rate estimates how oftenτ\\tauwould admit an unsafe lifecycle action:

V^\(d\)​\(g,τ\)=kg,d​\(τ\)/ng,d,ng,d\>0,\\hat\{V\}^\{\(d\)\}\(g,\\tau\)=k\_\{g,d\}\(\\tau\)/n\_\{g,d\},\\qquad n\_\{g,d\}\>0,wherekg,d​\(τ\)k\_\{g,d\}\(\\tau\)counts admitted cases whose offline validation loss exceedsJperfref\+ϵJ\_\{\\mathrm\{perf\}\}^\{\\mathrm\{ref\}\}\+\\epsilon\. Ifng,d=0n\_\{g,d\}=0, no group\-specific violation rate is estimated\. To avoid treating small groups as safe because they have few observed violations, a threshold is feasible only when the one\-sided Wilson upper confidence bound is below the decision\-specific budgetVmax\(d\)V\_\{\\max\}^\{\(d\)\}:

ℱg,d=\{τ∈𝒯d:WilsonUCB1−α​\(kg,d​\(τ\),ng,d\)≤Vmax\(d\)\}\.\\mathcal\{F\}\_\{g,d\}=\\left\\\{\\tau\\in\\mathcal\{T\}\_\{d\}:\\mathrm\{WilsonUCB\}\_\{1\-\\alpha\}\(k\_\{g,d\}\(\\tau\),n\_\{g,d\}\)\\leq V\_\{\\max\}^\{\(d\)\}\\right\\\}\.The pooled feasible setℱdpool\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}is computed analogously after pooling cases across groups and supplies the default for unseen or sparse groups\. This replay filter uses only logged cases with observed outcomes; it does not simulate induced changes to later repository contents, traces, or repairs\. Because thresholds are selected after sweeping many candidates, the Wilson bound is a conservative replay filter rather than a uniform post\-selection or deployment\-time safety guarantee\.

Ifℱdpool≠∅\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}\\neq\\emptyset, the pooled threshold is

τ^dpool=arg⁡minτ∈ℱdpool⁡C^maint\(d\)​\(pool,τ\)\.\\hat\{\\tau\}^\{\\mathrm\{pool\}\}\_\{d\}=\\arg\\min\_\{\\tau\\in\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}\}\\hat\{C\}^\{\(d\)\}\_\{\\mathrm\{maint\}\}\(\\mathrm\{pool\},\\tau\)\.Ifℱdpool=∅\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}=\\emptyset, automatic thresholding for decisionddis deferred to maintenance review\. For a group withng,d\>0n\_\{g,d\}\>0andℱg,d≠∅\\mathcal\{F\}\_\{g,d\}\\neq\\emptyset,

τ^dg=arg⁡minτ∈ℱg,d⁡C^maint\(d\)​\(g,τ\)\.\\hat\{\\tau\}^\{g\}\_\{d\}=\\arg\\min\_\{\\tau\\in\\mathcal\{F\}\_\{g,d\}\}\\hat\{C\}^\{\(d\)\}\_\{\\mathrm\{maint\}\}\(g,\\tau\)\.Deployment shrinks small\-group estimates toward the pooled threshold:

τdg=\{Πℱg,d​\[ωg,d​τ^dg\+\(1−ωg,d\)​τ^dpool\],ng,d\>0,ℱg,d≠∅,ℱdpool≠∅,Πℱdpool​\[τ^dpool\],ng,d=0,ℱdpool≠∅,deferd,otherwise\.\\tau\_\{d\}^\{g\}=\\begin\{cases\}\\Pi\_\{\\mathcal\{F\}\_\{g,d\}\}\\\!\\left\[\\omega\_\{g,d\}\\hat\{\\tau\}\_\{d\}^\{g\}\+\(1\-\\omega\_\{g,d\}\)\\hat\{\\tau\}\_\{d\}^\{\\mathrm\{pool\}\}\\right\],&n\_\{g,d\}\>0,\\;\\mathcal\{F\}\_\{g,d\}\\neq\\emptyset,\\;\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}\\neq\\emptyset,\\\\\[5\.0pt\] \\Pi\_\{\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}\}\\\!\\left\[\\hat\{\\tau\}\_\{d\}^\{\\mathrm\{pool\}\}\\right\],&n\_\{g,d\}=0,\\;\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}\\neq\\emptyset,\\\\\[5\.0pt\] \\operatorname\{defer\}\_\{d\},&\\text\{otherwise\}\.\\end\{cases\}Hereωg,d=ng,d/\(ng,d\+n0\)\\omega\_\{g,d\}=n\_\{g,d\}/\(n\_\{g,d\}\+n\_\{0\}\)andΠℱ\\Pi\_\{\\mathcal\{F\}\}projects to the nearest replay\-supported feasible value\. The valuedeferd\\operatorname\{defer\}\_\{d\}means no automatic thresholded action is taken, and the case is routed to maintenance review\. At runtime,πθ\\pi\_\{\\theta\}compares each lifecycle signal with the deployed group threshold; crossings trigger review, not direct writes to𝒦\\mathcal\{K\}\. Appendix[A\.7](https://arxiv.org/html/2606.08049#A1.SS7)gives the Wilson formula, tie\-breaking, empty\-set handling, and full estimation procedure\.

##### Threshold calibration versus end\-to\-end RL\.

Recent RLVR methods update model parameters using verifiable task rewards\[[15](https://arxiv.org/html/2606.08049#bib.bib1),[11](https://arxiv.org/html/2606.08049#bib.bib2)\]\.SKILL\.nbadopts the same preference for externally checkable outcomes, but applies it to durable workflow artifacts rather than model weights\. In web automation, failures arise from exogenous interface drift—changed DOM structure, selectors, or page flow\[[2](https://arxiv.org/html/2606.08049#bib.bib3),[16](https://arxiv.org/html/2606.08049#bib.bib4),[17](https://arxiv.org/html/2606.08049#bib.bib5),[32](https://arxiv.org/html/2606.08049#bib.bib6)\]—so the failing object is often the artifact, not the base model\.SKILL\.nbtherefore updates versioned artifacts while keeping the base LLM fixed during runtime and offline maintenance\.

Lifecycle decisions are sparse governance actions over accumulated evidence, not token\-level policy updates\. Release, formalization, demotion, and retirement depend on auditable signals such as trace support and accepted repair counts, while replay\-estimated violation rates determine feasible thresholds\. Thresholds expose these tradeoffs, support conservative replay filtering, and remain inspectable during review\. Online RL would require deployment\-sensitive exploration over durable artifacts, while richer offline controllers would require full\-trajectory counterfactuals not present in the logs, reflecting support\-mismatch issues in offline RL\[[37](https://arxiv.org/html/2606.08049#bib.bib7),[18](https://arxiv.org/html/2606.08049#bib.bib8)\]\. Thus execution evidence serves as RLVR\-like feedback for artifact governance, with replay\-relative rather than future\-shift safety claims\. Appendix[A\.7\.3](https://arxiv.org/html/2606.08049#A1.SS7.SSS3)summarizes the design contrast\.

### 3\.4Runtime Execution and Maintenance Loop

Runtime handles a queryqqby executing the latest released workflow or, if none exists, synthesizing a provisional workflow𝒲^\\hat\{\\mathcal\{W\}\}from task intent and current state\. It maintains temporary per\-run memoryℳ\\mathcal\{M\}alongside the authoritative repository𝒦\\mathcal\{K\};ℳ\\mathcal\{M\}is mutable and non\-authoritative, storing transient observations, local repairs, and provisional routines\.

Execution is gate\-conditioned at the step level\. For stepsis\_\{i\}, runtime checksγi,pre​\(xt\)\\gamma\_\{i,\\mathrm\{pre\}\}\(x\_\{t\}\)on current statextx\_\{t\}; if drift causes failure, agents attempt local repair before execution\. Runtime then follows the fallback cascadeCi→Pi→IiC\_\{i\}\\to P\_\{i\}\\to I\_\{i\}until one realization satisfies both gates\. If all realizations fail within the retry budget, the run is unresolved and its trace is submitted to maintenance, but𝒦\\mathcal\{K\}is unchanged until offline validation accepts a repair or replacement\. Only accepted repairs update repair evidence\. Algorithm[1](https://arxiv.org/html/2606.08049#alg1)and Appendix[A\.6](https://arxiv.org/html/2606.08049#A1.SS6)give the execution procedure, trigger criteria, recovery behavior\.

Offline maintenance closes the loop\. Each execution is distilled into non\-authoritative temporary evidence inℳ\\mathcal\{M\}and, when warranted, durable proposals updates\. This yields a retrieve→\\rightarrowexecute→\\rightarrowdistill→\\rightarrowpromote loop in which runtime proposes changes while maintenance agents, counted in maintenance cost, verify, refactor, de\-duplicate, and promote artifacts into new repository versions\. The thresholds from §[3\.3](https://arxiv.org/html/2606.08049#S3.SS3)govern workflow release or retirement and step formalization or demotion\.

## 4Experiments

We evaluateSKILL\.nbalong three axes: fresh\-start task performance on WebArena\-Verified and directional transfer on Mind2Web \(§[4\.1](https://arxiv.org/html/2606.08049#S4.SS1)\), repeated\-use lifecycle reliability \(§[4\.2](https://arxiv.org/html/2606.08049#S4.SS2)\), and real application\-version drift \(§[4\.3](https://arxiv.org/html/2606.08049#S4.SS3)\)\. Supporting these experiments, we show detailed ablation for isolated mechanism contributions \(Appendix[C\.1](https://arxiv.org/html/2606.08049#A3.SS1)\) and threshold specialization \(Appendix[C\.2](https://arxiv.org/html/2606.08049#A3.SS2)\)\.

Baselines\.We compareSKILL\.nbagainst baselines targing different capabilitie\. We use CodeAct\[[39](https://arxiv.org/html/2606.08049#bib.bib13)\], AWMonline\[[41](https://arxiv.org/html/2606.08049#bib.bib30)\], and ReasoningBank\[[27](https://arxiv.org/html/2606.08049#bib.bib31)\], representing executable action generation, workflow memory, and retrieved reasoning memory, respectively\.

Benchmarks\.To align with prior workflow\- and reasoning\-memory evaluations\[[41](https://arxiv.org/html/2606.08049#bib.bib30),[27](https://arxiv.org/html/2606.08049#bib.bib31)\], we evaluate on WebArena\-Verified\[[5](https://arxiv.org/html/2606.08049#bib.bib29)\]and Mind2Web\[[4](https://arxiv.org/html/2606.08049#bib.bib10)\]\. WebArena\-Verified provides 812 tasks across five self\-hosted websites\. We use its 258\-task hard subset for component and threshold ablations\. Mind2Web is used for single\-round generalization on the cross\-task, cross\-website, and cross\-domain test splits\.

Evaluation Protocol\.All methods are re\-evaluated in a shared harness for a fair comparison\. WebArena\-Verified reports 95% Wilson CIs and uses a two\-sided continuity\-corrected McNemar test for the main paired success comparison; Mind2Web reports point estimates because its metrics mix macro\-averaged and task\-level quantities\. Each experiment below provides specific experimental setup and full protocol details are in Appendix[B\.1](https://arxiv.org/html/2606.08049#A2.SS1)\.

### 4\.1Benchmark Performance and Generalization

WebArena\-Verified\.We report single\-round performance on all 812 WebArena\-Verified tasks\. Each method starts without task\-specific persistent state and builds any repository or memory online during that round\. The primary metric is task success rate \(SR\)\. Table[1](https://arxiv.org/html/2606.08049#S4.T1)reports overall and per\-website SR\. We do not compare raw step counts because methods introduce different step types by design\.

Table 1:SKILL\.nbachieves 53\.7% success on WebArena\-Verified, outperforming the next\-best baseline by 3\.9 points \(p=0\.029p=0\.029\)\. Overall SR is the task\-weighted success rate across all 812 tasks \(95% Wilson CI in brackets\)\. Site\-specific columns report SR point estimates for each domain\. All methods start without preloaded workflow or memory states\.SKILL\.nbachieves the highest SR at 53\.7%, outperforming ReasoningBank by 3\.9 percentage points \(p=0\.029p=0\.029\), AWMonlineby 7\.3 points, and CodeAct by 15\.4 points\. The largest site\-level gains appear on GitLab \(\+9\.2 points over the next\-best baseline\) and Maps \(\+5\.5 points\)\. Run\-log analysis indicates that GitLab gains stem from more stable access to projects and issues, while Maps gains reflect selective formalization that avoids brittle map and routing interactions\. The performance lead on the Multi\-site subset \(17\.5% vs\. 11\.3%\) suggests that lifecycle governance is particularly effective for long\-horizon tasks\. More broadly, these fresh\-start gains indicate that even during initial discovery,SKILL\.nbbenefits from executable formalization that provides more robust state\-handling than baselines relying on natural\-language reasoning or unstructured memories\.

Table 2:SKILL\.nbachieves the best overall performance on Mind2Web, leading on all metrics in the cross\-website and cross\-domain splits\. Results provide directional evidence of transfer, especially at the step level\. We show element accuracy \(EA,↑\\uparrow\), action F1\(AF1,↑\\uparrow\), step success rate \(SSR,↑\\uparrow\), and task success rate \(SR,↑\\uparrow\)\. EA, AF1are macro\-averaged; SSR, SR are micro\-averaged\.Mind2Web\.To test zero\-shot generalization to unseen environments, we report performance on Mind2Web \(Table[2](https://arxiv.org/html/2606.08049#S4.T2)\)\. Absolute task success remains low across all methods, butSKILL\.nbachieves the best values across all four metrics in the cross\-website and cross\-domain settings\. Given the modest split sizes, we treat the overall pattern across metrics as more informative than any single cell\-level gap\. On the cross\-task split,SKILL\.nbclosely matches ReasoningBank on task success but leads by 2\.8 percentage points on action F1\. We interpret these results as directional evidence of transfer at the step level\. In the cross\-website setting, workflows transfer because high\-level intents remain similar despite UI differences; in the cross\-domain setting, step\-level transfer remains viable because finer\-grained interactions—such as search and form filling—recur across the web\.

### 4\.2Lifecycle Dynamics: Reuse Consistency, Repair, and Regression

Beyond initial success, we investigate whether persistent workflow or memory states benefit agents on recurring tasks\. We assess this through two lifecycle tests\. The first measures round\-level reuse \(Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(a\)\): methods execute five WebArena\-Verified rounds, carrying persistent states across iterations while per\-task environment states and transient contexts reset\. Each round applies a uniform perturbation protocol across all methods: tasks are reshuffled, starting URLs are varied, and intent templates are paraphrased while preserving core slots and intents\. The second test isolates artifact reuse under perturbation before and after update\. We first snapshot initially successful states and re\-execute them three times without updates to measure reuse consistency \(Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(b\)\)\. Failed snapshots then enter each method’s native update path, where we measure recovery and update\-induced regression \(Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(c\)\)\. Appendix[B\.1](https://arxiv.org/html/2606.08049#A2.SS1)gives the full protocol\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/x1.png)Figure 1:SKILL\.nbimproves task success over repeated rounds, maintains the highest reuse consistency, and optimizes the recovery–regression trade\-off\. \(a\) Task success over five perturbed rounds\. \(b\) Reuse consistency: fraction of workflows surviving three perturbed re\-executions without updates\. \(c\) Recovery vs\. regression under each method’s native update path\. Emphasized markers denote repair budget 2 \(Table[10](https://arxiv.org/html/2606.08049#A2.T10)\)\. Error bars indicate 95% Wilson CIs\.SKILL\.nbis the only method to improve over repeated rounds, rising from 53\.7% to 55\.7% by round 5, while baselines decline by 4\.6–7\.0 percentage points \(Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(a\)\)\. This advantage reflects superior artifact\-level stability:SKILL\.nbmaintains 91\.7% reuse consistency, above ReasoningBank \(76\.2%\) or AWMonline\(71\.6%; Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(b\)\)\. When artifacts fail and enter native update paths,SKILL\.nboptimizes the recovery–regression trade\-off \(72\.9%/4\.2%\), outperforming ReasoningBank \(63\.0%/15\.0%\) and AWMonline\(58\.0%/17\.0%\) at budget 2 \(Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(c\)\)\. These results indicate that while ungated memories often absorb regressive updates,SKILL\.nb’s validation\-gated promotion ensures that persistent workflow state remains stable under repeated perturbation and repair\.

Table 3:Main mechanism ablation summary on the 258\-task WebArena\-Verified hard subset\. SR and regression are percentages; token cost isSKILL\.nb\-internal maintenance/update tokens per success\. Full diagnostics are in Appendix[C\.1](https://arxiv.org/html/2606.08049#A3.SS1)\.Mechanism and overhead diagnostics\.Table[3](https://arxiv.org/html/2606.08049#S4.T3)shows component ablations on the WebArena\-Verified hard subset\. The full system has the highest point SR and lowestSKILL\.nb\-internal maintenance/update token cost per success\. Removing executable formalization or disabling fallback lowers success, while removing gates increases update\-induced regression\. In a separate three\-round threshold\-policy ablation \(Figure[13](https://arxiv.org/html/2606.08049#A3.F13)\), group\-specialized thresholds give the best success–regression–maintenance trade\-off: by round 3,SKILL\.nbreaches 38\.3% success with 3\.3% regression, whereas loose fixed thresholds over\-promote and fall to 27\.1% success with 22\.0% regression\. Maintenance overhead amortizes over reuse, with token usage per successful task falling to 69\.2% of its round\-1 value by round 5\. Full ablations, threshold curves, and cost curves are in Appendices[C\.1](https://arxiv.org/html/2606.08049#A3.SS1),[C\.2](https://arxiv.org/html/2606.08049#A3.SS2), and[B\.3](https://arxiv.org/html/2606.08049#A2.SS3)\.

### 4\.3Adapting to Real\-World Environment Drift

We next test whether persistent state remains useful under real interface drift\. We migrate all 180 GitLab tasks from WebArena\-Verified from GitLab 15 to two target versions: GitLab 16, the first major UI\-changing release relative to the original deployment, and GitLab 18\.9, the latest available version\. Appendix Figure[12](https://arxiv.org/html/2606.08049#A2.F12)and Table[11](https://arxiv.org/html/2606.08049#A2.T11)illustrate the corresponding UI and DOM drift\. Unlike synthetic perturbations, these updates change DOM structure, selectors, and page flow while preserving user intent\. Each method is evaluated in five conditions: fresh\-start runs on GitLab 15, 16, and 18, plus frozen\-state reuse on GitLab 16 and 18 using the repository or memory built on GitLab 15\. In the frozen\-state conditions, we restore the GitLab 15 repository or memory snapshot before each target\-version task and discard durable updates afterward\. Thus success measures whether persistent state learned on the source deployment helps or harms execution after migration, rather than cumulative relearning on the target version\. Full protocol details are in Appendix[B\.4](https://arxiv.org/html/2606.08049#A2.SS4)\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/x2.png)Figure 2:SKILL\.nbavoids negative transfer from stale procedural state under real GitLab version drift\. The x\-axis shows fresh\-start runs on GitLab 15, 16, and 18, followed by frozen\-state reuse from 15→\\rightarrow16 and 15→\\rightarrow18 Within each condition, orange, green, and blue bars denote AWMonline, ReasoningBank, andSKILL\.nb\. Hatched bars denote frozen repository or memory reuse\. Error bars indicate 95% Wilson CIs\.Newer GitLab versions are not intrinsically harder under fresh\-start execution: all methods improve on newer versions\. Qualitative log inspection suggests that newer GitLab versions expose more stable DOM locators for the Playwright\-based harness\. They may also better match recent LLM web\-navigation related pre\-training knowledge\.

Persistent\-memory baselines suffer negative transfer from frozen source\-version state\. On GitLab 18, AWMonlinedrops from 88/180 fresh\-start successes to 62/180 frozen\-state successes \(48\.9% to 34\.4%\), a 14\.4\-point degradation\. ReasoningBank similarly drops from 90/180 to 70/180 successes \(50\.0% to 38\.9%\), an 11\.1\-point degradation\. Qualitative inspection suggests that frozen memories often retain interface\-specific assumptions, such as brittle click identifiers or page\-geometry\-dependent procedures, that no longer match the migrated DOM\. In contrast,SKILL\.nbpreserves target\-version performance under frozen\-state reuse\. On GitLab 18, it obtains 110/180 fresh\-start successes and 111/180 frozen\-state successes \(61\.1% vs\. 61\.7%\)\. The same pattern holds on GitLab 16, where frozen\-state reuse reaches 108/180 successes compared with 111/180 fresh\-start successes\. This result is consistent with gate\-conditioned selective formalization: when an executable realization no longer satisfies its pre/postcondition gates, runtime can fall back from code to the preserved NL procedure or step intent\. Thus, under real GitLab version drift,SKILL\.nbavoids the old\-state degradation observed in persistent\-memory baselines\.

## 5Conclusion and Limitations

We presentedSKILL\.nb, a lifecycle\-governance framework for durable web\-agent workflow artifacts\. Its core mechanism,*selective formalization*, uses execution\-grounded evidence to decide when workflows are released or retired, and when steps remain NL\-guided, become executable, or are demoted after repair\. Versioned notebooks provide the governed artifact: they bind NL procedures, executable realizations, validation gates, and maintenance history into a reusable object\. Across WebArena\-Verified, Mind2Web, and GitLab version drift,SKILL\.nbimproves single\-round task performance, repeated\-use reliability, repair recovery, and robustness to stale persistent state\. These results suggest that durable agent artifacts should be treated not only as memories, but as lifecycle\-managed objects whose promotion, repair, and retirement are governed by execution evidence\.

Limitations\.SKILL\.nb’s lifecycle decisions are bounded by logged execution evidence\. The Wilson\-UCB feasibility check bounds estimated violation rates on threshold\-estimation cases, not arbitrary future workloads or interface shifts\. The method also depends on reliable gates, metadata quality, group assignment, and recurring task structure; gate errors or sparse recurrence can lead to invalid executions or conservative pooled behavior\. Our maintenance\-cost proxy counts LLM inference tokens rather than total operational cost, such as wall\-clock latency, storage, or human review\. Finally, our evaluation studies reuse and lifecycle correction in controlled benchmark environments, including single\-application GitLab version migration, which may differ from long\-term production deployment\. Appendix[A\.8](https://arxiv.org/html/2606.08049#A1.SS8)discusses these limitations in detail\.

## References

- \[1\]\(2025\)Mem0: building production\-ready AI agents with scalable long\-term memory\.External Links:2504\.19413,[Link](https://arxiv.org/abs/2504.19413)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[2\]S\. R\. Choudhary, D\. Zhao, H\. Versee, and A\. Orso\(2011\)WATER: web application test repair\.InProceedings of the First International Workshop on End\-to\-End Test Script Engineering,Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p1.1)\.
- \[3\]L\. de Moura and S\. Ullrich\(2021\)The lean 4 theorem prover and programming language\.InAutomated Deduction – CADE 28,pp\. 625–635\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p2.1)\.
- \[4\]X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su\(2023\-12\)Mind2Web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\(english\)\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p4.1),[§4](https://arxiv.org/html/2606.08049#S4.p3.1)\.
- \[5\]A\. El Hattami, M\. Thakkar, N\. Chapados, and C\. Pal\(2025\)WebArena Verified: reliable evaluation for web agents\.InWorkshop on Scalable and Efficient Agents at NeurIPS,Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p4.1),[§4](https://arxiv.org/html/2606.08049#S4.p3.1)\.
- \[6\]H\. Elhashemy, Y\. Lotfy, and Y\. Tang\(2025\)Bridging the prototype\-production gap: a multi\-agent system for notebooks transformation\.External Links:2511\.07257,[Link](https://arxiv.org/abs/2511.07257)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[7\]R\. Fang, Y\. Liang, X\. Wang, J\. Wu, S\. Qiao, P\. Xie, F\. Huang, H\. Chen, and N\. Zhang\(2025\)MemP: exploring agent procedural memory\.External Links:2508\.06433,[Link](https://arxiv.org/abs/2508.06433)Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[8\]H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu, H\. Wang, H\. Xiao, Y\. Zhou, S\. Zhang, J\. Zhang, J\. Xiang, Y\. Fang, Q\. Zhao, D\. Liu, Q\. Ren, C\. Qian, Z\. Wang, M\. Hu, H\. Wang, Q\. Wu, H\. Ji, and M\. Wang\(2025\)A survey of self\-evolving agents: what, when, how, and where to evolve on the path to artificial super intelligence\.External Links:2507\.21046,[Link](https://arxiv.org/abs/2507.21046)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[9\]A\. Gelman, J\. B\. Carlin, H\. S\. Stern, D\. B\. Dunson, A\. Vehtari, and D\. B\. Rubin\(2013\)Bayesian data analysis\.3rd edition,Chapman and Hall/CRC\.Cited by:[§A\.7](https://arxiv.org/html/2606.08049#A1.SS7.p2.12)\.
- \[10\]L\. Geng and E\. Y\. Chang\(2025\)ALAS: transactional and dynamic multi\-agent llm planning\.arXiv preprint arXiv:2511\.03094\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[11\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-r1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.08049#S3.p1.1)\.
- \[12\]M\. Hu, T\. Chen, Q\. Chen, Y\. Mu, W\. Shao, and P\. Luo\(2025\)HiAgent: hierarchical working memory management for solving long\-horizon agent tasks with large language model\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 32779–32798\.External Links:[Link](https://aclanthology.org/2025.acl-long.1575/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1575)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[13\]F\. Huq, Z\. Z\. Wang, F\. F\. Xu, T\. Ou, S\. Zhou, J\. P\. Bigham, and G\. Neubig\(2025\)CowPilot: a framework for autonomous and human\-agent collaborative web navigation\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(System Demonstrations\),Albuquerque, New Mexico,pp\. 163–172\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-demo.17),[Link](https://aclanthology.org/2025.naacl-demo.17/)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[14\]A\. Q\. Jiang, W\. Li, S\. Tworkowski, K\. Czechowski, T\. Odrzygóźdź, P\. Miłoś, Y\. Wu, and M\. Jamnik\(2023\)Draft, sketch, and prove: guiding formal theorem provers with informal proofs\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p2.1)\.
- \[15\]N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, S\. Lyu, Y\. Gu, S\. Malik, V\. Graf, J\. D\. Hwang, J\. Yang, R\. Le Bras, O\. Tafjord, C\. Wilhelm, L\. Soldaini, N\. A\. Smith, Y\. Wang, P\. Dasigi, and H\. Hajishirzi\(2024\)T"ulu 3: pushing frontiers in open language model post\-training\.arXiv preprint arXiv:2411\.15124\.Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.08049#S3.p1.1)\.
- \[16\]M\. Leotta, D\. Clerissi, F\. Ricca, and P\. Tonella\(2014\)Visual vs\. DOM\-based web locators: an empirical study\.InInternational Conference on Web Engineering,Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p1.1)\.
- \[17\]M\. Leotta, D\. Clerissi, F\. Ricca, and P\. Tonella\(2016\)ROBULA\+: an algorithm for generating robust XPath locators for web testing\.Journal of Software: Evolution and Process\.Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p1.1)\.
- \[18\]S\. Levine, A\. Kumar, G\. Tucker, and J\. Fu\(2020\)Offline reinforcement learning: tutorial, review, and perspectives on open problems\.arXiv preprint arXiv:2005\.01643\.Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p2.1)\.
- \[19\]I\. Levy, B\. Wiesel, S\. Marreed, A\. Oved, A\. Yaeli, and S\. Shlomov\(2025\-05\)ST\-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents\.Note:arXiv:2410\.06703 \[cs\]External Links:[Document](https://dx.doi.org/10.48550/arXiv.2410.06703),[Link](http://arxiv.org/abs/2410.06703)Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1)\.
- \[20\]Y\. Li, S\. Ping, X\. Chen, X\. Qi, Z\. Wang, Y\. Luo, and X\. Zhang\(2025\)AgentGit: a version control framework for reliable and scalable llm\-powered multi\-agent systems\.arXiv preprint arXiv:2511\.00628\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[21\]X\. Liang, Y\. He, Y\. Xia, X\. Song, J\. Wang, M\. Tao, L\. Sun, X\. Yuan, J\. Su, K\. Li, J\. Chen, J\. Yang, S\. Chen, and T\. Shi\(2024\)Self\-evolving agents with reflective and memory\-augmented abilities\.External Links:2409\.00872,[Link](https://arxiv.org/abs/2409.00872)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[22\]J\. Liu, K\. Wang, Y\. Chen, X\. Peng, Z\. Chen, L\. Zhang, and Y\. Lou\(2024\)Large language model\-based agents for software engineering: A survey\.arXiv preprint arXiv:2409\.02977\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1)\.
- \[23\]Y\. Liu, M\. Sra, J\. P\. Inala, and C\. Wang\(2025\)ReUseIt: synthesizing reusable ai agent workflows for web automation\.arXiv preprint arXiv:2510\.14308\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[24\]Y\. Liu, C\. Si, K\. R\. Narasimhan, and S\. Yao\(2025\)Contextual experience replay for self\-improvement of language agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 14179–14198\.External Links:[Link](https://aclanthology.org/2025.acl-long.694/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.694)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[25\]B\. P\. Majumder, B\. D\. Mishra, P\. Jansen, O\. Tafjord, N\. Tandon, L\. Zhang, C\. Callison\-Burch, and P\. Clark\(2023\)CLIN: a continually learning language agent for rapid task adaptation and generalization\.External Links:2310\.10134,[Link](https://arxiv.org/abs/2310.10134)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[26\]B\. Mohammadi, N\. Potamitis, L\. Klein, A\. Arora, and L\. Bindschaedler\(2026\)Atomix: timely, transactional tool use for reliable agentic workflows\.arXiv preprint arXiv:2602\.14849\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[27\]S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang, V\. Tirumalashetty, G\. Lee, M\. Rofouei, H\. Lin, J\. Han, C\. Lee, and T\. Pfister\(2025\)ReasoningBank: scaling agent self\-evolving with reasoning memory\.External Links:2509\.25140Cited by:[§A\.6\.3](https://arxiv.org/html/2606.08049#A1.SS6.SSS3.p2.2),[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§1](https://arxiv.org/html/2606.08049#S1.p4.1),[§2](https://arxiv.org/html/2606.08049#S2.p1.1),[§4](https://arxiv.org/html/2606.08049#S4.p2.1),[§4](https://arxiv.org/html/2606.08049#S4.p3.1)\.
- \[28\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2023\)MemGPT: towards LLMs as operating systems\.External Links:2310\.08560,[Link](https://arxiv.org/abs/2310.08560)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[29\]Y\. Piao, H\. Min, H\. Su, L\. Zhang, L\. Wang, Y\. Yin, X\. Wu, Z\. Xu, L\. Qu, H\. Li,et al\.\(2025\)AgentBay: a hybrid interaction sandbox for seamless human\-ai intervention in agentic systems\.arXiv preprint arXiv:2512\.04367\.Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[30\]C\. Qian, S\. Liang, Y\. Qin, Y\. Ye, X\. Cong, Y\. Lin, Y\. Wu, Z\. Liu, and M\. Sun\(2024\)Investigate\-consolidate\-exploit: a general strategy for inter\-task agent self\-evolution\.External Links:2401\.13996,[Link](https://arxiv.org/abs/2401.13996)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[31\]H\. U\. Rahman, A\. Alzayed, M\. I\. Mohmand, A\. M\. Albarrak, and S\. N\. Qasem\(2023\)Application maintenance offshoring using hci based framework and simple multi attribute rating technique \(smart\)\.IEEE Access11,pp\. 107068–107084\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2023.3320941)Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1)\.
- \[32\]A\. Stocco, M\. Leotta, F\. Ricca, and P\. Tonella\(2018\)Visual web test repair\.InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering,Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p1.1)\.
- \[33\]H\. Su, R\. Sun, J\. Yoon, P\. Yin, T\. Yu, and S\. O\. Arik\(2025\)Learn\-by\-interact: a data\-centric framework for self\-adaptive agents in realistic environments\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=3UKOzGWCVY)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[34\]Z\. Tan, J\. Yan, I\. Hsu, R\. Han, Z\. Wang, L\. Le, Y\. Song, Y\. Chen, H\. Palangi, G\. Lee, A\. R\. Iyer, T\. Chen, H\. Liu, C\. Lee, and T\. Pfister\(2025\)In prospect and retrospect: reflective memory management for long\-term personalized dialogue agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 8416–8439\.External Links:[Link](https://aclanthology.org/2025.acl-long.413/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.413)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[35\]X\. Tang, T\. Hu, M\. Ye, Y\. Shao, X\. Yin, S\. Ouyang, W\. Zhou, P\. Lu, Z\. Zhang, Y\. Zhao, A\. Cohan, and M\. Gerstein\(2025\)ChemAgent: self\-updating memories in large language models improves chemical reasoning\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=kuhIqeVg0e)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[36\]W\. Tao, X\. Xing, Y\. Chen, L\. Huang, and X\. Xu\(2025\)TreeRAG: unleashing the power of hierarchical storage for enhanced knowledge retrieval in long documents\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 356–371\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.20),[Link](https://aclanthology.org/2025.findings-acl.20/)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[37\]P\. S\. Thomas, G\. Theocharous, and M\. Ghavamzadeh\(2015\)High confidence policy improvement\.InInternational Conference on Machine Learning,Cited by:[§3\.3](https://arxiv.org/html/2606.08049#S3.SS3.SSS0.Px1.p2.1)\.
- \[38\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\-11\)Voyager: an open\-ended embodied agent with large language models\.Transactions on Machine Learning Research\(english\)\.External Links:ISSN 2835\-8856Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[39\]X\. Wang, Y\. Chen, L\. Yuan, Y\. Zhang, Y\. Li, H\. Peng, and H\. Ji\(2024\)Executable code actions elicit better LLM agents\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p4.1),[§4](https://arxiv.org/html/2606.08049#S4.p2.1)\.
- \[40\]Z\. Wang, G\. Neubig, and D\. Fried\(2024\)TroVE: inducing verifiable and efficient toolboxes for solving programmatic tasks\.InForty\-First International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=DCNCwaMJjI)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p2.1)\.
- \[41\]Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig\(2025\)Agent workflow memory\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§1](https://arxiv.org/html/2606.08049#S1.p4.1),[§2](https://arxiv.org/html/2606.08049#S2.p1.1),[§4](https://arxiv.org/html/2606.08049#S4.p2.1),[§4](https://arxiv.org/html/2606.08049#S4.p3.1)\.
- \[42\]Y\. Wu, A\. Q\. Jiang, W\. Li, M\. N\. Rabe, C\. Staats, M\. Jamnik, and C\. Szegedy\(2022\)Autoformalization with large language models\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p2.1)\.
- \[43\]Z\. Wu, C\. Han, Z\. Ding, Z\. Weng, Z\. Liu, S\. Yao, T\. Yu, and L\. Kong\(2025\-03\)OS\-Copilot: towards generalist computer agents with self\-improvement\.InICLR 2024 Workshop on Large Language Model \(LLM\) Agents,\(english\)\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1)\.
- \[44\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-MEM: agentic memory for LLM agents\.External Links:2502\.12110,[Link](https://arxiv.org/abs/2502.12110)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[45\]Z\. You, Y\. Zhang, D\. Xu, Y\. Lou, Y\. Yan, W\. Wang, H\. Zhang, and Y\. Huang\(2025\)DatawiseAgent: a notebook\-centric llm agent framework for adaptive and robust data science automation\.External Links:2503\.07044,[Link](https://arxiv.org/abs/2503.07044)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[46\]Z\. Zhang, X\. Bo, C\. Ma, R\. Li, X\. Chen, Q\. Dai, J\. Zhu, Z\. Dong, and J\. Wen\(2024\)A survey on the memory mechanism of large language model based agents\.External Links:2404\.13501,[Link](https://arxiv.org/abs/2404.13501)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[47\]Z\. Zhang and A\. Zhang\(2024\-06\)You only look at screens: multimodal chain\-of\-action agents\.cs,arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2309.11436)Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p2.1)\.
- \[48\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang\(2024\)ExpeL: LLM agents are experiential learners\.Proceedings of the AAAI Conference on Artificial Intelligence38\(17\),pp\. 19632–19642\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v38i17.29936),[Link](https://doi.org/10.1609/aaai.v38i17.29936)Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[49\]B\. Zheng, M\. Y\. Fatemi, X\. Jin, Z\. Z\. W\. a\. A\. Gandhi, Y\. Song, Y\. Gu, J\. Srinivasa, G\. Liu, G\. Neubig, and Y\. Su\(2025\)SkillWeaver: Web Agents can Self\-Improve by Discovering and Honing Skills\.cs\.AI,arXiv\.External Links:[Link](https://arxiv.org/abs/2504.07079)Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p3.1)\.
- \[50\]L\. Zheng, R\. Wang, X\. Wang, and B\. An\(2023\-10\)Synapse: trajectory\-as\-exemplar prompting with memory for computer control\.InThe Twelfth International Conference on Learning Representations,\(english\)\.Cited by:[§1](https://arxiv.org/html/2606.08049#S1.p1.1),[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.
- \[51\]W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang\(2024\-03\)MemoryBank: enhancing large language models with long\-term memory\.Proceedings of the AAAI Conference on Artificial Intelligence38\(17\),pp\. 19724–19731\(english\)\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v38i17.29946),ISSN 2374\-3468Cited by:[§2](https://arxiv.org/html/2606.08049#S2.p1.1)\.

## Appendix AMethod Details

### A\.1From Provisional Trace to Released Workflow Artifact

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_prov_intent.png)

\(a\)Cell 1: provisional header cell with task\-specific values\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_rel_intent.png)

\(b\)Cell 1: released header cell with reusable inputs and metadata\.

Figure 3:Header\-cell generalization for the GitLab merge\-request lifecycle example\. The provisional notebook records the concrete task request in Cell 1, while the released artifact keeps the reusable workflow description, input schema, and metadata so future requests can instantiate the same procedure with different branches, projects, and reviewers\.This appendix illustrates how a task\-specific provisional notebook is converted into a reusable releasedSKILL\.nbworkflow artifact\. The example uses a GitLab merge\-request task to show how concrete task values are lifted into an input schema, how notebook cells are aligned to workflow steps, and how execution evidence supports validation and promotion\. Runtime gates use browser\-observable state and task\-provided expected values; notebook outputs, logs, screenshots, and cell status are used by offline maintenance during artifact review\. The example does not use hidden benchmark labels or evaluator outputs\.

The concrete example is Task 668, a GitLab task starting from[http://localhost:8023](http://localhost:8023/)\. The user request is to submit a merge request for source branchredesignin projecta11yproject\.com, merge it into themainbranch, and assignRoshan Jossyas the reviewer\. This task instance supplies concrete values for the source branch, target branch, project, and reviewer\. During maintenance, these values are lifted into an input schema, while the task\-specific request is rewritten as a generic merge\-request workflow intent\. This keeps the artifact reusable without storing the benchmark task ID as part of the executable workflow\.

Figure[3](https://arxiv.org/html/2606.08049#A1.F3)shows the first promotion transformation\. The task\-specific request supplies concrete inputs: source branchredesign, target branchmain, source projecta11yproject\.com, and reviewerRoshan Jossy\. Maintenance distills this instance into a reusable header cell that records the workflow description, input schemaXX, and workflow metadataMWM^\{W\}for retrieval\.

After the provisional workflow executes, maintenance uses notebook evidence to align traces to the step cells in Figure[4](https://arxiv.org/html/2606.08049#A1.F4), then expands individual stages as in Figure[5](https://arxiv.org/html/2606.08049#A1.F5)to decide which steps should be formalized as executable cells with gates\. In this example, trace evidence supports promotion fromy​\(𝒲v\)=provisionaly\(\\mathcal\{W\}\_\{v\}\)=\\texttt\{provisional\}toy​\(𝒲v\)=releasedy\(\\mathcal\{W\}\_\{v\}\)=\\texttt\{released\}, while validated executable cells change the corresponding step indicators fromzi=0z\_\{i\}=0tozi=1z\_\{i\}=1\. Each formalized step stores a local intentIiI\_\{i\}, natural\-language procedurePiP\_\{i\}, executable realizationCiC\_\{i\}, executable pre/post gatesΓi\\Gamma\_\{i\}, and metadataMiSM\_\{i\}^\{S\}\. The gates can check the current GitLab context, branch fields, reviewer selection, navigation state, and successful form submission\. If later repair evidence shows that the workflow has become too costly or unstable, maintenance can move the same workflow lineage toy​\(𝒲v\)=retiredy\(\\mathcal\{W\}\_\{v\}\)=\\texttt\{retired\}while retaining it for rollback and analysis\. Because creating a merge request is non\-idempotent, maintenance validation controls promotion rather than blindly rerunning the same submission\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_prov_collapsed_steps.png)

\(a\)Cells 2–N: provisional workflow steps\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_rel_collapsed_steps.png)

\(b\)Cells 2–N: released workflow steps with executable gates\.

Figure 4:Step\-cell transformation for the GitLab merge\-request lifecycle example\. Cells 2–N contain the executable workflow steps: setup, browser actions, checks, and submission\. The released artifact preserves the reusable step structure while replacing task\-specific traces with parameterized inputs and validation gates\.![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_prov_step_1.png)

\(a\)Cells 2–3: provisional Step 1 with setup, checks, and red\-boxed execution log\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_rel_step_1.png)

\(b\)Cells 2–8: released Step 1 with parameterized setup and gates\.

Figure 5:Expanded Step 1 transformation for the GitLab merge\-request lifecycle example\. The provisional step in \(a\), from the provisional notebook, is formalized into \(b\), from the released notebook, ass1=⟨I1,P1,C1,Γ1,M1S⟩s\_\{1\}=\\langle I\_\{1\},P\_\{1\},C\_\{1\},\\Gamma\_\{1\},M\_\{1\}^\{S\}\\ranglewith intent, procedure, executable setup, gates, and metadata\.
### A\.2Notebook Evidence and Debugging Affordances

The previous subsection used Task 668 to illustrate artifact promotion\. The next figures show additional notebook affordances used by maintenance across workflows: cell\-attached screenshots, interactive debugging state, localized failure evidence, and heterogeneous execution cells\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_prov_step_4.png)Figure 6:Cell\-attached screenshot evidence for dynamic form handling\. When provisional creation or execution encounters ambiguous controls, such as repeatedUnassigneddropdowns for assignee and reviewer, the agent attaches the observed UI to the relevant cell\. The embedded screenshot is cropped for readability; the actual notebook can link to the full\-page screenshot used by maintenance to formalize reusable gates\.![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_prov_step_1_debug.png)Figure 7:Interactive debugging support from the notebook representation\. Because workflow realizations live in executable Jupyter cells, an agent or maintenance process can pause at a breakpoint, step through browser automation code, inspect task inputs and Playwright page state, and turn observed failures into more reliable executable gates\.![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/668_prov_failed_post_check.png)Figure 8:Cell\-local failure evidence in a provisional notebook\. When a gate fails, Jupyter stores the printed context, exception, and traceback directly under the failing cell, so an agent can retrieve the relevant code, state, and error in one localized artifact\. A multi\-component logging system could provide similar information, but would add logging infrastructure and synchronization points outside the workflow artifact\.Although the main walkthrough uses Task 668, the notebook representation also supports heterogeneous execution cells\. Figure[9](https://arxiv.org/html/2606.08049#A1.F9)shows a separate Task 784 example using Bash and Python cells in the same artifact\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/workflows/784_prov_multilang.png)Figure 9:Mixed\-language workflow cells in a provisional notebook\. For Task 784, the agent can use a Bash cell to clone the GitLab repositoryCellularPrivacy/Android\-IMSI\-Catcher\-Detector, then use Python cells to analyze commits on branchmasterand identify the contributor email\. This lets each step use the most suitable tool while preserving the full workflow trace in one artifact\.
### A\.3SKILL\.nbas an Extension ofSKILL\.md

Agent Skills define a lightweight convention for packaging agent capabilities as folders with aSKILL\.mdfile, optional scripts, references, and assets\.222[https://agentskills\.io/home](https://agentskills.io/home)ASKILL\.nbartifact can be viewed as a natural extension of this convention rather than a separate abstraction\. It preserves the same core ingredients: a human\-readable description, procedural instructions, optional executable resources, and supporting artifacts\. The key difference is the execution contract\. ASKILL\.mdfile may include code blocks, but the markdown artifact alone does not guarantee that an agent executes those blocks exactly as written\. Unless an external client or harness enforces verbatim execution, the agent may adapt the snippet, use it as guidance, or ignore it\.

SKILL\.nbmakes this boundary explicit by placing instructions, executable cells, gates, observed outputs, and local evidence in one versioned notebook object\. This does not remove the need for task\-level execution controls, but it gives maintenance a single auditable artifact for checking what code was intended to run, where validation occurred, and what evidence was produced\. The choice also avoids introducing a bespoke trace format: notebooks are a standard and widely supported representation for interleaving markdown, code, metadata, and outputs, and are therefore a familiar substrate for existing tools and LLM\-backed agents\. In our implementation, the loader supports progressive disclosure: it first reads the leading markdown cell and loads later cells only when execution, validation, or evidence inspection is needed\.

Our implementation uses this continuity directly\. It overloads the existing skill\-loading path so that a notebook artifact can be exposed through the same discovery interface as aSKILL\.mdskill\. At discovery time, the loader extracts the notebook\-level name, intent, and description from the leading markdown cell\. At activation time, it reads the relevant notebook cells and attached resources\. This makes the notebook representation compatible with the same discovery pathway in our implementation while retaining executable cell boundaries, validation gates, and cell outputs\.

Figure[10](https://arxiv.org/html/2606.08049#A1.F10)shows the correspondingSKILL\.mdrepresentation for the released Step 1 notebook artifact in Figure[5](https://arxiv.org/html/2606.08049#A1.F5)\(b\), while Table[4](https://arxiv.org/html/2606.08049#A1.T4)summarizes the design differences\.

The comparison is representational rather than a claim that notebooks are intrinsically safer\. Execution still depends on the surrounding client and task environment\. The benefit ofSKILL\.nbis that the executable cells, gates, outputs, screenshots, and logs are stored in one artifact, so offline maintenance can audit the intended realization and the evidence used to validate it\.

Table 4:Comparison between conventionalSKILL\.mdartifacts andSKILL\.nbnotebook artifacts\.The notebook representation has overhead\. A notebook is stored as JSON rather than plain markdown, and it carries cell metadata and outputs\. This increases artifact size and may make raw diffs less compact than a shortSKILL\.mdfile\. We view this as an engineering tradeoff\. The JSON format is standardized, widely supported, and easy for modern agents to parse\. In return, the artifact can combine natural\-language procedure, executable code in multiple languages, screenshots, validation logs, and rerun boundaries in one versioned object\. This also improves local debugging: an agent can inspect the recorded outputs and validation logs in the same artifact, without requiring a separate tracing or observability system\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/skillmd.png)Figure 10:Markdown skill representation of Figure[5](https://arxiv.org/html/2606.08049#A1.F5)\(b\), Cells 2–8: released Step 1 with parameterized setup and gates\. TheSKILL\.mdversion stores the same GitLab merge\-request workflow as front matter, inputs, natural\-language intent and procedure, metadata, and code snippets in a markdown document, whereas the correspondingSKILL\.nbartifact stores these units as executable notebook cells with validation gates and cell\-local outputs\.
### A\.4Artifact Safety and Reproducibility Scope

The workflow artifact schema records intents, procedures, executable cells, gates, metadata, and version links\. It does not include credentials, hidden evaluator labels, or benchmark final answers as durable workflow fields\. Runtime traces and local repair memories are non\-authoritative until offline maintenance promotes a validated workflow version into𝒦\\mathcal\{K\}\.

In our evaluation, executable notebook cells are run under the surrounding task or benchmark execution controls\. We therefore treat sandboxing, dependency pinning, credential scoping, and trace redaction as deployment and evaluation\-harness controls rather than as guarantees provided by the lifecycle policy itself\. Trace redaction, including redaction of stored visual state when it contains user\-provided secrets or credentials, is part of the surrounding deployment harness and outside the algorithmic claims of the lifecycle policy\. The method claims here rely on versioned artifacts, deterministic gate checks, offline validation, and rollback\-addressable repository state\.

### A\.5Validation and Evidence Signals

Table[5](https://arxiv.org/html/2606.08049#A1.T5)summarizes the signals used bySKILL\.nb\. Runtime gates are deterministic predicates over browser\-observable state instantiated with task\-provided expected values\. They can inspect DOM structure, visibility, URLs, form values, page text, and counts of relevant page objects\. Gates do not call benchmark evaluators, hidden success labels, or final\-answer oracles\.

Offline validation is broader than runtime gating\. It may use logged traces, cached regression checks, maintenance review, or calibration\-split benchmark labels when such labels are available\. These labels are offline\-only: they are never available to the runtime controller, and final evaluation tasks are disjoint from the calibration logs and labels used for threshold analysis\. Final benchmark trajectories, hidden labels, and task success outcomes are not used to estimate thresholds, assign groups, accept repairs, or promote workflows before evaluation\.

A repair is accepted only when it passes the triggering trace’s deterministic pre/post gates and offline maintenance validation for the affected workflow version\. Accepted repairs update the workflow\-version repair count used for demotion and the workflow\-lineage repair burden used for retirement\. A threshold violation is an offline calibration outcome where a candidate threshold would admit a lifecycle action whose downstream validation loss exceedsJperfref\+ϵJ\_\{\\mathrm\{perf\}\}^\{\\mathrm\{ref\}\}\+\\epsilon\.

Table 5:Validation and evidence signals used bySKILL\.nb\. Runtime never accesses benchmark labels\. Calibration\-split labels may be used only for offline threshold analysis; final evaluation labels are metrics\-only\.
### A\.6Runtime System Details

This section collects the operational details supporting §[3\.4](https://arxiv.org/html/2606.08049#S3.SS4)\. It gives the existing\-workflow execution algorithm together with the fuller runtime semantics for gate\-conditioned execution, drift recovery, experience distillation, and asynchronous maintenance\.

#### A\.6\.1Execution Algorithms

Runtime execution is decomposed into two substantive procedures\. Algorithm[1](https://arxiv.org/html/2606.08049#alg1)manages run state, proposal accumulation, and recovery escalation, while Algorithm[2](https://arxiv.org/html/2606.08049#alg2)encapsulates gate checking, local repair, and the realization cascade\. Failure distillation and the recovery trigger are simple bookkeeping operations, so we describe them in prose rather than as separate algorithm floats\.

Algorithm 1Top\-level existing\-workflow execution loop1:query

qq, knowledge repo

𝒦\\mathcal\{K\}, runtime memory

ℳ\\mathcal\{M\}, execution log

ℛlogs\\mathcal\{R\}\_\{\\mathrm\{logs\}\}, policy parameter

τrecover\\tau\_\{\\mathrm\{recover\}\}, env

ℰ\\mathcal\{E\}
2:result

yy
3:

𝒲vL←Retrieve​\(q,𝒦\)\\mathcal\{W\}\_\{v\_\{L\}\}\\leftarrow\\mathrm\{Retrieve\}\(q,\\mathcal\{K\}\);

S←𝒲vL\.SS\\leftarrow\\mathcal\{W\}\_\{v\_\{L\}\}\.S
4:

ℐaff←∅\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\\leftarrow\\varnothing;

patches←∅\\mathrm\{patches\}\\leftarrow\\varnothing;

ℬmem←∅\\mathcal\{B\}\_\{\\mathrm\{mem\}\}\\leftarrow\\varnothing
5:for

i=1i=1to

\|S\|\|S\|do

6:

𝒩i←RetrieveMem​\(q,xt,S​\[i\],ℳ\)\\mathcal\{N\}\_\{i\}\\leftarrow\\mathrm\{RetrieveMem\}\(q,x\_\{t\},S\[i\],\\mathcal\{M\}\)
7:

\(ok,pi,ρi,ri\)←PerformStep​\(𝒲vL,i,𝒩i,ℰ\)\(\\mathrm\{ok\},p\_\{i\},\\rho\_\{i\},r\_\{i\}\)\\leftarrow\\textsc\{PerformStep\}\(\\mathcal\{W\}\_\{v\_\{L\}\},i,\\mathcal\{N\}\_\{i\},\\mathcal\{E\}\)
8:ifnot

ok\\mathrm\{ok\}then

9:distill failure memory

ρf\\rho\_\{f\}and proposal

Πf\\Pi\_\{f\}from

\(q,i,xt,ℛlogs\)\(q,i,x\_\{t\},\\mathcal\{R\}\_\{\\mathrm\{logs\}\}\)
10:add non\-empty

ρf\\rho\_\{f\}to

ℬmem\\mathcal\{B\}\_\{\\mathrm\{mem\}\}; submit non\-empty

Πf\\Pi\_\{f\}
11:

ℐaff←ℐaff∪\{i\}\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\\leftarrow\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\\cup\\\{i\\\}
12:else

13:add non\-empty

ρi\\rho\_\{i\}to

ℬmem\\mathcal\{B\}\_\{\\mathrm\{mem\}\}and non\-empty

pip\_\{i\}to

patches\\mathrm\{patches\}
14:if

rir\_\{i\}or

pi≠∅p\_\{i\}\\neq\\varnothingthen

15:

ℐaff←ℐaff∪\{i\}\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\\leftarrow\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\\cup\\\{i\\\}
16:endif

17:endif

18:

ρaff←\|ℐaff\|/\|S\|\\rho\_\{\\mathrm\{aff\}\}\\leftarrow\|\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\|/\|S\|
19:if

ρaff≥τrecover\\rho\_\{\\mathrm\{aff\}\}\\geq\\tau\_\{\\mathrm\{recover\}\}then

20:

\(pg,yg\)←GlobalRecoverAndExecute​\(𝒲vL,i,q,𝒩i,ℛlogs,ℰ,𝒦\)\(p\_\{g\},y\_\{g\}\)\\leftarrow\\mathrm\{GlobalRecoverAndExecute\}\(\\mathcal\{W\}\_\{v\_\{L\}\},i,q,\\mathcal\{N\}\_\{i\},\\mathcal\{R\}\_\{\\mathrm\{logs\}\},\\mathcal\{E\},\\mathcal\{K\}\)
21:if

pg≠∅p\_\{g\}\\neq\\varnothingthen

22:

patches←patches∪\{pg\}\\mathrm\{patches\}\\leftarrow\\mathrm\{patches\}\\cup\\\{p\_\{g\}\\\}
23:endif

24:if

yg≠successy\_\{g\}\\neq\\mathrm\{success\}then

25:if

ℬmem≠∅\\mathcal\{B\}\_\{\\mathrm\{mem\}\}\\neq\\varnothingthen

26:

ℳ←Consolidate​\(ℳ,ℬmem\)\\mathcal\{M\}\\leftarrow\\mathrm\{Consolidate\}\(\\mathcal\{M\},\\mathcal\{B\}\_\{\\mathrm\{mem\}\}\)
27:endif

28:return

failure\\mathrm\{failure\}
29:endif

30:break

31:elseifnot

ok\\mathrm\{ok\}then

32:if

ℬmem≠∅\\mathcal\{B\}\_\{\\mathrm\{mem\}\}\\neq\\varnothingthen

33:

ℳ←Consolidate​\(ℳ,ℬmem\)\\mathcal\{M\}\\leftarrow\\mathrm\{Consolidate\}\(\\mathcal\{M\},\\mathcal\{B\}\_\{\\mathrm\{mem\}\}\)
34:endif

35:return

failure\\mathrm\{failure\}
36:endif

37:endfor

38:if

patches≠∅\\mathrm\{patches\}\\neq\\varnothingthen

39:

𝒲′←BuildProposal​\(𝒲vL,patches,ℐaff\)\\mathcal\{W\}^\{\\prime\}\\leftarrow\\mathrm\{BuildProposal\}\(\\mathcal\{W\}\_\{v\_\{L\}\},\\mathrm\{patches\},\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\)
40:

Submit​\(𝒲′\)\\mathrm\{Submit\}\(\\mathcal\{W\}^\{\\prime\}\)
41:endif

42:if

ℬmem≠∅\\mathcal\{B\}\_\{\\mathrm\{mem\}\}\\neq\\varnothingthen

43:

ℳ←Consolidate​\(ℳ,ℬmem\)\\mathcal\{M\}\\leftarrow\\mathrm\{Consolidate\}\(\\mathcal\{M\},\\mathcal\{B\}\_\{\\mathrm\{mem\}\}\)
44:endif

45:return

success\\mathrm\{success\}

Algorithm 2Step execution with gate checking, local repair, and realization fallback1:workflow

𝒲vL\\mathcal\{W\}\_\{v\_\{L\}\}, step index

ii, retrieved memory

𝒩i\\mathcal\{N\}\_\{i\}, env

ℰ\\mathcal\{E\}
2:

\(ok,pi,ρi,ri\)\(\\mathrm\{ok\},p\_\{i\},\\rho\_\{i\},r\_\{i\}\)
3:

pi←∅p\_\{i\}\\leftarrow\\varnothing;

ρi←∅\\rho\_\{i\}\\leftarrow\\varnothing;

ri←falser\_\{i\}\\leftarrow\\mathrm\{false\}
4:ifnot

γi,pre​\(xt\)\\gamma\_\{i,\\mathrm\{pre\}\}\(x\_\{t\}\)then

5:diagnose the mismatch between expected assumptions and current state

6:ifthe mismatch is attributable to environmental drift and a local repair succeedsthen

7:

ri←truer\_\{i\}\\leftarrow\\mathrm\{true\}; record any induced patch

pip\_\{i\}
8:else

9:return

\(false,pi,ρi,ri\)\(\\mathrm\{false\},p\_\{i\},\\rho\_\{i\},r\_\{i\}\)
10:endif

11:endif

12:attempt realizations in order

Ci→Pi→IiC\_\{i\}\\to P\_\{i\}\\to I\_\{i\}until one satisfies the postcondition gate

13:ifthe attempted realization is acceptedthen

14:distill any reusable local\-repair memory into

ρi\\rho\_\{i\}
15:return

\(true,pi,ρi,ri\)\(\\mathrm\{true\},p\_\{i\},\\rho\_\{i\},r\_\{i\}\)
16:endif

17:return

\(false,pi,ρi,ri\)\(\\mathrm\{false\},p\_\{i\},\\rho\_\{i\},r\_\{i\}\)

Algorithm[1](https://arxiv.org/html/2606.08049#alg1)is the algorithmic realization of the runtime loop in §[3\.4](https://arxiv.org/html/2606.08049#S3.SS4)\. It manages run\-level state: retrieving the released workflow, maintaining the affected\-step setℐaff\\mathcal\{I\}\_\{\\mathrm\{aff\}\}, accumulating local patches, and deciding when to terminate, recover globally, or submit a workflow\-level proposal\.

Algorithm[2](https://arxiv.org/html/2606.08049#alg2)encapsulates gate\-conditioned step execution\. It checks the precondition gate, attempts local repair when the mismatch appears to be environmental drift, then attempts the code→\\toprocedure→\\tointent cascade until one realization satisfies the postcondition\. Its outputs are the success flagok\\mathrm\{ok\}, a local patchpip\_\{i\}, a reusable memory itemρi\\rho\_\{i\}, and the repair indicatorrir\_\{i\}\.

When a step fails, runtime extracts a failure\-sourced memory itemρf\\rho\_\{f\}and a workflow\-level proposalΠf\\Pi\_\{f\}from the current execution trace\. The memory item may later be consolidated intoℳ\\mathcal\{M\}, while the proposal is submitted for offline maintenance review\. Recovery escalation uses the normalized affected\-step ratioρaff=\|ℐaff\|/\|S\|\\rho\_\{\\mathrm\{aff\}\}=\|\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\|/\|S\|and triggers global recovery whenρaff≥τrecover\\rho\_\{\\mathrm\{aff\}\}\\geq\\tau\_\{\\mathrm\{recover\}\}; the next subsection gives the full criterion\.

#### A\.6\.2Gate\-Conditioned Execution

Runtime handles a queryqqin either*existing\-workflow*mode or*no\-workflow*mode\. In existing\-workflow mode, the latest released workflow is the sole authoritative runtime artifact\. In no\-workflow mode, the agent synthesizes a provisional workflow𝒲^\\hat\{\\mathcal\{W\}\}from the task intent and current observable state; similar released workflows, supporting artifacts, and same\-site or same\-domain logs may be used only as advisory context\. Gates for either released or provisional steps are executable predicates over browser\-observable state, and cannot query benchmark evaluators, final task labels, or hidden oracle state\.

During either mode, runtime maintains a temporary per\-run memoryℳ\\mathcal\{M\}alongside the authoritative repository𝒦\\mathcal\{K\}\. This memory is mutable but non\-authoritative: it may store transient observations, local repair traces, and provisional routines for the current run, but it cannot update𝒦\\mathcal\{K\}directly\. Candidate updates derived fromℳ\\mathcal\{M\}become durable only after offline review and promotion\.

#### A\.6\.3Runtime Recovery Criterion

During execution, runtime maintains a run\-level instability signal to determine when local repair is no longer sufficient\. A step is marked as affected when it fails locally, requires an accepted local repair, or produces a non\-empty step\-local patchpip\_\{i\}indicating drift\. Let

ℐaff=\{i∣step​i​has shown instability in the current run\}\\mathcal\{I\}\_\{\\mathrm\{aff\}\}=\\\{i\\mid\\text\{step \}i\\text\{ has shown instability in the current run\}\\\}denote the set of unique affected steps in the current run\. Each step contributes at most once, so repeated local difficulty on the same step does not artificially inflate the run\-level signal\. The normalized affected\-step ratio is

ρaff=\|ℐaff\|\|S\|,\\rho\_\{\\mathrm\{aff\}\}=\\frac\{\|\\mathcal\{I\}\_\{\\mathrm\{aff\}\}\|\}\{\|S\|\},where\|S\|\|S\|is the number of steps in the current workflow\.

Runtime escalates to workflow\-level recovery when this ratio crosses the runtime recovery controlτrecover\\tau\_\{\\mathrm\{recover\}\}:

ei=\(ρaff≥τrecover\)\.e\_\{i\}=\(\\rho\_\{\\mathrm\{aff\}\}\\geq\\tau\_\{\\mathrm\{recover\}\}\)\.This normalization makes the recovery control comparable across workflows of different lengths\. Wheneie\_\{i\}holds, the agent invokes a global recovery routine that re\-plans and executes the remaining steps using retrieved memories, execution history, and prior workflow versions as context\[[27](https://arxiv.org/html/2606.08049#bib.bib31)\]\. If recovery cannot make progress under the available task context and permissions, the run terminates as a failure\. When progress requires an external precondition unavailable to the agent, such as credentials or user approval, runtime pauses for operator input or terminates\. This fallback is outside the autonomous lifecycle policy and is not counted as a method capability in evaluation\.

#### A\.6\.4Experience Distillation

Runtime traces are distilled only into non\-authoritative evidence inℳ\\mathcal\{M\}or into candidate workflow\-level proposals for offline review\. These outputs may summarize reusable procedural fragments, local repairs, missing preconditions, guardrails, or drift signatures, but raw experiences never enter𝒦\\mathcal\{K\}directly; durable updates require offline verification and promotion\.

#### A\.6\.5Asynchronous Maintenance

Once runtime agents submit a candidate𝒲′\\mathcal\{W\}^\{\\prime\}, offline maintenance agents verify, refactor, and promote validated workflows into𝒦\\mathcal\{K\}, with their inference calls counted in maintenance cost\. They also review the lessons and repair\-derived workflow changes distilled from the run’s temporary memory for possible promotion into supporting repository artifacts\.

Raw runtime experience is first distilled into non\-authoritative per\-run evidence inℳ\\mathcal\{M\}, and only then do candidate workflow changes or supporting artifacts enter gated promotion\. This progression letsSKILL\.nbpreserve the adaptability of learned agent experience while keeping durable notebook updates and reusable assets under code\-level governance\.

Verification and Promotion\.Upon receiving a proposal𝒲′\\mathcal\{W\}^\{\\prime\}or failure\-sourced proposalΠf\\Pi\_\{f\}, maintenance agents refine the candidate against execution traces, stripping exploratory detours and backtracking actions to yield a minimal, reproducible artifact\. They resolve duplicates and conflicts against𝒦\\mathcal\{K\}, then verify the candidate using logged traces, cached regression checks, or side\-effect\-controlled reruns where available\. This consolidation is deliberately stricter than append\-only memory: candidate changes must correspond to concrete workflow edits and pass deterministic gate checks before promotion\. Gate definitionsΓi\\Gamma\_\{i\}are updated where needed to reflect observed drift\. Each promotion creates a new version and release in𝒦\\mathcal\{K\}and triggers downstream maintenance jobs\.

### A\.7Adaptive Thresholds: Technical Details

This section records the deployed threshold\-estimation procedure supporting §[3\.3](https://arxiv.org/html/2606.08049#S3.SS3)\. The intent is reproducibility rather than a distribution\-free guarantee: thresholds are filtered by replay on logged decision opportunities, then shrunk toward pooled behavior when group evidence is sparse\.

Regularized interpretation\.Viewed more formally, the pooled\-plus\-group procedure in §[3\.3](https://arxiv.org/html/2606.08049#S3.SS3)can be read as an empirical\-Bayes\-style shrinkage heuristic\. Under a local\-quadratic approximation to the within\-group costs, and when the pooled cost dominates the regularization pull, the operational procedure is well approximated by a conservative single\-pass coordinate\-descent update to the following objective:

min\{τdg\},\{τdpool\}​∑d\[C^maint\(d\)​\(pool,τdpool\)\+∑g∈𝒢d\(C^maint\(d\)​\(g,τdg\)\+λg,d​\(τdg−τdpool\)2\)\]\\min\_\{\\\{\\tau\_\{d\}^\{g\}\\\},\\,\\\{\\tau\_\{d\}^\{\\mathrm\{pool\}\}\\\}\}\\sum\_\{d\}\\left\[\\hat\{C\}^\{\(d\)\}\_\{\\mathrm\{maint\}\}\(\\mathrm\{pool\},\\tau\_\{d\}^\{\\mathrm\{pool\}\}\)\+\\sum\_\{g\\in\\mathcal\{G\}\_\{d\}\}\\Bigl\(\\hat\{C\}^\{\(d\)\}\_\{\\mathrm\{maint\}\}\(g,\\tau\_\{d\}^\{g\}\)\\;\+\\;\\lambda\_\{g,d\}\\,\\bigl\(\\tau\_\{d\}^\{g\}\-\\tau\_\{d\}^\{\\mathrm\{pool\}\}\\bigr\)^\{2\}\\Bigr\)\\right\]\(1\)subject toWilsonUCB1−α​\[V^\(d\)​\(pool,τdpool\)\]≤Vmax\(d\)\\mathrm\{WilsonUCB\}\_\{1\-\\alpha\}\[\\hat\{V\}^\{\(d\)\}\(\\mathrm\{pool\},\\tau\_\{d\}^\{\\mathrm\{pool\}\}\)\]\\leq V\_\{\\max\}^\{\(d\)\}andWilsonUCB1−α​\[V^\(d\)​\(g,τdg\)\]≤Vmax\(d\)\\mathrm\{WilsonUCB\}\_\{1\-\\alpha\}\[\\hat\{V\}^\{\(d\)\}\(g,\\tau\_\{d\}^\{g\}\)\]\\leq V\_\{\\max\}^\{\(d\)\}for allddandg∈𝒢dg\\in\\mathcal\{G\}\_\{d\}, withτdg,τdpool∈𝒯d\\tau\_\{d\}^\{g\},\\tau\_\{d\}^\{\\mathrm\{pool\}\}\\in\\mathcal\{T\}\_\{d\}\. Calibratingλg,d∝1/ng,d\\lambda\_\{g,d\}\\propto 1/n\_\{g,d\}yields the shrinkage weightωg,d=ng,d/\(ng,d\+n0\)\\omega\_\{g,d\}=n\_\{g,d\}/\(n\_\{g,d\}\+n\_\{0\}\), wheren0n\_\{0\}is a reference sample size controlling the transition from pooled to group\-specialized behavior, and the unconstrained blendωg,d​τ^dg\+\(1−ωg,d\)​τ^dpool\\omega\_\{g,d\}\\,\\hat\{\\tau\}\_\{d\}^\{g\}\+\(1\-\\omega\_\{g,d\}\)\\,\\hat\{\\tau\}\_\{d\}^\{\\mathrm\{pool\}\}\. This blend has the usual partial\-pooling behavior\[[9](https://arxiv.org/html/2606.08049#bib.bib9)\]: small groups are shrunk toward the pooled default, while well\-supported groups remain close to their own empirical optimum\. Because shrinkage is continuous inng,dn\_\{g,d\}, no separate minimum\-sample specialization gate is needed: sparse groups remain near pooled behavior automatically\. The runtime rule substitutes the constrained sweep outputτ^dg\\hat\{\\tau\}\_\{d\}^\{g\}for the unconstrained minimizer when the group has a nonempty feasible set\. When the unconstrained minimum is infeasible butℱg,d\\mathcal\{F\}\_\{g,d\}is nonempty, this substitution is conservative because it shrinks from a feasible point\. Section[A\.7\.1](https://arxiv.org/html/2606.08049#A1.SS7.SSS1)gives the deployed estimation procedure\.

#### A\.7\.1Threshold Estimation Procedure

Before threshold estimation, offline maintenance agents assign each workflow and step to a group using the versioned metadata maps stored with the artifact\. These group descriptors are maintenance\-produced artifact metadata, not a fixed benchmark taxonomy\. They may change when a workflow version is updated, repaired, or re\-reviewed\. Descriptors are selected from a constrained schema and canonicalized before threshold estimation; the schema and canonicalization rules are frozen for each calibration run\. For threshold calibration, group assignment is performed chronologically from metadata available at the time of the logged decision\. Later outcomes, benchmark labels, and final task success are excluded from the grouping input\. Rare or novel descriptors back off to coarser parents, and the exact group assignment used for each decision opportunity is logged with the workflow version\. Descriptors are restricted to reusable, observable properties such as site family, task type, action type, step type, and interface properties such as whether a form is dynamic\. They do not include benchmark task identifiers, hidden evaluator outputs, final answers, or success labels\. The resulting labels determine which logged decision opportunities contribute to group\-specific estimates\.

The retirement normalizer is fixed from calibration logs ascref=maxe∈ℛcal⁡c​\(e\)c\_\{\\mathrm\{ref\}\}=\\max\_\{e\\in\\mathcal\{R\}\_\{\\mathrm\{cal\}\}\}c\(e\), the maximum accepted repair\-event token cost observed during calibration\. Ifℛcal\\mathcal\{R\}\_\{\\mathrm\{cal\}\}is empty, automatic retirement by repair burden is deferred to maintenance review\. For each decisiondd, the candidate set𝒯d\\mathcal\{T\}\_\{d\}is the sorted set of unique replay values of that decision’s signal, not a hand\-tuned grid\. For eachτ∈𝒯d\\tau\\in\\mathcal\{T\}\_\{d\}, replay estimates maintenance costC^maint\(d\)​\(g,τ\)\\hat\{C\}^\{\(d\)\}\_\{\\mathrm\{maint\}\}\(g,\\tau\)and violation rateV^\(d\)​\(g,τ\)\\hat\{V\}^\{\(d\)\}\(g,\\tau\)on threshold\-estimation cases𝒟g,d\\mathcal\{D\}\_\{g,d\}\. A case is used only when the threshold\-estimation log contains the candidate action with its validation outcome and maintenance\-token cost\. The sweep keeps candidates whose pointwise one\-sided Wilson upper bound is at mostVmax\(d\)V\_\{\\max\}^\{\(d\)\}and selects the feasible candidate with lowest replay\-estimated maintenance cost\. The same sweep is run on pooled data and on each group with usable support\.

Fork=kg,d​\(τ\)k=k\_\{g,d\}\(\\tau\),n=ng,dn=n\_\{g,d\},p^=k/n\\hat\{p\}=k/n, andz=Φ−1​\(1−α\)z=\\Phi^\{\-1\}\(1\-\\alpha\), the one\-sided Wilson upper bound is

WilsonUCB1−α​\(k,n\)=p^\+z2/\(2​n\)\+z​p^​\(1−p^\)/n\+z2/\(4​n2\)1\+z2/n\.\\mathrm\{WilsonUCB\}\_\{1\-\\alpha\}\(k,n\)=\\frac\{\\hat\{p\}\+z^\{2\}/\(2n\)\+z\\sqrt\{\\hat\{p\}\(1\-\\hat\{p\}\)/n\+z^\{2\}/\(4n^\{2\}\)\}\}\{1\+z^\{2\}/n\}\.A threshold is feasible when this upper bound is at mostVmax\(d\)V\_\{\\max\}^\{\(d\)\}\. Because the final threshold is selected after sweeping over𝒯d\\mathcal\{T\}\_\{d\}, this bound is used as a conservative pointwise filter rather than a formal uniform confidence guarantee after adaptive selection\.

The deployed rule uses the sample\-size shrinkage weightωg,d=ng,d/\(ng,d\+n0\)\\omega\_\{g,d\}=n\_\{g,d\}/\(n\_\{g,d\}\+n\_\{0\}\)from §[3\.3](https://arxiv.org/html/2606.08049#S3.SS3)\. This is the usual partial\-pooling form: groups with little replay support stay near the pooled threshold, while well\-observed groups move toward their own replay estimate\. The implementation is parameterized directly byn0n\_\{0\}and does not estimate a curvature term\. After shrinkage, the blended threshold may not be one of the replay\-feasible candidate values, so the deployed rule projects it back onto the finite feasible set:

Πℱ​\(x\)=arg⁡minτ∈ℱ⁡\|τ−x\|\.\\Pi\_\{\\mathcal\{F\}\}\(x\)=\\arg\\min\_\{\\tau\\in\\mathcal\{F\}\}\|\\tau\-x\|\.Ties are broken towardτ^dpool\\hat\{\\tau\}\_\{d\}^\{\\mathrm\{pool\}\}\. Ifng,d=0n\_\{g,d\}=0, the method skips the specialized branch and uses the pooled branch whenℱdpool\\mathcal\{F\}^\{\\mathrm\{pool\}\}\_\{d\}is nonempty\. Ifng,d\>0n\_\{g,d\}\>0butℱg,d=∅\\mathcal\{F\}\_\{g,d\}=\\emptyset, the method does not automatically apply the pooled threshold to that group\. It routes the corresponding lifecycle action to maintenance review, because group\-specific replay found no threshold satisfying the violation budget\. If the pooled feasible set is also empty, the method returnsdeferd\\operatorname\{defer\}\_\{d\}\. This value is not a numeric threshold; it suppresses automatic thresholded action for decisiondduntil maintenance review\. This avoids applying a threshold for which replay found no candidate satisfying the violation budget\.

Runtime outcomes feed the threshold signals as follows: successful executions add workflow\-level evidence for creation and step\-level evidence for formalization\. Accepted local repairs add a count signal for demotion and a token\-weighted burden signal for retirement\. The same threshold mechanism applies to all four lifecycle decisions:createreleases a sufficiently supported workflow,formadds executable realizations to stable steps,demotereturns repaired or unstable steps to NL\-guided execution, andretireremoves a workflow when normalized repair burden is widespread\.

Table 6:Lifecycle decisions governed by replay\-calibrated thresholds\.
#### A\.7\.2Replay\-Relative Scope of the Feasibility Check

Projection does not claim the threshold is safe under future shifts\. It only ensures that the deployed value is one of the candidates that passed the replay filter on the logged opportunities used for estimation\. The UCB check is replay\-relative: it filters thresholds usingV^\(d\)\\hat\{V\}^\{\(d\)\}, the violation rate estimated by changing one candidate threshold at historical decision points while keeping the rest of the logged repository trajectory fixed\. It does not control the true violation rate under arbitrary future workload or interface shift, and it does not model second\-order effects where a different threshold would have changed later repository contents, group assignments, or repair opportunities\. For this reason, the paper describes the result as UCB\-bounded replay feasibility rather than a deployment\-time safety guarantee\.

#### A\.7\.3Why Threshold Control Rather Than End\-to\-End RL

Table[7](https://arxiv.org/html/2606.08049#A1.T7)summarizes whySKILL\.nbuses thresholded lifecycle control rather than end\-to\-end RL or RLVR\. The distinction is not that RLVR is unsuitable in general\. Rather, the policy target here is a durable repository of workflow artifacts, and the available counterfactual evidence is local to logged lifecycle decisions\.

Table 7:WhySKILL\.nbuses thresholded lifecycle control rather than end\-to\-end RLVR\. The policy governs durable workflow artifacts, not token generation or primitive browser actions\.

### A\.8Limitations and Scope

The adaptive\-threshold procedure has bounded scope in several important ways\. Its UCB\-bounded filtering is replay\-relative through the estimatorV^\(d\)\\hat\{V\}^\{\(d\)\}, rather than a blanket deployment\-time guarantee under arbitrary distribution shift\. Replay changes one candidate threshold at a time while holding the rest of the logged lifecycle fixed, so it does not model second\-order effects on future repository contents, future group assignments, or later repair opportunities\. The offline routine is therefore a practical single\-pass approximation to the joint four\-decision problem rather than a fully joint optimizer\. We also do not evaluate alternative threshold estimators such as Bayesian optimization over the empirical replay breakpoints or learned predictors from group features\. Direct offline partial pooling with UCB\-constrained sweeps is chosen here for sample efficiency on finite replay candidate sets, interpretability, and explicit feasibility filtering\. Appendix[A\.7](https://arxiv.org/html/2606.08049#A1.SS7)provides the technical details\.

The method also depends on the quality of the gates and metadata used to produce these signals\. Evidence and repair signals are only as reliable as the gate specifications\(γi,pre,γi,post\)\(\\gamma\_\{i,\\mathrm\{pre\}\},\\gamma\_\{i,\\mathrm\{post\}\}\)and the offline metadata tags used to group workflows and steps\. Gate false positives can admit an invalid step execution or repair, while false negatives can trigger unnecessary fallback, demotion, or repair\. Poor gates or miscalibrated tags can therefore distort the signals driving the lifecycle decisions\. Separately, the benefits of specialization depend on recurring workload structure\. This assumption is plausible in deployment settings such as help\-desk or enterprise automation, where many requests recur as variations of the same underlying workflows, but it may fail in more heterogeneous environments\. If few groups accumulate enough support, shrinkage keeps their deployed thresholds close to the pooled branch and the method behaves more like a pooled retuning baseline than a strongly specialized policy\. The maintenance\-cost proxyc​\(e\)=tokin​\(e\)\+tokout​\(e\)c\(e\)=\\mathrm\{tok\}\_\{\\mathrm\{in\}\}\(e\)\+\\mathrm\{tok\}\_\{\\mathrm\{out\}\}\(e\)measures LLM inference tokens only\. It does not optimize for wall\-clock latency, human handoff cost, storage overhead for the repository and logs, or developer review effort\. For retirement, token\-weighted repair burden is a maintenance\-effort proxy rather than a complete measure of semantic criticality, so a cheap but critical repair can still be underweighted\.

The empirical evaluation also has bounded scope\. Benchmark evaluations provide snapshots of performance under fixed benchmark conditions, while the method’s distinctive correction capabilities, including demotion, retirement, and re\-formation after drift, operate over longer deployment timescales that benchmark runs do not naturally exercise\. As a result, the paper characterizes the growth phase of workflow creation and step formalization more thoroughly than the correction phase\.

Broader impacts\.More auditable workflow artifacts can make web agents easier to inspect, reproduce, and roll back when they fail\. This may reduce silent regressions in benign automation settings such as internal support or software\-maintenance workflows\. The same capability could also make undesired web automation more reliable, or preserve traces that contain sensitive visual state or credentials if deployed without redaction\. For this reason, our claims are limited to controlled benchmarks, and deployment should pair lifecycle governance with credential scoping, trace redaction, rate limits, and task\-level authorization controls\.

## Appendix BExperimental Details

This appendix records protocol details for the main experiments in §[4](https://arxiv.org/html/2606.08049#S4)\. The main text summarizes the benchmark choices, shared harness, and statistical conventions, while this appendix pins the run configuration, persistent\-state rules, re\-execution protocol, repair protocol, and confidence\-interval calculations\. Additional component\-removal and transfer diagnostics appear in Appendix[C](https://arxiv.org/html/2606.08049#A3)\.

### B\.1Evaluation Protocol

All reported experiments use identical infrastructure and a fixed execution configuration\. All compared methods are run with the same GPT 5\.3 Codex model, tool access, and execution budget in our evaluation harness for a fair comparison, so the main\-text tables and figures report a unified re\-evaluation rather than numbers copied directly from the original papers\. For baseline methods, the shared harness wraps the released public implementations and preserves their native persistent\-state and update paths rather than replacing them with a common re\-implementation\. On WebArena\-Verified, the benchmark table in the main text reports a fresh\-start full benchmark round in which each method starts without persistent learned state and builds any repository or memory online during that round\. Separately, the lifecycle protocol runs for five rounds over the full 812\-task benchmark\. For persistent\-state methods, the GitLab version\-drift evaluation includes fresh\-start runs on GitLab 15\.7, GitLab 16\.11, and GitLab 18\.9, plus old\-to\-target conditions that build old\-version state on GitLab 15\.7 and reuse restored snapshots on GitLab 16\.11 and GitLab 18\.9\. We choose GitLab 16\.11 as an intermediate target because documented 16\.x interface changes overlap with GitLab task families in WebArena\-Verified, and use GitLab 18\.9 as a longer\-horizon target\. Appendix[B\.4](https://arxiv.org/html/2606.08049#A2.SS4)gives the detailed GitLab protocol and version\-selection rationale\. The benchmark environment is reset for each task execution\. What persists varies by method: AWMonlinecarries its induced workflow memory, ReasoningBank carries its distilled reasoning bank, andSKILL\.nbcarries its workflow repository𝒦\\mathcal\{K\}, associated workflow state, event logs, and lifecycle\-policy state\. CodeAct has no persistent learned state or released\-artifact maintenance loop and is therefore omitted from the lifecycle figures and GitLab old\-state reuse evaluation\. We do not synchronize update frequencies across methods\. Each persistent\-state method updates its state according to its own design during the multi\-round protocol\.

Task order and online state\.Each WebArena\-Verified round uses a shuffled task order, and the same order for that round is applied to all methods\. The order is reshuffled between rounds\. After each task, transient agent context is cleared; persistent method state is not reset between tasks or rounds\. We do not report multiple independent order seeds in this draft, so results should be interpreted as performance under the shared online\-learning orders used here rather than as order\-invariant estimates\. Persistent state is allowed to store reusable task\-family structure, natural\-language procedures, executable workflow components, gates, and repair history\. It is not allowed to store hidden evaluator outputs, benchmark success labels, final answers, or benchmark\-specific task identifiers as reusable workflow fields\. URLs may appear only when they are part of the observable task state or a reusable site\-level navigation pattern\.

Lifecycle perturbations\.For each lifecycle round, we generate one perturbation set offline and apply it to every method\. Eligible tasks receive a same\-site start URL sampled by random walk; tasks that require a benchmark\-specified non\-default start URL keep that required condition\. Language perturbations are generated at the WebArena\-Verifiedintent\_templatelevel, e\.g\., “Get name\(s\) of reviewer\(s\) who mention \{\{description\}\} for the product on the current page,” rather than by rewriting fully instantiated tasks\. An LLM proposes paraphrases while preserving slot variables, intent, and task parameters\. We filter candidates with a sentence encoder, require semantic similarity above 90%, and choose paraphrases with a penalty on word\-level similarity so the maximum planned number of rounds uses diverse but semantically close wording\. The paraphrase bank is generated once before evaluation and then reused across methods\.

The exact template\-paraphrasing prompt is:

> You are paraphrasing WebArena\-Verified intent templates for controlled evaluation\. Rewrite the template in different wording while preserving the exact task intent, all required outputs, and every slot variable exactly as written\. Do not add constraints, remove constraints, change entities, change the requested output type, or alter the success condition\. Keep all \{\{slot\}\} placeholders unchanged\. ReturnNNparaphrases, one per line\.

ForSKILL\.nb, the experimental hyperparameters are fixed unless otherwise noted:Vmax\(d\)=0\.05V\_\{\\max\}^\{\(d\)\}=0\.05for all lifecycle decisions,α=0\.05\\alpha=0\.05,n0=2n\_\{0\}=2, andτrecover=0\.25\\tau\_\{\\mathrm\{recover\}\}=0\.25\. These values define the replay\-feasibility budget, UCB tail probability, shrinkage reference sample size, and runtime recovery trigger used throughout the experiments\. TheSKILL\.nbruntime and maintenance agents are implemented usingopencode\.

Reproducibility record\.Table[8](https://arxiv.org/html/2606.08049#A2.T8)records the fixed run configuration used for the reported experiments\.

Table 8:Reproducibility record for the reported experiments\. For baseline methods marked with∗, temperature settings follow the corresponding original work where specified\. All methods are re\-evaluated withgpt\-5\.3\-codex, and baseline runs usehighreasoning effort for fair comparison\.ForSKILL\.nb, the persistent state specifically includes the workflow repository𝒦\\mathcal\{K\}, workflow states \(including provisional, released, and retired artifacts\), supporting event logs, and the learned lifecycle\-policy state\. Reuse consistency is measured before any update\. For each method, the denominator is the set of tasks solved in the initial WebArena\-Verified round\. We freeze the method\-specific learned state used for each such success: the released workflow forSKILL\.nb, the associated workflow\-memory state for AWMonline, and the retrieved reasoning\-bank state for ReasoningBank\. Each frozen artifact is re\-executed three times under independent environment resets, shared start\-URL perturbations, and shared intent\-template paraphrases\. A task is counted as maintaining reuse consistency only if all three re\-executions succeed without editing the artifact\.

Only artifacts that later fail under re\-execution enter the repair protocol\. Each failing artifact receives up to three repair attempts through that method’s native state\-update mechanism\. We do not impose a synthetic cross\-method validation gate\. ForSKILL\.nb, a candidate repair must pass promotion before replacing the pinned workflow version\. For AWMonline, repair attempts use its workflow\-memory update path; for ReasoningBank, they use its memory extraction and reasoning\-bank update path\.

Recovery at budgetBBis the fraction of failed artifacts restored to success withinBBnative update attempts\. After each accepted update attempt, regression is evaluated on the cached\-trace subset of tasks that were passing before the candidate update\. This subset is method\-specific: it is drawn from that method’s previously passing tasks with reusable cached traces available at repair time and therefore varies by update event\. Reuse consistency, recovery, and regression are therefore reported with method\-conditional denominators, and the repair/regression comparison characterizes native update behavior rather than a matched validation\-gate ablation\.

On Mind2Web, we report only a single benchmark round in which each method starts without task\-specific persistent state\. Methods may accumulate workflows or memories online as allowed by their design, but no re\-execution or repair is evaluated there\. All repeated\-execution and repair results in this paper are therefore specific to WebArena\-Verified under the shared evaluation setup\.

For statistical reporting on WebArena\-Verified, individual method rates use Wilson 95% confidence intervals\. The main overall success\-rate comparison betweenSKILL\.nband the next\-best baseline is additionally evaluated with a two\-sided continuity\-corrected McNemar test on per\-task success outcomes over the shared 812\-task benchmark\. We do not report analogous paired tests for every per\-site subset in this draft, so those breakdowns should be interpreted as descriptive point estimates\.

The round\-2 dip forSKILL\.nbin Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(a\) coincides with demotions of workflows created late in round 1\. Because several task templates appear only once or twice, those workflows had limited evidence before the next perturbed round\. We treat this as a protocol\-specific diagnostic rather than a standalone claim about the method\.

### B\.2Lifecycle Curve Confidence Intervals

Table[9](https://arxiv.org/html/2606.08049#A2.T9)reports the 95% Wilson confidence intervals for the per\-round success rates shown in Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(a\)\. Each cell is computed over the full 812\-task WebArena\-Verified round for that method and round\. The McNemar test reported in the main text is computed from paired per\-task outcomes over the same 812 tasks, while the marginal Wilson intervals for the single\-round WebArena\-Verified benchmark are reported directly in Table[1](https://arxiv.org/html/2606.08049#S4.T1)\.

Table 9:Per\-round WebArena\-Verified success rates with 95% Wilson confidence intervals for Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(a\)\. Each entry is reported as SR \[lower, upper\], in percent\.Table 10:Budget\-2 recovery and regression rates with 95% Wilson confidence intervals for Figure[1](https://arxiv.org/html/2606.08049#S4.F1)\(c\)\. Entries are reported as rate \[lower, upper\], in percent\.
### B\.3Lifecycle Maintenance Costs

Figure[11](https://arxiv.org/html/2606.08049#A2.F11)reports the maintenance token usage per successful task across the five rounds of the lifecycle protocol\. Because token counts are highly method\-dependent, we report usage patterns normalized by each method’s own round\-1 maintenance cost \(round 1 = 100%\), highlighting within\-method cost dynamics\. We observe thatSKILL\.nbexperiences a steady decline in maintenance cost per task as its workflow repository𝒦\\mathcal\{K\}stabilizes, reaching 69\.2% of its initial round\-1 cost by round 5\. The small round\-2 dip in success coincides with demotions of workflows created late in round 1, before enough evidence had accumulated to validate them reliably\. We do not include other baselines in this analysis as they lack an explicit maintenance procedure that can be tracked and fairly measured againstSKILL\.nb\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/x3.png)Figure 11:Maintenance token usage per successful task across five rounds, normalized to each method’s round\-1 cost \(100%\)\.
### B\.4GitLab Version Drift

We use controlled GitLab version drift because persistent web\-agent artifacts can fail when a maintained application changes outside the agent’s control\. This better matches how web\-agent artifacts can fail in practice than synthetic UI perturbations chosen by the experimenter, since the user’s underlying work remains similar while selectors, workflows, and stored assumptions may become stale\. The experiment therefore tests reuse across an application\-version change without treating it as evidence for broader drift settings\.

This appendix provides protocol support for the main GitLab version\-drift experiment in Section[4\.3](https://arxiv.org/html/2606.08049#S4.SS3)\. In Figure[2](https://arxiv.org/html/2606.08049#S4.F2), v15, v16, and v18 abbreviate GitLab 15\.7, GitLab 16\.11, and GitLab 18\.9, respectively\. The candidate pool is the 180 single\-site GitLab tasks from WebArena\-Verified\. We exclude multi\-site tasks that include GitLab because those tasks add cross\-site coordination and non\-GitLab state, which would make the measured change less specific to GitLab application\-version drift\. The source deployment is GitLab 15\.7\. We evaluate GitLab 16\.11 as an intermediate target and GitLab 18\.9 as a longer\-horizon target\. We construct each target deployment by applying GitLab’s official upgrade procedure777[https://docs\.gitlab\.com/update/upgrade/](https://docs.gitlab.com/update/upgrade/)to the source deployment data, preserving the task data while changing the GitLab system version\.

We selected these versions to test two levels of controlled drift\. GitLab 15\.7 matches the source environment for the benchmark tasks\. GitLab 16\.11 is the intermediate target because GitLab 16\.0 introduced the new navigation experience and made the new Web IDE the default multi\-file editor,888[https://docs\.gitlab\.com/releases/16/gitlab\-16\-0\-released/](https://docs.gitlab.com/releases/16/gitlab-16-0-released/)while GitLab 16\.11 redesigned the project overview page so project information and links appear in a sidebar\.999[https://docs\.gitlab\.com/releases/16/gitlab\-16\-11\-released/](https://docs.gitlab.com/releases/16/gitlab-16-11-released/)These changes overlap with the task mix: non\-exclusive keyword grouping of the 180 single\-site GitLab tasks gives about 40 issue\-related tasks, 17 merge\-request or review tasks, 20 file\-editor or template tasks, and 132 project, repository, group, or member tasks\. GitLab 18\.9 is retained as a longer\-horizon target after additional accumulated application changes\.

#### B\.4\.1Example UI Drift in Project Navigation

Figure[12](https://arxiv.org/html/2606.08049#A2.F12)shows the same project overview across the three GitLab deployments\. The red boxes mark the UI control used to open commit history\. In GitLab 15\.7, there is no dedicated history button on the project file toolbar\. The agent must instead click the commit\-count link in the project summary\. In GitLab 16\.11 and GitLab 18\.9, GitLab exposes a dedicatedHistorybutton, but its position changes as the project summary and action toolbar are reorganized\. This example illustrates why old procedural state can become stale even when the underlying task intent and project data are unchanged\.

The underlying DOM changes as well\. GitLab 15\.7 exposes the entry point as a project\-stat link, roughlya\.nav\-link\.stat\-link, with a nestedsvg\[data\-testid=commit\-icon\]and visible text48 Commits\. GitLab 16\.11 exposes a button\-style anchor with visible textHistory, but without a dedicated test identifier\. GitLab 18\.9 addsdata\-testid=last\-commit\-historyto the corresponding anchor\. This supports the qualitative observation in Section[4\.3](https://arxiv.org/html/2606.08049#S4.SS3): newer GitLab versions can expose more stable DOM locators, anddata\-testidattributes are preferred by browser automation because they are less tied to layout or styling than class names\.

Table 11:DOM\-level differences for the same commit\-history navigation target across GitLab versions\.\(a\) GitLab 15\.7

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/gitlab_15.png)

\(b\) GitLab 16\.11

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/gitlab_16.png)

\(c\) GitLab 18\.9

![Refer to caption](https://arxiv.org/html/2606.08049v1/content/figures/gitlab_18.png)

Figure 12:Example project\-level UI drift across GitLab versions\. The screenshots show the same project view with the commit\-history entry point highlighted in red\. GitLab 15\.7 exposes commit history through the commit\-count link, while GitLab 16\.11 and GitLab 18\.9 expose a dedicatedHistorybutton in different toolbar positions\.We define𝒟mig\\mathcal\{D\}\_\{\\mathrm\{mig\}\}as the subset of the 180 candidate tasks whose required task state and evaluator can be migrated or reconstructed on both target deployments\. In the current migrated\-run aggregate,\|𝒟mig\|/180=180/180\|\\mathcal\{D\}\_\{\\mathrm\{mig\}\}\|/180=180/180\. All success rates in this experiment are computed on𝒟mig\\mathcal\{D\}\_\{\\mathrm\{mig\}\}\.

We evaluate five conditions on the same𝒟mig\\mathcal\{D\}\_\{\\mathrm\{mig\}\}denominator\. The source fresh condition starts each method from empty persistent state on GitLab 15\.7\. The two target fresh conditions start each method from empty persistent state on GitLab 16\.11 and GitLab 18\.9\. The two old\-to\-target reuse conditions first start each method from empty persistent state on GitLab 15\.7 and run on𝒟mig\\mathcal\{D\}\_\{\\mathrm\{mig\}\}to construct old\-version state, then snapshot that state for target\-version evaluation\. For AWMonline, the snapshot is the induced workflow memory\. For ReasoningBank, the snapshot is the distilled reasoning bank, memory pool, and retrieval index as used by the implementation\. ForSKILL\.nb, the snapshot includes the workflow repository𝒦\\mathcal\{K\}, workflow states, event logs needed by the lifecycle policy, and threshold and lifecycle state\.

During target\-version reuse evaluation, each target task restores the old\-version snapshot or uses an isolated copy of that snapshot\. Transient context, browser state, and environment state reset per task in all five conditions\. Within\-task fallback or local repair is allowed only when it is part of the method’s normal execution\. Any durable workflow, repository, memory, retrieval\-index, event\-log, threshold, or lifecycle update written during one GitLab 16\.11 or GitLab 18\.9 reuse task is discarded before later target tasks are scored\. The fresh target\-version conditions control for target\-version difficulty, while the reuse conditions measure target\-version success under old\-state reuse rather than cumulative target\-version relearning\.

Table[12](https://arxiv.org/html/2606.08049#A2.T12)gives the appendix results\. The fresh columns measure performance when the method starts empty on each version\. The reuse columns measure target\-version performance when initialized from the GitLab 15\.7 persistent\-state snapshot\.

For each success\-rate cell and each point in Figure[2](https://arxiv.org/html/2606.08049#S4.F2), we compute a two\-sided 95% Wilson binomial confidence interval from the number of successful taskskkout ofn=\|𝒟mig\|n=\|\\mathcal\{D\}\_\{\\mathrm\{mig\}\}\|migrated tasks\. Withp^=k/n\\hat\{p\}=k/nandz=1\.96z=1\.96, the interval is

p^\+z2/\(2​n\)±z​p^​\(1−p^\)/n\+z2/\(4​n2\)1\+z2/n\.\\frac\{\\hat\{p\}\+z^\{2\}/\(2n\)\\pm z\\sqrt\{\\hat\{p\}\(1\-\\hat\{p\}\)/n\+z^\{2\}/\(4n^\{2\}\)\}\}\{1\+z^\{2\}/n\}\.We multiply the endpoints by 100 when reporting percentages\.

Table 12:Appendix support table for the GitLab version\-drift evaluation on𝒟mig\\mathcal\{D\}\_\{\\mathrm\{mig\}\}\. Each persistent\-state method is evaluated under source fresh start, target fresh starts, and target\-version reuse from a GitLab 15\.7 persistent\-state snapshot\. Cells report SR with 95% Wilson confidence intervals\.

## Appendix CAdditional Experiments

### C\.1Controlled Component Ablations

In this section, we include a support ablation to address whether the finalSKILL\.nbbehavior depends on its main components rather than on a single incidental design choice\. On the 258\-task WebArena\-Verified hard subset, the experiment isolates selective code formalization withSKILL\.nbNL\-only\\text\{\{SKILL\.nb\}\}\_\{\\text\{NL\-only\}\}, which keeps only workflow intents and natural\-language procedures; fallback withSKILL\.nbcode\-only\\text\{\{SKILL\.nb\}\}\_\{\\text\{code\-only\}\}, which disables fallback when executable code is unavailable or brittle; runtime validation withSKILL\.nbno\-gates\\text\{\{SKILL\.nb\}\}\_\{\\text\{no\-gates\}\}, which removes precondition and postcondition checks; and lifecycle cleanup withSKILL\.nbno\-demote\\text\{\{SKILL\.nb\}\}\_\{\\text\{no\-demote\}\}, which disables demotion and retirement\. All variants use the same model, harness, task stream, repository schema, evaluation budget, and online maintenance setting; each variant maintains its own repository state\. Threshold specialization is evaluated separately in Appendix[C\.2](https://arxiv.org/html/2606.08049#A3.SS2)\.

Table[13](https://arxiv.org/html/2606.08049#A3.T13)reports the resulting success,SKILL\.nb\-internal token cost, fallback, repair, and regression diagnostics\.

Table 13:Controlled component ablations on the 258\-task WebArena\-Verified hard subset\. All rows areSKILL\.nbvariants, so tokens per success use the sameSKILL\.nb\-internal maintenance and update accounting and are reported in thousands\. SR is reported as percent with 95% Wilson confidence intervals\. Fallback is the percent of tasks using natural\-language or raw\-intent fallback; repair is the percent of tasks with an accepted repair or repository update; regression is update\-induced failure after accepted repair or update\.SKILL\.nbhas the highest point SR and the lowest internal token cost per success\. Removing code formalization increases cost and fallback use, removing fallback lowers SR and increases regression, and removing gates or lifecycle cleanup increases regression\. Because this is a single\-stream online ablation on the hard subset, we interpret the pattern as diagnostic evidence that the components are useful rather than as a benchmark\-wide causal estimate\.

### C\.2Adaptive Threshold Ablation

In this section, we include a diagnostic ablation to isolate the contribution of the learned lifecycle policy from the workflow representation itself\. Figure[13](https://arxiv.org/html/2606.08049#A3.F13)reports this ablation on the 258\-task WebArena\-Verified hard subset over three rounds\. It compares fourSKILL\.nbvariants: the loose fixed variant uses one permissive fixed threshold vector shared across all groups and rounds, the strict fixed variant uses a more restrictive fixed vector, the pooled variant uses pooled thresholds from the offline replay estimator without group specialization, andSKILL\.nb\(ours\) uses the full deployed group\-specialized policy\. In the current setup, the loose fixed vector is\(τcreate,τform,τdemote,τretire\)=\(4,3,5,0\.50\)\(\\tau\_\{\\mathrm\{create\}\},\\tau\_\{\\mathrm\{form\}\},\\tau\_\{\\mathrm\{demote\}\},\\tau\_\{\\mathrm\{retire\}\}\)=\(4,3,5,0\.50\)and the strict fixed vector is\(8,5,2,0\.25\)\(8,5,2,0\.25\)\. Panel \(a\) asks whether specialization improves task success, panel \(b\) whether ordinary lifecycle promotions introduce regressions on previously passing cached traces, and panel \(c\) whether those gains come with lower cumulative maintenance compute relative to the loose fixed baseline\.

Read together, the three panels are consistent with the group\-specialized lifecycle policy improving the success–regression–maintenance trade\-off on this subset\. By round 3,SKILL\.nbreaches the highest success rate \(38\.3%\) while maintaining the lowest update\-induced regression \(3\.3%\), and panel \(c\) shows that its cumulative maintenance compute remains below the loose fixed baseline\. The fixed\-threshold variants illustrate the failure modes avoided\. The loose fixed variant over\-promotes, driving success down from 32\.0% to 27\.1% while regression rises to 22\.0%\. The strict fixed variant under\-promotes, producing stable but stagnant success near 31%\. The pooled variant improves on both fixed policies but remains worse thanSKILL\.nbon all three metrics\. The strict fixed variant’s low maintenance footprint in panel \(c\) reflects this low update activity rather than better efficiency per successful task\. The brief round\-2 dip forSKILL\.nbmirrors the lifecycle result and is followed by the strongest round\-3 recovery\.

![Refer to caption](https://arxiv.org/html/2606.08049v1/x4.png)Figure 13:Diagnostic adaptive\-threshold ablation on the WebArena\-Verified hard subset \(258 tasks\) over three rounds\. The fourSKILL\.nbvariants differ only in how lifecycle thresholds are chosen: loose fixed thresholds, strict fixed thresholds, pooled thresholds from the offline estimator without group specialization, and the deployed group\-specialized policy \(SKILL\.nb, ours\)\. \(a\) Success rate by round\. \(b\) Update\-induced regression on previously passing cached traces, by round\. \(c\) Cumulative maintenance compute, normalized to the loose fixed variant at the same round \(loose fixed=100%=100\\%\)\.

Similar Articles

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

arXiv cs.AI

This paper introduces Formal Skill, a runtime-native abstraction for LLM agents that encodes reusable procedures as executable state machines with JSON metadata, Python executors, and hook-governed control logic. An open-source implementation called FairyClaw is presented, showing competitive performance on Harness-Bench with reduced token usage.

SkillGen: Verified Inference-Time Agent Skill Synthesis

arXiv cs.LG

This article introduces SkillGen, a multi-agent framework that synthesizes and verifies reusable inference-time skills for LLM agents by contrasting successful and failed trajectories. The method ensures skills are auditable and empirically verified for their net positive impact on agent performance.

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Hugging Face Daily Papers

SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.