@dair_ai: Outstanding paper on computer-using agents. (bookmark it) Computer-using agents drive real software through the screen,…

X AI KOLs Following 06/17/26, 11:20 PM Papers

computer-using-agents state-machine replay efficiency ai-agents preact

Summary

PreAct compiles successful agent runs into small state-machine programs, enabling 8.5-13x faster replay on repeated tasks without per-step language model calls, with runtime screen checks to ensure correctness.

Outstanding paper on computer-using agents. (bookmark it) Computer-using agents drive real software through the screen, but they solve every task from scratch. Ask one to repeat a task, and it re-reads the screen and re-reasons every tap, paying the full cost again. PreAct compiles the first successful run into a small state-machine program, states that check the screen and transitions that act, then replays it directly on later runs. That runs 8.5 to 13x faster with no per-step language-model calls. Replay stays guarded. At each step, PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent when reality diverges. Why does it matter? Most computer-use costs are repeated reasoning on tasks the agent has already solved. Amortizing that into a replayable program is a clean way to make agents faster the second time. Paper: https://arxiv.org/abs/2606.17929 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 06/18/26, 12:00 AM

Outstanding paper on computer-using agents.

(bookmark it)

Computer-using agents drive real software through the screen, but they solve every task from scratch. Ask one to repeat a task, and it re-reads the screen and re-reasons every tap, paying the full cost again.

PreAct compiles the first successful run into a small state-machine program, states that check the screen and transitions that act, then replays it directly on later runs. That runs 8.5 to 13x faster with no per-step language-model calls.

Replay stays guarded. At each step, PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent when reality diverges.

Why does it matter?

Most computer-use costs are repeated reasoning on tasks the agent has already solved. Amortizing that into a replayable program is a clean way to make agents faster the second time.

Paper: https://arxiv.org/abs/2606.17929

Learn to build effective AI agents in our academy: https://academy.dair.ai

Computer-Using Agents that Get Faster on Repeated Tasks

Source: https://arxiv.org/html/2606.17929

Abstract

Computer-using agents drive real software through the screen—clicking and typing—but they solve every task from scratch: asked to repeat a task, an agent re-reads the screen, re-reasons every tap, and pays the full cost again. We present PreAct, which lets such an agent get faster on tasks it has done before. The first time it succeeds, PreAct compiles the run into a small state-machine program—states that check the screen, transitions that act—and on later runs replays it directly instead of invoking the agent—8.58.5–13×13\timesfaster, with no per-step language-model calls. Replay is not blind: at each step PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent the moment something is off.

PreAct applies the same discipline when deciding what to keep: a freshly compiled program enters the store only if, re-run from a clean state, an independent evaluator confirms it solved the task—catching programs that replay to their last step yet leave the task undone. Across a mobile, a desktop, and a web benchmark, this store-time check separates repeated runs that improve from ones that degrade as faulty programs accumulate—worth1.751.75–2.62.6tasks per benchmark, the same direction on all three; a fallback that explores afresh when no program fits brings PreAct level with a strong record-and-replay baseline. We also report what did not matter: prompt wording, runtime guardrails, and whether a language model or a plain embedding retriever selects which program to reuse.

1 Introduction

Computer-using agents—models that operate ordinary software by reading the screen and issuing clicks and keystrokes[59,1]—are now good enough to handle everyday chores like booking a meeting or filing an expense. They are also wasteful. Consider a task from the AndroidWorld benchmark[36]:add a new contact named “Emilia Gonzalez” with a given phone number. The agent reads a screenshot, decides where to tap, acts, and repeats, spending half a dozen rounds of vision-and-language inference to walk through the contacts app. Ask for the same task tomorrow and it starts over. The screen has not changed and neither have the taps, but the agent kept nothing it could simply run again, so it pays the full cost a second time.

This is not how people work. Someone opening Photoshop for the first time, or a traveler in China reaching for Alipay or WeChat Pay to settle a bill, is painfully slow at first—reading every menu, hunting for the right button. After a few repetitions they are nearly as fluent as a local: the deliberate, perception-heavy reasoning of the first attempt has been internalized into an almost automatic routine. Yet that fluency is not blind—they still glance at the screen and act only when it looks as they expect; the moment the network stalls or an unfamiliar dialog appears, they notice at once and slow back down to think. Fluency is prediction, not blindness. The animating question behind PreAct is whether a computer-using agent can acquire the same continual-learning capability—not merely solving a task it has seen before, but solving itfaster and more cheaplythe second time, much as a practiced human does. This efficiency dimension is largely absent from how agents are evaluated: most benchmarks score only whether a tasksucceeds, not how much perception and reasoning a repeated success costs, so an agent that re-derives every task from scratch looks no worse than one that has learned to skip the deliberation. We treat repeated-task efficiency as a first-class outcome and measure it directly (Figure1).

Refer to caption Figure 1:The cost of doing the same task twice. A standard computer-using agent re-derives a familiar task on every run, so its cumulative LLM cost grows with the number of runs (red). PreAct pays a one-time premium to compile and check the first success (the bump at run 1), then replays the stored program on every later run with no per-step language-model calls (blue), so its cumulative cost barely rises; replays also run8.58.5–13×13\timesfaster in wall-clock time. The more often a task recurs, the more this saves. Cost is normalised to one full agent solve; the wall-clock speedup and the one-time overhead are measured, not illustrative.Several lines of work try to reuse past behaviour, but each keeps a model in the execution loop or stores something it cannot reliably re-run. Skill libraries such as SAGE[25]save parameterised behaviours, yet a language model still carries them out at run time, so most of the per-step cost remains. Compilation systems[45,65]turn behaviour into deterministic code, but they target narrow business logic, or build a program by crawling an application rather than from an actual successful run. Record-and-replay systems[34,11,4]cache the sequence of actions from one run and replay it, falling back to the agent when the cache misses; the cached trace has no way to tell whether each step really landed, no room for branching, and no means of improving itself.

PreAct takes a different position: the thing the agent stores is exactly the thing it later runs. The first time the agent finishes “add Emilia Gonzalez,” PreAct turns the run into a small state machine. Each state carries a check on the screen—the contact form is open; the name field now reads the value we typed—and each transition carries an action, such as tappingCreate contactor typing the first name. The next time a matching task arrives, PreAct retrieves that state machine and walks it: at every state it confirms the screen looks right before doing anything, and at every transition it performs the action. No language model runs during a successful replay, which is where the order-of-magnitude saving comes from. This observe-first-then-act discipline is what makes direct replay trustworthy and what sets PreAct apart from a recorded macro: where an RPA script or a cached action trace fires its stored steps blindly, PreAct checks the screen before each one. When a check does fail—the app has changed, an unexpected dialog appears—PreAct hands control back to the full agent, which finishes the task and produces a fresh program. Over time the store grows on its own: the code that drives this loop is fixed, and what accumulates is a library of programs the agent can run directly (Figure2).

Refer to caption Figure 2:The PreAct harness is a verified compile–extend–replace loop. A goal is routed to theProgram Selector; a retrieved program is run by theReplayer, which walks the graph and checks each predicate against the live screen. On any miss the harness falls back to the full agent (CUA), whose trace is compiled into a new programP′P^{\prime}.P′P^{\prime}enters the corpus only through theVerify-before-Store Gate. The corpus is the only growing structure; the harness code is fixed.The same check runs a second time, at the moment PreAct decides what to remember—and getting this right took us the longest. A compiled program can replay all the way to its final step and still leave the task undone: it taps through the contact form and presses Save, but a stale field meant the name was never entered, so no contact exists. An agent that trusts such a program will fail the same way every time it reuses it, and the more it runs the worse it becomes. PreAct’s answer is to apply the run-time discipline once more before storage: right after compiling, it resets the environment, runs the new program from scratch, and asks the benchmark’s own evaluator whether the task was actually solved; only then is the program allowed into the store. The same principle thus operates at two timescales—before each action on replay, and before a program is kept—and the second is the difference between a system that gets better with use and one that quietly decays.

We test PreAct on three platforms—a phone (AndroidWorld), a desktop (OSWorld), and a website (WebArena)—where warm replays run8.58.5–13×13\timesfaster than re-deriving the task from scratch and make no per-step language-model calls. Turning the check on or off, with a fresh corpus or one the agent has already built, gives a consistent picture: with the check, repeated runs get better; without it, they get worse as faulty programs collect. The gate is worth1.751.75–2.62.6tasks across the cold→\towarm transition, pointing the same way on every platform. A second, smaller idea matters too—when no stored program fits a task, exploring it afresh rather than giving up keeps PreAct level with a strong record-and-replay baseline (Muscle-Mem). We are also careful to report what didnotchange the numbers, since it is easy to credit the wrong thing: the exact wording of the prompts, a handful of hand-written runtime guardrails, and whether a language model or a plain embedding retriever (which matches the agentic selector at100%100\%vs.75.6%75.6\%retrieval) picks which stored program to reuse all made little or no difference. What carries the result is the design itself—store the program you run, and never keep one you have not checked.

2 Related Work

PreAct sits at the intersection of six literatures: computer-using agents, the reason–act loop they run, skill and memory systems that try to reuse past behaviour, code as an agent representation, world models that learn environment dynamics, and systems that compile or replay workflows. We survey each in turn, then (Table1, Figure3) locate PreAct relative to the closest prior work.

Table 1:PreAct vs. related approaches. Refer to caption Figure 3:Where PreAct sits, along the two properties it is designed to combine. Many prior systems achieve one or the other.Record-and-replaysystems (Muscle-Mem, Workflow-Use) and classical RPA run their stored artifactwithoutan LLM, but only append it (top-left).Skill and memorysystems—Voyager, TroVE, ExpeL, Mem0/A-MEM—genuinely self-extend, and some (Voyager, TroVE) even self-verify what they add, but they invoke itthroughan LLM at run time (bottom-right). PreAct’s distinction is thecombination: a directly-executable artifact (no LLM in the loop at replay) grown through verified, in-place refinement. Axes are qualitative.Computer-using agents.A rapidly growing line of work builds agents that operate real software through its graphical interface—clicking, typing, and scrolling—rather than through bespoke APIs[1]. Early web agents learned from demonstrations or reinforcement on simulated sites[30,57,13], and benchmarks such as Mind2Web[10], WebArena[66], VisualWebArena[20], and WebVoyager[17]pushed toward realistic, open-ended websites. A parallel thread grounds vision-language models directly on screen pixels and accessibility trees: SeeAct[64], CogAgent[18], SeeClick[8], and UI-TARS[35]learn to localize and act on UI elements. The same paradigm now spans mobile[61,48,37,36], desktop and OS control[31,53,60,3,54], and software engineering[56,19], with reinforcement learning[2]and general-assistant benchmarks[29]measuring progress. PreAct is agnostic to which such agent fills its CUA-fallback slot; we use AndroidWorld’s T3A and Anthropic’s Computer-Use API, but the harness wraps any of them.

The reason–act loop and its cost.Almost all of these agents follow the observe–reason–act paradigm popularized by ReAct[59], layered on chain-of-thought and search-style reasoning[52,58,50], tool use[38], self-reflection[39], and autonomous task loops[40,41]. This loop is exactly what makes repeated execution expensive: every step pays for fresh perception and reasoning even when the task is one the agent has solved before[26]. PreAct attacks this cost directly by replacing the loop, on familiar tasks, with deterministic graph execution.

Skill libraries and experiential memory.A second response to repetition is to remember. Skill systems accumulate reusable behaviours—Voyager grows a library of code skills[47], SAGE and Skill-Pro parameterize learned skills[25,28], and ExpeL distills experience into rules[63]—but each skill still runs inside an LLM-bound loop at execution time. More broadly, agent-memory systems store and retrieve past interactions: generative-agent memory streams[33], operating-system-style memory[32], and production memory stacks[9,55](surveyed in[62]), typically consulted by retrieval-augmented generation[21]. PreAct’s corpus is a memory in this sense, but what it stores isexecutable: recall and action are the same step. User-as-Code[22]represents a user’s accumulated state as typed Python objects and the rules over it as executable constraints, so representing the user and reasoning about the user share one medium. The two papers differ in what the codeis: there it is typed state plus constraint predicates queried over a user model; here it is a verified state-machineaction programwhose execution is itself the recall. Scaling such reuse, rather than re-deriving behaviour, is consistent with the “bitter lesson” that general mechanisms outrun hand-engineered ones[44].

Code as the agent’s action and representation.A third thread expresses agent behaviour as code. PAL, Program-of-Thoughts, and CoCoGen offload reasoning to executable programs[12,7,27]; Code-as-Policies and CodeAct make code the agent’s action space[24,49]; Chain-of-Code emulates execution[23]; and TroVE and LATM induce reusable, verifiable toolboxes from solved tasks[51,5]. PreAct shares the conviction that code is the right substrate for reliable reuse, but its compiled artifact is averified state machine executed directly, not a flat script regenerated and re-interpreted by an LLM each time.

Internalized vs. externalized world models.A fourth thread learns the environment’sdynamics—a model that predicts the next observation, or a tool’s response, from the current state and a candidate action. Classical world models learn these dynamics in network parameters and plan inside the learned simulator[14,15]; the idea has since migrated to language agents that use the LLM itself as a world model to predict action outcomes for planning[16], including for web navigation, where an agent learns environment dynamics to anticipate the effect of a click before committing to it[6]. PreAct can be read as a world model of a different kind. Each stored program is anexternalized, verifiablemodel of one task’s dynamics—its state predicates are checkable claims about what the screen will be, and its transitions are the actions that move between those states—rather than dynamics baked into weights. The difference is where the learning lives and whether it can be trusted: an internalized world model predicts transitions implicitly and can hallucinate them with no way to check against the live environment, whereas PreAct’s verify-before-store gate (§3.3) is exactly a test that the learned dynamics still hold before a program is trusted, and a program can be inspected, replaced, or repaired without retraining. The price is generality: these are task-specific models grounded in concrete selectors, not one dense dynamics model, and they do not transfer across task families (§5.6).

Compiling and replaying workflows.Closest to PreAct are systems that compile or cache concrete workflows. Robotic process automation has long replayed recorded UI scripts deterministically[46]; Compiled-AI and ActionEngine compile agent behaviour into deterministic code, the latter generating flat Python from a state machine built by untargeted crawling[45,65]; and concurrent record-and-replay systems—Muscle-Mem[34], AgentRR[11], and Workflow-Use[4]—cache a linear action sequence and fall back to the agent on a cache miss. We compare against these directly. The gap PreAct closes is the same one at two levels: the stored artifact is never checked against the live UI—not per step as it replays (so a blind action sequence walks straight into a changed screen), and not before it is trusted as a stored behaviour (so the artifact is appended, never refined or revalidated). PreAct adds verification at both points. Table1summarizes the comparison.

Running the state machine directly.What distinguishes PreAct is the second column (it runs the state machine directly) combined with the fourth (recompile under the verify-gate). ActionEngine builds a state machine but emits flat Python from it; once the script is generated the state machine is thrown away, and what actually runs is that script. PreAct instead executes the state machine itself: each state’s verification predicate is evaluated against the live UI, and each transition is dispatched to the underlying action backend (XPath click, JSON action, etc.). This direct execution enables (a) per-state verification and graceful fallback, (b) in-place patching when a transition fails, and (c) the verify-before-store gate (which we will argue is empirically necessary for monotonicity).

What grows with experience.The deeper distinction concernswhat growsas the agent gains experience. In Muscle-Mem, AgentRR, and Workflow-Use, the cache or experience store is append-only: every captured trajectory adds a new entry without affecting the existing ones. In ActionEngine, the state machine is built once via crawling and the flat Python is regenerated wholesale on each rebuild; there is no notion of incremental refinement. PreAct’s compile-extend-replace loop, by contrast, allows the corpus tomutate: a fresh compile with the same dedup signature replaces an older program if it passes the verify-gate. The corpus does not just accumulate—it refines its representation of repeated tasks as the agent encounters them across different parameter values, environments, and sessions. This is the property we mean by “self-extending executable code corpus” (§1): not an append-only cache, but a mutable, verified, directly-executable code library.

3 System Architecture

PreAct rests on a single discipline applied at two timescales:predict, then check against the live screen before trusting the prediction. At run time the replayer checks each state’s predicate before it fires the transition out of that state—observe first, then act—which is what makes direct, model-free execution trustworthy (§3.2). At store time the same check is run once more, over the whole freshly compiled program, before it is allowed into the corpus—which is what keeps the growing corpus honest (§3.3). The per-step check is the more primitive of the two: the store-time gate is simply that same predicate-checking replay, run from a clean state and confirmed by an independent evaluator. We describe the stored artifact (§3.1), the loop that runs and extends it (§3.2), and the two points at which it verifies.

3.1 What the agent stores: a state-machine program

We start from the artifact, because in PreAct the agent runs that artifact directly rather than regenerating a script from it. Listing1is anactualprogram from the corpus, compiled while the agent ran theContactsAddContacttask from the AndroidWorld benchmark[36]—the trace was the instance “Create a new contact for Emilia Gonzalez.” It is a small graph, drawn in Figure4: eachstatecarries a verification predicate (an accessibility-tree pattern that must hold on the live screen for the agent to believe it is in that state) and eachtransitioncarries an action. The concrete values typed during the trace are lifted toparameters—herefirst_name,last_name,phone_number, listed in the metadata—so this single program serves every future “add a contact” request by binding the parameters to each new contact’s values.

{

“metadata”:{

“program_id”:“ab4390a9-1902-47a1-821b-ebddc7f4010a”,

“task_description”:“CreateanewcontactforEmiliaGonzalez.Theirnumberis+14240925675.”,

“application_context”:“com.google.android.contacts”,

“parameters”:[“first_name”,“last_name”,“phone_number”]

“states”:[

{“id”:“contacts_app_open”,

“verification”:{“type”:“expect_element”,“xpath”:“resource_id=com.google.android.contacts:id/floating_action_button”,“timeout_ms”:5000},

“description”:“ContactsappisopenshowingthemainlistwiththeCreate-ContactFABvisible”},

{“id”:“create_contact_form”,

“verification”:{“type”:“expect_element”,“xpath”:“class=android.widget.EditText&&hint=Firstname”,“timeout_ms”:3000},

“description”:“New-contactcreationformisopenwiththeFirstnamefieldvisible”},

{“id”:“first_name_entered”,

“verification”:{“type”:“expect_element”,“xpath”:“class=android.widget.EditText&&hint=Lastname”,“timeout_ms”:2000},

“description”:“Firstnamehasbeenentered;Lastnamefieldisavailable”},

{“id”:“last_name_entered”,

“verification”:{“type”:“expect_element”,“xpath”:“class=android.widget.EditText&&hint=Phone”,“timeout_ms”:2000},

“description”:“Lastnamehasbeenentered;Phonefieldisavailable”},

{“id”:“phone_entered”,

“verification”:{“type”:“expect_element”,“xpath”:“resource_id=android:id/text1&&text=Mobile”,“timeout_ms”:2000},

“description”:“Phonenumberhasbeenentered;phone-typeselector(Mobile)isvisible”},

{“id”:“phone_type_selected”,

“verification”:{“type”:“expect_element”,“xpath”:“resource_id=com.google.android.contacts:id/toolbar_button&&text=Save”,“timeout_ms”:2000},

“description”:“PhonetypeMobileselected;Savebuttonisvisibleinthetoolbar”},

{“id”:“contact_saved”,

“verification”:{“type”:“terminal_state”,“timeout_ms”:5000},

“description”:“Contacthasbeensavedsuccessfullyandthetaskiscomplete”}

“transitions”:[

{“from_state”:“contacts_app_open”,“to_state”:“create_contact_form”,

“action”:{“type”:“action_navigate”,“text”:“Contacts”}},

{“from_state”:“contacts_app_open”,“to_state”:“create_contact_form”,

“action”:{“type”:“action_click”,“target”:“resource_id=com.google.android.contacts:id/floating_action_button”}},

{“from_state”:“create_contact_form”,“to_state”:“first_name_entered”,

“action”:{“type”:“action_type”,“target”:“class=android.widget.EditText&&hint=Firstname”,“parameter_name”:“first_name”}},

{“from_state”:“first_name_entered”,“to_state”:“last_name_entered”,

“action”:{“type”:“action_type”,“target”:“class=android.widget.EditText&&hint=Lastname”,“parameter_name”:“last_name”}},

{“from_state”:“last_name_entered”,“to_state”:“phone_entered”,

“action”:{“type”:“action_type”,“target”:“class=android.widget.EditText&&hint=Phone”,“parameter_name”:“phone_number”}},

{“from_state”:“phone_entered”,“to_state”:“phone_type_selected”,

“action”:{“type”:“action_click”,“target”:“resource_id=android:id/text1&&text=Mobile”}},

{“from_state”:“phone_type_selected”,“to_state”:“contact_saved”,

“action”:{“type”:“action_click”,“target”:“resource_id=com.google.android.contacts:id/toolbar_button&&text=Save”}}

]

}

Listing 1:The complete compiled program for “add a contact,” taken verbatim from the corpus (the AndroidWorldContactsAddContacttask). Only the schema-defaultnullfields of each action object are omitted; everything else—state ids, verification predicates with their timeouts, per-state descriptions, and the full resource-id selectors—is reproduced exactly. Figure4draws the same program.The three values typed during the trace are recorded as the parametersfirst_name,last_name,phone_numberand re-bound to new values on every retrieval, so the one program generalizes across contacts. The two transitions out ofcontacts_app_open(anaction_navigatethat opens Contacts and theaction_clickon the create button) are a small redundancy the compiler emitted and the replayer tolerates; both reachcreate_contact_form.

Refer to caption Figure 4:The compiled “add a contact” program of Listing1drawn as a state machine (read top row left-to-right, then bottom row right-to-left). Each box is a state—its verification predicate, which must match the live screen, is given in Listing1—and each arrow is a transition holding an action. The harness executes this graph directly; it is not regenerated into a flat script.Formally a program is a tupleP=(S,T,M,V)P=(S,T,M,V): statesSS(id, description, verification predicate), transitionsTT(each(si,sj,a)(s_{i},s_{j},a)firing actionaato move fromsis_{i}tosjs_{j}), metadataMM(task description, app context, parameter schema, and adedup signaturethat identifies which task family the program serves), and data-extraction predicatesVVfor question-answering tasks (e.g.inspect_text). The full action vocabulary is in Appendix A. This graph form buys three things a flat script cannot:(1)per-state verification, so the agent knows immediately when an action did not have the intended effect;(2)branching, since one state can have several outgoing transitions chosen by predicate (a delete-confirmation dialog that appears only on some app versions); and(3)in-place extension, since a new state and transition can be spliced in without regenerating the whole program.

3.2 The loop: select, replay, fall back, verify

Figure2shows how the harness uses such programs. For a goalTT:

1.Select.The Program Selector queries the corpus withTTand returns either a candidate programPPor “no candidate.”
2.Replay.IfPPexists, the Replayer walks its graph—checking each state’s predicate against the live screen and firing each transition’s action. “Add Emilia Gonzalez” on Tuesday is solved here, in seconds, with no language-model call.
3.Fall back.If a predicate is false or an action errors, the harness hands control to the full agent (CUA), which continues from the live screen and records a fresh trace.
4.Verify and store.On success the trace is compiled into a new programP′P^{\prime}; the verify-before-store gate (§3.3) re-runsP′P^{\prime}from a clean state and stores it—replacing any program with the same dedup signature—only if it independently passes.

The static harness code defines this loop; the corpus is what grows, and step 4’s gate is what lets it growmonotonicallyrather than accumulate programs that run but do not work. Algorithm1makes the loop precise.

Replay verifies before it acts.The distinction from blind record-and-replay lives in step 2, and it is the property we most want to make explicit. A classical RPA script or a cached action trace fires its stored actions in order, trusting that the screen is wherever it left off; PreAct instead checks each state’s verification predicate against the live screenbeforeit performs the transition out of that state—observe first, then act. When the prediction holds—the expected element is present, the page is the one the program expects—the action fires immediately, with no language-model call, which is where the speed comes from. When it does not—an error dialog has appeared, an element is missing, the app has navigated somewhere unexpected—the predicate fails, replay halts instead of blindly clicking into a wrong state, and control passes to the CUA fallback (step 3). The agent finds a path from the live screen, and that path is compiled and, if it passes the gate, stored, so the program accrues new branches over time—success states, error-handling states, version-specific dialogs—covering more of the task’s real state space each time it is exercised. This verify-before-act discipline, applied ateverystep rather than only at the end, is what makes replay reliable enough to trust with no model in the loop; it is the difference between an agent that fluently re-executes a known task and one that blindly repeats a recording until it walks off a cliff. We isolate this run-time verification’s own contribution—against a flat-script runtime that strips it—in §4.6.

Refer to caption Figure 5:Where the cost goes. The first time PreAct sees a task it runs the full agent (top:≈N\approx Nrounds of observe–reason–act) and pays a one-time compile-and-verify premium. Every later run replays the stored program (bottom): no language-model calls, an order of magnitude faster, and already verified before it was stored.Algorithm 1PreAct verified compile-extend-replace loop1:Task goal

TT, environmentenv, program corpus

CC, replay-coverage threshold

τ\tau(default 0.5)

P←Selector(T,C)P\leftarrow\textsc{Selector}(T,C)⊳\trianglerightLLM-agentic; may return⊥\bot

r←Noner\leftarrow\textsc{None} 4:if

P≠⊥P\neq\botthen

r←Replay(P,env)r\leftarrow\textsc{Replay}(P,\textit{env})⊳\trianglerightwalk graph; verify each state predicate

6:if

r.success∧r.cov>τr.\textit{success}\land r.\textit{cov}>\tauthen

7:return

rr⊳\trianglerightwarm path: trusted replay

8:endif

9:endif

10:

r←CUA(T,env,r)r\leftarrow\textsc{CUA}(T,\textit{env},r)⊳\trianglerighthybrid: whenr≠Noner{\neq}\textsc{None}, CUA continues from the partial-replay state

11:if

¬r.success\neg r.\textit{success}thenreturn

rr 12:endif

13:

P′←Compile(r.trace)P^{\prime}\leftarrow\textsc{Compile}(r.\textit{trace})⊳\trianglerightLLM-driven trace→\tostate-machine

14:

env′←Reset(env,T)\textit{env}^{\prime}\leftarrow\textsc{Reset}(\textit{env},T)⊳\trianglerightindependent re-evaluation environment

15:

r′←Replay(P′,env′)r^{\prime}\leftarrow\textsc{Replay}(P^{\prime},\textit{env}^{\prime}) 16:

score′←Evaluate(env′,T)\textit{score}^{\prime}\leftarrow\textsc{Evaluate}(\textit{env}^{\prime},T) 17:if

r′.success∧score′≥1.0r^{\prime}.\textit{success}\land\textit{score}^{\prime}\geq 1.0then⊳\trianglerightdouble gate

18:

C←Upsert(C,P′)C\leftarrow\textsc{Upsert}(C,P^{\prime})⊳\trianglerightinsert by dedup signature; replace if collision

19:endif

20:return

Three properties of Algorithm1are worth highlighting. First,the harness’s only side effect onCCis line 16’sUpsertcall, gated by line 15’s double predicate: programs enter the corpus only after independent re-validation. Second,the corpus grows monotonically in expectation: eachUpsertcall adds a program that has independently re-passed both replay and evaluation, so in expectation the corpus quality does not decrease acrossUpsertevents (a stored program can still fail on a later parameter binding—§5.6—so this is an expectation, not a guarantee). Third,the corpus mutates viaUpsert, not just append: a fresh program with the same dedup signature replaces an older one, allowing the corpus to refine its representation of repeated tasks rather than just accumulate variants.

3.3 The verify-before-store gate

The gate is the run-time check of §3.2applied once more—this time to the whole program, at the moment of storage—and it is what separates PreAct from a blind cache. When the agent succeeds and the trace is compiled into a candidateP′P^{\prime}, the harness doesnottrust it. It resets the environment to the task’s clean initial state, replaysP′P^{\prime}there (the same predicate-checking replay used on every warm run), and asks the benchmark’s own evaluator whether the task is solved (Figure6).P′P^{\prime}is stored only ifbothhold: the replay ran to its terminal state without error,andthe evaluator scores it a pass.

Both halves earn their place. Checking the evaluator’s pass alone lets through programs whose actions silently no-op (a malformed action the backend swallows) while the environment happens to still hold passing state from an earlier step—so we additionally require the replay to report its own success. The converse and more important failure mode is a clean replay that nonetheless does not pass: the program executes every action to completion (100% coverage) yet the end state does not satisfy the evaluator—the contact form was filled and “Save” was tapped, but a stale field meant no contact was written. We call this thecov==100%/score==0 lossy replay, and §4.3shows it is exactly what the gate must catch.

One assumption is built into this design: because the gate verifies byre-runningthe program rather than inspecting it, the verify step performs the task’s actions a second time. This is safe in our benchmarks, where every task (and every verify-replay) starts from a scripted environment reset, but it presumes a resettable or otherwise idempotent task; tasks with irreversible side effects (sending a message, charging a card) would need a side-effect-free verification path instead. We return to this in limitation L6 (§5.7).

Refer to caption Figure 6:The verify-before-store gate. A freshly compiled program is re-run from a clean state and re-scored by an independent evaluator; it enters the corpus only if it both replays cleanlyandpasses. This is what rejects the “runs but doesn’t work” programs—the ones that replay to full coverage yet leave the task unsolved.

3.4 Selector and fallback

Two supporting components complete the loop. TheProgram Selectordecides which stored program (if any) applies to a new goal; by default it is an LLM that reads the goal and each program’s description and parameter schema and chooses via a tool call, but §Appendix F. Negative-Findings Tablesshows a plain embedding retriever does just as well—the choice of selector makes little difference to the result. TheCUA fallbackis the full agent, invoked whenever the selector finds no candidate or a replay fails; we use AndroidWorld’s T3A agent on mobile and Anthropic’s Computer-Use API on desktop and web. The production fallback runs theverbatimupstream agent prompt with no PreAct-specific additions (§4.6)—the harness, not the prompt, is what makes the system work.

4 Empirical Validation

4.1 Setup

Benchmarks.We evaluate on (a) AndroidWorld[36]“official-15” subset (15 tasks across 8 application domains; the curated subset used by the upstream T3A baseline), (b) OSWorld[54]“test_tiny” subset (6 tasks across Chrome, LibreOffice Calc, and LibreOffice Writer), and (c) a 12-task shopping_admin subset of WebArena[66](string-match answer-extraction tasks on a Magento e-commerce admin panel).

Models.For Android CUA, we use Gemini 3 Flash via the multi-provider port; this is roughly 10×\timescheaper than Claude Sonnet 4.6 at equivalent SR (we confirm this with cross-model replication in §4.2). For OSWorld and WebArena CUA, we use Claude Sonnet 4.6 via Anthropic’s Computer-Use API. For program compilation and selector reasoning, we use Claude Sonnet 4.6 throughout.

Container.AndroidWorld runs in the publicandroid_worldcontainer image, extended with a small patch that exposes the device’s screenshot and accessibility tree to the agent (Appendix C). OSWorld uses the standardosworld-dockerimage via its environment provider. WebArena uses the canonicalshopping_adminMagento Docker stack.

Multi-seed protocol.For each seed in {42, 100, 1337, 2024, 7777}, we set the existing corpus aside (preserving prior state) and start from a fresh empty corpus. We then run cold (empty corpus) followed by warm (cold-built corpus) on the same task list.

Cost and protocol.The full validation push ran in≈50\approx 50hours of autonomous container time for≈$30\approx\mathdollar 30–3535in LLM spend. Every claim below uses paired cold/warm runs: “cold” starts from an empty corpus; “warm” reuses the corpus the cold run built on the same tasks. This is the setting the rerun crisis lives in—the same user, the same tasks, a second time.

Metrics.We report task success rateandthe cost of achieving it—wall-clock time, language-model token spend, and replay coverage—rather than success alone. The three benchmarks we build on score only whether a task passes, which makes an agent that re-derives every repeated task indistinguishable from one that has learned to replay it. Because PreAct’s whole premise is that repeated work should get cheaper, not just stay correct (§1), we treat this efficiency dimension as a first-class outcome: warm-run speedup and the one-time compile-and-verify overhead are measured quantities (Figures1,8), not illustrative ones.

A one-paragraph map of what follows. Two harness mechanisms move the success rate: theverify-before-store gate(§4.3) and thecache-miss fallback(§4.4). The runtime add-ons we tried—prompt content, runtime guardrails, and the choice of selector—do not (§4.6); the one remaining knob, the step-budget cap, trades success rate for wall-clock time by design rather than adding any.The harness is what works.

4.2 Repeated runs get faster

The basic promise is that the second time the agent sees a task, it is faster and no worse. Figure7shows the success rate rising from cold to warm on AndroidWorld official-15 across three Gemini 3 Flash seeds (corpus reset per seed): every seed improves, mean+1.33+1.33tasks, withnoseed regressing. The gain is not free retrieval of luck—on the warm runs the agent handles a majority of tasks by replaying a stored program rather than re-deriving it. Of the 13 tasks that clear setup on seed 42, 8 shift from fresh-agent solving on cold to replay-or-hybrid on warm (versus all 13 re-derived by the fresh agent on cold), which is the corpus doing its job.

Refer to caption Figure 7:Cold→\towarm refinement is monotonic.AndroidWorld official-15 success rate, three Gemini 3 Flash seeds with the corpus reset per seed. Every seed improves (mean9.33→10.679.33\to 10.67of 15,Δ=+1.33\Delta=+1.33); none regresses. Per-task detail in Appendix B.The speedup is real.A replayed program carries no per-step language-model call, so warm runs are both cheaper and faster: a served replay makes essentially no model calls (its LLM cost collapses to near zero), and on WebArena completes8.58.5–13×13\timesfaster in wall-clock time than the corresponding fresh-agent solve (Figure1), while on Android a warm task that hits the corpus finishes in seconds. The one-time price is the compile-and-verify step, which costs roughly as much as one more solve (+162%+162\%wall time on Android,+217%+217\%on OSWorld per stored program; Figure8); it is repaid the first time the task recurs. Figure5traces where this cost goes: a one-time premium on the first solve, then near-free replays on every later run.

Refer to caption Figure 8:The gate’s one-time cost. Verifying a freshly compiled program (re-running it from a clean state) costs about as much as—or, on both platforms measured, somewhat more than—the original solve (+162%+162\%wall time on Android,+217%+217\%on OSWorld), the gap being largest on OSWorld where both the solve and the verify-replay are slowest. This is paid once per stored program and amortized over every warm run.The result is not model-specific.Swapping Gemini 3 Flash for Claude Sonnet 4.6 on the same subset gives the same73.3%73.3\%(11/15) and—tellingly—thesamethree stable failures (BrowserDraw, SystemBrightnessMax, SystemWifiTurnOn) across both models and all three seeds. These three are properties of the benchmark’s UI (a canvas not exposed in the accessibility tree; a brightness slider that rejects scroll; a stale status-bar Wi-Fi icon), not of the agent’s reasoning—they bound the benchmark, not the method (§5.5).

4.3 The verify-gate keeps the speedup from rotting

Reuse cuts both ways. The same corpus that makes warm runs cheap can make themwrong, if it stores a program that runs but does not work. Recall the lossy contact program from §3.3: it taps through the form and presses Save, replays at 100% coverage, reports success—and yet writes no contact, because a stale field left the name blank. Store that program, and every future “add a contact” silently fails. The verify-before-store gate exists to keep such programs out. To show it is doing real work, we turn it off.

Refer to caption Figure 9:The gate is what makes repeated runs improve.Mean cold→\towarm success-rate change with the gate on (blue) versus off (red),±1\pm 1s.d. On all three platforms the gate-on bar is higher: the gate makes the corpus improve the agent on mobile and desktop, and degrade it far less on the web (where most programs cannot be re-verified, §4.3); without it, the corpusdegradesthe agent everywhere as lossy programs accumulate. The gap between the bars (the gate’s marginal value) is1.751.75–2.62.6tasks and points the same way everywhere—this is a property of the harness, not of any one platform. Per-seed and per-rep data in Appendix B.We ran the same2×22{\times}2ablation—cold/warm×\timesgate on/off—on all three platforms: AndroidWorld official-15 (n=5n{=}5seeds, Gemini), OSWorld test_tiny (n=5n{=}5reps, Claude), and a 12-task WebArena shopping_admin subset (n=4+4n{=}4{+}4reps, Claude). Table2gives the per-platform cold→\towarm change in each condition; Figure9plots it.

Table 2:Verify-gate ablation across three platforms. Each cell is the mean cold→\towarm change in tasks solved (±1\pm 1s.d.). With the gateon, repeated runs hold or improve on mobile and desktop and regress less on the web; with itoff, every platform regresses further as lossy programs accumulate. “Gate’s gain” is the gap between the two—the cost of not verifying. The gate-onΔ\Deltais higher than gate-off on all three platforms, strictly so in 13 of the 14 paired runs (the 14th, WebArena rep 4, is a tie). Per-seed and per-rep tables are in Appendix B.The gate decides whether reuse helps or hurts.Read the table one row at a time. On mobile and desktop the gate-on agent improves or holds across repeated runs (Δ≥0\Delta\geq 0), while the gate-off agentlosesground—it does the same tasks a second time and solves fewer of them, because the corpus is now seeded with programs that replay but do not work. The effect points the same way on all three platforms and in every non-tied paired run (13 of 14; the 14th, WebArena rep 4, is a tie); per-platform sign tests givep≈0.031p\approx 0.031on Android and OSWorld andp=0.125p=0.125on WebArena’s three non-tied pairs. Pooling the consistent direction across platforms givesp≈10−4p\approx 10^{-4}under a no-effect null; because the within-platform repetitions are correlated rather than independent, we read that pooled figure as a directional lower bound, and rest the claim on the per-platform tests and the consistent direction. That the gate’s gain lands in the same1.751.75–2.62.6task band despite three platforms, three models, and three evaluator styles suggests it reflects how often thecompileremits a lossy program—a property of compilation, not of any platform (§5.3).

Why WebArena is low even with the gate on.On WebArena the gate-on agent still regresses (−4.0-4.0), because its tasks are answer-extraction (“what was the top-selling product in 2022?”): almost every compiled program reads a value off a page that has since changed, so the gate—correctly—rejects nearly all of them, leaving the corpus empty and nothing to replay. The gate is behaving exactly as designed (it never stores a program that fails re-verification); the missing piece is what to do when it empties the corpus, which is the subject of §4.4.

The failure, observed directly.The failure the gate intercepts is directly observable. Under gate-off we logged five distinct programs (Table3) that replay to 100% coverage, report success—and score 0 on the evaluator. The first is our running example: a stored “add a contact” program that taps through the whole form and presses Save, yet leaves no contact behind. On WebArena the same pattern is near-total: all 48 gate-off warm replays score 0. These are precisely what the gate exists to reject before storage.

Table 3:Five reproducible “runs-but-doesn’t-work” programs logged under gate-off: each replays its full action sequence (100% coverage) yet the evaluator scores it 0. The first is the running example. With the gate on, these are rejected at compile time and never enter the corpus.

4.4 The second mechanism: a fallback that closes the gap to record-and-replay

The gate exposes a question: when it empties the corpus for a task (as it does on most WebArena tasks), what should the harness do? To answer it we compared PreAct against the strongest record-and-replay baseline,Muscle-Mem[34]—which caches the linear action sequence, replays it blindly, and re-runs the full agent on any cache miss—and againstWorkflow-Use[4], which stores per-task scripts with no verification. Same harness, tasks, and backend (Claude Sonnet 4.6),n=4n{=}4reps each (Figure10, Table4).

Refer to caption Figure 10:Closing the gap to Muscle-Mem on WebArena.Warm−-cold success-rate change (±1\pm 1s.d.). As shipped, PreAct’s gate alone (−4.0-4.0) trails Muscle-Mem (−0.75-0.75): when the gate empties the corpus, PreAct had nothing to fall back to, whereas Muscle-Mem’s cache miss re-runs the full agent. Adding the same cache-miss fallback to PreAct (−1.0-1.0) makes the two statistically indistinguishable (p≈0.84p\approx 0.84). Workflow-Use and gate-off collapse to0warm, for the same reason gate-off does in Figure9.Table 4:Head-to-head baselines on WebArena shopping_admin 12-task subset,n=4n{=}4reps each. PreAct (gate-ON / gate-OFF) rows reproduced from §4.3; the bottom row is the cache-miss-to-CUA fallback harness fix (§4.4.1) that we propose and measure.The honest reading of Table4unfolds in two stages.*Stage 1 (PreAct as-is vs. baselines).*At matchedn=4n{=}4, the as-shippedPreAct gate-ON harness underperforms Muscle-Mem on this benchmark: Muscle-Mem’s warm-SR mean (6.25/12=52%6.25/12=52\%) is more than3×3\timesPreAct gate-ON’s (2.0/12=17%2.0/12=17\%). The warm-Δ\Deltagap is−0.75-0.75vs.−4.0-4.0, i.e. Muscle-Mem loses about3.253.25fewer tasks than PreAct on the cold-to-warm transition. Workflow-Use, by contrast, behaves like PreAct gate-OFF (warm SR0/120/12in every rep)—it stores compiled scripts without verification and, like gate-OFF, suffers complete warm-time collapse when the scripts encounter live-page selectors that don’t match.

Mechanism diagnosis: cache miss vs. verify-reject.WebArena’s two-run protocol replays thesame taskon Run 2, so the relevant question is what happens when the stored artifact does not match the live page. Muscle-Mem’s blind-replay attempt fails fast (typical∼30\sim 30s before timing out on a missing selector), and the harness then falls through to a fresh CUA run that re-solves the task with the original LLM cost. PreAct’s verify-before-store gate, by contrast, runs an independent verify-replay between Run 1 and Run 2; on these WebArena tasks the compiled state-machine predicates are brittle enough (e.g.state=reviews_index_pagematched via XPath patterns) that they fail re-verification even on the cold-pass task, so the gate deletes the program and Run 2 hasno artifactto fall back to.Muscle-Mem’s advantage on WebArena is not that its replay is more reliable—it isn’t, with reps showing cache miss rates of5050–67%67\%—but that its cache-miss path falls through cleanly to CUA, whereas PreAct’s gate-reject path leaves Run 2 with no artifact at all.

4.4.1 Closing the Gap: Cache-Miss-to-CUA Fallback

The diagnosis suggests a small change: when the gate rejects all compiles for a task, run a fresh CUA on Run 2 instead of returning “no artifact”. We add a fallback branch that triggers a fresh CUA exploration whenever the verify-gate has emptied the corpus for the current task; we additionally tighten the gate’s cleanup to delete every program added during the task’s compile-and-verify cycle (Run-1 compileandany programs the verify-replay’s own CUA recovery happens to add), so a stale verify-recovery program cannot suppress the cache-miss branch. With this change, Run 2 fires a fresh CUA exactly when Muscle-Mem would: the bottom row of Table4.

The result:warm SR rises from2.0/122.0/12to6.25/126.25/12, exactly matching Muscle-Mem’s6.25/126.25/12, with the warm-Δ\Deltagap closing from−4.0-4.0to−1.0-1.0.Across the four reps the warm-SR sequence is6,7,5,76,7,5,7(vs. Muscle-Mem5,6,7,75,6,7,7) and the per-rep warm-Δ\Deltavalues are0,+1,−3,−20,+1,-3,-2(vs. Muscle-Mem−2,0,+1,−2-2,0,+1,-2). Welch’s two-sidedtt-test on the warm-Δ\Deltadistributions gives|t|≈0.21|t|\approx 0.21,p≈0.84p\approx 0.84: under any conventional threshold we cannot reject equality of means, i.e. PreAct-with-fallback and Muscle-Mem are statistically indistinguishable on this benchmark. The cold-SR comparison is similar (7.257.25vs.7.07.0,p>0.5p>0.5).

This is a small-but-realpositivefinding for the architecture: PreAct’s gate-rejection signal is high quality (the rejections in the gate-ON-no-fallback condition were correct—the underlying compiled programs really were lossy), and the only thing missing was a fallback path. Critically, the fallback doesnotcompromise the gate’s Android/OSWorld value: the gate continues to reject lossy compilesbeforethey enter the corpus, preserving monotonicity (§4.3); the fallback only changes what happens on the cache-missbranchof Run 2 on the platform where the gate’s rejection rate is highest. The fallback also does not save tokens compared to Muscle-Mem on this benchmark—both pay the full CUA cost when the cache misses, so warm tokens are similar to cold (∼65\sim 65k for fallback vs.∼50\sim 50k cold; the slight increase comes from CUA re-exploring with a fresh browser session)—but it recovers the warm-SR property that the as-shipped harness was leaving on the table.

Implication.Across the three platforms tested, the harness needed two cooperating mechanisms: the verify-before-store gate to keep the corpus monotonic (§4.3; Android+2.6+2.6, OSWorld+2.6+2.6, WebArena+1.75+1.75task diff-of-deltas under the gate),anda cache-miss-to-CUA fallback when the gate empties the corpus on a particular task (this section; WebArenaΔ=−1.0\Delta=-1.0matching Muscle-Mem’s−0.75-0.75). The two-stage finding—“gate alone” loses to Muscle-Mem on WebArena, but “gate + cache-miss fallback” matches Muscle-Mem on WebArenaandretains the gate’s Android/OSWorld monotonicity—is the main architectural lesson of this section. PreAct enables this cache-miss fallback by default; disabling it recovers the gate-ON, no-fallback behavior reported as theΔ=−4.0\Delta=-4.0measurement.

4.5 Compile-LLM Robustness: Gemini vs Claude Compile

To probe whether the verify-gate’s compile-rejection rate is Claude-specific or a property of the compile step itself, we swap the compile-step LLM from Claude Sonnet 4.6 to Gemini 3 Flash, leaving the CUA loop and selector on Claude. We rann=3n{=}3WebArena reps on the same 12-task shopping_admin subset (gate-ON, no cache-miss fallback, for clean comparison with the Claude-compile baseline in §4.3); Table5and Figure11report the result.

Refer to caption Figure 11:The gate is mechanism-driven, not Claude-specific. Swapping the compile-step model from Claude to Gemini 3 Flash leaves the gate-rejection rate in the same band (83%83\%vs.89%89\%); warm success rate depends only on what survives the gate.Table 5:Compile-LLM robustness: gate behavior under Claude vs Gemini compile, WebArena 12-task shopping_admin gate-ON (no cache-miss fallback).Cold SR is statistically indistinguishable across compilers (6.06.0vs.6.336.33,p>0.5p>0.5), as expected since both use the same Claude Sonnet 4.6 CUA loop on Run 1. The gate’s rejection rate is in the same band (83%83\%with Claude vs.89%89\%with Gemini), and the dominant rejection mechanism is the same (cov==0%/score==0 or cov==17%/score==0 lossy replays). Qualitatively, Gemini compiles tend to fail at very-first-step replay (cov==0% modal) whereas Claude compiles run to completion but produce wrong answers (cov==100% modal); both failure modes are caught by the gate. The Gemini-compile warm SR is0/120/12on all three reps withσ=0\sigma{=}0—tighter than the Claude bimodal distribution (4,4,0,04,4,0,0)—reflecting Gemini’s higher rate of cov==0% compiles that the gate rejects with100%100\%consistency.The verify-gate is mechanism-driven, not Claude-specific: substituting Gemini 3 Flash at the compile step does not change the qualitative behavior (gate fires; lossy compiles get rejected; warm SR depends on what survives), and the quantitative rejection rate shifts only modestly (83→89%83\to 89\%). Cross-model replication closes the compile-step concern noted under L1 in Appendix E.

4.6 What DidNotChange the Result

The AB experiments that came backnegativeor marginal are just as useful: they rule out the obvious alternative explanations for why PreAct works. Detailed per-seed tables are in Appendix F.

Refer to caption Figure 12:A plain embedding retriever matches the LLM-based selector.On the 58-program corpus, both embedding backbones reach100%100\%correct-family retrieval (left) at0%0\%mis-routing of unrelated queries (right) once the similarity thresholdτ\tauis tuned, against the agentic selector’s75.6%75.6\%(dashed). The choice of selector thus barely affects retrieval quality—a tuned embedding retriever even edges out the agentic one, and the agentic selector’s own misses are conservative no-picks that fall back to CUA rather than wrong retrievals; we keep it as the default only for its interpretable logs and tuning-free operation. Full sweep in Appendix F.The choice of program selector barely matters.We had assumed an LLM-based selector was necessary—deciding whether a stored program applies to a new task looked like a reasoning problem. It is not: a plain embedding retriever (MiniLM-L6 or bge-large) matches or beats the agentic selector on the 58-program corpus, reaching100%100\%correct-family retrieval at0%0\%mis-routing once its threshold is tuned, versus the agentic selector’s75.6%75.6\%(Figure12). We keep the agentic selector as the default only for its interpretable reasoning logs and because it needs no threshold tuning—not because it retrieves better.

Prompt-level PreAct content is unused in production.Our implementation includes three optional PreAct-specific guideline bullets, but the production CUA path never passes them to the agent. The 73.3% Android SR is achieved with the verbatim upstream T3A prompt and zero PreAct prompt-level additions. The contribution on Android is the harness wrapping T3A—not the prompt.

Code-level runtime guardrails are aggregate-neutral atn=5n{=}5.A2×22{\times}2ablation of the guardrails (a double-tap before text entry, an image-task no-op cap, and scroll-exhaustion notes) givesidenticalcold means (10.2=10.210.2=10.2) and warm means11.011.0(on) vs.10.610.6(off), a0.40.4-task gap well within seed-level variance (Figure15; Table13). Per-task analysis shows the double-tap helps the audio-record-with-filename task (a populated filename dialog) but the gain is cancelled by minor timing penalties on Camera tasks.

Run-time verification (state-machine vs. flat-script): real but small on these benchmarks.This ablation isolates the run-time half of PreAct’s verification principle (§3.2): it strips the per-state predicate checks, leaving a flat script that fires its actions blindly the way a recorded macro would. State-machine vs. flat-script (Figure16and Table14in Appendix F) gives flat-script10.6±1.1410.6\pm 1.14(n=5n{=}5seeds) vs. the verified runtime11.67±0.5811.67\pm 0.58on the 3 matched seeds; matched-seedΔ=+0.67\Delta=+0.67tasks favoring verification, all 3 paired comparisons in the non-regressive direction (sign-testp=0.125p=0.125, suggestive). The wins concentrate on Camera tasks, where verification catches mismatches that linear execution walks past. The effect is small here for a telling reason: per-step verification pays off exactly when the screen deviates from the program’s expectation, and these short-horizon benchmark tasks rarely throw an anomaly mid-replay—so the mechanism shows its value more inhowa run fails (a clean halt-and-fall-back rather than clicking on into a wrong state) than in raw success rate. We therefore treat run-time verification as the enabling design choice that makes trustworthy direct replay possible, not as a large SR lever; the strongly-validated lever is the store-time gate (1.75–2.6-task diff-of-deltas).

Dynamic step-budget cap: a deliberate speed/accuracy tradeoff.Unlike the add-ons above, the step-budget cap isnotneutral—it is the one knob in this section that moves the success rate, and it moves it by choice rather than as a side effect of the harness. The cap—a budget of8×8\timesthe task’s step-complexity, clamped between 20 and 30 steps—vs. unbounded 60-step atn=5n{=}5gives9.8±0.849.8\pm 0.84(capped) vs.11.6±0.5511.6\pm 0.55(unbounded): the capcostsΔ=1.8±0.84\Delta=1.8\pm 0.84tasks of success rate (within the±2\pm 2pre-registered prediction interval) in exchange for≈30%\approx 30\%less wall time on the FAIL tail (the unbounded budget burns the extra time largely on SystemBrightnessMax, which is harness-deterministic and never solved either way). We ship the capped setting as the latency-favoring default; deployments that prioritize success rate over latency should raise or remove the cap. We list it here because it is the natural place to compare it against the genuinely-neutral add-ons, not because it is itself neutral.

Per-seed tables for these negative ablations are in Appendix F, and a complete map of which validity threat each experiment closes—twelve in all, eleven in scope and closed at the rigor noted per threat—is in Appendix H.

5 Discussion

5.1 Why the Verify-Gate Matters

The mechanism is: the verify-gate filters lossy compiles before they enter the corpus. “Lossy” here means: action sequences that are mechanically faithful (cov=100% on replay) but do not produce the live evaluator’s required end state. Without the gate, every CUA-success-then-compile cycle adds a new program to the corpus regardless of whether that program actually achieves the goal under deterministic re-execution. With the gate, only programs that have independently re-passed enter the corpus.

This matters because the corpus is the input to subsequent retrieval. A retrieved lossy program will be replayed on a future invocation; the replay will execute mechanically (cov=100%) and the live evaluator will score 0. Worse, the harness’s selector will continue to retrieve the lossy program until something explicitly removes it. The verify-gate prevents the corpus from accumulating these failures, which would otherwise reproduce the same wrong outcome on every future retrieval.

5.2 A Predictive View of Replay and Fallback

Replay and fallback can be read through the lens of prediction. Each state’s verification predicate is apredictionabout what the screen will look like once the previous action has taken effect, and the replayer checks that prediction against the live screen before it acts (§3.2). When prediction and observation agree, the agent proceeds automatically and without a language-model call—the regime that produces the speedup. When they disagree, the mismatch signals that the world has departed from the stored model, and the agent switches to deliberate reasoning (the CUA fallback) to recover and, in doing so, extend the model with a new branch. This is the same two-speed behaviour a practiced human shows (§1): fluent, near-automatic execution of a familiar workflow, punctuated by immediate attention the moment something looks wrong.

The resonance with predictive accounts of learning—in which an agent learns by predicting its environment and updates on the gap between predicted and observed state[43,42]—is deliberate, but the analogy is loose and we do not want to overstate it. PreAct carries no reward signal and does no value bootstrapping; its “error signal” is a discrete predicate mismatch thatroutes controlbetween fast replay and slow reasoning, not a continuous temporal-difference update that adjusts a value estimate. What the two genuinely share is the principle that prediction is what makes fast action safe: an agent may skip deliberation precisely because it is still checking, at every step, that the world matches what it expected. That per-step check is exactly what blind record-and-replay omits, and it is why a stored PreAct program can be trusted without a model in the loop while a cached action trace cannot—the difference between fluency and blind repetition.

5.3 The Verify-Gate’s Marginal Value is Structural

A notable result is that the verify-gate’s marginal value (Figure9) falls in thesame narrow band—1.75 to 2.6 tasks of diff-of-deltas—on three structurally different platforms with different LLMs, different task subsets, and different evaluator semantics: 2.6 tasks on Android (Gemini, UI-predicate evaluators), 2.6 tasks on OSWorld (Claude, desktop-state evaluators), and 1.75 tasks on WebArena (Claude, string-match evaluators on a 12-task shopping_admin subset). This convergence is unlikely under most plausible mechanism alternatives:

•If the gate’s value derived from a model-specific behavior (e.g., Gemini-particular flaws), Android’s diff-of-deltas would not match OSWorld’s or WebArena’s (both Claude).
•If it derived from task-specific properties (e.g., Markor-edit ambiguity), the desktop-task and web-task diff-of-deltas would differ from the mobile case.
•If it derived from benchmark-evaluator quirks (e.g., Android’s specific scoring rubric), OSWorld’s desktop-state evaluators and WebArena’s string-match evaluators would not produce the same magnitude.

Instead, the diff-of-deltas magnitude is structural: it is the rate at which the underlyingcompilestep produces lossy state-machine programs. Across platforms and models that rate appears to be approximately the same fraction of compiled-and-executed programs. Theabsoluteregression varies more (43 pp on OSWorld’s 6-task subset, 17 pp on Android’s 15-task subset, 15 pp on WebArena’s 12-task subset), but theabsolute number of tasksaffected falls in the same range. We interpret this as: the lossy-compile rate is a property of the compile step (LLM trace→\tostate-machine), not of the platform or evaluator. The cross-platform meta-test on the 14 paired observations across all three platforms yieldsp≈1.2×10−4p\approx 1.2\times 10^{-4}under the null hypothesis of no gate effect (a directional lower bound that treats within-platform reps as independent; see the correlation caveat in §4.3).

5.4 Per-Program Reproducibility of cov==100%/score==0

The five cov==100%/score==0 cases (Table3) are not equally reproducible across reps—the OSWorld Calc-formula program reproduces 5/5 reps, the Calc-chart 4/5, and the Chrome-history 2/5 (per-program breakdown in Appendix G).The aggregate regression, however, is consistent across every rep.All 5 OFF reps lose at least 2 tasks; no rep escapes the regression. The selector and replayer’s per-task behavior is partially nondeterministic, but the gate-OFF condition reliably produces≥2\geq 2-task regression on every rep. The paper’s claims rest on this aggregate-level determinism, not on per-program reproducibility.

5.5 Harness-Deterministic vs. LLM-Driven Failures

Three AndroidWorld tasks (BrowserDraw, SystemBrightnessMax, SystemWifiTurnOn) and one OSWorld task (LibreOffice Writer macro) fail across all five seeds, both LLM backends (Claude and Gemini), all four guardrail/budget configurations, and both verify-gate conditions. These failures areharness-deterministic: they are properties of the UI patterns themselves (status-bar stale reads, scroll-on-seekbar incompatibility, image-canvas-not-in-a11y-tree) rather than properties of the agent’s reasoning. They bound the benchmark, not the method.

5.6 Out-of-Distribution Generalization

We tested whether the corpus transfers beyond seen tasks by running a test_tiny-built corpus on the 33 non-test_tiny tasks of OSWorld test_small atn=3n{=}3reps (with the corpus copy reset between reps). Per-rep SR over the 30 OOD tasks that complete setup:21/3021/30,15/3015/30,14/3014/30, giving a warm-OOD mean of16.7/30=55.6%±12%16.7/30=55.6\%\pm 12\%. The cold-OOD baseline on the same task subset is a single (n=1n{=}1) pass at≈20/30=67%\approx 20/30=67\%; because it is unpaired and not multi-rep, the gap below should be read as indicative rather than a tightly-estimated effect.The corpus does not transfer; atn=3n{=}3the warm-OOD mean is≈11\approx 11percentage pointsbelowthis cold-OOD pass(Figure13), a small-but-real headwind rather than the neutral outcome a single pass had suggested. Inspection of the rep logs shows the selector occasionally retrieves a near-miss program (a Chrome task’s program retrieved for a Thunderbird task whose description shares the words “open” and “message”), and the brittle XPath/state predicates from the source domain fail on the OOD page, costing replay budget without recovering via CUA fast enough. The selector’s no-pick rate on OOD queries does buffer this somewhat—it returns “no candidate” on the majority of cross-family queries—but the queries it does pick are penalised. This sharpens the scope of the “self-extending corpus” claim:the corpus’s value is concentrated in repeated invocations of seen tasks, and corpora built on small task subsets actively work against out-of-distribution generalization on this evaluator. Full per-domain breakdown in Appendix D.

Refer to caption Figure 13:The corpus does not transfer cleanly. A corpus built on six OSWorld tasks, applied to 30 unseen tasks, lands≈11\approx 11pointsbelowthe cold baseline: a few mis-routed retrievals fire brittle predicates and burn replay budget. The corpus’s value is concentrated in repeated invocations of seen tasks.

5.7 Limitations

Beyond the eleven in-scope validity threats closed in Table16, we explicitly acknowledge six remaining limitations:

•(L1) Architectural baseline (partial).A direct AB against an ActionEngine-style flat-script baseline atn=5n{=}5Android seeds gives flat-script10.6±1.1410.6\pm 1.14vs. state-machine11.67±0.5811.67\pm 0.58(Δ=+0.67\Delta=+0.67,p=0.125p=0.125): suggestive but not significant.
•(L2) Benchmark coverage.The 6-task OSWorld test_tiny, 15-task AndroidWorld official-15, and 12-task WebArena shopping_admin subsets are a small sample of computer-using tasks; absolute pp magnitudes are denominator-dependent (43/17/15 pp for 1.75–2.6 absolute tasks).
•(L3) Compile cost.Verify-replay adds median +217% wall overhead on OSWorld and +162% on Android per successful CUA-then-compile cycle; the cost is amortized on warm runs but matters for high-throughput deployments.
•(L4) Compile-fidelity rate.The LLM compiler produces lossy state-machines on a non-trivial fraction of inputs (audit:∼28%\sim 28\%Android programs missnavigate_back;100%100\%of audited WebArena programs useinspect_screenshoton dynamic state); the gate filters this at storage but scales as the number of compile attempts, not stored programs.
•(L5) Selector nondeterminism.Verbatim retrieval is 73% exact-id (all 8 misses were semantically-equivalent picks); paraphrase robustness is 75.6% functional (47% exact-id + 29% equivalent), 24% no-pick,0% wrong-task-family—the selector is conservative-by-design rather than over-eager.
•(L6) Verification assumes a resettable or idempotent environment.The verify-before-store gate confirms a freshly compiled program byre-executingit from a reset environment and consulting the task’s evaluator, and the warm replay path likewise re-runs a stored program end-to-end. Both presuppose that performing the program again is safe. Our benchmarks provide this for free—each task starts from a scripted reset—but many real tasks are not idempotent: re-running “add contact Emilia Gonzalez,” “send the email,” or “submit the payment” duplicates the side effect rather than re-confirming the goal. Outside a benchmark, the gate would need a side-effect-free way to verify (a read-only check of the end state, a sandboxed dry-run, or a transactional rollback) in place of literal re-execution. We validate the gate as a benchmark mechanism; making it safe under irreversible side effects is future work.

Detailed measurements and analysis for L1–L5 in Appendix E.

5.8 Future Work

Four directions stand out.

Scaling the corpus.The current corpus (58 Android programs after multi-week experiments) is small. We have not characterized the corpus’s behavior at10310^{3}or10410^{4}programs: does the agentic selector still discriminate accurately when the candidate set is large? Does dedup-signature collision become an issue? Does the verify-replay step cost become prohibitive at scale?

Architectural baselines.The state-machine-as-executable claim should be tested directly against a flat-script baseline (e.g., an ActionEngine-style implementation that runs the same compile cycle but generates flat Python at the verify step). The current paper supports the claim by working implementation across two platforms but not by direct comparison.

Long-horizon tasks.Both AndroidWorld and OSWorld have task horizons of order 10–30 actions. Tasks requiring long-term planning, goal decomposition, or recovery from compound failures (e.g., entire workflow tasks lasting hundreds of actions) are not represented in our benchmarks. We expect the harness’s CUA-fallback mechanism to be tested most severely on such tasks; integration with explicit goal-decomposition planners is the natural next step.

Internalizing the corpus.PreAct externalizes task dynamics into inspectable programs; the converse is open—whether a model can be trained on its own verified corpus tointernalizethese programs, folding the externalized world model back into the weights while keeping the verify-gate as the signal that decides what is worth internalizing. This would trade PreAct’s inspectability for the generality an internalized world model affords, and the verified corpus is a natural source of supervision for it.

6 Conclusion

PreAct started from a simple observation: a computer-using agent should not re-derive a task it has already solved. Acting on it took two commitments. The first is about representation—what the agent remembers is a state-machine program it can run directly, not a transcript it has to re-interpret. The second is about discipline—the agent checks against the live screen before it trusts anything: before each action as it replays a program, and once more over the whole program, re-run from a clean state, before that program is allowed into its memory. Across three different platforms, these two ideas together let an agent get faster on repeated work while keeping its growing memory honest. The pieces we had expected to matter mostly did not: the runtime guardrails left the aggregate numbers unchanged, the prompt-level guidance was never exercised in production, and a small embedding retriever did as well as the agentic LLM selector. What carried the result was the pairing itself—an executable memory together with the refusal to keep anything unverified.

That refusal is the easy part to skip and the part that matters most. A memory that grows without verification does not make an agent better over time; it makes the agent repeat the same mistake on every reuse, retrieving a program that looks right, no longer works, and runs to its last step with nothing to show for it. As agents build up larger and longer-lived libraries of skills, the hard question stops being whether an agent can remember what it did and becomes whether it can tell which of its memories are still true.

Acknowledgements

We thank Hao Fu and Kai Chen for early discussions that inspired this work, and Yurun Jin and Junlong Ye for early implementations and experiments. Pine Copilot, Claude Code, and Claude Opus 4.8 were used during this research.

References

[1]Anthropic(2024-10)Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku.Note:https://www.anthropic.com/news/3-5-models-and-computer-useCited by:§1,§2.
[2]H. Bai, Y. Zhou, M. Cemri, J. Pan, A. Suhr, S. Levine, and A. Kumar(2024)DigiRL: training in-the-wild device-control agents with autonomous reinforcement learning.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[3]R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y. Li, Y. Lu, J. Wagle, K. Koishida, A. Bucker, L. Jang, and Z. Hui(2024)Windows agent arena: evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264.Cited by:§2.
[4]Browser-Use(2025)Workflow-Use: create and run workflows (RPA 2.0).Note:https://github.com/browser-use/workflow-useCited by:§1,§2,§4.4.
[5]T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou(2023)Large language models as tool makers.arXiv preprint arXiv:2305.17126.Cited by:§2.
[6]H. Chae, N. Kim, K. T. Ong, M. Gwak, G. Song, J. Kim, S. Kim, D. Lee, and J. Yeo(2024)Web agents with world models: learning and leveraging environment dynamics in web navigation.arXiv preprint arXiv:2410.13232.Cited by:§2.
[7]W. Chen, X. Ma, X. Wang, and W. W. Cohen(2023)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research (TMLR).Note:arXiv:2211.12588Cited by:§2.
[8]K. Cheng, Q. Sun, Y. Chu, F. Xu, Y. Li, J. Zhang, and Z. Wu(2024)SeeClick: harnessing gui grounding for advanced visual gui agents.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),Cited by:§2.
[9]P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav(2025)Mem0: building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413.Note:ECAI 2025Cited by:§2.
[10]X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su(2023)Mind2Web: towards a generalist agent for the web.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[11]E. Feng, W. Zhou, Z. Liu, L. Chen, Y. Dong, C. Zhang, Y. Zhao, D. Du, Z. Hua, Y. Xia, and H. Chen(2025)Get experience from practice: LLM agents with record & replay.arXiv preprint arXiv:2505.17716.Cited by:§1,§2.
[12]L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig(2023)PAL: program-aided language models.InProceedings of the 40th International Conference on Machine Learning (ICML),Note:arXiv:2211.10435Cited by:§2.
[13]I. Gur, H. Furuta, A. Huang, M. Safdari, Y. Matsuo, D. Eck, and A. Faust(2024)A real-world webagent with planning, long context understanding, and program synthesis.InInternational Conference on Learning Representations (ICLR),Cited by:§2.
[14]D. Ha and J. Schmidhuber(2018)World models.arXiv preprint arXiv:1803.10122.Cited by:§2.
[15]D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap(2023)Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104.Cited by:§2.
[16]S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu(2023)Reasoning with language model is planning with world model.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),Cited by:§2.
[17]H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, and D. Yu(2024)WebVoyager: building an end-to-end web agent with large multimodal models.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),Cited by:§2.
[18]W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, Y. Dong, M. Ding, and J. Tang(2024)CogAgent: a visual language model for gui agents.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),Cited by:§2.
[19]C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan(2024)SWE-bench: can language models resolve real-world github issues?.InInternational Conference on Learning Representations (ICLR),Cited by:§2.
[20]J. Y. Koh, R. Lo, L. Jang, V. Duvvur, M. C. Lim, P. Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried(2024)VisualWebArena: evaluating multimodal agents on realistic visual web tasks.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL),Cited by:§2.
[21]P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela(2020)Retrieval-augmented generation for knowledge-intensive nlp tasks.InAdvances in Neural Information Processing Systems (NeurIPS),Note:arXiv:2005.11401Cited by:§2.
[22]B. Li(2026)User as code: executable memory for personalized agents.arXiv preprint arXiv:2606.16707.Cited by:§2.
[23]C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh, S. Levine, L. Fei-Fei, F. Xia, and B. Ichter(2024)Chain of code: reasoning with a language model-augmented code emulator.InProceedings of the 41st International Conference on Machine Learning (ICML),Note:arXiv:2312.04474Cited by:§2.
[24]J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng(2023)Code as policies: language model programs for embodied control.InIEEE International Conference on Robotics and Automation (ICRA),Note:arXiv:2209.07753Cited by:§2.
[25]X. Liang, B. Cao, S. Gu, Y. Li,et al.(2024)SAGE: self-evolving agents with reflective and memory-augmented abilities.arXiv preprint arXiv:2409.00872.Cited by:§1,§2.
[26]Y. Lu, S. Yang, C. Qian, G. Chen, Q. Luo, Y. Wu, H. Wang, X. Cong, Z. Zhang, Y. Lin, W. Liu, Y. Wang, Z. Liu, F. Liu, and M. Sun(2024)Proactive agent: shifting llm agents from reactive responses to active assistance.arXiv preprint arXiv:2410.12361.Cited by:§2.
[27]A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neubig(2022)Language models of code are few-shot commonsense learners.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP),Note:arXiv:2210.07128Cited by:§2.
[28]Q. Mi, Z. Ma, M. Yang, H. Li, Y. Wang, H. Zhang, and J. Wang(2026)Skill-pro: learning reusable skills from experience via non-parametric ppo for llm agents.arXiv preprint arXiv:2602.01869.Note:ICML 2026Cited by:§2.
[29]G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom(2024)GAIA: a benchmark for general ai assistants.InInternational Conference on Learning Representations (ICLR),Cited by:§2.
[30]R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders,et al.(2021)WebGPT: browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332.Cited by:§2.
[31]R. Niu, J. Li, S. Wang, Y. Fu, X. Hu, X. Leng, H. Kong, Y. Chang, and Q. Wang(2024)ScreenAgent: a vision language model-driven computer control agent.InProceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI),Cited by:§2.
[32]C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez(2023)MemGPT: towards llms as operating systems.arXiv preprint arXiv:2310.08560.Cited by:§2.
[33]J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein(2023)Generative agents: interactive simulacra of human behavior.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST),External Links:DocumentCited by:§2.
[34]Pig.dev(2025)Muscle-Mem: a cache for AI agents to learn and replay complex behaviors.Note:https://github.com/pig-dot-dev/muscle-memCited by:§1,§2,§4.4.
[35]Y. Qin, Y. Ye, J. Fang, H. Wang, S. Liang, S. Tian, J. Zhang, J. Li, Y. Li, S. Huang,et al.(2025)UI-TARS: pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326.Cited by:§2.
[36]C. Rawles, S. Clinckemaillie, Y. Chang, J. Waltz, G. Lau, M. Fair, A. Li, W. Bishop, W. Li, F. Campbell-Ajala,et al.(2025)AndroidWorld: a dynamic benchmarking environment for autonomous agents.InInternational Conference on Learning Representations (ICLR),External Links:2405.14573Cited by:§1,§2,§3.1,§4.1.
[37]C. Rawles, A. Li, D. Rodriguez, O. Riva, and T. Lillicrap(2023)Android in the wild: a large-scale dataset for android device control.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[38]T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom(2023)Toolformer: language models can teach themselves to use tools.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[39]N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao(2023)Reflexion: language agents with verbal reinforcement learning.InAdvances in Neural Information Processing Systems (NeurIPS),Note:arXiv:2303.11366Cited by:§2.
[40]Significant Gravitas(2023)Auto-gpt: an autonomous gpt-4 experiment.Note:https://github.com/Significant-Gravitas/AutoGPTCited by:§2.
[41]T. R. Sumers, S. Yao, K. Narasimhan, and T. L. Griffiths(2024)Cognitive architectures for language agents.Transactions on Machine Learning Research (TMLR).Cited by:§2.
[42]R. S. Sutton and A. G. Barto(2018)Reinforcement learning: an introduction.2nd edition,MIT Press.Cited by:§5.2.
[43]R. S. Sutton(1988)Learning to predict by the methods of temporal differences.Machine Learning3(1),pp. 9–44.Cited by:§5.2.
[44]R. Sutton(2019)The bitter lesson.Note:http://www.incompleteideas.net/IncIdeas/BitterLesson.htmlBlog post, March 13, 2019Cited by:§2.
[45]G. Trooskens, A. Karlsberg, A. Sharma, L. De Brouwer, M. Van Puyvelde, M. Young, J. Thickstun, and G. Alterovitz(2026)Compiled AI: deterministic code generation for LLM-based workflow automation.arXiv preprint arXiv:2604.05150.Cited by:§1,§2.
[46]W. M. P. van der Aalst, M. Bichler, and A. Heinzl(2018)Robotic process automation.Business & Information Systems Engineering60(4),pp. 269–272.Cited by:§2.
[47]G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar(2024)Voyager: an open-ended embodied agent with large language models.Transactions on Machine Learning Research (TMLR).Note:arXiv:2305.16291Cited by:§2.
[48]J. Wang, H. Xu, J. Ye, M. Yan, W. Shen, J. Zhang, F. Huang, and J. Sang(2024)Mobile-agent: autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158.Cited by:§2.
[49]X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji(2024)Executable code actions elicit better llm agents.InProceedings of the 41st International Conference on Machine Learning (ICML),Note:arXiv:2402.01030Cited by:§2.
[50]X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou(2023)Self-consistency improves chain of thought reasoning in language models.InInternational Conference on Learning Representations (ICLR),Cited by:§2.
[51]Z. Wang, D. Fried, and G. Neubig(2024)TroVE: inducing verifiable and efficient toolboxes for solving programmatic tasks.arXiv preprint arXiv:2401.12869.Cited by:§2.
[52]J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou(2022)Chain-of-thought prompting elicits reasoning in large language models.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[53]Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, and L. Kong(2024)OS-copilot: towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456.Cited by:§2.
[54]T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei,et al.(2024)OSWorld: benchmarking multimodal agents for open-ended tasks in real computer environments.InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track,External Links:2404.07972Cited by:§2,§4.1.
[55]W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang(2025)A-mem: agentic memory for llm agents.arXiv preprint arXiv:2502.12110.Note:NeurIPS 2025Cited by:§2.
[56]J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press(2024)SWE-agent: agent-computer interfaces enable automated software engineering.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[57]S. Yao, H. Chen, J. Yang, and K. Narasimhan(2022)WebShop: towards scalable real-world web interaction with grounded language agents.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[58]S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan(2023)Tree of thoughts: deliberate problem solving with large language models.InAdvances in Neural Information Processing Systems (NeurIPS),Cited by:§2.
[59]S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao(2023)ReAct: synergizing reasoning and acting in language models.InInternational Conference on Learning Representations (ICLR),External Links:2210.03629Cited by:§1,§2.
[60]C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y. Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang(2024)UFO: a ui-focused agent for windows os interaction.arXiv preprint arXiv:2402.07939.Cited by:§2.
[61]C. Zhang, Z. Yang, J. Liu, Y. Han, X. Chen, Z. Huang, B. Fu, and G. Yu(2023)AppAgent: multimodal agents as smartphone users.arXiv preprint arXiv:2312.13771.Cited by:§2.
[62]Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J. Wen(2024)A survey on the memory mechanism of large language model based agents.arXiv preprint arXiv:2404.13501.Cited by:§2.
[63]A. Zhao, D. Huang, Q. Xu, M. Lin, Y. Liu, and G. Huang(2024)ExpeL: llm agents are experiential learners.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:§2.
[64]B. Zheng, B. Gou, J. Kil, H. Sun, and Y. Su(2024)GPT-4v(ision) is a generalist web agent, if grounded.InProceedings of the 41st International Conference on Machine Learning (ICML),Cited by:§2.
[65]H. Zhong, F. Faisal, L. França, T. Leesatapornwongsa, A. Szekeres, K. Rong, and S. Nath(2026)ActionEngine: from reactive to programmatic GUI agents via state machine memory.arXiv preprint arXiv:2602.20502.Cited by:§1,§2.
[66]S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig(2024)WebArena: a realistic web environment for building autonomous agents.InInternational Conference on Learning Representations (ICLR),External Links:2307.13854Cited by:§2,§4.1.

Appendix A. Computer-Action Schema

Each transition’sactionfield must be one of the action types in Table6. Action types are translated to platform-specific backends at execution time:action_clickon Android maps to aJSONAction-formatted touch with the resolved element’s(x,y)(x,y)coordinates; on OSWorld it maps topyautogui.click(x, y). The cross-platform schema enables the same compiled state-machine program to run on either backend, although in practice each task’s compile is platform-specific because the verification predicates use platform-specific selector dialects (Android XPath vs. pyautogui screenshot regions).

Table 6:Computer action types in PreAct’s program schema. Each transition uses exactly one type.Thedescriptionfield on each transition (one short sentence) is consumed by the agentic Program Selector (§3) when judging whether a stored program applies to the current task. Selectors are written to be parameter-aware (e.g. “type the filename (parameterfilename)” rather than “type text”), since the program’sparameterslist is what the LLM uses to bind the program’s slots to the new task’s instance values at retrieval time.

Appendix B. Per-Seed and Per-Task Multi-Seed Data

This appendix gives the per-seed and per-rep detail behind the aggregate figures in §4.2(Figure7) and §4.3(Table2, Figure9). The verify-gate ablation is broken out per platform in Table8(Android), Table9(OSWorld), and Table10(WebArena).

Table 7:Cold→\towarm monotonicity, AndroidWorld official-15,n=3n{=}3Gemini seeds with corpus reset per seed (aggregated in Figure7). All three seeds refine monotonically. “Mode-shift” counts tasks that move from fresh-agent solving (cua) on cold to direct replay (rpa) or replay-then-fallback (hybrid) on warm, out of the tasks that clear setup.Table 8:Verify-gate ablation, AndroidWorld official-15,n=5n{=}5seeds with corpus reset per seed (Android row of Table2). All five gate-ON pairs are monotonic; all five gate-OFF pairs regress (zero inversions; paired sign testp≈0.031p\approx 0.031).SeedCold ONWarm ONΔ\DeltaONCold OFFWarm OFFΔ\DeltaOFF421011+11110−1-1100911+21110−1-11337910+1109−1-120241112+11110−1-177771011+1129−3-3Mean9.811.0+1.2±0.45+1.2\pm 0.4511.09.6−1.4±0.89-1.4\pm 0.89Table 9:Verify-gate ablation, OSWorld test_tiny,n=5n{=}5reps, Claude Sonnet 4.6 (OSWorld row of Table2). All five gate-OFF reps regress by≥2\geq 2tasks; paired sign testp≈0.031p\approx 0.031.Table 10:Verify-gate ablation, WebArena shopping_admin 12-task subset,n=4+4n{=}4{+}4reps, Claude Sonnet 4.6 CUA + Claude compile, gate-ON without the cache-miss fallback (WebArena row of Table2; SR out of 12 tasks). Every gate-OFF warm run collapses to0(all 48 warm replays score0); the gate-on column loses fewer tasks on every non-tied rep. Pairing rep-for-rep, gate-ON beats gate-OFF on reps 1–3 and ties on rep 4 (3 non-tied pairs, sign testp=0.125p=0.125).The gate-on warm SR is bimodal across reps (4,4,0,04,4,0,0): on reps 1–2 a compiled program survives the gate and serves a partial replay, whereas on reps 3–4 the gate empties the corpus and, with no cache-miss fallback in this condition, warm SR drops to0. This is the WebArena-specific pattern §4.4resolves by adding the cache-miss-to-CUA fallback. Even so, gate-on loses fewer tasks than gate-off in every non-tied rep, matching the cross-platform direction.

Table11reports per-task warm-pass results across the three Gemini 3 Flash seeds of the cold→\towarm monotonicity experiment (§4.2, Table7; seed totals 11/11/10 match that table exactly). The stable failure set (BrowserDraw, SystemBrightnessMax, SystemWifiTurnOn) fails across all three seeds; the remaining cross-seed variation falls on three sampling-sensitive tasks (BrowserMaze, CameraTakeVideo, ContactsAddContact), where the action sequence is sensitive to LLM sampling.

Table 11:Per-task warm-pass results across the three Gemini 3 Flash seeds on AndroidWorld official-15 (cold→\towarm monotonicity experiment, §4.2). Seed totals (11/11/10) match Table7.†Live evaluator passes but the verify-before-store gate rejected the task’s recompiled program on that seed (it failed re-verification); the warm run still passed via RPA/hybrid/CUA.

The complete per-task results across all 32 Android multi-seed runs and 8 OSWorld runs are also included in the project’s released artifacts.

Appendix C. Container Patch

Our container patch adds a state-reporting endpoint to the AndroidWorld image that returns a base64-encoded screenshot plus the serialized accessibility-tree elements on demand, sidestepping the upstream populated-AVD a11y-gRPC initialization wedge.

Appendix D. Out-of-Distribution Generalization (Details)

To test whether the corpus transfers beyond seen tasks, we built a corpus from the OSWorld test_tiny set (6 tasks across Chrome, LibreOffice Calc, LibreOffice Writer; 4 verify-gate-stored programs after compile filtering) and ran on the 33 OSWorld test_small tasks that arenotin test_tiny, spanning 9 domains (chrome×\times2, gimp×\times2, libreoffice_calc×\times1, libreoffice_impress×\times2, multi_apps×\times17, os×\times2, thunderbird×\times2, vlc×\times2, vs_code×\times3). The agentic Program Selector retrieves from the test_tiny corpus when the new task’s description plausibly matches.

Result atn=3n{=}3reps (corpus copy reset per rep): warm-OOD21/3021/30,15/3015/30,14/30=14/30=mean16.7/30=55.6%±12.016.7/30=55.6\%\pm 12.0. The cold-OOD baseline on the same 30 OOD tasks is a single (n=1n{=}1) pass at approximately20/30=66.7%20/30=66.7\%, so warm-OOD is roughly1111percentage pointsbelowcold-OOD (the cold side is unpaired and not replicated, so the magnitude is indicative).

The corpus does not transfer cleanly across task families. Inspecting the rep logs: the selector retrieves a near-miss program (a test_tiny Chrome history-clean program retrieved for a Thunderbird email task whose intent shares lexical surface with “clean”) on∼6/30\sim 6/30queries per rep on average; the source-domain XPath/state predicates then fail on the OOD page, consuming replay budget that the harness can no longer recover within the per-task wall budget. The remaining∼24/30\sim 24/30OOD queries are handled correctly by the selector returning “no candidate”—these run as pure CUA and approximately match cold-OOD on their own—so the aggregate1111pp regression is concentrated in the small fraction of mis-routed retrievals.*The corpus is not neutral on OOD; it is mildly harmful.*This sharpens the scope of the self-extending corpus claim: corpora built on small in-distribution task subsets should not be deployed cross-domain without retraining the selector or extending the corpus to cover the new domain.

Appendix E. Limitations (Extended)

(L1) Architectural baseline.We ran an ActionEngine-style flat-script baseline atn=5n{=}5Android seeds (Table14) by disabling per-state verification so that stored programs run as flat scripts. Flat-script mean (warm) is10.6±1.1410.6\pm 1.14across seeds 42/100/1337/7/2025, vs. state-machine11.67±0.5811.67\pm 0.58on the matched 3 seeds; the matched-seedΔ=+0.67\Delta=+0.67tasks in favor of state-machine, sign-testp=0.125p=0.125onn=3n{=}3paired comparisons (suggestive but not significant). The architectural commitment matters but is a smaller effect than the verify-gate’s 1.75–2.6-task diff-of-deltas, consistent with the paper’s framing that the gate is the bigger empirical contribution. We did not run an untargeted-exploration baseline. The Gemini-vs-Claude compile-LLM concern previously flagged here is now closed atn=3n{=}3(§4.5).

(L2) Benchmark coverage.OSWorld test_tiny has only 6 tasks, AndroidWorld official-15 has 15, WebArena shopping_admin subset has 12. While the verify-gate effect replicates cross-platform, the benchmarks themselves are a small sample of computer-using tasks. We acknowledge platform-coverage bias as out-of-scope for this paper. The relative percentage-point magnitudes are influenced by the denominator: 43 pp on OSWorld’s 6-task subset is mathematically larger than 17 pp on Android’s 15-task subset and 15 pp on WebArena’s 12-task subset, for absolute task counts that fall in a similar range (1.75–2.6 tasks).

(L3) Compile cost.The verify-before-store gate requires a full re-replay and re-evaluation of every compiled program. From timing measurements across our experiments (60 OSWorld tasks; 60 Android tasks under gate-ON), the verify-replay phase adds median 141 s on OSWorld and 76 s on Android per stored task, on top of the original CUA execution (median 65 s OSWorld; 47 s Android). This corresponds to+217% wall overhead on OSWorld and +162% on Android per successful CUA-then-compile cycle(verify-replay is a near-fixed cost roughly comparable to or larger than the original CUA execution on both platforms; it is proportionally largest on OSWorld, where both the absolute verify-replay time and the base CUA time are highest). The cost is amortized on warm runs where the corpus retrieves already-verified programs in seconds, but the compile-time overhead is real and may matter for high-throughput deployments.

(L4) Compile-fidelity rate.A separate audit of stored programs across platforms (Android: 58 programs; OSWorld: 16; WebArena: 7) found two recurring compile-fidelity risk patterns: missingnavigate_backon nav-heavy edit/delete tasks (affecting roughly 28% of Android programs), and dynamicinspect_text/inspect_screenshotreturns on pages whose state evolves between Run 1 and replay (100% of audited WebArena programs useinspect_screenshot, fully explaining the 48/48 smoking-gun rate observed under gate-OFF). The compiler itself is an LLM that produces lossy state-machines on a non-trivial fraction of inputs. The verify-gate filters this at storage time, but as the corpus scales, the gate’s verify-replay cost (proportional to compile attempts, not just stored programs) may become a bottleneck. Figure14breaks down the Android audit.

Refer to caption Figure 14:Compile-fidelity audit of the 58-program Android corpus. Roughly a quarter (16 programs) are lossy-risk—nav-heavy edit/move/delete tasks missing anavigate_back—which is the rate the gate must catch. Manual inter-classifier agreement was 5/5 per bucket.(L5) LLM nondeterminism in selector.The agentic Program Selector is itself an LLM call. We measured selector exact-id-match accuracy of 22/30 = 73% on the corpus’s stored programs, prompted with each program’s own task description verbatim. Inspecting the 8 misses, all were picks ofsemantically equivalentprograms (the same task, but a different stored copy accumulated across reps under different dedup signatures). Functional accuracy (correct task family selected) is therefore close to 100%, with the 73% exact-id figure reflecting the corpus’s accumulation of multiple matches per task.

To stress-test the selector beyond verbatim retrieval, we ran a paraphrase-robustness audit: for each of 15 stored Android programs, we generated 3 LLM-rewritten paraphrases (45 paraphrased queries total) and measured whether the selector still picked the original program. Result: 21/45 = 47% exact-id, 13/45 = 29% same-task-different-id (paraphrase picked an equivalent program), 11/45 = 24% no-pick (selector returned “no candidate”), and0/45 = 0% wrong-task-family(selector never picked an unrelated program).Functional accuracy(exact + equivalent) is 75.6% (34/45); the no-pick fraction is conservative behavior (the selector errs toward CUA fallback rather than retrieving an inapplicable program). At 75.6% functional retrieval and 0% mis-retrieval, the selector is conservative-by-design rather than over-eager.

Appendix F. Negative-Findings Tables

This appendix provides the per-seed tables for the negative findings summarized in §4.6, including the full selector-backend threshold sweep (Table12) behind Figure12.

Table 12:Selector backend ablation (full sweep behind Figure12). “Functional accuracy” is the fraction of 45 paraphrased Android queries retrieving a program in the correct task family; “No-pick” is the abstention rate on those queries; “False-pick” is the fraction of 15 unrelated-domain queries that wrongly retrieve any program. The wrong-task-family rate on paraphrases is0%0\%at every operating point. Bold rows mark each backbone’s optimalτ\tau.The interpretation: the hypothesis we entered with—“embedding-RAG is inferior because discrimination is a reasoning problem”—does not survive. With either backbone tuned to its safe operating point, embedding retrieval matches or beats the agentic selector at zero wrong-task and zero false-pick; bge-large holds100%100\%over a widerτ\tauwindow (0.650.65–0.850.85) than MiniLM (0.500.50–0.650.65), which matters as the corpus grows and intra-cluster similarity rises. We retain the agentic selector only for its interpretable logs and tuning-free operation, and flag that selectors should be re-tested at103+10^{3}{+}programs where these properties become consequential.

Refer to caption Figure 15:Runtime guardrails are aggregate-neutral. Cold means are identical with and without guardrails (10.2=10.210.2=10.2); the warm gap (+0.4+0.4) is well within seed variance. Detail in Table13.Table 13:Code-level guardrails ablation, AndroidWorld official-15,n=5n{=}5. Cold means areidentical(10.2=10.210.2=10.2). WarmΔ\Delta=+0.4+0.4is well within the standard deviations on each side.SeedCold ONWarm ONΔ\DeltaONCold OFFWarm OFFΔ\DeltaOFF421011+11012+2100111101110−1-11337911+21010020241211−1-1101007777911+21011+1Mean10.211.0+0.8±1.30+0.8\pm 1.3010.210.6+0.4±1.14+0.4\pm 1.14 Refer to caption Figure 16:State-machine vs. flat-script runtime (warm SR, AndroidWorld official-15). The verified graph form helps by a small, consistent margin (+0.67+0.67tasks, all three matched seeds non-regressive,p=0.125p=0.125)—real but far smaller than the gate’s1.751.75–2.62.6tasks.Table 14:Architectural ablation: state-machine vs. flat-script runtime, AndroidWorld official-15 warm SR. State-machinen=3n{=}3(matched seeds), flat-scriptn=5n{=}5(matched seeds + 7 + 2025).

Appendix G. Smoking-Gun Reproducibility (Per-Program)

The five smoking-gun cov==100%/score==0 cases (Table3) are not equally reproducible across reps. Table15reports the per-program reproducibility across the 5 OSWorld warm-OFF reps.

Table 15:Smoking-gun reproducibility across 5 OSWorld warm-OFF reps. Each row: how many of 5 reps replayed the program at cov==100% and got score==0.The mechanism is partially deterministic. For the most reliable case (the Calc formula program), the lossy program reproducibly fails at cov==100% on every warm rep. For the Calc chart program, hybrid-mode rescue occasionally salvaged the run via CUA fallback after partial replay (cov==8%). For the Chrome history program, the cold-stored program’s mechanical action sequence happened to match the evaluator’s required browser state on 3 of 5 reps—a partial-determinism that traces to environmental factors (Chrome’s session state, browsing history initialization) varying across environment resets even with identical task IDs.

Appendix H. Validity-Threat Closure Map

Table16maps each validity threat—the original nine from our pre-experiment validation plan, plus three added since the May 2026 revision—to its closure status and the experiment that closes it. Eleven of the twelve are in scope and closed, most with multi-seed runs and the remainder at the rigor noted in the Evidence column (e.g., codebase inspection for threat 4, manual inter-rater agreement for threat 6,n=3n{=}3for threats 11–12); threat 8 (platform coverage) is out of scope as it would require net-new benchmarks.

Table 16:Validity threats and closure status. Eleven in-scope threats closed at the rigor noted in the Evidence column (most multi-seed); threat 8 is out of scope.