ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
Summary
ClawForge is a generator-backed benchmark framework for executable command-line workflows under state conflict, evaluating LLM agents on tasks with pre-existing partial, stale, or conflicting artifacts across 17 scenarios.
View Cached Full Text
Cached at: 05/15/26, 06:21 AM
# Generating Executable Interactive Benchmarks for Command-Line Agents
Source: [https://arxiv.org/html/2605.14133](https://arxiv.org/html/2605.14133)
Peng XiaHaonian JiKaiwen XiongKaide ZengJiaqi LiuFang WuJike ZhongZeyu ZhengCihang XieHuaxiu Yao
###### Abstract
Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation\. Hand\-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state\. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre\-existing partial, stale, or conflicting artifacts\. We presentClawForge, a generator\-backed benchmark framework for executable command\-line workflows under state conflict\. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching\. We instantiate this framework as the ClawForge\-Bench \(17 scenarios, 6 ability categories\)\. Results across seven frontier models show that the best model reaches only 45\.3% strict accuracy, wrong\-state replacement remains below 17% for all models, and the widest model separation \(17% to 90%\) is driven by whether agents inspect existing state before acting\. Partial\-credit and step\-efficiency analyses further reveal that many failures are near\-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict\. Code and Data:[https://github\.com/aiming\-lab/ClawForge](https://github.com/aiming-lab/ClawForge)
## 1Introduction
As Large Language Model \(LLM\) agents\(Yaoet al\.,[2022](https://arxiv.org/html/2605.14133#bib.bib40); Shinnet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib107)\)move from one\-shot prompting to persistent software workflows, benchmark construction becomes a bottleneck in its own right\. A realistic workflow task is not just an instruction paired with an answer: it requires initialized state, valid and invalid pre\-existing artifacts, expected side effects, and executable validation logic\. These ingredients must also be reproducible and auditable, because small changes in state can turn an apparently equivalent task into a different decision problem\. Hand\-authored interactive tasks are therefore expensive to scale and difficult to revise once coverage or fairness issues emerge\(Zhenget al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib12); Changet al\.,[2024](https://arxiv.org/html/2605.14133#bib.bib13)\)\.
This problem is especially acute for command\-line agents\(OpenClaw,[2026](https://arxiv.org/html/2605.14133#bib.bib30); Merrillet al\.,[2026](https://arxiv.org/html/2605.14133#bib.bib8); Research,[2026](https://arxiv.org/html/2605.14133#bib.bib9)\)\. Real workflows require agents to inspect task boards, inbox threads, calendars, files, runtime configuration, weather state, and messaging surfaces before deciding whether to add, preserve, repair, or replace state\. Many failures only appear after execution: an agent may create a duplicate task, leave stale state unresolved, choose the wrong branch under conflicting evidence, or stop before the final side effect is produced\. Existing interactive benchmarks have advanced agent evaluation significantly\(Qinet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib14); Liuet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib18); Zhouet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib20)\), but most initialize tasks from clean state and do not systematically test how agents handle pre\-existing partial, stale, or conflicting artifacts\(Yeet al\.,[2026](https://arxiv.org/html/2605.14133#bib.bib7); Jiet al\.,[2026](https://arxiv.org/html/2605.14133#bib.bib11); Liet al\.,[2026a](https://arxiv.org/html/2605.14133#bib.bib10)\)\. This means a central part of workflow competence remains unmeasured: correctness that depends on the final state of a live environment, not only on a plausible textual answer or command sequence\.
Figure 1:ClawForge\-Bench benchmark coverage\. Inner: 6 primary ability categories\. Outer: 17 scenario families within each category\.To address these challenges, we presentClawForge, a generator\-backed benchmark framework for executable state\-conflict workflows\. ClawForge compiles scenario templates, grounded variables, seeded environment state, reference trajectories, validators, and metadata into executable task specifications that can be regenerated, audited, and systematically extended\. The execution environment then evaluates agents step by step through a CLI\-style interface, scoring normalized workflow state and observable side effects rather than exact command imitation\. Automatic generation is therefore not merely a scalable way to produce more tasks, but a mechanism for maintaining reproducibility, extensibility, and evaluation consistency in interactive benchmarks\. We instantiate this framework as*ClawForge\-Bench*, a suite of17 scenariosorganized into six ability categories \([Figure1](https://arxiv.org/html/2605.14133#S1.F1)\) that begin from partial, stale, or conflicting workflow state and target realistic failure modes, including duplicate\-aware completion, stale\-state repair, wrong\-state replacement, multi\-source branch resolution, and full workflow closure\.
In summary, our primary contribution is ClawForge, a generator\-backed benchmark framework for evaluating command\-line agents under state\-conflict workflows through automated task construction, stateful execution, and result\-first evaluation\. On ClawForge\-Bench \(17 scenarios spanning 6 ability categories\), evaluations across seven frontier models show that the benchmark remains far from saturated: the best model achieves only 45\.3% strict accuracy, wrong\-state replacement stays below 17% for all models, and Interrupted Workflow Resume exhibits the largest model gap \(17%–90%\), largely depending on whether agents inspect existing state before acting\. Partial\-credit and step\-efficiency analyses further reveal that many failures arise from near\-miss workflow closure rather than early breakdowns, and that models exhibit qualitatively different failure behaviors under state conflict\.
## 2Task Generation and Execution
ClawForge is a generator\-backed benchmark system rather than a prompt collection\. The central design principle is that each generated task is an executable specification: the instruction, initialized state, reference trajectory, validators, and metadata are produced together and then evaluated through the same stateful runtime\. This section describes how tasks are generated \(§[2\.1](https://arxiv.org/html/2605.14133#S2.SS1)\) and executed \(§[2\.2](https://arxiv.org/html/2605.14133#S2.SS2)\);[Section3](https://arxiv.org/html/2605.14133#S3)defines the evaluation protocol\.
### 2\.1Automated Benchmark Generation
Each task is generated from a scenario template together with grounded slots such as city, timezone, recipient, topic, due date, or calendar start time\. This yields surface diversity while ensuring that every generated task maps back to a known scenario and can be evaluated by the same validators\. As shown in[Figure2](https://arxiv.org/html/2605.14133#S2.F2), the generation pipeline proceeds in stages: the scenario template is first grounded with concrete slot values, then a state\-mode \(mock or real\) is selected and the initial environment state is instantiated accordingly, followed by instruction rendering\. The pipeline then synthesizes reference commandsC⋆C^\{\\star\}by executing the intended workflow against the initialized state, generates validators that check required state transitions and side effects, and exports structured metadata \(scenario family, primary ability, prompt style\)\. The final output is a complete executable task object\.
Figure 2:Automated benchmark generation pipeline\. Scenario templates are compiled into executable task specificationsτ=\(x,S0,C⋆,ℰ,m\)\\tau=\(x,S\_\{0\},C^\{\\star\},\\mathcal\{E\},m\)through slot grounding, state initialization, instruction rendering, reference command synthesis, and validator generation\.A generated task is an executable specification with five coupled components:
τ=\(x,S0,C⋆,ℰ,m\),\\tau=\(x,S\_\{0\},C^\{\\star\},\\mathcal\{E\},m\),\(1\)wherexxis the instruction,S0S\_\{0\}the initialized state,C⋆C^\{\\star\}a reference command trajectory,ℰ\\mathcal\{E\}the executable checks, andmmthe structured metadata\. Eachτ\\tauis instantiated from a scenario familyσ\\sigma, a grounded slot assignmentzz, and a prompt policypp\. The key design choice is that all five components are generated together from the same scenario specification, so the benchmark object is an executable workflow task rather than a prompt paired with an offline answer key\.
Because the generator is parameterized by scenario family, different scenarios can encode fundamentally different decision structures rather than only surface wording variations\. Some are gap\-completion tasks, some require explicit repair or replacement of stale state, and others require branch selection across several information sources\. This structural diversity lets the benchmark separate state conflict, closure efficiency, and multi\-source decision failures rather than measuring only surface command following \([Section4](https://arxiv.org/html/2605.14133#S4)\)\. Once generated, each task is executed through the interactive environment described next\.
### 2\.2Interactive Environment
Generated tasks are executed in a stateful environment rather than graded as static prompts\. ClawForge exposes a CLI\-style interface over workflow surfaces such as tasks, calendar, email, messaging, files, runtime configuration, weather, and recurring checks\. An episode begins with a natural\-language instruction\. At each step, the agent emits one command, the environment executes it, and the next observation is built from the resulting outputs and updated state\. This execution\-and\-evaluation loop is summarized in[Figure3](https://arxiv.org/html/2605.14133#S2.F3), which makes explicit that execution and result\-first grading happen inside one coupled episode rather than as separate offline stages\.
This loop is intentionally stateful: many failures only become visible after execution\. An agent may recreate an existing object, leave stale state untouched, update the wrong entity, or stop before the final side effect is produced\. ClawForge therefore treats execution state as a first\-class object in both rollout and evaluation \(see Appendix[B\.1](https://arxiv.org/html/2605.14133#A2.SS1)for implementation details\)\.
Figure 3:Interactive execution and evaluation loop\. Agents emit commands step by step; the environment executes them, records state changes and effect traces, and merges everything into a normalized evaluation stateS^\\hat\{S\}for result\-first scoring\.
## 3Evaluation Protocol
Algorithm 1ClawForge episode rollout1:Load task
TTand copy base state
S0S\_\{0\}into an isolated state directory
2:Apply task\-specific state overrides
3:Initialize routed backend
BBand evaluator
EE
4:Initialize observation
o0←o\_\{0\}\\leftarrowinstruction, config, gateway status, command hints
5:for
t=1,…,Ht=1,\\dots,Hdo
6:ifagent requests stop or emitsDONEthen
7:break
8:endif
9:Agent emits one command
ata\_\{t\}
10:Execute
ata\_\{t\}through routed backend
BB
11:Record stdout/stderr, exit code, command metadata, and inferred effects
12:Construct next observation
oto\_\{t\}
13:if
at∈\{done,exit,quit\}a\_\{t\}\\in\\\{\\texttt\{done\},\\texttt\{exit\},\\texttt\{quit\}\\\}or
t=Ht=Hor a rollout stopping rule triggersthen
14:break
15:endif
16:endfor
17:Build normalized evaluation state
S^\\hat\{S\}from command history, effects, latest outputs, config, and merged backend state
18:Return final result
E\(S^,T\)E\(\\hat\{S\},T\)
ClawForge evaluates tasks functionally rather than by comparing surface\-form trajectories\. The evaluator operates over normalized execution state and explicit effect traces, so multiple trajectories can pass as long as they produce the required state transitions and observable side effects\. Because tasks are generated together with their validators, evaluation stays tied to the same scenario semantics that produced the task\. This result\-first design lets us distinguish between early breakdowns and structured near misses, such as preserving the wrong object, omitting one final side effect, or stopping after a partially correct repair\.
### 3\.1Episode Rollout
Each episode is instantiated from a generated task containing an instruction, initialized state, reference trajectory, executable checks, and structured metadata\. On reset, the environment materializes task state, applies any task\-specific state overrides, initializes the selected backend, and builds the evaluator\. The agent receives the instruction once and then emits exactly one command at each subsequent step\. The environment executes that command, records state\-changing effects, and returns the next observation\. Algorithm[1](https://arxiv.org/html/2605.14133#alg1)summarizes this interaction\. When a rollout ends, the environment builds a normalized stateS^\\hat\{S\}from the command history, accumulated effect traces, the latest process outputs, and the merged backend state\. The evaluator then applies task\-defined checks overS^\\hat\{S\}rather than comparing against one exact command sequence\. The default protocol usesmultimode, which routes commands to their corresponding workflow surfaces, with a configurable maximum step limit \(see Appendix[B\.1](https://arxiv.org/html/2605.14133#A2.SS1)for additional runtime details\)\.
### 3\.2Scoring
We report two complementary metrics\. Let𝒟\\mathcal\{D\}denote the evaluation set\.*Strict full\-pass accuracy*counts the fraction of tasks where every required check passes:
Acc\(𝒟\)=1\|𝒟\|∑τ∈𝒟𝟏\[τpasses all required checks\]\.\\mathrm\{Acc\}\(\\mathcal\{D\}\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\\tau\\in\\mathcal\{D\}\}\\mathbf\{1\}\[\\tau\\text\{ passes all required checks\}\]\.\(2\)*Partial\-credit score*measures how much of each workflow was completed correctly\. If a rollout has checks\{ci\}i=1n\\\{c\_\{i\}\\\}\_\{i=1\}^\{n\}, each with scoresi∈\[0,1\]s\_\{i\}\\in\[0,1\]and weightwi\>0w\_\{i\}\>0, the per\-task score is
Score\(S^\)=∑i=1nwisi∑i=1nwi,\\mathrm\{Score\}\(\\hat\{S\}\)=\\frac\{\\sum\_\{i=1\}^\{n\}w\_\{i\}s\_\{i\}\}\{\\sum\_\{i=1\}^\{n\}w\_\{i\}\},\(3\)and we report the average over𝒟\\mathcal\{D\}\. The two metrics use the same evaluator but summarize it differently: strict accuracy rewards complete closure, while partial credit distinguishes near\-miss failures from early breakdowns\. When agents are served by external providers, ClawForge also records provider\-side failures and provider\-impacted tasks as supplementary records\.
## 4Experiments
We evaluate the ClawForge\-Bench \(17 scenarios, 362 tasks, two prompt styles\) across seven frontier models\. Our experiments address the following questions: \(1\) Is the benchmark challenging and discriminative across strong models? \(2\) Which ability categories expose the largest capability gaps? \(3\) Do models differ in step efficiency, and does extra interaction translate into better closure? \(4\) Can partial\-credit analysis separate near\-miss failures from early stops?
### 4\.1Experimental Setup
All experiments follow the protocol in[Section3](https://arxiv.org/html/2605.14133#S3)usingmultimode, full interaction history, and a 25\-step budget\. We evaluate seven frontier models: Kimi\-K2\.5\(Teamet al\.,[2026](https://arxiv.org/html/2605.14133#bib.bib32)\), GPT\-5\.2\(OpenAI,[2025a](https://arxiv.org/html/2605.14133#bib.bib37)\), OpenAI o3\(OpenAI,[2025c](https://arxiv.org/html/2605.14133#bib.bib96)\), GPT\-5\.3\-Codex\(OpenAI,[2025b](https://arxiv.org/html/2605.14133#bib.bib36)\), GPT\-5\.4\(OpenAI,[2026](https://arxiv.org/html/2605.14133#bib.bib35)\), Claude Sonnet 4\.6\(Anthropic,[2026b](https://arxiv.org/html/2605.14133#bib.bib34)\), and Claude Opus 4\.6\(Anthropic,[2026a](https://arxiv.org/html/2605.14133#bib.bib33)\)\. We report strict full\-pass accuracy and partial\-credit score as defined in[Section3\.2](https://arxiv.org/html/2605.14133#S3.SS2)\. For Claude models, step counts in[Table4](https://arxiv.org/html/2605.14133#S4.T4)are aggregated from partially overlapping trace subsets rather than a single uniform run; consequently, step\-efficiency comparisons with other models should be interpreted cautiously \(see Appendix[D\.1](https://arxiv.org/html/2605.14133#A4.SS1)\)\.
### 4\.2Benchmark Results
The generated ClawForge\-Bench snapshot separates abilities that matter in stateful execution\.[Tables1](https://arxiv.org/html/2605.14133#S4.T1),[2](https://arxiv.org/html/2605.14133#S4.T2),[4](https://arxiv.org/html/2605.14133#S4.T4)and[3](https://arxiv.org/html/2605.14133#S4.T3)highlight three benchmark findings\. First, aggregate performance remains far from saturation while still separating strong frontier models\. Second, the hardest settings are not simply long workflows, but scenarios involving stale state, incorrect replacement, and incomplete workflow closure\. Third, partial\-credit and step\-efficiency analyses expose meaningful behavioral differences—including near\-miss completion and inefficient execution—that strict pass/fail accuracy alone would largely obscure\.
#### 4\.2\.1Overall Benchmark Separation
[Table1](https://arxiv.org/html/2605.14133#S4.T1)shows that the benchmark is neither saturated nor dominated by one model family: the best strict accuracy is 45\.3% \(Claude Opus 4\.6\) and the weakest is 35\.9% \(Kimi\-K2\.5\)\. Strict accuracy and partial credit rank models differently: GPT\-5\.3\-Codex has the highest partial credit \(0\.8238\) but ranks fourth on strict accuracy, while Claude Opus 4\.6 leads strict accuracy but has lower partial credit \(0\.7868\)\. This mismatch indicates that the suite distinguishes models that steadily accumulate correct intermediate state from models that convert a smaller subset of trajectories into complete workflow closure\.
Table 1:Main results\(strict accuracy and partial score\)\. All entries usemultimode with full interaction history and a 25\-step budget\.Table 2:Strict accuracy \(%\) by primary ability\.Each row is a disjoint partition of the task set\.Table 3:Partial\-credit score by primary ability\.Higher scores indicate more complete workflow execution even when evaluation fails\.Table 4:Accuracy \(%\) and average executed steps by reference\-trajectory length\.Format: Acc / Steps\. Shorter reference trajectories indicate structurally simpler workflows\.
#### 4\.2\.2Ability\-Level Analysis
The clearest capability gap appears when the environment already contains partially wrong state \([Table2](https://arxiv.org/html/2605.14133#S4.T2)\)\. State repair remains below 34% for all models, and Wrong\-State Replacement is the hardest shared slice: even the best model reaches only 17\.4%, while several remain at 0\.0%\. These failures are qualitatively different from ordinary completion errors, as the model must identify stale state, preserve valid parts, retire or replace invalid parts, and still finish downstream follow\-through\.
[Table3](https://arxiv.org/html/2605.14133#S4.T3)adds a partial\-credit perspective\. Many rollouts that fail the strict criterion still make substantial progress: for example, o3 leads state\-repair score at 0\.6757 despite only 33\.7% strict accuracy\. This reveals that many failures are near\-miss closures rather than early breakdowns\. Models often identify the correct repair direction but miss one final side effect or replacement step, or stop as if the workflow were already complete after only a partial repair\. The score view supports a useful failure taxonomy: early breakdown, correct direction but incomplete closure, stale\-state mishandling, and hallucinated completion\.
#### 4\.2\.3Step Efficiency
[Table4](https://arxiv.org/html/2605.14133#S4.T4)pairs reference\-trajectory length with average executed steps, revealing two distinct failure styles\. GPT\-5\.3\-Codex and GPT\-5\.4 are step\-efficient, staying near eight executed steps on the longest reference buckets while maintaining strong accuracy\. Claude Opus 4\.6 achieves higher strict accuracy on several buckets but uses substantially more steps \(16–20 average steps on GT\-6 through GT\-8 tasks\)\. This distinguishes two failure modes:*step\-efficient but imperfect*models reach compact, partially correct solutions but miss one final operation, while*high\-step low\-conversion*models keep exploring without improving closure proportionally\. In stateful workflows, extra steps are not neutral: they often correspond to repeated inspection, redundant list queries, or additional state mutations that do not improve the final evaluator\-visible state\. A benchmark that only reports pass/fail would collapse these two failure styles together; the combined accuracy/steps view shows that workflow competence involves not just whether a model eventually passes, but how efficiently it converts interaction budget into closure\.
#### 4\.2\.4Scenario\-Level Discrimination
[Figure4](https://arxiv.org/html/2605.14133#S4.F4)shows that the benchmark’s difficulty is not uniform\. At one extreme, Already\-Done Skip and Duplicate Avoidance are near saturation \(90–100% for most models\) and are not shown in the figure\. At the other extreme, Wrong\-State Replacement remains below 17% for all models, and Release Gate stays between 3% and 17%, exposing persistent weaknesses in state\-conflict resolution\. Between these extremes, Interrupted Workflow Resume produces the widest model separation \(17% to 90%\), with Claude models strongly outperforming GPT models on this slice\. This per\-scenario view demonstrates that the aggregate accuracy gap across models is not driven by uniform difficulty but by a small set of genuinely discriminative scenario families targeting state repair, replacement, and multi\-step closure\.
Figure 4:Per\-scenario strict accuracy\. Duplicate\-aware scenarios are near saturation, while state repair, release gating, and wrong\-state replacement remain challenging across all models\.
### 4\.3Case Studies
The aggregate results above show*where*models fail; this section illustrates*how*they fail through two representative tasks\. We select these cases because they test complementary state\-conflict capabilities: one requires identifying and replacing incorrect state, while the other requires recognizing and preserving correct state\. Together they cover the two hardest judgment calls in stateful workflows\.
Case 1: Wrong\-State Replacement\.
Task\.hard\_decision\_workflow\_261: “Something in the Seattle release setup was staged wrong\. Check whether the next step is stale, replace the stale next step with the correct version, and leave the next step and sync correctly set\.” Initial state\.The environment contains a pre\-existing taskExisting Seattle release follow\-up\(low priority, stale\)\. The replacement task and replacement sync are both missing\. Required actions\.Inspect the seeded task, determine it is stale, retire it, create the correctedSeattle release replacement next step, and add the matchingSeattle replacement release synccalendar event\. Evaluation\.Strict evaluation requires*both*retirement of the stale task*and*creation of the replacement state\. Adding the replacement without cleanup, or cleaning up without recreating the follow\-through, both fail\.
This task tests whether agents can identify and remove incorrect state before creating its replacement\. In[Table2](https://arxiv.org/html/2605.14133#S4.T2), Wrong\-State Replacement remains below 17% strict accuracy for all models, making it the hardest shared slice in the benchmark\. The partial\-credit view \([Table3](https://arxiv.org/html/2605.14133#S4.T3)\) reveals that many models identify the stale object and attempt a replacement, but fail to complete the full retirement\-and\-recreation cycle, leaving the workflow in an inconsistent intermediate state\. The common failure pattern is hallucinated completion: the model retires the stale task but stops before creating the replacement, or creates the replacement but leaves the stale task in place\.
The second case tests the opposite judgment: rather than replacing wrong state, the agent must preserve correct state and only fill in what is missing\.
Case 2: Interrupted Workflow Resume\.
Task\.hard\_decision\_workflow\_307: “The New York release work already has pieces in place\. Check the current model, the next step on the board, and the sync on the calendar, finish the missing handoff file, send manager@example\.com a recap, and leave the existing setup alone\.” Initial state\.The environment already containsNew York existing release next step,New York existing release sync, and drafted notes in/ops/release\-review\.txt\. Only the handoff file and recap email are missing\. Required actions\.Resume from the partially completed state: read existing review notes, create the missing handoff file, preserve the staged task and sync \(no duplicates\), and send the recap email\. Evaluation\.The evaluator accepts only resume\-style closure: the existing task and sync must remain, no duplicates may be created, the handoff file must appear, and the recap email must be sent\.
This task produces the widest model separation in[Figure4](https://arxiv.org/html/2605.14133#S4.F4): Claude Opus 4\.6 reaches 90% while Kimi\-K2\.5 stays at 17%\. The gap is driven by whether the model inspects existing state before acting or immediately begins creating objects from scratch\. Models that skip the inspection step tend to recreate the task and calendar event, triggering the duplicate guard and failing despite producing otherwise correct artifacts\.
These two cases illustrate the central design principle of the benchmark\. Static or clean\-state benchmarks would score both tasks as straightforward multi\-step completions\. The difficulty arises entirely from pre\-existing state: in Case 1 the agent must judge what is*wrong*and replace it, while in Case 2 it must judge what is*right*and leave it alone\. This state\-aware judgment is what the ClawForge\-Bench is designed to measure\. The full scenario catalog with additional examples is provided in Appendix[F](https://arxiv.org/html/2605.14133#A6)\.
## 5Related Work
Interactive agent benchmarks\.Recent work increasingly evaluates LLM agents as sequential decision\-makers that interleave reasoning, tool use, and action rather than produce a single final answer, with ReAct serving as a representative formulation\(Yaoet al\.,[2022](https://arxiv.org/html/2605.14133#bib.bib40)\)\. This shift has motivated executable or semi\-executable benchmarks for software repair, web interaction, planning, and tool use, including SWE\-bench\(Jimenezet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib19)\), AgentBench\(Liuet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib18)\), WebArena\(Zhouet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib20)\), OSWorld\(Xieet al\.,[2024](https://arxiv.org/html/2605.14133#bib.bib17)\), GAIA\(Mialonet al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib16)\), Claw\-Eval\(Yeet al\.,[2026](https://arxiv.org/html/2605.14133#bib.bib7)\), and ClawsBench\(Liet al\.,[2026a](https://arxiv.org/html/2605.14133#bib.bib10)\)\. However, these benchmarks mainly evaluate task completion or trajectory correctness under clean or weakly constrained initial state, making it difficult to measure failures caused by stale state, conflicting artifacts, or incomplete workflows\.
Persistent memory and long\-horizon agents\.A related line of work studies persistent memory and long\-horizon behavior in LLM agents\. Virtual\-context approaches, including MEMGPT\(Packeret al\.,[2023](https://arxiv.org/html/2605.14133#bib.bib5)\), MEMORYOS\(Kanget al\.,[2025](https://arxiv.org/html/2605.14133#bib.bib4)\), and stream\-based controllers\(Wanget al\.,[2024](https://arxiv.org/html/2605.14133#bib.bib51)\), extend interaction length through memory paging or context management\. Structured and graph\-based systems, such as MEMORYBANK\(Zhonget al\.,[2024](https://arxiv.org/html/2605.14133#bib.bib3)\), MEM0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.14133#bib.bib91)\), ZEP\(Rasmussenet al\.,[2025](https://arxiv.org/html/2605.14133#bib.bib2)\), A\-MEM\(Xuet al\.,[2025](https://arxiv.org/html/2605.14133#bib.bib1)\), and O\-MEM\(Wanget al\.,[2025](https://arxiv.org/html/2605.14133#bib.bib43)\), instead organize memory into persistent representations for long\-horizon retrieval and planning\. These approaches primarily improve memory mechanisms, but provide limited evaluation of whether agents correctly preserve, repair, or update workflow state during extended execution\.
Stateful workflow evaluation\.ClawForge focuses on evaluating these workflow\-state behaviors directly\. Unlike existing benchmarks that mainly evaluate final task completion or exact trajectories\(Liet al\.,[2026b](https://arxiv.org/html/2605.14133#bib.bib6); Yeet al\.,[2026](https://arxiv.org/html/2605.14133#bib.bib7)\), ClawForge evaluates agents under persistent workflow state, where tasks may already be partially completed, inconsistent, stale, or duplicate\-sensitive before execution begins\. Agents must preserve valid state, detect outdated artifacts, resolve conflicting information, avoid redundant mutations, and complete workflows\. Correctness is defined over normalized end states and observable side effects rather than exact action sequences, allowing multiple valid trajectories\.
## 6Conclusion
We presented ClawForge, a generator\-backed benchmark framework that systematically evaluates command\-line agents under state\-conflict workflows through automated task construction, stateful execution, and result\-first evaluation\. The ClawForge\-Bench \(17 scenarios, 6 ability categories\) exposes failure modes that static or clean\-state evaluation cannot observe: wrong\-state replacement remains below 17% strict accuracy for all seven frontier models evaluated, and the widest model separation \(17% to 90% on Interrupted Workflow Resume\) is driven by whether agents inspect existing state before acting\. Partial\-credit and step\-efficiency analyses further reveal that models exhibit qualitatively different failure styles, from step\-efficient near\-miss closures to high\-step low\-conversion exploration\. Expanding scenario coverage to new workflow domains and reducing the manual effort for defining scenario templates are promising directions for future work\.
## References
- Anthropic \(2024\)The claude 3 model family: opus, sonnet, haiku\.Note:[https://www\.anthropic\.com/news/claude\-3\-family](https://www.anthropic.com/news/claude-3-family)Cited by:[§E\.2](https://arxiv.org/html/2605.14133#A5.SS2.p1.1)\.
- Anthropic \(2026a\)Introducing opus 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-6](https://www.anthropic.com/news/claude-opus-4-6)Cited by:[§4\.1](https://arxiv.org/html/2605.14133#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14133#S4.T1.5.2.1.1)\.
- Anthropic \(2026b\)Introducing sonnect 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)Cited by:[§4\.1](https://arxiv.org/html/2605.14133#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14133#S4.T1.5.3.2.1)\.
- Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, L\. Yang, K\. Zhu, H\. Chen, X\. Yi, C\. Wang, Y\. Wang,et al\.\(2024\)A survey on evaluation of large language models\.ACM transactions on intelligent systems and technology15\(3\),pp\. 1–45\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p1.1)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- H\. Ji, K\. Xiong, S\. Han, P\. Xia, S\. Qiu, Y\. Zhou, J\. Liu, J\. Li, B\. Li, Z\. Zheng,et al\.\(2026\)ClawArena: benchmarking ai agents in evolving information environments\.arXiv preprint arXiv:2604\.04202\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2023\)Swe\-bench: can language models resolve real\-world github issues?\.arXiv preprint arXiv:2310\.06770\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p1.1)\.
- J\. Kang, M\. Ji, Z\. Zhao, and T\. Bai \(2025\)Memory os of ai agent\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 25972–25981\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- X\. Li, K\. W\. Choe, Y\. Liu, X\. Chen, C\. Tao, B\. You, W\. Chen, Z\. Di, J\. Sun, S\. Zheng,et al\.\(2026a\)ClawsBench: evaluating capability and safety of llm productivity agents in simulated workspaces\.arXiv preprint arXiv:2604\.05172\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1),[§5](https://arxiv.org/html/2605.14133#S5.p1.1)\.
- X\. Li, M\. Li, D\. Xu, W\. Chiang, I\. Stoica, C\. Hsieh, and T\. Zhou \(2026b\)ClawEnvKit: automatic environment generation for claw\-like agents\.arXiv preprint arXiv:2604\.18543\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p3.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2023\)Agentbench: evaluating llms as agents\.arXiv preprint arXiv:2308\.03688\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1),[§5](https://arxiv.org/html/2605.14133#S5.p1.1)\.
- M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan,et al\.\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.arXiv preprint arXiv:2601\.11868\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2023\)Gaia: a benchmark for general ai assistants\.InThe Twelfth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p1.1)\.
- OpenAI \(2025a\)Introducing gpt\-5\.2\.Note:[https://openai\.com/index/introducing\-gpt\-5\-2](https://openai.com/index/introducing-gpt-5-2)Cited by:[§4\.1](https://arxiv.org/html/2605.14133#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14133#S4.T1.5.7.6.1)\.
- OpenAI \(2025b\)Introducing gpt\-5\.3\-codex\.Note:[https://openai\.com/index/introducing\-gpt\-5\-3\-codex/](https://openai.com/index/introducing-gpt-5-3-codex/)Cited by:[§4\.1](https://arxiv.org/html/2605.14133#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14133#S4.T1.5.5.4.1)\.
- OpenAI \(2025c\)Introducing o3 and o4\-mini\.Note:[https://openai\.com/index/introducing\-o3\-and\-o4\-mini/](https://openai.com/index/introducing-o3-and-o4-mini/)Cited by:[§4\.1](https://arxiv.org/html/2605.14133#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14133#S4.T1.5.6.5.1)\.
- OpenAI \(2026\)Introducing gpt\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Cited by:[§E\.2](https://arxiv.org/html/2605.14133#A5.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.14133#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14133#S4.T1.5.4.3.1)\.
- OpenClaw \(2026\)OpenClaw — personal ai assistant\.Note:[https://github\.com/openclaw/openclaw](https://github.com/openclaw/openclaw)Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1)\.
- C\. Packer, V\. Fang, S\. Patil, K\. Lin, S\. Wooders, and J\. Gonzalez \(2023\)MemGPT: towards llms as operating systems\.\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2023\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.arXiv preprint arXiv:2307\.16789\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1)\.
- P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef \(2025\)Zep: a temporal knowledge graph architecture for agent memory\.arXiv preprint arXiv:2501\.13956\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- N\. Research \(2026\)Hermes agent — the agent that grows with you\.Note:[https://github\.com/nousresearch/hermes\-agent](https://github.com/nousresearch/hermes-agent)Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p1.1)\.
- K\. Team, T\. Bai, Y\. Bai, Y\. Bao, S\. Cai, Y\. Cao, Y\. Charles, H\. Che, C\. Chen, G\. Chen,et al\.\(2026\)Kimi k2\. 5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276\.Cited by:[§E\.2](https://arxiv.org/html/2605.14133#A5.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.14133#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.14133#S4.T1.5.8.7.1)\.
- Y\. Wang, R\. Takanobu, Z\. Liang, Y\. Mao, Y\. Hu, J\. McAuley, and X\. Wu \(2025\)Mem\-\{\\\{\\\\backslashalpha\}\\\}: learning memory construction via reinforcement learning\.arXiv preprint arXiv:2509\.25911\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2024\)Agent workflow memory\.arXiv preprint arXiv:2409\.07429\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-mem: agentic memory for llm agents\.arXiv preprint arXiv:2502\.12110\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p1.1),[§5](https://arxiv.org/html/2605.14133#S5.p1.1)\.
- B\. Ye, R\. Li, Q\. Yang, Y\. Liu, L\. Yao, H\. Lv, Z\. Xie, C\. An, L\. Li, L\. Kong,et al\.\(2026\)Claw\-eval: toward trustworthy evaluation of autonomous agents\.arXiv preprint arXiv:2604\.06132\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1),[§5](https://arxiv.org/html/2605.14133#S5.p1.1),[§5](https://arxiv.org/html/2605.14133#S5.p3.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§5](https://arxiv.org/html/2605.14133#S5.p2.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2023\)Webarena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.Cited by:[§1](https://arxiv.org/html/2605.14133#S1.p2.1),[§5](https://arxiv.org/html/2605.14133#S5.p1.1)\.
## Appendix ABenchmark Construction
This section documents the scenario taxonomy, generated task objects, and implementation details that support reproducibility\. We first present the full scenario inventory \(§[A\.1](https://arxiv.org/html/2605.14133#A1.SS1)\), then show a representative task instruction card \(§[A\.2](https://arxiv.org/html/2605.14133#A1.SS2)\), and finally describe the implementation contract \(§[A\.3](https://arxiv.org/html/2605.14133#A1.SS3)\)\.
### A\.1Scenario Inventory
Table[5](https://arxiv.org/html/2605.14133#A1.T5)lists all 17 scenario families in the ClawForge\-Bench, together with their primary ability classification, overlapping ability tags, and a brief description of the required workflow\.
Table 5:Full scenario inventory forhard\_decision\_workflow\. Scenario names are rendered in paper form rather than as generator slugs; release counts are omitted because the generator allows per\-scenario counts to be reconfigured\.
### A\.2Example Task Instruction Card
To make the generated task format concrete, we show one complete instruction card from a gap\-completion scenario\. This illustrates how the five task components \(instruction, initial state, required actions, checks, and difficulty\) are coupled in a single specification\.
Task ID\.hard\_decision\_workflow\_204\(completion\_gap\_followthrough\)Primary ability\.Gap completionAbility tags\.gap\_completion,state\_inspection,workflow\_completionInstruction\.“I already have part of the Berlin release work in motion, and the review sync is the missing piece\. Review what’s there, read the release notes, refresh the handoff file, leave the next step on the board in place and add only the missing sync on the calendar, and leave the rest alone\.”Initial visible state\.The initial state already contains a pending board item for the Berlin release next step and a drafted release\-review note\. The calendar sync is still missing, and the handoff artifact needs to be created or refreshed\. This means the task does not begin from an empty workspace: the agent must preserve valid state while identifying the one missing closure path\.Required correction and closure\.A semantically correct completion must keep the existing Berlin next\-step task in place, create or refresh a handoff artifact, and add the missing Berlin release sync on the calendar\. It must not create a duplicate next\-step task\. The reference trajectory uses the benchmark date anchor for Europe/Berlin, but the evaluator accepts alternative command traces as long as the final state closes the workflow without introducing duplicate board work\.Representative required checks\.This task mixes state\-preservation and side\-effect checks: the existing Berlin board item must still be present; a handoff\-style file creation or refresh effect must be visible; no duplicate next\-step task may be created; and the missing calendar sync must exist by the end of the rollout\.Why this task is hard\.The task is easy to understand semantically but easy to fail procedurally\. Models can recognize that a review sync is missing and still fail by recreating the board item, omitting the handoff refresh, or stopping after scheduling the sync and incorrectly assuming the workflow is complete\.
### A\.3Implementation Detail
This subsection describes how the task specification above is realized at runtime\. At task level, eachhard\_decision\_workflowepisode is an executable specification rather than a bare prompt\. The specification bundles five pieces of information: a natural\-language instruction, scenario metadata, task\-specificinitial\_state\_overrides, a reference command trajectory, and typed evaluation checks\. In practice, the task object also stores instruction variants, prompt style, realism and scenario tags, public or hidden task notes, and provider\-related metadata used for analysis\. At reset time, the environment loads a base state, applies the overrides declared by the task, initializes the backend for that state directory, and then exposes the instruction plus compact execution hints to the agent\. Forhard\_decision\_workflow\_204, the override layer inserts an already\-existing Berlin next\-step task and a drafted release\-review note, so the evaluator can later distinguish correct gap completion from unnecessary re\-creation\.
During rollout, the environment records each command, its stdout/stderr, exit code, and inferred effects such as created files, completed tasks, or created calendar events\. The rollout record therefore preserves not only what the agent typed, but what externally visible state transitions were actually produced\. Scoring is then performed over a normalized result state rather than over exact trajectory imitation\. This design matters for hard workflows because many tasks require a sequence of decisions before any mutation is safe: the agent often has to determine whether the relevant object already exists, whether it is stale, whether it should be preserved, and only then whether the correct action is to add, update, or replace state\. In other words, the benchmark object is not just “what command should be typed next,” but “what workflow state should exist when the rollout is over\.”
The default benchmark runtime ismultimode, which uses explicit routing rather than one monolithic executor\. Command families such asopenclaw,calendar,email,tasks,weather,file, andcurlare dispatched to their corresponding skills or adapters\. This routing layer is important for reproducibility because it preserves cross\-surface interaction while keeping the execution contract stable across benchmark runs\.
One useful way to think about the implementation is as a four\-stage contract: \(1\) construct initialized workflow state, \(2\) execute one command at a time under routed backends, \(3\) infer explicit state\-changing effects from those executions, and \(4\) merge the resulting traces into evaluator\-facing state\. The appendix carries these details so that the main paper can focus on what the benchmark reveals about model behavior rather than on runtime mechanics alone\.
## Appendix BEvaluation Details
This section shows how result\-first scoring works in practice\. We first describe the execution modes and evaluator internals \(§[B\.1](https://arxiv.org/html/2605.14133#A2.SS1)\), then present representative evaluation checks \(§[B\.2](https://arxiv.org/html/2605.14133#A2.SS2)\), and finally explain how to read the full scenario examples \(§[B\.3](https://arxiv.org/html/2605.14133#A2.SS3)\)\.
### B\.1Execution Mode and Evaluator Details
Algorithm[2](https://arxiv.org/html/2605.14133#alg2)gives the episode\-level rollout contract used by the main experiments\.
Algorithm 2ClawForge episode rollout \(detailed\)1:Load task
TT, base state
S0S\_\{0\}, and backend
BB
2:Apply task\-specific state overrides and initialize evaluator
EE
3:Initialize observation
o0←o\_\{0\}\\leftarrowinstruction, config, gateway status, command hints
4:for
t=1,…,Ht=1,\\dots,Hdo
5:Agent emits one command
ata\_\{t\}
6:Execute
ata\_\{t\}through backend
BB
7:Update backend state and append inferred effects
8:Construct next observation
oto\_\{t\}
9:if
at∈\{done,exit,quit\}a\_\{t\}\\in\\\{\\texttt\{done\},\\texttt\{exit\},\\texttt\{quit\}\\\}or
t=Ht=Hthen
10:break
11:endif
12:endfor
13:Build normalized evaluation state
S^\\hat\{S\}
14:Return final result
E\(S^,T\)E\(\\hat\{S\},T\)
ClawForge supports four execution modes with different fidelity and stability trade\-offs:
- •mockuses a single in\-process mock backend for unit tests and deterministic debugging\.
- •multiis the default benchmark setting, with explicit command\-family routing through registered skills and adapters\.
- •realreplaces theopenclawbranch with a subprocess\-backed real backend while keeping the same routing architecture\.
- •hybridextendsrealwith gateway lifecycle management, including auto\-start, port injection, and health polling\.
Internally, the evaluator builds a normalized stateS^\\hat\{S\}that merges configuration state, gateway state, bounded command history, the latest stdout/stderr/exit code, and explicit effect traces \(created calendar events, sent emails, created files, completed tasks, etc\.\)\. This normalization is particularly important for the hard benchmark because many tasks are judged by the relationship between pre\-existing and newly created state: the evaluator must determine not only that an artifact exists, but whether an existing object was preserved, a stale object was retired, and the required downstream side effects were produced\.
### B\.2Evaluation Check Examples
Table[6](https://arxiv.org/html/2605.14133#A2.T6)shows representative result\-first checks from two hard tasks:hard\_decision\_workflow\_204\(gap completion\) andhard\_decision\_workflow\_263\(wrong\-state replacement\)\. The first task rewards preserving correct existing state while filling a single missing gap\. The second additionally requires retiring stale state and putting the replacement plan in place\. This is why replacement\-style tasks are materially harder: they require preserving the valid part of the workflow, modifying the invalid part, and still achieving end\-to\-end closure\.
Table 6:Representative task\-level evaluation checks\. These checks illustrate the benchmark’s result\-first philosophy: trajectories may differ, but the rollout must leave the required state and effects behind\.Different checks can also carry different weights in the partial\-credit score\. In practice, this means the benchmark can distinguish between a rollout that gets the main task direction right but misses one final side effect, and a rollout that never reaches the correct repair or branch decision in the first place\. The score layer is therefore not a soft alternative to evaluation, but an additional view on how close a failed rollout came to the intended state transition\.
### B\.3Execution Examples
The appendix now separates two example roles\. The later full scenario catalog gives one long\-form example for every released scenario family, while the present section explains how to read those examples\. Each long example follows the same schema: a concrete instance, the initial visible state, the key repair or branch decision, a short representative action trace, and the evaluator\-facing success condition\. This keeps the examples comparable even when one scenario is mostly about duplicate avoidance and another is mostly about state replacement or multi\-source resolution\.
A useful reading strategy is to compare three things across the catalog: first, whether the task begins from empty or partially populated state; second, whether the required operation is additive, repair\-oriented, or replacement\-oriented; and third, whether final closure depends on one surface or on coordination across board, calendar, email, file, and cron state\. That contrast is exactly what makes the ClawForge\-Bench analytically useful\.
## Appendix CSupplementary Results
This section provides additional experimental breakdowns that extend the main\-text analysis: per\-scenario accuracy slices \(§[C\.1](https://arxiv.org/html/2605.14133#A3.SS1)\), recurring failure patterns \(§[C\.2](https://arxiv.org/html/2605.14133#A3.SS2)\), and step\-efficiency interpretation \(§[C\.3](https://arxiv.org/html/2605.14133#A3.SS3)\)\.
### C\.1Scenario Slice Table
Table[7](https://arxiv.org/html/2605.14133#A3.T7)highlights six scenario families that are particularly useful for interpreting the benchmark\. The selected slices include near\-saturated checks \(already\_done\_skip\_followthrough,duplicate\_avoidance\_followthrough\), a uniformly hard shared slice \(release\_gate\_followthrough\), the hardest replacement slice \(wrong\_state\_replacement\_followthrough\), a strong separator \(interrupted\_workflow\_resume\), and a multi\-source branch\-selection slice \(multi\_source\_decision\_followthrough\)\.
Table 7:Supplementary scenario\-level full\-pass accuracy \(%\) for selected hard slices\. Values follow the same evaluated run set used in the main experiment section\.
### C\.2Failure Pattern Notes
Beyond per\-scenario accuracy, the benchmark reveals recurring failure patterns that cut across scenario families\. The main text argues that many failures are structured rather than arbitrary; Table[8](https://arxiv.org/html/2605.14133#A3.T8)expands that claim with concrete pattern\-level categories that recur across scenario families\.
Table 8:Representative failure patterns that recur across the hard benchmark\.
### C\.3Length and Efficiency Notes
The previous subsections examine what models get wrong; this subsection examines how efficiently they use their interaction budget\. The length\-stratified analysis in the main paper pairs reference\-trajectory length with average executed steps\. A useful interpretation is that GT length measures the structural size of the intended workflow, while executed steps measure how much interaction budget a model actually spends on that workflow\. These quantities need not match closely\. A model can be step\-efficient but still miss one final operation, or it can consume many additional steps without improving final closure\.
This distinction is particularly important for stateful tasks\. In many ClawForge\-Bench episodes, extra steps are not neutral: they often correspond to repeated inspection, redundant list queries, or additional state mutations that do not improve the final evaluator\-visible state\. The benchmark therefore treats step usage as a behavioral signal about workflow efficiency, not just as an implementation detail of the rollout trace\.
## Appendix DData and Reproducibility
### D\.1Data Provenance and Provider Impact
The main experimental tables report benchmark outcomes over the shared ClawForge\-Bench snapshot\. For Kimi\-K2\.5, GPT\-5\.2, GPT\-o3, GPT\-5\.3\-Codex, GPT\-5\.4, and Claude Opus 4\.6, the reported strict accuracy, partial\-credit score, and length\-bucket average steps are computed from task\-level traces over the evaluated run set\. The Claude Models aggregate accuracy is reported over the same task denominator, but the length\-bucket average\-step values are computed from the available task\-level traces selected for the merged available run\.
Provider\-aware reporting is included because benchmark runs over external model endpoints can introduce retries, filtered outputs, or fallback behavior that affect an otherwise valid interactive rollout\. In the reporting scheme,provider\_failurescounts tasks that fail outright because a provider\-side issue prevents a normal completion\.provider\_impacted\_taskscounts tasks whose trajectories complete but are materially disturbed by provider\-side effects such as compact fallbacks or filtered responses\. These statistics are explanatory rather than primary: they do not replace full\-pass accuracy or partial\-credit score, but help readers judge whether serving noise is large enough to affect a comparison\. The intended reading order is therefore to examine accuracy and score first, use the length table to interpret closure efficiency, and then use provider\-aware statistics to determine whether a model’s execution trace was unusually noisy at the serving layer\.
### D\.2Limitations and Scope
Several limitations bound the claims made in the main paper\. First, the benchmark is intentionally narrow in environment type: it studies executable command\-line workflow episodes over tasks, calendar, email, messaging, files, configuration, weather, and cron state, rather than arbitrary long\-horizon agent deployment\. The intended claim is therefore not that one benchmark exhausts real agent reliability, but that state\-conflict workflows expose failure modes that static prompting and lighter tool\-use tests often miss\.
Second, the reported model comparisons are benchmark evaluations over one shared ClawForge\-Bench snapshot rather than variance estimates over many independently repeated runs\. This matters especially for externally served agents, where provider\-side retries, filtered outputs, and compact\-fallback behavior can perturb a rollout even when the task specification is fixed\. The paper surfaces those effects through provider\-aware accounting, but it does not eliminate them\.
Third, the hard suite emphasizes result\-first closure under persistent state, so some conclusions are benchmark\-specific by construction\. The strongest separations in this paper come from duplicate\-sensitive completion, stale\-state repair, replacement, and branch resolution\. Other capabilities that matter in agent deployment, such as open\-ended research, web navigation, or multimodal perception, are outside the scope of this benchmark family\.
Fourth, the Sonnet 4\.6 reporting line is not perfectly symmetric with the other entries\. Its strict\-accuracy value uses the shared task denominator, but some diagnostic step statistics rely on the available merged best\-available traces rather than one fully uniform trace set\. The main text and[SectionD\.1](https://arxiv.org/html/2605.14133#A4.SS1)therefore treat those step counts as efficiency evidence, not as a clean latency benchmark\.
### D\.3Statistical Significance
The paper reports deterministic benchmark outcomes on a fixed task snapshot and emphasizes exact full\-pass accuracy, partial\-credit score, and structured breakdown tables\. It does not report confidence intervals, bootstrap uncertainty, or hypothesis tests\. This omission is deliberate rather than hidden: the current paper is positioned as a benchmark\-construction and empirical\-separation study, not as a statistical comparison of nearly tied models under repeated random resampling\.
The absence of formal significance reporting should shape how close comparisons are read\. Large gaps tied to stable scenario families, such as the persistent difficulty of replacement\-style tasks or the near\-saturation of already\-done and duplicate\-avoidance slices, are more important to the paper’s argument than tiny differences between adjacent aggregate rankings\. Future versions of the benchmark can add repeated\-run variance estimation or bootstrap intervals over tasks, but that analysis is not part of the present submission\.
### D\.4Compute Resources
The reported experiments do not train new models\. The local benchmark workload is therefore orchestration\-heavy rather than training\-heavy: task reset, routed command execution, evaluator\-state construction, and result aggregation all run through the ClawForge runtime inmultimode with full interaction history and a 25\-step budget\. In this setup, the benchmark runner itself is CPU\-oriented and does not require local GPU execution for the paper’s reported results\.
## Appendix EBroader Impact and Release
### E\.1Broader Impact
Positive impact\.This work improves measurement quality for interactive agents\. Many real\-world failures arise from state conflicts rather than one\-shot misunderstanding\. By explicitly capturing failure modes such as duplication, stale state, incorrect branching, and incomplete workflow closure, the benchmark enables more precise diagnosis of where agents remain unreliable\.
Risks\.Improved benchmark coverage may indirectly accelerate deployment of agents in coordination, operations, or messaging workflows, where incorrect execution can incur real\-world cost\. Benchmark progress may also be misinterpreted as readiness for broader automation beyond the evaluated scenarios\.
Misuse and mitigation\.Workflow benchmarks can be used to optimize agents that manipulate calendars, messages, files, or system state, including in settings with limited oversight\. To mitigate this, the benchmark is built on synthetic or seeded environments, emphasizes transparent reporting of failure modes, and frames evaluation as a diagnostic tool rather than evidence of safe autonomous deployment\.
### E\.2Asset Credits and Terms
The paper uses two main classes of existing assets: prior benchmark literature and externally served foundation\-model endpoints\. Prior interactive\-benchmark and agent\-evaluation papers are cited in the introduction, method, experiment, and related\-work sections\. The evaluated models are likewise credited by provider and citation: Kimi\-K2\.5\(Teamet al\.,[2026](https://arxiv.org/html/2605.14133#bib.bib32)\)to the Kimi team, GPT\(OpenAI,[2026](https://arxiv.org/html/2605.14133#bib.bib35)\)to OpenAI, and Claude Code\(Anthropic,[2024](https://arxiv.org/html/2605.14133#bib.bib86)\)models to Anthropic\.
These evaluated models are not treated as open\-weight assets in this paper\. They are proprietary provider endpoints accessed under the corresponding provider’s service terms rather than redistributed model weights\. The paper therefore makes a sharper distinction than many open\-model evaluations: for these systems, the relevant usage condition is provider\-controlled API or hosted\-access terms, not an open\-source model license embedded in the benchmark package itself\.
The new benchmark artifacts described in this paper are ClawForge task specifications, seeded state, evaluator contracts, and reporting views\. They are documented as new assets of the study rather than borrowed benchmark files relabeled as original contributions\. The paper does not claim to redistribute third\-party model weights, proprietary datasets, or third\-party application data snapshots\.
### E\.3Open Access Release Status
This submission documents the benchmark protocol, task structure, evaluator contract, and reporting methodology in enough detail to support scientific interpretation, but it does not attach an anonymized open\-access release bundle for the benchmark snapshot and code at submission time\. That is a release\-status fact, not a hidden omission\.
The practical reason is that a faithful release package for this benchmark is more than a PDF appendix: it must include task specifications, state initialization assets, command\-routing runtime code, evaluator definitions, and clear instructions for running the shared hard snapshot against supported model providers\. The paper already documents those components conceptually, but the anonymized distribution bundle is not part of the present submission package\.
## Appendix FFull Scenario Catalog
### F\.1Scenario Example Directory
Table 9:Directory of the full scenario example catalog\. Page references point to the start of each long example\.
### F\.2Full Scenario Example Catalog
The following catalog now uses one concrete representative task instance for every released scenario family\. Each box is grounded in one actualhard\_decision\_workflow\_xxxspecification rather than in a scenario\-level summary, so the instruction wording, seeded state, representative commands, and evaluator\-visible closure conditions all come from a real released task\.
Example A: Inbox \(hard\_decision\_workflow\_1\)Representative instance\.hard\_decision\_workflow\_1asks: “The budget email thread in the inbox needs attention today\. Then check the Berlin board and calendar; right now the next step is already on the board, and if it only needs async follow\-up, leave the calendar alone, so fill only what is missing, note it in a handoff file, and send alice@example\.com a quick update\.” Initial state\.The seeded board already contains the pending taskBerlin budget follow\-up\. The task instruction also implies that the follow\-through should stay async if the existing calendar state already supports that branch, so the missing work is the handoff note plus the external update rather than a new calendar block\. Required correction\.Read the budget thread, verify that the staged Berlin task can stay in place, avoid adding a duplicate task or review event, create the handoff note under/handoff, and send the requested email update toalice@example\.com\. Representative actions\.1\.email search \-\-query ’budget’2\.email read \-\-id email\_seed\_33\.tasks list \-\-status pending4\.calendar list5\.calendar today \-\-timezone Europe/Berlin6\.file create \-\-path ’/handoff/berlin\-budget\-followup\.txt’ \-\-content ’Berlin budget follow\-up handoff note\.’7\.email send \-\-to alice@example\.com \-\-subject ’Budget follow\-up for Berlin’ \-\-body ’Tracking the next step for Berlin\.’Evaluation outcome\.The evaluator accepts this rollout only if the handoff file is created, the update email is sent, the pre\-existing Berlin follow\-up task remains sufficient, and no duplicate task or calendar event is introduced\. This makes the task a real inbox\-grounded completion problem rather than a generic email summary\.
Example B: Release Recovery Runbook \(hard\_decision\_workflow\_7\)Representative instance\.hard\_decision\_workflow\_7asks: “Please tighten up the Berlin release review\. Check what’s already on the board and calendar, use anthropic/claude\-opus\-4\-6, refresh the handoff file, and add only the missing review slot or decision\-log task\.” Initial state\.The override layer already stages a pending Berlin release next\-step task, so the workflow is not empty\. What is missing is the corrected review configuration under the required model path together with a refreshed handoff artifact\. Required correction\.Confirm the current model, switch toanthropic/claude\-opus\-4\-6, preserve the existing Berlin release next step, refresh the handoff file, and add only the missing Berlin review slot rather than recreating the release workflow from scratch\. Representative actions\.1\.openclaw config get agent\.model2\.tasks list \-\-status pending3\.calendar list4\.calendar today \-\-timezone Europe/Berlin5\.openclaw models set anthropic/claude\-opus\-4\-66\.file create \-\-path ’/ops/release\-handoff\.txt’ \-\-content ’Berlin release handoff notes\.’7\.calendar add\-event \-\-title ’Berlin release review’ \-\-start 2026\-03\-10T09:00Evaluation outcome\.A passing rollout must visibly set the model to the required Opus path, create the handoff file, keep the existing decision\-log reminder and board task intact, and place the Berlin review event on the calendar\. The task fails if the model switch is skipped or if valid staged state is rebuilt unnecessarily\.
Example C: Channel Incident Recovery \(hard\_decision\_workflow\_13\)Representative instance\.hard\_decision\_workflow\_13asks: “Please get the incident update to \#general on discord, check the incident next step on the board first to see whether it is already there, add only the missing piece, and send alice@example\.com a recap when it’s handled\.” Initial state\.The board already contains the pending taskIncident escalation follow\-up\. The missing work is therefore the externally visible incident post plus the recap email, not another board task\. Required correction\.Audit the task board first, post the incident escalation update to Discord\#general, avoid recreating the follow\-up task that already exists, and send the requested recap toalice@example\.com\. Representative actions\.1\.openclaw security audit2\.tasks list \-\-status pending3\.openclaw channels login \-\-channel discord4\.openclaw message send \-\-channel discord \-\-target \#general \-\-message ’Incident escalation started\. Please acknowledge\.’5\.openclaw channels list \-\-json6\.email send \-\-to alice@example\.com \-\-subject ’Incident escalation recap’ \-\-body ’The escalation is posted and being tracked\.’Evaluation outcome\.Strict evaluation requires both external side effects: the Discord message to\#generaland the recap email\. It also checks that the existing follow\-up task remains the only board artifact\. This makes the task a clean example of information transfer plus duplicate\-aware closure\.
Example D: Daily Operations Commitment Loop \(hard\_decision\_workflow\_19\)Representative instance\.hard\_decision\_workflow\_19asks: “Can you check today’s Europe/Berlin schedule for Berlin, look at the next step on the board and the recurring daily check, and only fill the missing piece?” Initial state\.The seeded state already containsBerlin existing ops next\-step task\. No recurring daily\-check cron is present yet, so only the recurring commitment is missing\. Required correction\.Inspect the Berlin schedule, verify the existing next\-step task, check the current cron state, and add only the missing recurring daily\-check job without touching the valid board state\. Representative actions\.1\.weather forecast \-\-location ’Berlin’ \-\-days 32\.calendar today \-\-timezone Europe/Berlin3\.tasks list \-\-status pending4\.openclaw cron list5\.openclaw cron add \-\-name hard\-ops\-01 \-\-cron ’0 9 \* \* \*’ \-\-message ’Run Berlin daily ops check’Evaluation outcome\.This rollout passes only if the Berlin next\-step task remains in place and exactly one new cron job is created for the daily ops check\. It fails when the model duplicates board work, adds the wrong recurring artifact, or stops after inspection without committing the missing scheduler state\.
Example E: Release Gate \(hard\_decision\_workflow\_43\)Representative instance\.hard\_decision\_workflow\_43asks: “Please get the Singapore release gate sorted out\. Confirm the current model, review the board and calendar first, then update the handoff file, the next step on the board, and the sync on the calendar\.” Initial state\.This instance does not preload the release artifacts through overrides, so the full gate closure must be assembled from inspection plus creation\. The hidden constraints require a specific model path, a handoff file, a release\-gate follow\-through task, and a matching sync\. Required correction\.Confirm the current model, inspect board and calendar state, switch toanthropic/claude\-3\-7\-sonnet\-latest, create the release handoff file, create the Singapore release\-gate task, and place the release\-gate sync on the calendar\. Representative actions\.1\.openclaw config get agent\.model2\.tasks list \-\-status pending3\.calendar list4\.calendar today \-\-timezone Asia/Singapore5\.openclaw models set anthropic/claude\-3\-7\-sonnet\-latest6\.file create \-\-path ’/ops/release\-handoff\.txt’ \-\-content ’Singapore release gate handoff notes\.’7\.tasks add \-\-title ’Singapore release gate follow\-through’ \-\-priority high \-\-due 2026\-03\-088\.calendar add\-event \-\-title ’Singapore release gate sync’ \-\-start 2026\-03\-10T09:009\.calendar today \-\-timezone Asia/SingaporeEvaluation outcome\.Evaluation checks all four closure outputs together: model set, handoff file created, task created, and calendar event created\. The task therefore exposes whether the agent can carry a release\-gate workflow through to a fully consistent final state rather than stopping after one local fix\.
Example F: Delivery Update \(hard\_decision\_workflow\_73\)Representative instance\.hard\_decision\_workflow\_73asks: “Can you get the outage update out to \#launch on discord? Fix the delivery path if needed, check what’s already on the board, the next step may already be on the board, so leave it alone if it is there, if the target is shared, leave a short live block on the calendar, and send bob@example\.com a quick recap\.” Initial state\.The board already containsOutage delivery next step\. The required shared\-target branch also expects a short live calendar block, but neither the Discord post nor the recap email has been sent yet\. Required correction\.Leave the staged outage next step alone, post the outage update to Discord\#launch, add the live follow\-up calendar block, and send the recap tobob@example\.com\. Representative actions\.1\.tasks list \-\-status pending2\.openclaw channels list \-\-json3\.openclaw channels login \-\-channel discord4\.calendar add\-event \-\-title ’Outage delivery follow\-up’ \-\-start 2026\-03\-12T14:005\.openclaw message send \-\-channel discord \-\-target \#launch \-\-message ’Outage update posted and being tracked\.’6\.email send \-\-to bob@example\.com \-\-subject ’Outage delivery recap’ \-\-body ’The delivery path is recovered and the update is posted\.’Evaluation outcome\.The evaluator requires all three visible side effects while also enforcing that no duplicate board task is created\. This task is representative because it mixes messaging, scheduling, and duplicate\-aware state preservation inside one delivery workflow\.
Example G: Operations Review \(hard\_decision\_workflow\_83\)Representative instance\.hard\_decision\_workflow\_83asks: “Can you look at the forecast and today’s Europe/London calendar for London, review the next step on the board and the recurring daily check, and only add what’s missing before you put the review on the calendar?” Initial state\.The seeded state already containsLondon existing ops next\-step task\. The missing pieces are the recurring daily\-check cron and the review event itself, both conditioned on current forecast and London calendar context\. Required correction\.Read the forecast, inspect the London calendar, confirm the existing ops task, create the missing daily\-check cron, and place theLondon backup ops reviewevent on the calendar\. Representative actions\.1\.weather forecast \-\-location ’London’ \-\-days 12\.calendar today \-\-timezone Europe/London3\.openclaw cron list4\.tasks list \-\-status pending5\.openclaw cron add \-\-name hard\-ops\-07\-01 \-\-cron ’15 8 \* \* 1\-5’ \-\-message ’Run London daily ops check’6\.calendar add\-event \-\-title ’London backup ops review’ \-\-start 2026\-03\-10T13:00Evaluation outcome\.A passing rollout must keep the existing task intact while creating both the cron job and the review event\. The task fails if the model schedules a review without completing the missing recurring setup or if it adds duplicate board state while trying to be safe\.
Example H: Existing State \(hard\_decision\_workflow\_113\)Representative instance\.hard\_decision\_workflow\_113asks: “The Boston daily ops setup is only partway done\. Look over the next step on the board, the review slot on the calendar, and the recurring daily check, then finish just the missing piece and send alice@example\.com a short recap\.” Initial state\.The override layer preloads bothBoston existing ops next\-step taskandBoston existing ops review block\. The only missing piece is the recurring daily\-check cron, after which a recap email must be sent\. Required correction\.Audit the partially completed Boston ops setup, preserve the existing task and review block, create only the missing cron job, and send the recap toalice@example\.com\. Representative actions\.1\.openclaw cron list2\.tasks list \-\-status pending3\.calendar list4\.calendar today \-\-timezone America/New\_York5\.openclaw cron add \-\-name existing\-hard\-ops\-01 \-\-cron ’30 9 \* \* 1\-5’ \-\-message ’Run Boston daily ops check’6\.Send the Boston ops recap email toalice@example\.com\.Evaluation outcome\.Strict evaluation checks that the existing task and review block remain present, no duplicate task or calendar event appears, the cron job is created, and the recap email is sent\. This makes the example a concrete existing\-state completion task rather than a vague recurring\-workflow summary\.
Example I: Duplicate Avoidance \(hard\_decision\_workflow\_123\)Representative instance\.hard\_decision\_workflow\_123asks: “Some of the London ops setup is already in place\. Check the board, forecast, and calendar, then add a backup review block on the calendar if the forecast looks risky; otherwise add a primary review block on the calendar, and avoid recreating the existing setup\.” Initial state\.The workflow already containsLondon existing ops next\-step taskand an active recurring cronlondon\-existing\-daily\-ops\-check\. The only missing artifact is the review block, whose title depends on the forecast branch\. Required correction\.Inspect the existing board and cron state, read the London forecast and calendar, choose the backup review branch supported by the task instance, and add only theLondon backup ops review blockwithout recreating existing setup\. Representative actions\.1\.openclaw cron list2\.tasks list \-\-status pending3\.calendar list4\.weather forecast \-\-location ’London’ \-\-days 15\.calendar today \-\-timezone Europe/London6\.calendar add\-event \-\-title ’London backup ops review block’ \-\-start 2026\-03\-20T08:30Evaluation outcome\.The evaluator rewards recognition that most of the workflow already exists\. It checks that the board task and daily check stay in place, no duplicate task or cron job is added, and the review event is created with a backup\-oriented title\.
Example J: Multi\-Source Decision \(hard\_decision\_workflow\_133\)Representative instance\.hard\_decision\_workflow\_133asks: “For Austin, check the review email note, the forecast, and the America/Chicago calendar, decide whether this stays live or shifts async, leave the next step on the board, keep the daily check scheduled, put the review on the calendar if live, and send leadership@example\.com an update only if async\.” Initial state\.This Austin instance begins without task\-specific overrides\. The closure branch must be inferred from the review email note, the forecast, and the America/Chicago calendar, and the live branch requires creating the task, cron job, and calendar review event\. Required correction\.Read the review email note, inspect the forecast and Chicago calendar, decide that the workflow stays live, create the Austin next\-step task, keep the daily check scheduled, and place theAustin primary ops reviewevent on the calendar\. Representative actions\.1\.weather forecast \-\-location ’Austin’ \-\-days 12\.calendar today \-\-timezone America/Chicago3\.email search \-\-query ’review’4\.email read \-\-id email\_seed\_55\.openclaw cron list6\.tasks list \-\-status pending7\.tasks add \-\-title ’Austin ops next step’ \-\-priority medium \-\-due 2026\-03\-088\.openclaw cron add \-\-name multi\-source\-hard\-ops\-01 \-\-cron ’0 9 \* \* \*’ \-\-message ’Run Austin daily ops check’9\.calendar add\-event \-\-title ’Austin primary ops review’ \-\-start 2026\-03\-10T09:00Evaluation outcome\.The evaluator requires task creation, cron creation, and review\-event creation together\. A rollout that sends an async email instead of committing to the live branch, or that keeps inspecting without deciding, still fails even if the intermediate reasoning looks plausible\.
Example K: State Repair \(hard\_decision\_workflow\_153\)Representative instance\.hard\_decision\_workflow\_153asks: “The Boston release setup has something stale in it\. Review the next\-step task on the board and the sync on the calendar, repair the stale next step, refresh the handoff file, and leave the next step and sync correctly set\.” Initial state\.The seeded state contains a stale low\-priority task,Existing Boston release follow\-up, under thehard\_wrong\_tasksetup\. The correct release follow\-through, handoff refresh, and Boston release sync do not yet exist\. Required correction\.Detect that the seeded Boston follow\-up is stale, refresh the handoff file, complete the outdated task, create the corrected high\-priority taskBoston release follow\-through, and add theBoston release synccalendar event\. Representative actions\.1\.tasks list \-\-status pending2\.calendar list3\.calendar today \-\-timezone America/New\_York4\.file create \-\-path ’/ops/release\-handoff\.txt’ \-\-content ’Boston refreshed release handoff notes\.’5\.tasks complete \-\-title ’Existing Boston release follow\-up’6\.tasks add \-\-title ’Boston release follow\-through’ \-\-priority high \-\-due 2026\-03\-167\.calendar add\-event \-\-title ’Boston release sync’ \-\-start 2026\-03\-19T15:00Evaluation outcome\.This rollout passes only if the stale object is actually retired and both replacement artifacts are created\. It fails when the model notices the conflict but leaves the outdated task in place, adds the new task without cleanup, or stops after a partial repair and behaves as if the workflow were complete\.
Example L: Completion Gap \(hard\_decision\_workflow\_199\)Representative instance\.hard\_decision\_workflow\_199asks: “Part of the Seattle release work is already handled, but the next step is the missing piece\. Check what’s there, read the release notes, refresh the handoff file, and leave the sync on the calendar in place and add only the missing next step on the board\.” Initial state\.The override layer preloads the review note/ops/release\-review\.txtand an existing Seattle release sync on the calendar\. What is missing is the board next step plus the refreshed handoff file\. Required correction\.Read the drafted release notes, preserve the existing Seattle sync, create the handoff file, and add only the missing high\-priority taskSeattle release next stepon the board\. Representative actions\.1\.openclaw config get agent\.model2\.tasks list \-\-status pending3\.calendar list4\.calendar today \-\-timezone America/Los\_Angeles5\.file read \-\-path ’/ops/release\-review\.txt’6\.file create \-\-path ’/ops/release\-handoff\.txt’ \-\-content ’Seattle completion\-gap release handoff notes\.’7\.calendar list \-\-from 2026\-03\-10 \-\-to 2026\-03\-108\.tasks add \-\-title ’Seattle release next step’ \-\-priority high \-\-due 2026\-03\-08Evaluation outcome\.Evaluation checks exactly that structure: the handoff file must be created, the release sync must remain an already\-existing artifact, no duplicate calendar event may be created, and the new task must appear\. This is a real gap\-completion example where the missing piece is the board task rather than the calendar sync\.
Example M: Branch Resolution \(hard\_decision\_workflow\_211\)Representative instance\.hard\_decision\_workflow\_211asks: “The Sydney ops setup already has part of the work in place\. Please review the forecast, today’s Australia/Sydney calendar, the task already on the board, and the recurring daily check schedule before working out whether this stays live or moves async, and send bob@example\.com the update if it moves async\.” Initial state\.The seeded state already containsSydney existing ops next\-step taskand an active daily\-check cron\. The task instance resolves to the async branch, so the live review path should not be scheduled\. Required correction\.Inspect the Sydney forecast, calendar, board task, and cron state; decide that the workflow moves async; preserve the existing task and daily check; and send the backup\-plan email tobob@example\.cominstead of creating a calendar review block\. Representative actions\.1\.weather forecast \-\-location ’Sydney’ \-\-days 12\.calendar today \-\-timezone Australia/Sydney3\.openclaw cron list4\.tasks list \-\-status pending5\.Send the Sydney backup\-plan email tobob@example\.com\.Evaluation outcome\.The evaluator checks the preserved task and cron state, the absence of duplicate board or cron artifacts, the email side effect, and the absence of a duplicate calendar event\. This makes the task a concrete branch\-resolution example instead of a generic live\-vs\-async summary\.
Example N: Already\-Done Skip \(hard\_decision\_workflow\_251\)Representative instance\.hard\_decision\_workflow\_251asks: “The New York release setup should already be in place\. Please verify the current model, the next\-step task on the board, and the sync on the calendar before you send finance@example\.com a short recap and leave it alone if it is already right\.” Initial state\.Both key artifacts are already present through overrides:New York existing release next stepon the board andNew York existing release syncon the calendar\. The only remaining obligation is to verify the state and send the finance recap\. Required correction\.Confirm the current model, inspect the board and calendar, recognize that no repair is needed, and send the requested recap tofinance@example\.comwithout creating any new task or event\. Representative actions\.1\.openclaw config get agent\.model2\.tasks list \-\-status pending3\.calendar list4\.calendar today \-\-timezone America/New\_York5\.Send the New York release recap email tofinance@example\.com\.Evaluation outcome\.This task passes only when the model resists unnecessary mutation\. The evaluator checks the model setting, the existing task and sync, the recap email, and the absence of duplicate task or calendar creations\. It is therefore a real recognition\-and\-restraint task, not an empty no\-op\.
Example O: Wrong\-State Replacement \(hard\_decision\_workflow\_261\)Representative instance\.hard\_decision\_workflow\_261asks: “Something in the Seattle release setup was staged wrong\. Check whether the next step is stale, replace the stale next step with the correct version, and leave the next step and sync correctly set\.” Initial state\.Underhard\_wrong\_task, the workflow starts withExisting Seattle release follow\-up, a low\-priority stale task that should not survive in the final state\. The replacement task and replacement sync are both still missing\. Required correction\.Inspect the seeded Seattle task, determine that it is stale, complete it, create the corrected taskSeattle release replacement next step, and add the matchingSeattle replacement release syncevent\. Representative actions\.1\.tasks list \-\-status pending2\.calendar list3\.calendar today \-\-timezone America/Los\_Angeles4\.tasks complete \-\-title ’Existing Seattle release follow\-up’5\.tasks add \-\-title ’Seattle release replacement next step’ \-\-priority high \-\-due 2026\-03\-096\.calendar add\-event \-\-title ’Seattle replacement release sync’ \-\-start 2026\-03\-10T13:00Evaluation outcome\.Strict evaluation requires both retirement of the stale task and creation of the corrected replacement state\. Adding the replacement without cleanup, or cleaning up without recreating the follow\-through, still fails\. That is why this slice remains one of the hardest state\-conflict settings in the benchmark\.
Example P: Interrupted Workflow Resume \(hard\_decision\_workflow\_307\)Representative instance\.hard\_decision\_workflow\_307asks: “The New York release work already has pieces in place\. Check the current model, the next step on the board, and the sync on the calendar, finish the missing handoff file, send manager@example\.com a recap, and leave the existing setup alone\.” Initial state\.The seeded release state already containsNew York existing release next step,New York existing release sync, and drafted notes in/ops/release\-review\.txt\. The only missing artifact is the handoff file, after which a recap email is required\. Required correction\.Resume from the partially completed New York release setup, read the existing review notes, create the missing handoff file, preserve the staged task and sync, and send the recap tomanager@example\.com\. Representative actions\.1\.openclaw config get agent\.model2\.tasks list \-\-status pending3\.tasks search \-\-query ’New York release’4\.calendar list5\.calendar today \-\-timezone America/New\_York6\.file read \-\-path ’/ops/release\-review\.txt’7\.file create \-\-path ’/ops/release\-handoff\.txt’ \-\-content ’New York resumed release handoff notes\.’8\.Send the New York resume recap email tomanager@example\.com\.Evaluation outcome\.The evaluator accepts only resume\-style closure: the task and sync must remain existing artifacts, no duplicates may be created, the handoff file must appear, and the recap email must be sent\. Restarting the workflow from scratch or mutating already\-correct state still fails\.
Example Q: Contradictory Source Resolution \(hard\_decision\_workflow\_337\)Representative instance\.hard\_decision\_workflow\_337asks: “Please reconcile the latest review email note, the forecast, and today’s Europe/London calendar for London\. Leave the next step on the board, keep the recurring daily check scheduled, and either put the review on the calendar or send bob@example\.com the async update\.” Initial state\.This London instance begins without seeded overrides, so all action depends on reconciling the latest review email, the forecast, and the current London calendar\. The correct branch here is async: the task and recurring check should be created, but no review event should be added\. Required correction\.Read the latest review email note, inspect forecast and calendar evidence, createLondon contradictory\-source next step, add the recurring contradiction\-check cron, and send the async update tobob@example\.comwhile avoiding a calendar review block\. Representative actions\.1\.email search \-\-query ’review’2\.email read \-\-id email\_seed\_53\.weather forecast \-\-location ’London’ \-\-days 14\.calendar today \-\-timezone Europe/London5\.calendar list \-\-from 2026\-03\-01 \-\-to 2026\-03\-016\.tasks list \-\-status pending7\.tasks add \-\-title ’London contradictory\-source next step’ \-\-priority high \-\-due 2026\-03\-118\.openclaw cron add \-\-name contradictory\-source\-01 \-\-cron ’0 18 \* \* 1\-5’ \-\-message ’Run London daily contradiction check’9\.Send the London async update email tobob@example\.com\.Evaluation outcome\.The evaluator checks exactly those branch\-specific side effects: task created, cron created, email sent, and no duplicate calendar event\. This makes the example a genuine contradictory\-evidence resolution task rather than a generic multi\-source lookup problem\.Similar Articles
I made a small open-source benchmark runner for testing OpenClaw agents on my own real workflows
A developer shares a personal open-source benchmark runner for testing OpenClaw agents on real, messy workflows. The tool allows users to define private evaluation cases, run agents in their actual workspace, and generate reports, aiming to provide more relevant signals than public benchmarks.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit is an automated pipeline that generates diverse, verified environments for claw-like agents from natural language descriptions, enabling the construction of Auto-ClawEval, a large-scale benchmark with 1,040 environments at 13,800x lower cost than human curation. The system supports continuous, on-demand evaluation and adaptive training environment generation across multiple model families and agent frameworks.
ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents
ClawGUI is an open-source framework for training, evaluating, and deploying GUI agents using reinforcement learning, featuring standardized benchmarks and cross-platform deployment to Android, iOS, and HarmonyOS.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
GTA-2 introduces a hierarchical benchmark for evaluating general tool agents across atomic tool-use and open-ended workflows, revealing a significant capability cliff where frontier models achieve only 14.39% success on complex tasks despite reasonable atomic performance.