STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

arXiv cs.AI Papers

Summary

This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based computing environments, enabling scalable, state-based evaluation of LLM-powered agents.

arXiv:2606.10394v1 Announce Type: new Abstract: Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:15 AM

# Automated State-based Agent Benchmarking for Realistic Scenarios
Source: [https://arxiv.org/html/2606.10394](https://arxiv.org/html/2606.10394)
Sirui Liang1,3,4,6, Bohan Yu1,2,4,611footnotemark:1, Peiyu Wang1,3,4, Shiguang Guo6, Wenxing Hu6, Pengfei Cao1,3, Jian Zhao4,5,Cao Liu6,Ke Zeng6,Xunliang Cai6,Kang Liu1,3 1The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, 2School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, 3Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences, 4Zhongguancun Academy,5Zhongguancun Institute of Artificial Intelligence,6Meituan \{liangsirui2024, yubohan2025\}@ia\.ac\.cn, kliu@nlpr\.ia\.ac\.cn

###### Abstract

Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge\. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal\-agent evaluation\. This paper introduces STAGE\-Claw, an automated framework for building and evaluating realistic personal\-agent scenarios in state\-based personal\-computing environments\. Given a task hint, STAGE\-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs\. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response\. Using STAGE\-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool\-call reliability, and common failure patterns\. Overall, STAGE\-Claw offers a scalable, state\-based way to evaluate agents in realistic user scenarios\. Code is available[here](https://github.com/LiangThree/STAGE-Claw.git)\.

STAGE\-Claw: Automated State\-based Agent Benchmarking for Realistic Scenarios

Sirui Liang1,3,4,6††thanks:Co\-first authors, they contributed equally to this work\., Bohan Yu1,2,4,611footnotemark:1, Peiyu Wang1,3,4, Shiguang Guo6, Wenxing Hu6, Pengfei Cao1,3,Jian Zhao4,5,Cao Liu6,Ke Zeng6,Xunliang Cai6,Kang Liu1,3††thanks:Corresponding author\.1The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation,2School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences,3Chinese Academy of Sciences, Beijing, China, University of Chinese Academy of Sciences,4Zhongguancun Academy,5Zhongguancun Institute of Artificial Intelligence,6Meituan\{liangsirui2024, yubohan2025\}@ia\.ac\.cn, kliu@nlpr\.ia\.ac\.cn

![Refer to caption](https://arxiv.org/html/2606.10394v1/x1.png)Figure 1:The workflow of STAGE\-Claw\.1\) Benchmark authoring: Explore a task hint and generate the task\.2\) Benchmark validation: Check task correctness, difficulty, and reproducibility, revising if needed\.3\) Agent execution: Target agent attempts the task in the environment\.4\) State\-based evaluation: Score results by verifying system state\.## 1Introduction

Large language models \(LLMs\) are increasingly being adopted as the reasoning backbones of autonomous agents\(Wanget al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib1); Xiet al\.,[2025](https://arxiv.org/html/2606.10394#bib.bib2); Yaoet al\.,[2022](https://arxiv.org/html/2606.10394#bib.bib3)\), as exemplified by systems such as Claude CodeAnthropic \([2025](https://arxiv.org/html/2606.10394#bib.bib4)\)and OpenClawOpenClaw \([2026](https://arxiv.org/html/2606.10394#bib.bib5)\), which augment LLMs with tool interfaces, execution environments, memory mechanisms, and control logic\. This shift changes evaluationLiuet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib10)\); Yeet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib14)\); Liet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib15)\), agents must not only answer textual prompts but also plan over multiple steps, coordinate heterogeneous tools, and interact with environments\. For example, agent integrated with email, calendars, files, browsers, and other everyday applicationsOpenClaw \([2026](https://arxiv.org/html/2606.10394#bib.bib5)\), benchmarks must measure reliable action in persistent, cross\-tool user scenariosLiet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib15)\)\.

Recent agent benchmarks have moved beyond text\-only evaluation, covering tool use and multi\-step reasoning, web instruction following, desktop interaction, and visually grounded web tasks\(Mialonet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib7); Denget al\.,[2023](https://arxiv.org/html/2606.10394#bib.bib8); Xieet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib9); Zhouet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib11); Kohet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib12)\)\. However, they are still limited in three crucial aspects\.First, most existing evaluations replace real application state with sandboxed artifacts\.For example, PinchBenchKilo AI \([2026](https://arxiv.org/html/2606.10394#bib.bib13)\)evaluated calendar scheduling ability through generated\.icsfiles and email\-related tasks through synthetic inboxes stored as workspace text files\. This file\-based formulation simplifies task completion evaluation, but it could ignore some operations in real scenarios, such as software\-permission and tool\-access errors\. Consequently, it mainly evaluates artifact generation rather than an agent’s ability to interact with the applications\.Second, existing evaluation tasks are typically constructed manually and are therefore difficult to scale\.GAIAMialonet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib7)\)and Claw\-EvalYeet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib14)\)rely on fixed question\-answer instances or human\-verified tasks and rubrics\. However, personal agents must adapt to diverse user preferences, goals, workflows, and evolving contexts, requiring evaluation tasks that scale across personalized and dynamic scenarios\. Such coverage is difficult and costly to achieve with manually curated static benchmarks\.Third, existing evaluations often lack process\-aware diagnostics\.Only final results scoring or checkersMaet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib17)\); Trivediet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib18)\)may fail to diagnose where errors occur within the completion workflow\. For instance, an incorrect conference\-tracking related calendar operation produced by an agent may result from errors in intermediate steps, such as time\-zone conversion, conflict resolution, or reconciling inconsistent message sources, which final\-result\-only checkers cannot localize\. Overall, these limitations motivate a state\-based and scalable evaluation paradigm that assesses agents in realistic scenarios\.

To address these limitations, this paper proposesSTAGE\-Claw\(State\-based,Tool\-integrated,Agent taskGeneration andEvaluation\), an automated framework for constructing and evaluating agent benchmarks in realistic environments\. STAGE\-Claw addresses these challenges from three aspects\.State\-based evaluationverifies whether an agent’s actions produce the expected state changes in the environment, rather than merely checking outputs or artifacts\.Automated constructionandProcess\-aware diagnosisrefer to evaluating agents by automatically generating and validating benchmark instances and analyzing fine\-grained metrics to localize task failure reasons\. As shown in Figure[1](https://arxiv.org/html/2606.10394#S0.F1), STAGE\-Claw automatically creates task instances from the task hint word, validates their verifiability, difficulty, and reproducibility, executes target agents in the generated environments, and evaluates their performance by verifying persistent system\-state changes across tools\. This design shifts the evaluation from final\-artifact checking to state\-based assessment of agent behavior in realistic scenarios\. We build 40 challenging tasks and conduct a detailed analysis of 11 frontier models’ test results\. Overall, our contributions are summarized as follows:

- •This paper proposes STAGE\-Claw, a framework that systematically automates the construction and validation of state\-based agent evaluation instances in realistic scenarios\.
- •Using STAGE\-Claw, we build a state\-based benchmark of 40 challenging tasks grounded in 5\-group realistic scenarios, covering workflows that require cross\-source reasoning, tool state updates, and cross\-tool consistency\.
- •We evaluate 11 frontier models on STAGE\-Claw and conduct a detailed analysis of the task trajectory and results, which contribute insights for developing reliable, state\-based, and extensible agent evaluation systems\.

## 2STAGE\-Claw

In this section, we introduceSTAGE\-Claw, a four\-stage automated framework for constructing and evaluating state\-based agent benchmarks\.

#### Formalized Definition\.

Each benchmark instance is formulated as a state\-transformation problem over a reconstructable real environment:

ℬ=\(q,E0,G,R,V\),\\mathcal\{B\}=\(q,E\_\{0\},G,R,V\),whereqqis a task prompt,E0E\_\{0\}is an initial tool environment,GGspecifies a target final state,RRis a scoring rubric, andVVis an executable verifier\. An agent observes only\(q,E0\)\(q,E\_\{0\}\)at the beginning, while\(G,R,V\)\(G,R,V\)is reserved for evaluation to prevent information leakage\. Given an agent policyπ\\pi, execution from the initial state𝒔0\\boldsymbol\{s\}\_\{0\}produces a trajectory

τπ=\(𝒔0,o0,a0,…,aH−1,oH,𝒔H\),\\tau\_\{\\pi\}=\(\\boldsymbol\{s\}\_\{0\},o\_\{0\},a\_\{0\},\\ldots,a\_\{H\-1\},o\_\{H\},\\boldsymbol\{s\}\_\{H\}\),whereoto\_\{t\}is the environment observation,ata\_\{t\}is the agent’s action for tool use, and𝒔H\\boldsymbol\{s\}\_\{H\}is the final environment state\. The agent succeeds not by merely producing a final textual answer, but by transformingE0E\_\{0\}into a final stateGGwhere aVVis utilized to evaluate its correctness\.

#### Implementation Environment\.

STAGE\-Claw evaluates agents in real computing environments, where user requests often span multiple applications\. The environment can be regarded as a collection of all the states of the tools\. Initial environmentE0=\(𝒯,𝒔0\)E\_\{0\}=\(\\mathcal\{T\},\\boldsymbol\{s\}\_\{0\}\), where𝒯\\mathcal\{T\}denotes the set of available tools and𝒔0\\boldsymbol\{s\}\_\{0\}denotes their initial joint state\. The global environment state at stepttis represented as𝒔t=\{stτ\}τ∈𝒯\\boldsymbol\{s\}\_\{t\}=\\\{s\_\{t\}^\{\\tau\}\\\}\_\{\\tau\\in\\mathcal\{T\}\}, wherestτs\_\{t\}^\{\\tau\}is the state of toolτ\\tau\. Agents receive user\-level access to realistic tools, including the file system, browser, terminal, calendar, email, reminders, and notes, which support read, write, and execution operations over persistent application states\. Tasks are modeled as transitions from the initial state𝒔0\\boldsymbol\{s\}\_\{0\}to an acceptable final\-state set𝒮G\\mathcal\{S\}\_\{G\}specified by a targetGG, and are considered successful when𝒔H∈𝒮G\\boldsymbol\{s\}\_\{H\}\\in\\mathcal\{S\}\_\{G\}, indicating that the agent has produced the required persistent changes while preserving relevant existing state\.

![Refer to caption](https://arxiv.org/html/2606.10394v1/fig/task_category_pie.png)Figure 2:Statistics of task categories\.
### 2\.1Stage 1: Benchmark Authoring

The first stage automatically constructs executable benchmark instances from task hint words\. We manually curate 40 real assistant scenarios as task hint words\. For each task hint, STAGE\-Claw invokes a benchmark\-authoring agent to instantiate an executable task\. The authoring agent acts only as a benchmark designer\. Explore realistic user needs and imagine several challenging scenarios, then select a scenario with sufficient complexity\. Based on the selected scenario, the authoring agent constructs a task instance containing a task prompt, an environment\-construction guide, the corresponding ground truth, and an executable verification program aligned with the ground truth\.

To ensure sufficient complexity, each task is designed to be multi\-step, involve multiple tools or information sources, be reconstructable from a clean state, and support objective state\-based evaluation\. We also deliberately add some operators with different difficulty types \(illustrated in Table[5](https://arxiv.org/html/2606.10394#A1.T5)in Appendix[A\.3](https://arxiv.org/html/2606.10394#A1.SS3)\) to model the complexity of real\-world scenario tasks\. In general, we build 40 tasks into 5 groups\. Figure[2](https://arxiv.org/html/2606.10394#S2.F2)shows the task distribution\.

### 2\.2Stage 2: Benchmark Validation

Before admission into the evaluation set, each task instance is checked by an independent validation agent that does not solve or modify the task\. The checker verifies four properties\.Structure, ensuring that each task includes environment construct guidance, visible task prompt, hidden ground truth, and the corresponding executable verify program\.Reproducibility, rebuilding the environment twice from a clean state and comparing snapshots across tool states such as files, calendars, reminders, notes, and email\. Ensure that the task environment can be consistently reconfigured to the same initial state each time\.Verifiability, assessing whether the scoring is objective and executable\.Difficulty calibration, checking whether the difficulty level of the task includes the following types of difficulty\. Such as cross\-source conflicts, hidden dependencies, noisy data, entity alignment, tool state updates, and cross\-tool consistency\.

Each dimension receives a status ofPass,Fail,Blocked, orWarn\. These statuses are aggregated into a weighted 100\-point checker score\. Instances scoring above a threshold \(80 in our experiments\) are accepted\. Failed instances are returned toStage 1with diagnostics for targeted repair and re\-submitted until they pass or reach the maximum number of repair attempts\.

BenchmarkState\-basedMulti\-toolAuditableAuto ConstructionPerturbationAgentBench\(Liuet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib10)\)✗✗GAIA\(Mialonet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib7)\)✗✗✗✗τ\\tau\-bench\(Yaoet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib24)\)✓✓✓✗✗WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib11)\)✓✓✗OSWorld\(Xieet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib9)\)✓✓✓✗ToolBench\(Qinet al\.,[2024](https://arxiv.org/html/2606.10394#bib.bib32)\)✗✓✓✗Terminal\-Bench\(Merrillet al\.,[2026](https://arxiv.org/html/2606.10394#bib.bib23)\)✓✓✗✗PinchBench\(Kilo AI,[2026](https://arxiv.org/html/2606.10394#bib.bib13)\)✗✗Claw\-Eval\(Yeet al\.,[2026](https://arxiv.org/html/2606.10394#bib.bib14)\)✓✓✓✓STAGE\-Claw \(Ours\)✓✓✓✓✓Table 1:Comparison of agent evaluation benchmarks along STAGE\-Claw\.State\-basedchecks explicit environment states;Multi\-toolrequires coordinated tool use;Auditablesupports traces, snapshots, rubrics checkers;Auto Constructionenables automatic or programmatic task construction; andPerturbationintroduces controlled noise, conflicts, or errors\. Green checks, yellow circles, and red crosses denote full, partial, and no core support\.
### 2\.3Stage 3: Agent Execution

This stage involves testing the qualified tasks\. For each validated benchmark, STAGE\-Claw resets the initial state, reconstructs the environment from the environment\-construction guide, and creates an isolated execution workspace containing only the visible task prompt, related documents, and tools\. The evaluated agent must complete the task within a predefined time budget using only the available instructions and tools\. During execution, it may interpret requirements, interact with tools and update tool states, inspect intermediate results, and produce final outputs\. Once the agent finishes, STAGE\-Claw records execution metadata, including run status, elapsed time, and tool\-interaction traces\. It also captures a snapshot of the final environment\. These recorded states are sent to the evaluation stage, where the evaluator assesses whether the agent’s outputs and final\-state record satisfy the task requirements\.

### 2\.4Stage 4: State\-Based Evaluation

The final stage performs a state\-based evaluation after the evaluated agent completes the task execution\. After execution, STAGE\-Claw runs the task\-specific verifier to check whether the recorded states align with the hidden ground truthGG\. The evaluator checks file output, tool\-state updates, and formatting constraints, then returns a structured evaluation report\. LLM\-assisted adjudication is used only as a guarded fallback\. It is triggered when executable verification fails to execute, times out, or cannot access a required snapshot field because of evaluator\-side tool instability\. All evaluation paths enforce a unified total\-score format for automatic parsing and aggregation\. Finally, STAGE\-Claw records the agent score, execution status, and evaluation metadata and links them with execution logs to support detailed analysis\.

Overall, we constructed 40 challenging tasks using STAGE\-Claw\. As summarized in Table[1](https://arxiv.org/html/2606.10394#S2.T1), STAGE\-Claw combines state\-based evaluation, multi\-tool workflows, auditable verification, automated construction, and controlled perturbation in one workflow\. This design makes STAGE\-Claw a scalable state\-based evaluation framework rather than a collection of manually written agent tasks\.

ModelScoreFirst\-PassTime \(s\)Tokens \(M\)Cost \($\)Tool CallsClaude\-Opus\-4\.777\.180\.0%422\.81\.242$6\.5540\.6Claude\-Sonnet\-4\.669\.4365\.0%733\.81\.265$3\.9842\.2GPT\-5\.569\.1965\.0%316\.40\.405$1\.3328\.9GPT\-5\.465\.3752\.5%192\.50\.236$0\.6921\.8Gemini\-3\.1\-Pro65\.567\.5%742\.40\.642$2\.6552\.6DeepSeek\-V4\-Pro59\.7860\.0%744\.40\.620$0\.2828\.3Doubao\-Seed\-2\.0\-Pro61\.0750\.0%325\.60\.531$0\.2728\.4GLM\-567\.4060\.0%423\.70\.694$0\.7340\.5Kimi\-K2\.659\.0555\.0%641\.30\.921$0\.9354\.9MiniMax\-M2\.747\.5335\.0%937\.50\.529$0\.1734\.8Qwen3\.5\-Plus59\.7457\.5%406\.20\.882$0\.3059\.3

Table 2:Performance and efficiency comparison of large language models on STAGE\-Claw\. Score is the average valid run task score\. First\-Pass denotes the proportion of tasks scoring above 60 in the first round\. Time, Tokens, Cost, and Tool Calls report average run time, token usage, API cost, and tool invocations of each task\. Bold values indicate the best result for each metric, and underlined values indicate the second\-best result\.

## 3Evaluation

### 3\.1Task Construction

#### Construction Settings\.

Benchmark authoring agent is instantiated with Claude\-Sonnet\-4\.6Anthropic \([2026b](https://arxiv.org/html/2606.10394#bib.bib21)\)\. The full prompts are shown in Appendix[C](https://arxiv.org/html/2606.10394#A3)\. Each candidate task is checked by the checker agent \(test results exceeded 80 points\), which requires an average of 2\.67 repair iterations\. All accepted tasks can be verified by executable verification programs without the need for LLM\-assisted adjudication\. Finally, human annotators check each accepted task for scenario realism, task completeness, instruction clarity, ground\-truth correctness, and evaluator\-rubric alignment\. The annotation details are provided in Appendix[A\.7](https://arxiv.org/html/2606.10394#A1.SS7)\. All task construction, repair, and checker\-based validation are fully automated, human involvement is only used for quality review\.

#### Construction Budget and Scope\.

Each STAGE\-Claw task requires automated construction, verification, evaluation, and human audit\. Building an accepted task costs about 35 to 40$ and takes 1\-2 hours, excluding model\-evaluation runs\. The time cost to each task for each model is 3 to 15 minutes, and the API cost is 0\.17 to 6\.55$\. Considering the substantial time and API costs of constructing and evaluating tasks, we build a set of 40 challenging tasks and position it as a high\-quality pilot benchmark rather than an exhaustive coverage of all personal\-agent scenarios\. We evaluate each model on each task once in our experiments and include a limited repeated\-run diagnostic in Appendix[A\.6](https://arxiv.org/html/2606.10394#A1.SS6)\. Our goal is to demonstrate the feasibility and diagnostic value of automated state\-based evaluation in realistic environments\.

### 3\.2Evaluated Models

We evaluated 11 state\-of\-the\-art models spanning multiple model families: Claude\-Opus\-4\.7Anthropic \([2026a](https://arxiv.org/html/2606.10394#bib.bib34)\), Claude\-Sonnet\-4\.6Anthropic \([2026b](https://arxiv.org/html/2606.10394#bib.bib21)\), DeepSeek\-V4\-ProDeepSeek \([2026](https://arxiv.org/html/2606.10394#bib.bib22)\), Qwen3\.5\-PlusAlibaba Cloud \([2026](https://arxiv.org/html/2606.10394#bib.bib25)\), GPT\-5\.5OpenAI \([2026a](https://arxiv.org/html/2606.10394#bib.bib33)\), GPT\-5\.4OpenAI \([2026b](https://arxiv.org/html/2606.10394#bib.bib26)\), Gemini\-3\.1\-pro\-previewGoogle \([2026](https://arxiv.org/html/2606.10394#bib.bib27)\), Doubao\-Seed\-2\.0\-ProByteDance Seed \([2026](https://arxiv.org/html/2606.10394#bib.bib28)\), GLM\-5GLM\-5\-Teamet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib29)\), Kimi\-k2\.6Moonshot AI \([2026](https://arxiv.org/html/2606.10394#bib.bib30)\), and MiniMax\-M2\.7MiniMax \([2026](https://arxiv.org/html/2606.10394#bib.bib31)\)\. All models were assessed in the 40 benchmark tasks constructed for this study\.

### 3\.3Evaluation Settings

All models were tested within the framework of OpenClaw, with reasoning disabled and temperature set to 0 when configurable \(otherwise using the provider\-fixed temperature\)\. This setting minimizes sampling variability and improves reproducibility across runs\. Each benchmark task was executed in an isolated operating\-system environment, and a freshly initialized OpenClaw agent was instantiated for each run to avoid interference from persistent memory, cached state, or prior context\.

Scores in Table[2](https://arxiv.org/html/2606.10394#S2.T2)show each model’s first valid run per task\. A valid run requires a correctly reconstructed environment and a completed evaluation\. Given that tasks involve interactions with real tools, including calendar events, emails, and reminders, executions were performed serially to avoid conflicts between concurrent tasks\. This setup ensures that each trial reflects the agent’s ability to complete the task based solely on the visible prompt and accessible tool interfaces, providing a controlled and reproducible evaluation environment\.

### 3\.4Main Results

Table[2](https://arxiv.org/html/2606.10394#S2.T2)reports the performance and efficiency of different LLMs in STAGE\-Claw under the OpenClaw framework\. Overall, the results reveal a clear trade\-off between task performance and execution efficiency\.

Model performance and execution efficiency exhibit a clear trade\-off\.Claude\-Opus\-4\.7 achieves the strongest overall performance, with the highest average score of 77\.1 and the highest First\-Pass rate of 80\.0%, demonstrating the highest reliability among all evaluated models\. Claude\-Sonnet\-4\.6 obtains the second\-best average score, while Gemini\-3\.1\-Pro achieves the second\-best First\-Pass rate\. However, these strong\-performing models generally require higher computational overhead, such as longer execution time, larger token consumption, or higher API cost\. In contrast, GPT\-5\.4 shows the best efficiency profile, achieving the lowest latency, the lowest token usage, and the fewest tool calls while maintaining a competitive score\. GPT\-5\.5 further provides a strong balance between performance and efficiency, approaching Claude\-Sonnet\-4\.6 in score while using substantially fewer resources\.

Tool\-use frequency does not directly determine task performance\.Models such as Qwen3\.5\-Plus \(59\.3\) and Kimi\-K2\.6 invoke tools \(54\.9\) far more frequently than GPT\-5\.4 \(21\.8\), yet obtain lower average scores\. Conversely, GPT\-5\.4 completes tasks with the fewest tool calls while outperforming several models with heavier tool usage\. This suggests that effective tool use depends more on planning quality, tool selection, and result verification than on the raw number of tool invocations\. Excessive tool calls may therefore indicate inefficient coordination rather than stronger task\-solving ability\.

![Refer to caption](https://arxiv.org/html/2606.10394v1/fig/model_score_vs_unit_price.png)Figure 3:Performance comparison between the unit model api price, and the task score\. The x\-axis represents the unit price of the API, and the y\-axis represents the average model task score\.Higher API unit price is positively correlated with stronger task performance, but does not fully explain model quality\.As shown in Figure[3](https://arxiv.org/html/2606.10394#S3.F3), model average task score exhibits a strong positive linear correlation with model API price\. This suggests that more expensive models tend to achieve better task performance on STAGE\-Claw\. Claude\-Opus\-4\.7 is a representative example, combining a high API price with the highest average score\. However, the correlation is not absolute\. GLM\-5 achieves a relatively high score at a low API price, while GPT\-5\.5 is more expensive but does not outperform Claude\-Opus\-4\.7 or Claude\-Sonnet\-4\.6\. We speculate that GPT\-5\.5’s relative underperformance may partly stem from its configuration with reasoning \(thinking\) modules turned off, which prioritizes execution efficiency over deliberative planning and multi\-step reasoning\. These observations indicate that price is an important but insufficient proxy for agent capability and that cost\-effectiveness varies substantially across models\.

![Refer to caption](https://arxiv.org/html/2606.10394v1/fig/failure_type_distribution.png)

![Refer to caption](https://arxiv.org/html/2606.10394v1/fig/failure_type_by_model.png)

![Refer to caption](https://arxiv.org/html/2606.10394v1/fig/figure_b_failure_cause_composition.png)

Figure 4:Failure\-type analysis over non\-passing outcomes\. Categories are multi\-label, so percentages do not sum to 100\.a\.prevalence of each failure type among failed outcomes\.b\.model\-wise percentage of failed outcomes containing each type\.c\.proportion of tool usage and the error types for the corresponding toolsThese findings show that STAGE\-Claw evaluates not only final task success, but also how reliably and efficiently models operate in realistic scenarios\. The results highlight the need to jointly consider task performance, latency, realized run cost, token usage, API unit price, and tool\-use behavior when assessing LLM agents\.

## 4Analysis

### 4\.1Failure Analysis

We further analyze non\-passing runs to characterize common failure patterns\. Since a single run may exhibit multiple errors, our analysis is multi\-label rather than mutually exclusive\. Figure[4](https://arxiv.org/html/2606.10394#S3.F4)\.a summarizes the resulting distribution\.

The most prevalent failure mode isTool Failure, appearing in 95\.4% of non\-passing runs\. These failures involve missing, incomplete, or incorrectly routed writes to real tools such as calendar, notes, reminders, or email, indicating that agents often generate plausible intermediate outputs but fail to correctly update the underlying environment\. The second most frequent category isInvalid Format\(75\.0%\), including incorrect JSON format, missing required fields, incorrect filenames, or verification reports without required metadata\.Trap and Reconciliation Failures\(69\.7%\) andVerification Failures\(59\.2%\) further show that agents struggle with source\-of\-truth selection, entity alignment, timezone and deadline reasoning, deduplication, noise filtering, and evidence\-backed validation\.Information Coverage Errors\(52\.6%\),Incomplete Execution\(49\.3%\), andSpurious Outputs\(44\.1%\) are also common, suggesting that current agents have difficulty integrating reliable extraction, verification, and conservative state updates into a coherent workflow\. Figure[4](https://arxiv.org/html/2606.10394#S3.F4)\.b shows the percentages in model of failed outcomes\.

### 4\.2Tool\-Use Analysis

In this section, we analyze the frequency of tool calls and the errors that occur in the frequent tool calls\. Figure[4](https://arxiv.org/html/2606.10394#S3.F4)\.c reports both the tool call usage distribution and the failure cause composition, where failure causes are normalized over failed tool calls within each category\. Among all categorized tool calls, Shell/Session tools are used most frequently, accounting for 79\.8% of calls, followed by File I/O tools at 13\.4%\. State/Messaging tools are invoked much less often, representing only 6\.7% of calls\.

Despite their different usage frequencies, the failure patterns vary substantially across tool categories\. Shell/Session failures are dominated by command or script errors \(53\.3%\), followed by missing resources or paths \(27\.3%\)\. This suggests that shell\-based interaction primarily fails due to incorrect command construction, missing dependencies, or invalid assumptions about the runtime environment\. File I/O failures exhibit a distinct pattern: missing resources or paths account for 59\.5% of failed calls, while tool runtime errors account for 33\.3%\. This indicates that accurate resource localization and path validation are critical for reliable file operations\. By contrast, all observed State/Messaging failures are attributed to tool runtime errors, suggesting that operations involving persistent application state or external communication channels remain particularly brittle\.

Overall, tool failures are category\-specific rather than driven by a single generic error mode\. Shell/Session tools require more robust command generation and dependency checking\. File I/O tools depend heavily on reliable resource validation, and State/Messaging tools require stronger recovery mechanisms for persistent\-state updates\.

### 4\.3Effectiveness of State\-based Evaluation

To examine whether state\-based evaluation can better capture unexpected failures in realistic scenarios, we construct a virtual\-state variant of STAGE\-Claw\. In this setting, each task is modified to replace actual tool\-state changes with textual outputs that simulate the expected state updates\. We run this virtual setting on the same tasks using two representative models, DeepSeek\-V4\-Pro and Qwen3\.5\-Plus, and compare the results with the original state\-based evaluation\.

ModelState\-basedVirtualΔ\\DeltaDeepSeek\-V4\-Pro59\.7866\.71\+6\.93Qwen3\.5\-Plus59\.7464\.57\+4\.83Cause category\#TasksProportionAvg\.Δ\\DeltaExecution failure78\.75%\+76\.57Real\-state gap1316\.25%\+14\.05Other errors33\.75%\-2\.67Table 3:The upper section shows a comparison between state\-based and virtual evaluation,Δ\\Deltadenotes Virtual \- State\-based\. The lower section breaks down the causes of these gaps\.As shown in Table[3](https://arxiv.org/html/2606.10394#S4.T3), the virtual setting yields higher scores than the state\-based setting for both models\. However, this gap does not reflect better task completion\. Instead, it suggests that output\-only evaluation can overestimate agent performance by ignoring whether the agent actually modifies persistent tool states\. In real tool\-integrated environments, agents may fail due to execution failure and real\-state gap, such as invalid state writes, missing generated artifacts, and tool\-side effects\. These failures are naturally exposed by state\-based evaluation, but can be hidden when the task is reduced to textual state simulation\. Therefore, the comparison indicates that the state\-based evaluation provides a more faithful assessment of agent capability in realistic tool\-use scenarios\.

### 4\.4Memory Perturbation Analysis

TaskBaseNoiseMisleadingConflictAI Conference Tracker77\.023\.0−54\.023\.0\_\{\-54\.0\}23\.0−54\.023\.0\_\{\-54\.0\}23\.0−54\.023\.0\_\{\-54\.0\}Benchmark Administrator85\.056\.0−29\.056\.0\_\{\-29\.0\}35\.0−50\.035\.0\_\{\-50\.0\}49\.0−36\.049\.0\_\{\-36\.0\}Competitor Radar62\.057\.0−5\.057\.0\_\{\-5\.0\}55\.0−7\.055\.0\_\{\-7\.0\}55\.0−7\.055\.0\_\{\-7\.0\}Email Message Triage Center45\.040\.0−5\.040\.0\_\{\-5\.0\}22\.0−23\.022\.0\_\{\-23\.0\}22\.0−23\.022\.0\_\{\-23\.0\}Average67\.344\.0−23\.3\\mathbf\{44\.0\}\_\{\\mathbf\{\-23\.3\}\}33\.8−33\.5\\mathbf\{33\.8\}\_\{\\mathbf\{\-33\.5\}\}37\.3−30\.0\\mathbf\{37\.3\}\_\{\\mathbf\{\-30\.0\}\}

Table 4:Qwen3\.5\-Plus scores under manually constructed memory perturbations on four STAGE\-Claw tasks\. Subscripts indicate score drops relative to the Base setting\.Persistent memory can also affect state\-based agent execution, especially when it contains irrelevant, stale, or conflicting information\. We therefore conduct a small diagnostic study on four STAGE\-Claw tasks using Qwen3\.5\-Plus\. For each task, we move key task\-specific requirements from the visible instruction into the agent’s persistent memory and lightly revise the task prompt so that the task remains natural and memory\-dependent\. We then manually construct three perturbation types:Noise, which adds irrelevant memory entries around the key facts,Misleading, which injects plausible but incorrect rules or values andConflict, which introduces inconsistent versions of the same key facts\.

Table[4](https://arxiv.org/html/2606.10394#S4.T4)reports the resulting scores\. All three perturbation types reduce performance relative to the base memory\-dependent setting, withMisleadingandConflictcausing larger drops in our sample\. These results are intended as diagnostic evidence rather than a comprehensive benchmark\-wide conclusion\. They suggest that memory quality can influence both agent decisions and downstream state updates, motivating future evaluation protocols and agent mechanisms that track memory provenance, recency, and conflicts in addition to tool\-execution reliability\.

## 5Related Work

#### Agent Benchmarks for Tool Use and Interactive Environments\.

Recent agent benchmarks have expanded LLM evaluation beyond isolated text generation toward tool use, multi\-step reasoning, and interactive task completion\. General\-purpose benchmarks such as AgentBenchLiuet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib10)\)and GAIAMialonet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib7)\)evaluate planning, reasoning, web access, and external tool use, while function\-calling benchmarks such as BFCLPatilet al\.\([2025](https://arxiv.org/html/2606.10394#bib.bib19)\)focus on whether models can correctly select, parameterize, and invoke tools\. Another line of work evaluates agents in interactive environments\. Mind2WebDenget al\.\([2023](https://arxiv.org/html/2606.10394#bib.bib8)\), WebArenaZhouet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib11)\), VisualWebArenaKohet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib12)\), and WorkArenaDrouinet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib20)\)study web and enterprise workflows, and OSWorldXieet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib9)\)evaluates agents in real operating\-system and desktop applications\. AppWorldTrivediet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib18)\)andτ\\tau\-benchYaoet al\.\([2024](https://arxiv.org/html/2606.10394#bib.bib24)\)further emphasize controllable app or tool\-agent\-user interactions\. These benchmarks provide progress in realistic interaction, but they are primarily designed around fixed tasks, simulated APIs, GUI control, or transactional tool use\.

#### Personal\-Agent Benchmarks and State\-based Evaluation\.

Recent benchmarks have moved closer to OpenClaw\-oriented personal agents, where LLMs interact with tools, memory, execution environments, and control logicOpenClaw \([2026](https://arxiv.org/html/2606.10394#bib.bib5)\); Liet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib15)\)\. PinchBenchKilo AI \([2026](https://arxiv.org/html/2606.10394#bib.bib13)\), WildClawBenchWildClawBench \([2026](https://arxiv.org/html/2606.10394#bib.bib6)\), ClawsBenchLiet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib15)\), and Claw\-EvalYeet al\.\([2026](https://arxiv.org/html/2606.10394#bib.bib14)\)evaluate practical workflows such as scheduling, email, research, file management, and multi\-turn tool use\. Many existing agent benchmarks rely on manually curated tasks, sandboxed artifacts, simulated workspaces, or coarse rubrics, limiting their scalability and ability to diagnose persistent\-state errors\. STAGE\-Claw addresses these limitations by automatically generating benchmark instances from task hint words, instantiating them in real\-application environments, and evaluating agents via snapshots of the final system state\.

## Conclusion

This paper introduces STAGE\-Claw, an automated state\-based framework for constructing and evaluating realistic scenario benchmarks\. By assessing state\-based snapshots rather than textual responses, STAGE\-Claw enables more faithful evaluation of agent ability in multi\-tool workflows\. In addition, the automated pipeline provides a practical path toward scaling realistic scenario agent benchmarks while preserving reproducibility and auditable evaluation\. Experiments on 40 tasks show that current agents still struggle to complete complex tasks that involve multiple tool invocations in real scenarios\. These results highlight the need for state\-based evaluation and provide practical insights for building more reliable assistant agents\.

## Limitations

STAGE\-Claw is designed for realistic state\-based evaluation, but this design also brings limitations\. First, the benchmark currently contains 40 accepted tasks, and the main evaluation follows a first\-valid\-run protocol with only limited repeated\-run diagnostics\. Thus, the results should be viewed as a controlled snapshot rather than a statistically exhaustive leaderboard\. Second, constructing high\-quality state\-based tasks remains resource\-intensive\. Although task authoring, repair, and checker\-based validation are automated after human\-provided task hints, each accepted task still requires realistic scenario design, deterministic environment setup, executable verifier implementation, and quality checking, which together incur nontrivial time and API costs\. Third, evaluation in real personal\-computing environments means that scores may reflect not only model capability but also OpenClaw, tool wrappers, OS configuration, permissions, and runtime stability\. This improves realism but may introduce platform\-dependent failures\.

## References

- Alibaba Cloud \(2026\)Qwen3\.5: Towards native multimodal agents\.Note:[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- Anthropic \(2025\)Claude code overview\.Note:[https://code\.claude\.com/docs/en/overview](https://code.claude.com/docs/en/overview)Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1)\.
- Anthropic \(2026a\)Introducing Claude Opus 4\.7\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-7](https://www.anthropic.com/news/claude-opus-4-7)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- Anthropic \(2026b\)Introducing Claude Sonnet 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)Cited by:[§3\.1](https://arxiv.org/html/2606.10394#S3.SS1.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- ByteDance Seed \(2026\)Seed2\.0\.Note:[https://seed\.bytedance\.com/en/seed2](https://seed.bytedance.com/en/seed2)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- DeepSeek \(2026\)DeepSeek V4 Pro\.Note:[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- X\. Deng, Y\. Gu, B\. Zheng, S\. Chen, S\. Stevens, B\. Wang, H\. Sun, and Y\. Su \(2023\)Mind2web: towards a generalist agent for the web\.Advances in Neural Information Processing Systems36,pp\. 28091–28114\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. Del Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez,et al\.\(2024\)Workarena: how capable are web agents at solving common knowledge work tasks?\.arXiv preprint arXiv:2403\.07718\.Cited by:[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- GLM\-5\-Team, A\. Zeng, X\. Lv, Z\. Hou, Z\. Du, Q\. Zheng,et al\.\(2026\)GLM\-5: From Vibe Coding to Agentic Engineering\.arXiv preprint arXiv:2602\.15763\.External Links:2602\.15763,[Document](https://dx.doi.org/10.48550/arXiv.2602.15763),[Link](https://arxiv.org/abs/2602.15763)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- Google \(2026\)Gemini 3\.1 Pro Preview\.Note:[https://deepmind\.google/models/gemini/pro/](https://deepmind.google/models/gemini/pro/)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- Kilo AI \(2026\)PinchBench: real\-world benchmarks for ai coding agents\.Note:[https://pinchbench\.com/](https://pinchbench.com/)Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[Table 1](https://arxiv.org/html/2606.10394#S2.T1.13.13.4),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Y\. Koh, R\. Lo, L\. Jang, V\. Duvvur, M\. Lim, P\. Huang, G\. Neubig, S\. Zhou, R\. Salakhutdinov, and D\. Fried \(2024\)Visualwebarena: evaluating multimodal agents on realistic visual web tasks\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 881–905\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Li, K\. W\. Choe, Y\. Liu, X\. Chen, C\. Tao, B\. You, W\. Chen, Z\. Di, J\. Sun, S\. Zheng,et al\.\(2026\)ClawsBench: evaluating capability and safety of llm productivity agents in simulated workspaces\.arXiv preprint arXiv:2604\.05172\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2024\)Agentbench: evaluating llms as agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 52989–53046\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1),[Table 1](https://arxiv.org/html/2606.10394#S2.T1.3.3.4),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Ma, J\. Zhang, Z\. Zhu, C\. Yang, Y\. Yang, Y\. Jin, Z\. Lan, L\. Kong, and J\. He \(2024\)Agentboard: an analytical evaluation board of multi\-turn llm agents\.Advances in neural information processing systems37,pp\. 74325–74362\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1)\.
- M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan,et al\.\(2026\)Terminal\-bench: benchmarking agents on hard, realistic tasks in command line interfaces\.arXiv preprint arXiv:2601\.11868\.Cited by:[Table 1](https://arxiv.org/html/2606.10394#S2.T1.10.10.2)\.
- G\. Mialon, C\. Fourrier, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)Gaia: a benchmark for general ai assistants\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 9025–9049\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[Table 1](https://arxiv.org/html/2606.10394#S2.T1.4.4.2),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- MiniMax \(2026\)MiniMax M2\.7: Early Echoes of Self\-Evolution\.Note:[https://www\.minimax\.io/news/minimax\-m27\-en](https://www.minimax.io/news/minimax-m27-en)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- Moonshot AI \(2026\)Kimi K2\.6: Advancing Open\-Source Coding\.Note:[https://www\.kimi\.com/blog/kimi\-k2\-6](https://www.kimi.com/blog/kimi-k2-6)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- OpenAI \(2026a\)GPT\-5\.5\.Note:[https://openai\.com/zh\-Hans\-CN/index/introducing\-gpt\-5\-5/](https://openai.com/zh-Hans-CN/index/introducing-gpt-5-5/)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- OpenAI \(2026b\)Introducing GPT\-5\.4\.Note:[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/)Cited by:[§3\.2](https://arxiv.org/html/2606.10394#S3.SS2.p1.1)\.
- OpenClaw \(2026\)OpenClaw: personal ai assistant\.Note:[https://openclaw\.ai/](https://openclaw.ai/)Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px2.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(bfcl\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning,Cited by:[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2024\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 9695–9717\.Cited by:[Table 1](https://arxiv.org/html/2606.10394#S2.T1.9.9.2)\.
- H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian \(2024\)Appworld: a controllable world of apps and people for benchmarking interactive coding agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16022–16076\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin,et al\.\(2024\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 186345\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1)\.
- WildClawBench \(2026\)WildClawBench\.Note:[https://github\.com/InternLM/WildClawBench](https://github.com/InternLM/WildClawBench)Cited by:[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px2.p1.1)\.
- Z\. Xi, W\. Chen, X\. Guo, W\. He, Y\. Ding, B\. Hong, M\. Zhang, J\. Wang, S\. Jin, E\. Zhou,et al\.\(2025\)The rise and potential of large language model based agents: a survey\.Science China Information Sciences68\(2\),pp\. 121101\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1)\.
- T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei,et al\.\(2024\)Osworld: benchmarking multimodal agents for open\-ended tasks in real computer environments\.Advances in Neural Information Processing Systems37,pp\. 52040–52094\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[Table 1](https://arxiv.org/html/2606.10394#S2.T1.8.8.2),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.arXiv preprint arXiv:2406\.12045\.Cited by:[Table 1](https://arxiv.org/html/2606.10394#S2.T1.5.5.1),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1)\.
- B\. Ye, R\. Li, Q\. Yang, Y\. Liu, L\. Yao, H\. Lv, Z\. Xie, C\. An, L\. Li, L\. Kong,et al\.\(2026\)Claw\-eval: toward trustworthy evaluation of autonomous agents\.arXiv preprint arXiv:2604\.06132\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p1.1),[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[Table 1](https://arxiv.org/html/2606.10394#S2.T1.14.14.2),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)Webarena: a realistic web environment for building autonomous agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 15585–15606\.Cited by:[§1](https://arxiv.org/html/2606.10394#S1.p2.1),[Table 1](https://arxiv.org/html/2606.10394#S2.T1.7.7.3),[§5](https://arxiv.org/html/2606.10394#S5.SS0.SSS0.Px1.p1.1)\.

## Appendix AAppendix

### A\.1Evaluation Separation and Blindness

To preserve evaluation validity, the benchmark construction protocol enforces a strict separation between the agent\-facing task prompt and the evaluator\-only materials\. The agent receives only the question prompt and the initialized environment specified by the environment\-construction document\. In contrast, the evaluator has access to ground truth, which may contain gold answers, scoring scripts, structured checklists, or other verification procedures\.

The agent\-facing prompt must not include any of the following: gold labels, expected final answers, hidden constraints, scoring rubrics, evaluator scripts, checksum values, or explicit references to the verification logic\. This constraint ensures that task success reflects the agent’s ability to operate in the environment rather than its ability to exploit leaked evaluation information\.

### A\.2Design Criteria

Each task produced under this protocol is required to satisfy the following criteria\.

1. 1\.Multi\-step execution\.The task must require at least three distinct operations, such as locating files, extracting information, transforming data, writing an output file, or interacting with an application\.
2. 2\.Cross\-tool grounding\.The task must require at least two different tools or information sources\. Examples include filesystem access and terminal commands, browser interaction and document parsing, or local data inspection and application state modification\.
3. 3\.Reproducibility\.The environment construction procedure must produce the same initial state on every run\. All seeded files, directories, and system configurations should be specified explicitly\.
4. 4\.Blind task execution\.The task prompt shown to the agent must not contain oracle information\. The agent should only know the user\-facing objective, the relevant working directory, and operational constraints\.
5. 5\.Objective gradability\.The final result must be verifiable without subjective human judgment\. Acceptable evaluation methods include deterministic scripts, exact\-match checks, structured rubrics with binary criteria, or reproducible checklist\-based validation\.
6. 6\.Time independence\.The task must not depend on real clocks, real\-time APIs, or mutable external states unless those states are explicitly mocked or seeded during environment construction\.

### A\.3Difficulty Categories

In order to simulate the difficulty and complexity of the task in a real situation, we have carried out complex processing on the task\. The specific types of processing are shown in Table[5](https://arxiv.org/html/2606.10394#A1.T5)\.

CategoryDifficulty MechanismDataInconsistencyCross\-source conflicts, Heterogeneous timestamp formats, Inconsistent status or Field valuesDependencyReasoningHidden dependencies, Topological ordering, Prerequisite Implementation requirementsNoiseFilteringTest records, Drafts, Deprecated items, Dead URLs, Empty or invalid entriesEntityFormatAliases, Email mappings, Abbreviations, Canonical IDs, Category mappingOutputPrecisionStrict field schemas, Deterministic sorting, Casing constraints, Date formats, Exact\-match error messagesTable 5:Difficulty categories and included mechanisms
### A\.4Snapshot Tools

Environment TypeSnapshot MethodRecorded StateFile systemTraverse the target environment directory using find XXX\_Env/ \-type f, sort the file list, and compute hashes for both file paths and file contents\.A deterministic digest of the file hierarchy and all task\-related file contents\.Calendar eventsQuery calendar events in a fixed future window using icalbuddy eventsFrom:today to:\+30d, and compute a hash of the normalized output\.A snapshot of scheduled calendar events within the evaluation window\.RemindersUse AppleScript to query reminder names, e\.g\., osascript \-e ’tell app "Reminders" to get name of reminders’\.The set of reminder entries visible to the system at snapshot time\.NotesUse AppleScript to query note titles, e\.g\., osascript \-e ’tell app "Notes" to get name of notes’\.The set of note records relevant to the initialized environment\.Email stateUse AppleScript or mail\-client queries to count messages under task\-specific labels, including drafts and flagged messages\.The number and status of task\-relevant email records, such as drafts, labels, or flags\.Table 6:Snapshot methods for validating reproducible benchmark environments\. Each snapshot records a deterministic representation of the corresponding environment state and can be compared before and after task execution\.
### A\.5Self\-Review Checklist

Before a benchmark task is finalized, the author performs a self\-review using the following checklist\.

1. 1\.The environment builder can follow environment make documents without requiring additional clarification\.
2. 2\.The setup procedure is deterministic and reproducible\.
3. 3\.The agent\-facing question prompt contains no ground\-truth information\.
4. 4\.The evaluator can score the task using only the final state and the files in ground truth documents\.
5. 5\.The success condition is singular, explicit, and unambiguous\.
6. 6\.Potential shortcuts or cheating strategies have been identified and blocked\.
7. 7\.The task admits a plausible harder variant for future benchmark scaling\.

### A\.6Repeated\-run diagnostic\.

The main results in Table[2](https://arxiv.org/html/2606.10394#S2.T2)are based on the first valid execution of each model\-task pair\. To examine run\-to\-run variability without incurring the full time and API cost of repeating all 11 models on all tasks, we conducted a limited three\-run diagnostic on three models: Qwen3\.5\-Plus, DeepSeek\-V4\-Pro, and GLM\-5\. The pass@3 results of these three models are shown in Figure[5](https://arxiv.org/html/2606.10394#A1.F5)\. This diagnostic analysis is intended to illustrate score variability under repeated executions and is not used to compute the aggregate results or model ranking in Table[2](https://arxiv.org/html/2606.10394#S2.T2)\.

MetricValueCandidate task instances46Retained benchmark tasks40Excluded after human audit6Retention rate87\.0%Average repair iterations2\.67Table 7:Statistics of the STAGE\-Claw construction pipeline\.
### A\.7Human Audit and Filtering

We conduct human annotation to validate the quality of each constructed benchmark task before inclusion in the final benchmark\. We hired three task annotators\. Annotators review the complete task package, including the environment construction guide, the agent\-facing prompt, and the evaluator\-only ground\-truth materials\. The review focuses on whether the task is reproducible, clearly specified, blind to ground\-truth information, multi\-step, cross\-tool, and time\-independent\. Annotators also check for potential shortcuts, information leakage, or inconsistencies between the task prompt and the evaluation materials\. Tasks that fail any mandatory criterion are revised or excluded\. This process ensures that the retained tasks provide reliable and deterministic measurements of agent performance\. Table[7](https://arxiv.org/html/2606.10394#A1.T7)shows the number of tasks we reviewed and filtered\. The human annotators only participated in task auditing but not in task modification\.

![Refer to caption](https://arxiv.org/html/2606.10394v1/x2.png)Figure 5:Range plot of three models across benchmark tasks\. Horizontal line shows the score range across three evaluation rounds\. Tasks are sorted by the average score across the three models\.

## Appendix BCase Studies

#### Case 1: correct content in the wrong state channel\.

InCompetitor Radar, one Qwen3\.5\-Plus run produced a high\-quality briefing and correctly resolved major data traps, including pricing reconciliation and exclusion of a rumored competitor\. However, it wrote calendar information to an\.icsfile and notes to a Markdown file rather than creating real Apple Calendar events and an Apple Notes entry\. The final score remained non\-passing despite strong analytical content, illustrating why state\-based evaluation is necessary: file artifacts that describe the intended actions are not equivalent to durable tool\-state changes\.

#### Case 2: early timeout before state construction\.

InProject Progress Broadcaster, a GLM\-5 run timed out after reading the input files and before creating any required deliverables\. The evaluator found no progress summary, verification report, reminders, notes, or URL\-verification evidence\. This case represents an incomplete\-execution failure: the agent spent its budget on initial data gathering but did not transition to synthesis, validation, and state updates\. Such failures are difficult to diagnose from final text alone, but are exposed directly by executable checks over missing files and tool state\.

#### Case 3: ambiguous evidence and unsupported verification\.

InCitation Gap Finder, a DeepSeek\-V4\-Pro run correctly extracted many bibliography entries and identified several uncited or duplicate references, but failed the key ambiguous\-citation trap by matching a single\-author citation to an incompatible multi\-author bibliography item\. It also claimed URL verification without corresponding tool evidence and did not create the required Notes or Calendar artifacts\. The case combines reasoning, provenance, and state\-update failures: the report is superficially coherent, yet the evaluator shows that important claims are unsupported and required persistent actions are absent\.

## Appendix CTask Prompt

This appendix describes the prompt protocol used to construct reproducible, objectively scorable benchmark tasks for OpenClaw\. Each benchmark task is represented as a self\-contained directory containing environment construction instructions, an agent\-facing task prompt, and ground\-truth materials\. The process separates task authoring, environment initialization, task execution, and evaluation in order to reduce information leakage and ensure reproducibility\.

### C\.1Benchmark Author Prompt

The benchmark author is instructed to design one reproducible, novel, and objectively scorable task\. The author is explicitly prohibited from solving or scoring the task\. A template is shown in Figure[6](https://arxiv.org/html/2606.10394#A3.F6)\.

`Benchmark Author Prompt`Figure 6:Benchmark author prompt in STAGE\-Claw\.
### C\.2Environment Construction Prompt

The environment construction prompt is used to guide the environment builder in creating a deterministic task environment\. It specifies the target directory, initialization workflow, and required ground\-truth materials\. A template is shown in Figure[7](https://arxiv.org/html/2606.10394#A3.F7)\.

`Environment Construction Prompt`Figure 7:Environment construction prompt in STAGE\-Claw\.
### C\.3Agent Task Prompt

The agent\-facing prompt is the only instruction shown to OpenClaw during benchmark execution\. It describes the user objective and available task directory, but intentionally excludes all oracle information, scoring criteria, and ground\-truth answers\. A template is shown in Figure[8](https://arxiv.org/html/2606.10394#A3.F8)\.

`Agent Task Prompt`Figure 8:Agent task prompt in STAGE\-Claw\.

## Appendix DTask Introduction

This section provides an introduction of the 40 STAGE\-Claw tasks\. The descriptions summarize user\-facing objectives and required task capabilities, without revealing hidden ground truth, scoring rubrics, or verifier logic\.

1. T01\.AI Conference Tracker\. This task requires the agent to consolidate conference schedules, email updates, user preferences, and venue information, produce an attendance plan, and create the corresponding calendar events, reminders, and notes\.
2. T02\.AI Lobster Academy Sparring Partner\. This task requires the agent to schedule three exam\-preparation study sessions for five members, assign practice topics, and create the associated calendar entries, notes, and reminders\.
3. T03\.API Migration Planner\. This task requires the agent to analyze migration materials for a transition from a v1 API to a v2 API, reconcile conflicting information, and produce a phased migration plan\.
4. T04\.AgentOS Personal Operating System\. This task requires the agent to integrate multi\-source information for a family gathering, resolve conflicts, compute the budget, and generate the final coordination plan\.
5. T05\.Benchmark Administrator\. This task requires the agent to process AgentOlympics 2026 team registration information, filter eligible teams, assign competition tracks, and record the relevant calendar and note entries\.
6. T06\.Browser Chaos Cleaner\. This task requires the agent to clean Firefox and Chrome bookmark exports by deduplicating entries, categorizing links, filtering invalid URLs, and generating a summary with follow\-up reminders\.
7. T07\.Citation Gap Finder\. This task requires the agent to audit in\-text citations and bibliography entries in a technical report, identifying missing, unused, duplicated, or inconsistent references\.
8. T08\.Code Skill Evolver\. This task requires the agent to revise a Python project according to 28 code\-review comments, run the relevant tests, and document the implemented changes and verification results\.
9. T09\.Competitor Radar\. This task requires the agent to integrate competitor information, verify feature links, generate a competitive\-intelligence brief, and schedule appropriate follow\-up actions\.
10. T10\.Config Drift Detector\. This task requires the agent to process meeting\-room booking requests from multiple sources, deduplicate entries, validate requests, resolve scheduling conflicts, and create valid calendar events\.
11. T11\.Cross\-Team Status Synthesizer\. This task requires the agent to synthesize status updates from multiple teams, resolve inconsistent reports, and produce an executive\-facing project status summary\.
12. T12\.Decision Log Keeper\. This task requires the agent to merge decision records from multiple sources, verify external references, produce a unified decision log, and create the corresponding calendar events\.
13. T13\.Dependency Upgrade Operator\. This task requires the agent to analyze dependency versions in a software project, identify required upgrades, formulate an upgrade plan, and record related system\-level follow\-up items\.
14. T14\.Docstring Debt Collector\. This task requires the agent to scan Python source files and API specifications, identify documentation debt, and create reminders for high\-priority documentation issues\.
15. T15\.Email Message Triage Center\. This task requires the agent to process RSVP emails for a team event, reconcile attendance status, and generate a verified participation report\.
16. T16\.Experiment Log Automation\. This task requires the agent to analyze machine\-learning experiment logs, identify the best completed run for each model family, and schedule a review calendar event\.
17. T17\.Grant Opportunity Radar\. This task requires the agent to integrate grant opportunity tables, email updates, and institutional requirements, verify deadlines, and generate a structured tracking result\.
18. T18\.Literature Review Generator\. This task requires the agent to select papers for a graduate\-level course, organize them into four weekly reading themes, and create the corresponding calendar and reminder entries\.
19. T19\.Local Dev Environment Fixer\. This task requires the agent to inspect project configuration files for tool\-version conflicts, verify the local development environment, and generate a repair report\.
20. T20\.Log\-to\-Incident Timeline\. This task requires the agent to parse production logs from multiple sources, filter noisy records, and construct an incident timeline\.
21. T21\.Meal Planner with Fridge Scan\. This task requires the agent to generate a five\-day meal plan based on fridge inventory, dietary restrictions, and the user’s calendar schedule\.
22. T22\.Medical Appointment Prep\. This task requires the agent to organize referral and medical documents, identify items requiring appointments, resolve conflicting information, and generate visit\-preparation materials\.
23. T23\.Meeting Action Operator\. This task requires the agent to extract action items from meeting transcripts, email summaries, and shared notes, then create the required system entries and reports\.
24. T24\.MemClaw Evaluation Scheduler\. This task requires the agent to validate evaluation\-task configurations, handle dependencies and resource allocation, schedule evaluation tasks, and generate a coordination report\.
25. T25\.Morning Command Center\. This task requires the agent to synthesize information from multiple sources, generate a morning briefing for 2026\-05\-04, and create the corresponding system state\.
26. T26\.Multi\-Agent Debate Arena\. This task requires the agent to aggregate multi\-round debate results, compute rankings, schedule the final round, and produce a verification report\.
27. T27\.Newsletter\-to\-Knowledge\. This task requires the agent to extract, deduplicate, and categorize articles from five technical newsletters, then build a knowledge base and weekly digest\.
28. T28\.Notebook Productionizer\. This task requires the agent to convert disorganized meeting notes into calendar events, reminders, and a readable structured report\.
29. T29\.Onboarding Repo Guide\. This task requires the agent to coordinate Project 53 meeting materials, reconcile source conflicts, and create a meeting calendar entry, reminder, and explanatory note\.
30. T30\.Paper Digest Pipeline\. This task requires the agent to merge paper data from multiple sources, deduplicate entries, filter records, verify links, and generate a weekly paper digest\.
31. T31\.Personal Health & Medical Assistant\. This task requires the agent to organize personal health records and produce a current medication list, question list, and medical summary\.
32. T32\.Project Progress Broadcaster\. This task requires the agent to integrate milestone data for the Phoenix Platform project, compute project status, and publish a structured progress update\.
33. T33\.Proposal Grant Writing Assistant\. This task requires the agent to process IRB approval information from multiple institutions, verify approval status and expiration dates, and generate a tracking report\.
34. T34\.Release Note Composer\. This task requires the agent to integrate release information from multiple sources, deduplicate overlapping items, verify external references, and produce accurate release notes\.
35. T35\.Research Project Manager\. This task requires the agent to process literature records, resolve metadata conflicts, verify DOIs, and generate final references together with follow\-up items\.
36. T36\.SSH Tunnel Guardian\. This task requires the agent to audit SSH tunnel configurations and runtime status, identify high\-priority maintenance items, and set monitoring reminders\.
37. T37\.School Notice Operator\. This task requires the agent to process emergency school\-closure information, notify relevant parties according to parent preferences, and create the necessary system records\.
38. T38\.Server Resource Dashboard\. This task requires the agent to analyze server alerts and resource metrics, filter invalid alerts, and generate a resource\-status and action report\.
39. T39\.Slack Thread Summarizer\. This task requires the agent to extract key decisions and action items from Slack discussion threads, create calendar and reminder entries, and produce a structured summary\.
40. T40\.Tech Community Intelligence Hub\. This task requires the agent to analyze multiple developer\-community snapshots, compute topic popularity, identify controversial topics, and generate follow\-up intelligence\.

Similar Articles

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Hugging Face Daily Papers

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.