Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Summary
Claw-SWE-Bench is a new benchmark and adapter protocol that standardizes evaluation conditions for comparing diverse coding agents on SWE-bench-style tasks, revealing that adapter design significantly impacts performance and cost.
View Cached Full Text
Cached at: 06/11/26, 01:38 PM
Paper page - Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Source: https://huggingface.co/papers/2606.12344 Published on Jun 10
·
Submitted byhttps://huggingface.co/hankaixyz
hankaion Jun 11
#3 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation.
General-purpose agents such asOpenClaware increasingly used as autonomous tool users, but their coding ability is difficult to measure underSWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingualSWE-bench-stylebenchmarkandadapter protocolthat makes heterogeneous agentharnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The fullbenchmarkcontains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn fromSWE-bench-Multilingual andSWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-BenchLite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the fullbenchmark,OpenClawwith a minimal direct-diff adapter scores only 19.1%Pass@1, whereas the full adapter reaches 73.4% with the sameGLM 5.1backbone, showing that adapter design is essential for enablingOpenClaw-styleharnesses to perform coding tasks effectively. Across anOpenClawtimes nine-model sweep and a five-claw times two-model sweep, model choice changesPass@1by 29.4 pp andharnesschoice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in totalAPI cost. Claw-SWE-Benchtherefore treatsharnessand cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a fullbenchmarkand a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-benchand https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2606\.12344
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.12344 in a model README.md to link it from this page.
Datasets citing this paper1
#### TokenRhythm/Claw-SWE-Bench Viewer• Updatedabout 8 hours ago • 430 • 2
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12344 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
I made a small open-source benchmark runner for testing OpenClaw agents on my own real workflows
A developer shares a personal open-source benchmark runner for testing OpenClaw agents on real, messy workflows. The tool allows users to define private evaluation cases, run agents in their actual workspace, and generate reports, aiming to provide more relevant signals than public benchmarks.
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
This paper introduces OpenClawBench, a large-scale dataset for benchmarking process-side anomalies in real-world AI agent execution trajectories. It reveals that task success can hide process failures, with 9.33% of oracle-passing executions containing anomalies, and provides structured supervision via a novel taxonomy.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge is a generator-backed benchmark framework for executable command-line workflows under state conflict, evaluating LLM agents on tasks with pre-existing partial, stale, or conflicting artifacts across 17 scenarios.
@Ali_TongyiLab: https://x.com/Ali_TongyiLab/status/2067158015615041755
The AgentScope team introduces PawBench, a benchmark for evaluating the combined performance of models and agent harnesses, analyzing 4,050 test cells to show that harness choice can be as impactful as model upgrades.