Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Hugging Face Daily Papers Papers

Summary

Claw-SWE-Bench is a new benchmark and adapter protocol that standardizes evaluation conditions for comparing diverse coding agents on SWE-bench-style tasks, revealing that adapter design significantly impacts performance and cost.

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:38 PM

Paper page - Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Source: https://huggingface.co/papers/2606.12344 Published on Jun 10

·

Submitted byhttps://huggingface.co/hankaixyz

hankaion Jun 11

#3 Paper of the day Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

A new benchmark and adapter protocol called Claw-SWE-Bench enables fair comparison of diverse coding agents by standardizing evaluation conditions and revealing the importance of adapter design for effective code generation.

General-purpose agents such asOpenClaware increasingly used as autonomous tool users, but their coding ability is difficult to measure underSWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingualSWE-bench-stylebenchmarkandadapter protocolthat makes heterogeneous agentharnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The fullbenchmarkcontains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn fromSWE-bench-Multilingual andSWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-BenchLite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the fullbenchmark,OpenClawwith a minimal direct-diff adapter scores only 19.1%Pass@1, whereas the full adapter reaches 73.4% with the sameGLM 5.1backbone, showing that adapter design is essential for enablingOpenClaw-styleharnesses to perform coding tasks effectively. Across anOpenClawtimes nine-model sweep and a five-claw times two-model sweep, model choice changesPass@1by 29.4 pp andharnesschoice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in totalAPI cost. Claw-SWE-Benchtherefore treatsharnessand cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a fullbenchmarkand a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-benchand https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2606\.12344

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.12344 in a model README.md to link it from this page.

Datasets citing this paper1

#### TokenRhythm/Claw-SWE-Bench Viewer• Updatedabout 8 hours ago • 430 • 2

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.12344 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Hugging Face Daily Papers

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.