@Ali_TongyiLab: https://x.com/Ali_TongyiLab/status/2067158015615041755
Summary
The AgentScope team introduces PawBench, a benchmark for evaluating the combined performance of models and agent harnesses, analyzing 4,050 test cells to show that harness choice can be as impactful as model upgrades.
View Cached Full Text
Cached at: 06/17/26, 09:52 AM
What We Learned from Evaluating 4,050 Agent Runs
Agents are no longer research demos. Today they write code, browse the web, manipulate files, and complete multi-step workflows.
But when an agent fails, it’s surprisingly difficult to answer a simple question:
Was the model not capable enough—or did the harness fail to support it properly?
To answer that question, the AgentScope team introduced PawBench, a benchmark designed specifically for evaluating the combined performance of models and agent harnesses.
-
Project: https://github.com/agentscope-ai/PawBench
-
Leaderboard: https://agentscope-ai.github.io/PawBench/
-
OpenJudge: https://github.com/agentscope-ai/OpenJudge
-
OpenJudge Website: https://openjudge.me/
PawBench is part of the OpenJudge ecosystem. It inherits OpenJudge’s philosophy of evaluation-driven optimization, with a specific focus on the joint effect of LLM × Harness.
Evaluating Agents, Not Just Models
Most benchmarks evaluate models in isolation. Real-world agents, however, are never deployed that way.
In practice, the model determines what an agent could potentially do; while the harness determines whether that capability can reliably translate into successful task execution.
Agent Performance = f(Model, Harness)
PawBench v1.0 contains:
-
150 real-world agent tasks
-
9 foundation models
-
3 production-grade harnesses: Hermes, OpenClaw, QwenPaw
The benchmark evaluates every combination:
9 Models × 3 Harnesses × 150 Tasks = 4,050 test cells
The 150 tasks were curated from six high-quality agent benchmarks: claweval, qwenclawbench, pinchbench, qwenpawbench, skillsbench, and wildclawbench.
Each task is tagged along five dimensions: application scenario, atomic capability, complexity, input modality, and runtime environment.
All tasks run inside Docker sandboxes and is fully traceable, making it possible to connect benchmark scores back to actual execution behavior.
The final score combines automated graders, including rule checks and sub-assertions, with LLM-as-judge for more semantic outputs. Scores are normalized to the 0–1 range and reported as percentages in this article.
The Results: Agent Performance Depends on Both the Model and the Harness
Start with the text-task matrix.
-
Models and harnesses both shape agent performance. Strong models remain robust: claude-opus-4.6 stays above 76 across all harnesses. But weaker models are more harness-sensitive: qwen3.6-35b-a3b moves from 57.9 on Hermes to 70.4 on QwenPaw.
-
The harness gap is comparable to a model upgrade. On text tasks, QwenPaw (75.4) leads OpenClaw (74.8) and Hermes (70.0), creating a 5.5-point spread. In some cases, a better harness closes or even reverses model rank: for example, qwen3.6-plus + QwenPaw (76.5) outperforms qwen3.6-max-preview + Hermes (70.2).
The takeaway is simple:
Model capability still matters, but harnesses can introduce measurable performance differences.
The leaderboard is only the starting point. The more interesting question is:
Where do those missing points actually come from?
Slice Analysis: Understanding the Gap
Where Models Differ: Strengths and Weaknesses by Slice
Where Models Differ: Strengths and Weaknesses by Slice
Across 4,050 cells, a clear pattern emerges: models don’t just differ in score—they fail in different ways.
To focus on model-side differences first, we fix the harness to QwenPaw and slice the same submissions by task labels.
-
Choose models by task type: claude-opus-4.6 sits in the top tier by aggregate score, with strong average performance and good stability. But once we break down by scenario, it only leads in 4 out of 11 task categories. Domain leadership shifts quickly: qwen3.6-max-preview leads in Manufacturing Engineering and Software Engineering, while qwen3.7-max leads in Data Analytics. In practice, model selection should start from workload, not leaderboard rank.
-
Qwen3.6 35B/A3B vs. Max: The Gap Shows Up in Long-Horizon Tasks: The small-to-large gap shows up mainly in long-horizon tasks. Within Qwen, scaling matters less for simple Q&A, but more for Math Computation, Planning, and Tool Use, where multi-step reasoning is required. Here, qwen3.6-max-preview is the most balanced, while qwen3.7-max is stronger in open-environment and data-analysis tasks.
-
Multimodal remains a shared weak spot: Under QwenPaw, all models underperform on multimodal vs text: -6.1 (claude-opus-4.6), -8.0 (deepseek-v4-pro), -12.4 (qwen3.6-35b-a3b). This points to a systemic challenge across image understanding, information extraction, cross-modal reasoning, and tool-chain handoff.
Fixing the harness clarifies model differences. The next question is how these gaps change across different harness designs.
Model × Harness Pairing: Three Interaction Slices
PawBench can slice the 4,050 cells by model size, modality, task type, skill domain, and more, then compare those slices against execution traces. This shows how model capability and harness behavior interact.
Finding #1: Smaller Models Need the Harness to Stabilize Execution
Start with two extremes. claude-opus-4.6 is stable across harnesses (2.3-point spread), while qwen3.6-35b-a3b shifts by 11.5 points depending only on the harness.
This gap shows a clear pattern: larger models can compensate for missing context: they infer paths, filter a larger tool list, and check whether artifacts were actually produced. Smaller models are more brittle. They lose track of current working directory, misjudge whether a file was written, or choose the wrong first tool when the tool list is too large.
Trace analysis points to three common failure sources:
-
Missing artifact-level validation: many harnesses rely on the model saying “done” instead of checking without checking files, tests, or outputs. This makes premature completion easy.
-
Loose path awareness and constraints: for example, Hermes did not clearly inject the current working directory into the prompt, nor did it strictly constrain write paths in tools like write_file. The model may believe it wrote the file successfully while the grader cannot find it in the standard workspace.
-
Tool overload: tool counts vary heavily (Hermes ~65, OpenClaw ~30, QwenPaw ~15). Larger toolsets often hurt smaller models by increasing decision cost.
The takeaway is not that small models are weak, but that they rely more on harness structure.
Finding #2: Skill Use Requires Harness Discovery and Model Follow-Through
Many developers store project-specific skills directly inside their workspace. PawBench simulates this setup to evaluate whether harnesses can discover and utilize them.
Across all three harnesses, Skill-related tasks were consistently more difficult than categories such as tool use, planning, or reasoning.
Two issues stand out:
-
Harness-side discovery gap: Except for OpenClaw, the other two harnesses do not actively scan the workspace for local skills, causing agents to miss valuable task-specific guidance.
-
Model-side execution gap: Even when the harness injects the Skill and places the signpost, the model can still fail during complex reasoning or precise computation. The harness can point the way; the base model still has to follow the path.
So success requires both: the harness must surface Skills clearly (name, scope, usage), and the model must reliably decide to invoke them.
If either side breaks, the model bypasses the Skill and tries to solve the task with general reasoning.
Finding #3: Web Search Tasks Depend Heavily on Default Availability
Web Search tasks test the model’s ability to search the web, fetch content, and do deeper research. PawBench does not assume the best-case setup where every search API key is configured. Instead, it recreates the default developer experience: clone a fixed version, add the LLM key, and run.
What we found:
-
Hermes includes web search capabilities, but requires external search API keys before those tools become available, thus did not perform ideally on these tasks.
-
OpenClaw has a better default experience: web_search can use keyless services such as DuckDuckGo, and web_fetch relies on built-in HTTP fetching.
-
QwenPaw does not have a dedicated search tool, but its browser_use tool plus model knowledge can still handle basic web access.
Importantly, results reflect both the model’s capabilities and whether the harness makes search usable by default.
Trace behavior shows a split:
-
Strong models adapt: they switch to terminal + curl , DOM inspection, and long-page extraction when search fails.
-
Weaker models stall: they repeat navigation or conclude the task is infeasible.
In other words, strong models can route around missing tools; weaker models depend on the harness keeping the search path stable and explicit.
Four Co-Design Principles for Models and Harnesses
Based on the benchmark results, we believe effective harnesses should follow four simple principles:
Models cannot act on information they do not have. Tell the model where it is, what resources exist, and what outputs are expected. Never assume the model will infer these details.
Tooling should be both sufficient and efficient. Provide the tools that matter, make sure critical tools are usable by default, but avoid overwhelming the model with unnecessary options.
Do not rely solely on what the model says. Verify artifacts, files, outputs, and execution results instead of just relying on agent self-reporting.
Many failures are recoverable if the framework provides useful feedback and retry opportunities. Provide critical information such as current state, missing requirements, and existing artifacts, and give the model a structured opportunity to recover.
Contribute to PawBench
PawBench helps agent users identify the best model–harness combination for their workload, and gives harness developers a way to measure and improve the systems.
The most important result from PawBench isn’t which model ranks first. It’s that agent performance is not a property of the model alone.
PawBench v1.0 is fully open source, and we welcome new harnesses, models, tasks, and contributions from the community.
-
Project: https://github.com/agentscope-ai/PawBench
-
Leaderboard: https://agentscope-ai.github.io/PawBench/
-
Full Blog: https://agentscope-ai.github.io/PawBench/en/blog/PAWBENCH_MODEL_HARNESS_BLOG_EN/
Similar Articles
Your harness is failing your agent but there's no benchmark to prove it
The article highlights a lack of benchmarks for evaluating the reliability of agent harnesses, specifically focusing on how MCP implementations handle tool calls and errors compared to the models themselves.
There is no benchmark for the agent that merged your pull request.
Artificial Analysis launched a coding agent index that tests harness and model combinations separately, highlighting that benchmark tasks differ from real production needs. The article argues that teams should evaluate agent configurations on their own codebases and workflows rather than relying solely on standardized benchmarks.
AA introduces Coding Agent Index - Performance Comparisons between Model & Harness Combinations
Artificial Analysis introduces the Coding Agent Index, a new benchmark suite combining SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA to evaluate the performance of AI coding agents across diverse tasks.
Observation: the best agent harness for each model will be from the model developer themselves
A discussion on how AI models perform best with harnesses developed by their own creators, as third-party harnesses may cause underperformance despite strong benchmarks, citing examples like Claude Code for Claude and Codex for GPT.
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
Claw-SWE-Bench is a new benchmark and adapter protocol that standardizes evaluation conditions for comparing diverse coding agents on SWE-bench-style tasks, revealing that adapter design significantly impacts performance and cost.