@0xLogicrw: Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system. The evaluation cross-tests 9 large models with three frameworks: Hermes, OpenClaw, and QwenPaw, covering 150 real-world tasks and 4050 ...

X AI KOLs Timeline 06/05/26, 11:25 AM Tools

agent-benchmark framework-evaluation model-evaluation ai-frameworks tongyi-lab qwen

Summary

Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system, covering 9 models and 3 frameworks with 150 tasks. It finds that framework design significantly affects agent performance, and proposes four design principles.

Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system. The evaluation cross-tests 9 large models with three frameworks: Hermes, OpenClaw, and QwenPaw, covering 150 real-world tasks and 4050 test units. The results show that the design of the runtime framework directly determines whether agent capabilities can be stably deployed. With the same model, the three frameworks exhibit significant performance gaps. QwenPaw scores 76.4, OpenClaw scores 75.4, while Hermes only scores 70.4. The 6.4-point gap is comparable to a major model version upgrade. Excellent design can even enable smaller models to achieve 'upset victories': in the Hermes framework, GLM 5.1 only scores 68.2, while in the QwenPaw framework, the smaller-scale Qwen3.6-35b-a3b scores 70.4. By analyzing runtime trajectories, the performance differences stem from a lack of substantive verification of workspace artifacts, loose tool path constraints, and an oversized tool table that increases model decision burden. Most frameworks also exhibit clear deficiencies in active discovery of local specialized skills (Skill) and zero-configuration availability of web search. The evaluation team proposed four fundamental principles for framework design: First, Inform Fully: clearly communicate environmental context such as cwd and workspace; Second, Equip on Demand: control the number of tools and ensure key tools like password-free search are available by default; Third, Monitor Actively: verify that task artifacts such as file writes are actually realized; Fourth, Recover Gracefully: provide opportunities for error correction and continuation when tools are abnormal or artifacts are missing.

Original Article

View Cached Full Text

Cached at: 06/05/26, 03:15 PM

Alibaba Tongyi Lab has released the agent evaluation benchmark PawBench v1.0, which for the first time integrates foundation models and execution frameworks into a unified evaluation system. The benchmark performs cross-testing on 9 large models and three frameworks—Hermes, OpenClaw, and QwenPaw—covering 150 real-world tasks and 4,050 test units.

Results show that the design of the execution framework directly determines whether agent capabilities can be reliably deployed. Under the same model, the three frameworks exhibit significant performance gaps: QwenPaw scores 76.4, OpenClaw scores 75.4, while Hermes trails at only 70.4. A gap of 6.4 points is comparable to a major model version upgrade.

Excellent design can even enable smaller models to “punch above their weight”: on the Hermes framework, GLM 5.1 scores only 68.2, while on QwenPaw, the smaller Qwen3.6-35b-a3b achieves 70.4.

Analysis of execution traces reveals that the performance differences stem from a lack of substantive validation of workspace artifacts, loose tool-path constraints, and overly large tool tables that increase the model’s decision-making burden. Most frameworks also show clear shortcomings in proactive discovery of local specialized skills and zero-configuration availability of web search.

The evaluation team proposes four fundamental principles for framework design:

Inform Fully – Clearly define context such as cwd and workspace.
Equip on Demand – Control the number of tools and ensure critical tools like password-free search are available by default.
Monitor Actively – Validate whether task artifacts such as file writes are actually realized.
Recover Gracefully – Provide opportunities for correction and continuation when tools fail or artifacts are missing.

Similar Articles

@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…

@Sentdex: For anyone who isn't sure, this is how you release a model and talk about the performance. Not 3-5 cherry-picked benchm…

@intheworldofai: Qwen 3.7-Max is genuinely one of the most impressive agentic coding models I’ve tested in a while. I had it generate a …

Introducing BenchBench (5 minute read)

@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286

Submit Feedback

Similar Articles

@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…

@Sentdex: For anyone who isn't sure, this is how you release a model and talk about the performance. Not 3-5 cherry-picked benchm…

@intheworldofai: Qwen 3.7-Max is genuinely one of the most impressive agentic coding models I’ve tested in a while. I had it generate a …

Introducing BenchBench (5 minute read)

@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286