@0xLogicrw: Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system. The evaluation cross-tests 9 large models with three frameworks: Hermes, OpenClaw, and QwenPaw, covering 150 real-world tasks and 4050 ...

X AI KOLs Timeline Tools

Summary

Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system, covering 9 models and 3 frameworks with 150 tasks. It finds that framework design significantly affects agent performance, and proposes four design principles.

Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system. The evaluation cross-tests 9 large models with three frameworks: Hermes, OpenClaw, and QwenPaw, covering 150 real-world tasks and 4050 test units. The results show that the design of the runtime framework directly determines whether agent capabilities can be stably deployed. With the same model, the three frameworks exhibit significant performance gaps. QwenPaw scores 76.4, OpenClaw scores 75.4, while Hermes only scores 70.4. The 6.4-point gap is comparable to a major model version upgrade. Excellent design can even enable smaller models to achieve 'upset victories': in the Hermes framework, GLM 5.1 only scores 68.2, while in the QwenPaw framework, the smaller-scale Qwen3.6-35b-a3b scores 70.4. By analyzing runtime trajectories, the performance differences stem from a lack of substantive verification of workspace artifacts, loose tool path constraints, and an oversized tool table that increases model decision burden. Most frameworks also exhibit clear deficiencies in active discovery of local specialized skills (Skill) and zero-configuration availability of web search. The evaluation team proposed four fundamental principles for framework design: First, Inform Fully: clearly communicate environmental context such as cwd and workspace; Second, Equip on Demand: control the number of tools and ensure key tools like password-free search are available by default; Third, Monitor Actively: verify that task artifacts such as file writes are actually realized; Fourth, Recover Gracefully: provide opportunities for error correction and continuation when tools are abnormal or artifacts are missing.
Original Article
View Cached Full Text

Cached at: 06/05/26, 03:15 PM

Alibaba Tongyi Lab has released the agent evaluation benchmark PawBench v1.0, which for the first time integrates foundation models and execution frameworks into a unified evaluation system. The benchmark performs cross-testing on 9 large models and three frameworks—Hermes, OpenClaw, and QwenPaw—covering 150 real-world tasks and 4,050 test units.

Results show that the design of the execution framework directly determines whether agent capabilities can be reliably deployed. Under the same model, the three frameworks exhibit significant performance gaps: QwenPaw scores 76.4, OpenClaw scores 75.4, while Hermes trails at only 70.4. A gap of 6.4 points is comparable to a major model version upgrade.

Excellent design can even enable smaller models to “punch above their weight”: on the Hermes framework, GLM 5.1 scores only 68.2, while on QwenPaw, the smaller Qwen3.6-35b-a3b achieves 70.4.

Analysis of execution traces reveals that the performance differences stem from a lack of substantive validation of workspace artifacts, loose tool-path constraints, and overly large tool tables that increase the model’s decision-making burden. Most frameworks also show clear shortcomings in proactive discovery of local specialized skills and zero-configuration availability of web search.

The evaluation team proposes four fundamental principles for framework design:

  1. Inform Fully – Clearly define context such as cwd and workspace.
  2. Equip on Demand – Control the number of tools and ensure critical tools like password-free search are available by default.
  3. Monitor Actively – Validate whether task artifacts such as file writes are actually realized.
  4. Recover Gracefully – Provide opportunities for correction and continuation when tools fail or artifacts are missing.

Similar Articles

@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…

X AI KOLs Timeline

The user tested Qwen3.7-Max and believes it matches top models like Claude 4.6 and Gemini 3.1 Pro in frontend, computing power, and Agent capabilities. Its reasoning ability has significantly improved, and with monthly iteration speed, it has become a first-tier domestic model.

Introducing BenchBench (5 minute read)

TLDR AI

Introduces BenchBench, a benchmark that tests AI models' ability to create effective benchmarks for other models, with GPT 5.2 being the only successful winner so far while frontier models like GPT 5.5 and Opus 4.6 struggled.

@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286

X AI KOLs Timeline

The author conducted a comparative evaluation of four domestic AI models: DeepSeek V4, Kimi K2.6, GLM-5.1, and MiniMax M2.7. The analysis covers their strengths and weaknesses regarding cost, long-context processing, coding stability, and reasoning performance, offering specific recommendations on how to route tasks involving large document analysis, long-running background jobs, and bulk content generation.