@0xLogicrw: Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system. The evaluation cross-tests 9 large models with three frameworks: Hermes, OpenClaw, and QwenPaw, covering 150 real-world tasks and 4050 ...
Summary
Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system, covering 9 models and 3 frameworks with 150 tasks. It finds that framework design significantly affects agent performance, and proposes four design principles.
View Cached Full Text
Cached at: 06/05/26, 03:15 PM
Alibaba Tongyi Lab has released the agent evaluation benchmark PawBench v1.0, which for the first time integrates foundation models and execution frameworks into a unified evaluation system. The benchmark performs cross-testing on 9 large models and three frameworks—Hermes, OpenClaw, and QwenPaw—covering 150 real-world tasks and 4,050 test units.
Results show that the design of the execution framework directly determines whether agent capabilities can be reliably deployed. Under the same model, the three frameworks exhibit significant performance gaps: QwenPaw scores 76.4, OpenClaw scores 75.4, while Hermes trails at only 70.4. A gap of 6.4 points is comparable to a major model version upgrade.
Excellent design can even enable smaller models to “punch above their weight”: on the Hermes framework, GLM 5.1 scores only 68.2, while on QwenPaw, the smaller Qwen3.6-35b-a3b achieves 70.4.
Analysis of execution traces reveals that the performance differences stem from a lack of substantive validation of workspace artifacts, loose tool-path constraints, and overly large tool tables that increase the model’s decision-making burden. Most frameworks also show clear shortcomings in proactive discovery of local specialized skills and zero-configuration availability of web search.
The evaluation team proposes four fundamental principles for framework design:
- Inform Fully – Clearly define context such as cwd and workspace.
- Equip on Demand – Control the number of tools and ensure critical tools like password-free search are available by default.
- Monitor Actively – Validate whether task artifacts such as file writes are actually realized.
- Recover Gracefully – Provide opportunities for correction and continuation when tools fail or artifacts are missing.
Similar Articles
@RookieRicardoR: Domestic models break through again, matching top models like Claude 4.6 and Gemini 3.1 Pro. Just tested Qwen3.7-Max, sharing some real thoughts. Last night I topped up as soon as the API went live and chose three tasks (see video) to test Qwen3.7-Max's frontend capabilities…
The user tested Qwen3.7-Max and believes it matches top models like Claude 4.6 and Gemini 3.1 Pro in frontend, computing power, and Agent capabilities. Its reasoning ability has significantly improved, and with monthly iteration speed, it has become a first-tier domestic model.
@Sentdex: For anyone who isn't sure, this is how you release a model and talk about the performance. Not 3-5 cherry-picked benchm…
A tweet by Sentdex highlights Alibaba Qwen's transparent benchmark reporting for the Qwen3.7-Max model, contrasting it with others who cherry-pick benchmarks.
@intheworldofai: Qwen 3.7-Max is genuinely one of the most impressive agentic coding models I’ve tested in a while. I had it generate a …
阿里巴巴发布了通义千问 3.7 Max,一款专为智能体时代设计的旗舰编码模型。该模型在长周期自主执行、前端生成和3D场景构建上表现突出,多项基准测试中与顶尖闭源模型持平甚至超越,是接近前沿的中国模型。
Introducing BenchBench (5 minute read)
Introduces BenchBench, a benchmark that tests AI models' ability to create effective benchmarks for other models, with GPT 5.2 being the only successful winner so far while frontier models like GPT 5.5 and Opus 4.6 struggled.
@yidabuilds: https://x.com/yidabuilds/status/2053409619641602286
The author conducted a comparative evaluation of four domestic AI models: DeepSeek V4, Kimi K2.6, GLM-5.1, and MiniMax M2.7. The analysis covers their strengths and weaknesses regarding cost, long-context processing, coding stability, and reasoning performance, offering specific recommendations on how to route tasks involving large document analysis, long-running background jobs, and bulk content generation.