Tag
Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system, covering 9 models and 3 frameworks with 150 tasks. It finds that framework design significantly affects agent performance, and proposes four design principles.