Tag
Alibaba Tongyi Lab launches Agent Evaluation Benchmark PawBench v1.0, for the first time integrating base models and runtime frameworks into a unified evaluation system, covering 9 models and 3 frameworks with 150 tasks. It finds that framework design significantly affects agent performance, and proposes four design principles.
LlamaIndex founder Jerry Liu discusses the company's strategic pivot from a general AI framework to focusing on providing high-accuracy context extraction from enterprise documents like PDFs and PowerPoints, aiming for 95%+ accuracy for agentic workflows in legal, insurance, and finance.