APIEval-20
Summary
APIEval-20 is an open benchmark designed to evaluate AI agents' capabilities in testing APIs.
Similar Articles
An Empirical Study of Automating Agent Evaluation
This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.
ProgramBench (5 minute read)
ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
This paper introduces SWE-WebDevBench, a comprehensive 68-metric framework for evaluating AI-powered application development platforms as virtual software agencies. The study highlights critical gaps in current platforms regarding specification understanding, backend reliability, production readiness, and security.
@dair_ai: // Agents' Last Exam // Agents' Last Exam is a living benchmark of over 1,000 economically valuable tasks, built with 2…
Agents' Last Exam is a living benchmark of over 1,000 economically valuable tasks designed to evaluate AI agents on real-world workflows, with a current full pass rate of only 2.6% on its hardest tier.
@BraceSproul: I've been thinking a lot about the two different groups of evals you need in general agents/agents which handle broad t…
A Twitter thread discussing two distinct evaluation suites needed for general AI agents: a lightweight benchmark eval for quick iteration and a comprehensive test coverage eval for thorough validation across diverse user paths.