@SoHarshhh: Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD. Most benchmarks e…
Summary
ToolFailBench, a diagnostic benchmark for tool-using agents, has been accepted at two ICML 2026 workshops, FAGEN and AIWILD.
View Cached Full Text
Cached at: 06/01/26, 11:20 AM
Really happy to share that “ToolFailBench” got accepted at two ICML 2026 workshops, FAGEN and AIWILD.
Most benchmarks evaluate tool-using agents with a single aggregate success rate, but that number can’t explain why a model actually fails. ToolFailBench is a diagnostic https://t.co/UCKA2H29Aw
Similar Articles
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
TOBench is a new benchmark for evaluating AI agents on real-world, task-oriented tool use with multimodal inputs and closed-loop verification. Experiments show top models like Qwen 3.5 Plus achieve only 41% success, far below the 94% human benchmark, highlighting a significant gap.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
This paper introduces MLS-Bench, a benchmark designed to assess whether AI systems can invent generalizable and scalable machine learning methods rather than just performing engineering tuning.
ProgramBench (5 minute read)
ProgramBench is a new benchmark that evaluates AI agents' ability to reconstruct complete software projects from compiled binaries and documentation without access to source code or decompilation tools.
@KLieret: You can evaluate on ProgramBench yourself: https://github.com/facebookresearch/ProgramBench/… We will open the leaderbo…
ProgramBench is a new benchmark that tests AI agents' ability to reconstruct a complete codebase from a compiled binary and its documentation. The leaderboard will open for submissions soon.
Introducing BenchBench (5 minute read)
Introduces BenchBench, a benchmark that tests AI models' ability to create effective benchmarks for other models, with GPT 5.2 being the only successful winner so far while frontier models like GPT 5.5 and Opus 4.6 struggled.