@rohanpaul_ai: Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper pr…

X AI KOLs Following 06/11/26, 12:29 AM Papers

benchmark ai-agents real-world-automation evaluation frontier-models workflow-testing performance-gap

Summary

This paper introduces Agents' Last Exam, a benchmark that tests AI agents on real expert work across 55 digital work areas. Current best agents fail most tasks, averaging only 2.6% pass rate on the hardest tier, revealing a large gap between benchmark scores and real-world automation readiness.

Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper proposes a Agents’ Last Exam, a benchmark that asks AI agents to finish real expert work, and today’s agents mostly fail. Even strong agents of today are nowhere near reliable on the hardest real workflows, which means benchmark success has not yet become broad workplace capability. So this paper shifts the question from “can AI answer hard questions?” to “can AI complete real work that people get paid to do?” Most of today's AI benchmarks show impressive scores, but they do not prove that agents can finish useful work in real jobs. Agents’ Last Exam tries to fix this by testing agents on long tasks from 55 digital work areas, including engineering, finance, medicine, law, media, and science. The tasks come from experts’ real completed projects, and the agent must use normal computer tools like files, browsers, command lines, and desktop software to produce a finished result. The authors tested many current agent systems and models, then scored their finished work with automatic checks or strict rubrics instead of loose human opinions. The main result is that today’s best systems still struggle badly, with an average full pass rate of only 2.6% on the hardest tier. ---- Link – arxiv. org/abs/2606.05405 Title: "Agents' Last Exam"

Original Article

View Cached Full Text

Cached at: 06/11/26, 03:39 PM

Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest.

This paper proposes a Agents’ Last Exam, a benchmark that asks AI agents to finish real expert work, and today’s agents mostly fail.

Even strong agents of today are nowhere near reliable on the hardest real workflows, which means benchmark success has not yet become broad workplace capability.

So this paper shifts the question from “can AI answer hard questions?” to “can AI complete real work that people get paid to do?”

Most of today’s AI benchmarks show impressive scores, but they do not prove that agents can finish useful work in real jobs.

Agents’ Last Exam tries to fix this by testing agents on long tasks from 55 digital work areas, including engineering, finance, medicine, law, media, and science.

The tasks come from experts’ real completed projects, and the agent must use normal computer tools like files, browsers, command lines, and desktop software to produce a finished result.

The authors tested many current agent systems and models, then scored their finished work with automatic checks or strict rubrics instead of loose human opinions.

The main result is that today’s best systems still struggle badly, with an average full pass rate of only 2.6% on the hardest tier.

Link – arxiv. org/abs/2606.05405

Title: “Agents’ Last Exam”

@rohanpaul_ai: Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper pr…

Similar Articles

@dair_ai: // Agents' Last Exam // Agents' Last Exam is a living benchmark of over 1,000 economically valuable tasks, built with 2…

@dawnsongtweets: Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is …

@rohanpaul_ai: Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not…

@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents

Submit Feedback

Similar Articles

@dair_ai: // Agents' Last Exam // Agents' Last Exam is a living benchmark of over 1,000 economically valuable tasks, built with 2…

@dawnsongtweets: Everyone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is …

@rohanpaul_ai: Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not…

@OkhayIea: Everyone's racing to build "AI scientists." So we asked a blunt question: Can today's best coding agents beat the publi…

HealthAgentBench: A Unified Benchmark Suite of Realistic Agentic Healthcare Environments for Challenging Frontier AI Agents