Tag
This paper introduces Agents' Last Exam, a benchmark that tests AI agents on real expert work across 55 digital work areas. Current best agents fail most tasks, averaging only 2.6% pass rate on the hardest tier, revealing a large gap between benchmark scores and real-world automation readiness.
An analysis from LessWrong examining the performance gap between open-source and proprietary AI models.