humanity's last exam current benchmarks thoughts?
Summary
Discussion of recent AI model scores on the 'humanity's last exam' benchmark, noting improvement from GPT-4o's 2.7% in May 2024 to around 45% by June 2026, questioning the exam's difficulty.
Similar Articles
Fable passes the "When A.I. Passes This Test, Look Out" test
Claude Fable achieves 53% on the 'Humanity's Last Exam' benchmark, surpassing the expected end-of-2025 milestone earlier than projected, indicating rapid AI progress.
AI can finally pass the Turing Test better than a human, study warns
A new study published in PNAS shows that advanced LLMs like GPT-4.5 can pass the Turing Test, with participants finding them more human than actual humans, prompting a reevaluation of what the test measures.
Does anyone else feel like AI benchmarks are becoming less useful for predicting real-world performance?
The article discusses the growing disconnect between high AI benchmark scores and actual real-world performance, highlighting issues like consistency, latency, and context handling.
@omarsar0: The efficiency frontier! Where do you think GPT-5.6 will land?
Discussion of recent benchmark results for Claude Opus 4.8 and GPT-5.5 on DeepSWE Bench, with speculation about future GPT-5.6 performance and efficiency trends.
@rohanpaul_ai: Today’s frontier agents are far less ready for real-world automation than their benchmark scores suggest. This paper pr…
This paper introduces Agents' Last Exam, a benchmark that tests AI agents on real expert work across 55 digital work areas. Current best agents fail most tasks, averaging only 2.6% pass rate on the hardest tier, revealing a large gap between benchmark scores and real-world automation readiness.