Tag
Discussion of recent AI model scores on the 'humanity's last exam' benchmark, noting improvement from GPT-4o's 2.7% in May 2024 to around 45% by June 2026, questioning the exam's difficulty.
Alibaba's Qwen3.7-Max model autonomously optimized a production kernel on unfamiliar T-Head PPU hardware over 35 hours, making 1,158 tool calls and achieving a 10x speedup, demonstrating sustained autonomous agentic behavior without human guidance.
Campbell Brown, former Meta news chief, launches Forum AI to evaluate foundation model accuracy on high-stakes topics like geopolitics and mental health, aiming to improve AI truthfulness through expert-led benchmarks.
A new AI model from interfaze_ai claims to outperform leading models (sonnet 4.6, gemini 3 flash, gpt 5.4 mini) on OCR, vision, and speech-to-text tasks.
The article discusses the growing disconnect between high AI benchmark scores and actual real-world performance, highlighting issues like consistency, latency, and context handling.
Andrew Ng proposes a new "Turing-AGI Test" to better measure artificial general intelligence by having systems perform real work tasks with internet access, arguing that the term AGI has become overhyped and needs precise definition to avoid misleading stakeholders about AI capabilities.