ai-benchmarks

#ai-benchmarks

humanity's last exam current benchmarks thoughts?

Reddit r/singularity ↗ · yesterday

Discussion of recent AI model scores on the 'humanity's last exam' benchmark, noting improvement from GPT-4o's 2.7% in May 2024 to around 45% by June 2026, questioning the exam's difficulty.

0 favorites 0 likes

#ai-benchmarks

Alibaba's Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better.

Reddit r/ArtificialInteligence ↗ · 2026-05-25 Cached

Alibaba's Qwen3.7-Max model autonomously optimized a production kernel on unfamiliar T-Head PPU hardware over 35 hours, making 1,158 tool calls and achieving a 10x speedup, demonstrating sustained autonomous agentic behavior without human guidance.

0 favorites 0 likes

#ai-benchmarks

Who decides what AI tells you? Campbell Brown, once Meta’s news chief, has thoughts

TechCrunch AI ↗ · 2026-05-14 Cached

Campbell Brown, former Meta news chief, launches Forum AI to evaluate foundation model accuracy on high-stakes topics like geopolitics and mental health, aiming to improve AI truthfulness through expert-led benchmarks.

0 favorites 0 likes

#ai-benchmarks

@aaron_epstein: New model just released that beats sonnet 4.6, gemini 3 flash, and gpt 5.4 mini on OCR, vision, and STT tasks @interfaz…

X AI KOLs Following ↗ · 2026-05-12

A new AI model from interfaze_ai claims to outperform leading models (sonnet 4.6, gemini 3 flash, gpt 5.4 mini) on OCR, vision, and speech-to-text tasks.

0 favorites 0 likes

#ai-benchmarks

Does anyone else feel like AI benchmarks are becoming less useful for predicting real-world performance?

Reddit r/ArtificialInteligence ↗ · 2026-05-07

The article discusses the growing disconnect between high AI benchmark scores and actual real-world performance, highlighting issues like consistency, latency, and context handling.

0 favorites 0 likes

#ai-benchmarks

New Year Special! Hopes for 2026 from David Cox, Adji Bousso Dieng, Juan M. Lavista Ferres, Tanmay Gupta, Pengtao Xie, Sharon Zhou

The Batch ↗ · 2026-01-02 Cached

Andrew Ng proposes a new "Turing-AGI Test" to better measure artificial general intelligence by having systems perform real work tasks with internet access, arguing that the term AGI has become overhyped and needs precise definition to avoid misleading stakeholders about AI capabilities.

0 favorites 0 likes

ai-benchmarks

humanity's last exam current benchmarks thoughts?

Alibaba's Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better.

Who decides what AI tells you? Campbell Brown, once Meta’s news chief, has thoughts

@aaron_epstein: New model just released that beats sonnet 4.6, gemini 3 flash, and gpt 5.4 mini on OCR, vision, and STT tasks @interfaz…

Does anyone else feel like AI benchmarks are becoming less useful for predicting real-world performance?

New Year Special! Hopes for 2026 from David Cox, Adji Bousso Dieng, Juan M. Lavista Ferres, Tanmay Gupta, Pengtao Xie, Sharon Zhou

Submit Feedback