Arena.ai is running possibly the most fraudulent benchmark thus far

Reddit r/singularity News

Summary

The article criticizes Arena.ai for allegedly running dishonest benchmarks, claiming it ranked GPT 5.5 below Meta's Muse Spark in coding and Grok Imagine above Seedance in video generation, which the author asserts is objectively false.

Previously they placed GPT 5.5 below Meta's Muse Spark in terms of coding ability. This latest benchmark they've released with Grok Imagine surpassing Seedance video generation... if anyone is currently using both it's fair to say this is objectively dishonest.
Original Article

Similar Articles

Rethinking how we measure AI intelligence

Google DeepMind Blog

Google DeepMind and Kaggle introduced Kaggle Game Arena, an open-source AI benchmarking platform where large language models compete head-to-head in strategic games to provide dynamic and verifiable evaluation of their capabilities. The platform addresses limitations of traditional benchmarks by offering clear winning conditions and unambiguous performance signals.

New DeepSWE benchmark finds Claude Opus cheats

Reddit r/LocalLLaMA

Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.

Introducing BenchBench (5 minute read)

TLDR AI

Introduces BenchBench, a benchmark that tests AI models' ability to create effective benchmarks for other models, with GPT 5.2 being the only successful winner so far while frontier models like GPT 5.5 and Opus 4.6 struggled.