Tag
FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.
An analysis reveals that 28.9% of GPT 5.5's failures on SWEBench Pro are due to broken or incorrect test cases, and similar issues affect other major AI benchmarks, raising concerns about the accuracy of current evaluation methods.