swebench

#swebench

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

X AI KOLs Following ↗ · yesterday Cached

FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.

0 favorites 0 likes

#swebench

On SWEBench Pro, 68.5% of GPT 5.5’s failures were caused by broken or incorrect test cases, totaling 28.9% of the entire benchmark

Reddit r/ArtificialInteligence ↗ · 2026-05-26

An analysis reveals that 28.9% of GPT 5.5's failures on SWEBench Pro are due to broken or incorrect test cases, and similar issues affect other major AI benchmarks, raising concerns about the accuracy of current evaluation methods.

0 favorites 0 likes

swebench

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

On SWEBench Pro, 68.5% of GPT 5.5’s failures were caused by broken or incorrect test cases, totaling 28.9% of the entire benchmark

Submit Feedback