metr

#metr

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

X AI KOLs Following ↗ · yesterday Cached

FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.

0 favorites 0 likes

#metr

The famous METR AI time horizons graph contains numerous severe errors [D]

Reddit r/MachineLearning ↗ · 2026-05-25

A detailed critique of the METR AI time horizons graph reveals numerous severe methodological errors, including biased human baselines, unmeasured data, and test-training contamination, undermining its conclusions about AI capabilities.

0 favorites 0 likes

metr

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

The famous METR AI time horizons graph contains numerous severe errors [D]

Submit Feedback