Tag
FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.
A detailed critique of the METR AI time horizons graph reveals numerous severe methodological errors, including biased human baselines, unmeasured data, and test-training contamination, undermining its conclusions about AI capabilities.