@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

X AI KOLs Following 06/08/26, 08:27 PM Tools

coding-benchmark ai-evaluation frontiercode swebench code-quality metr cognition

Summary

FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality. Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks. FC Diamond is so hard that Opus 4.8 scores 13.8%. Three eras of AI coding : Three eras of benchmarks 2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months. This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails. My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027.... The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

Original Article

View Cached Full Text

Cached at: 06/09/26, 08:58 AM

It’s finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the “WTF happened in Dec 2025” vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley’s ralph loops or @bcherny’s /goals or @steipete’s “loops that prompt your agents” without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I’ve already asked the team to prepare FrontierCode 2027….

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

previously on @cognition_labs

hey man im just a writer not a chart pornographer

good question :)

all combinations were run but the intent of this release is to measure models not harnesses. my chart puts all models on minisweagent

im on the FC team

i do. data a bit messier there. i dont think im cleared to share those ones rn but it’ll all be open sourced soon enough.

in the mean time this is public

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

Similar Articles

FrontierCode

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

@scaling01: Opus 4.8 is the best coding model out there FrontierCode by Cognition is probably the highest quality coding benchmark …

@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…

Submit Feedback

Similar Articles

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

@scaling01: Opus 4.8 is the best coding model out there FrontierCode by Cognition is probably the highest quality coding benchmark …

@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…