@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

X AI KOLs Following 06/08/26, 07:57 PM Tools

coding-eval benchmark code-quality agents swe-bench rubrics frontiercode

Summary

Cognition announces FrontierCode, a new coding evaluation benchmark that goes beyond unit tests to measure code quality, scope, test correctness, and human reviewer approval, addressing the issue of agents writing sloppy code that passes tests but is not maintainable.

To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be merged into main." A bit old now, but the point still stands. Agents often write more code — and more slop — than they should. But unit tests have no way to penalize those unnecessary changes; passing is passing, no matter how much junk came along with it. FrontierCode actively tests for this with rubric types that grade on more ambiguous metrics, like: - SCOPE — did it change more than it should have? - Test correctness — do the agent's own tests actually catch the bug? - Code quality — would a human reviewer approve this diff? (LLM-judge on human rubric)

Original Article

View Cached Full Text

Cached at: 06/09/26, 10:46 AM

To understand why we built FrontierCode, read @METR_Evals’s blog post on why “many SWE-bench-passing PRs would not be merged into main.” A bit old now, but the point still stands.

Agents often write more code — and more slop — than they should. But unit tests have no way to penalize those unnecessary changes; passing is passing, no matter how much junk came along with it.

FrontierCode actively tests for this with rubric types that grade on more ambiguous metrics, like:

SCOPE — did it change more than it should have?
Test correctness — do the agent’s own tests actually catch the bug?
Code quality — would a human reviewer approve this diff? (LLM-judge on human rubric)

Cognition (@cognition): Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

Similar Articles

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

FrontierCode: a coding eval that raises the bar for difficulty & quality.

FrontierCode

@cognition: We’ve made improvements to the FrontierCode methodology and are releasing FrontierCode 1.1 with clearer guidelines for …

@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…

Submit Feedback

Similar Articles

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

FrontierCode: a coding eval that raises the bar for difficulty & quality.

@cognition: We’ve made improvements to the FrontierCode methodology and are releasing FrontierCode 1.1 with clearer guidelines for …

@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…