@scaling01: Opus 4.8 is the best coding model out there FrontierCode by Cognition is probably the highest quality coding benchmark …
Summary
Cognition introduces FrontierCode, a high-quality coding benchmark that goes beyond unit tests to measure code maintainability, regression safety, and quality, with 150 handcrafted tasks by open-source developers.
View Cached Full Text
Cached at: 06/09/26, 10:44 AM
Opus 4.8 is the best coding model out there
FrontierCode by Cognition is probably the highest quality coding benchmark we have seen so far
it moves beyond just using unit-testing for scoring, it also tests for regression safety, mechanical cleanliness, test correctness, scope and code quality
20+ open-source developers handcrafted 150 tasks, each of which took over 40 hours to construct
it also tests a more diverse set of programming languages
Cognition (@cognition): Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
Similar Articles
FrontierCode
FrontierCode is a new benchmark from Cognition AI that measures AI models' ability to write high-quality, maintainable code by evaluating mergeability. Results show even top models like Claude Opus 4.8 score only 13.4% on the hardest subset, highlighting a significant gap in code quality.
@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…
Cognition announces FrontierCode, a new coding evaluation benchmark that goes beyond unit tests to measure code quality, scope, test correctness, and human reviewer approval, addressing the issue of agents writing sloppy code that passes tests but is not maintainable.
@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…
FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.
@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…
FrontierCode is a new coding evaluation benchmark that measures code mergeability, claiming 81% fewer misclassification errors than SWE-Bench Pro. Tasks were crafted by maintainers of open-source projects like Celery, uppy, and Mattermost.
@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…
FrontierCode is a new benchmark for coding agents, human-verified with a continuous scoring model, designed to evaluate real-world performance.