@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

X AI KOLs Following 06/08/26, 08:37 PM Papers

coding-eval benchmark frontiercode software-engineering ai-evaluation open-source merge-quality

Summary

FrontierCode is a new coding evaluation benchmark that measures code mergeability, claiming 81% fewer misclassification errors than SWE-Bench Pro. Tasks were crafted by maintainers of open-source projects like Celery, uppy, and Mattermost.

FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually merge this code? The result is 81% fewer misclassification errors than SWE-Bench Pro, the most accurate model ranking available today. Every task was crafted by the actual maintainers of repos like Celery, uppy, and Mattermost. I also encourage you to dive into the blog post, it's beautiful and interactive!

Original Article

View Cached Full Text

Cached at: 06/08/26, 11:28 PM

FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually merge this code?

The result is 81% fewer misclassification errors than SWE-Bench Pro, the most accurate model ranking available today.

Every task was crafted by the actual maintainers of repos like Celery, uppy, and Mattermost. I also encourage you to dive into the blog post, it’s beautiful and interactive!

Cognition (@cognition): Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

Similar Articles

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

FrontierCode

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

FrontierCode: a coding eval that raises the bar for difficulty & quality.

@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…

Submit Feedback

Similar Articles

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

FrontierCode: a coding eval that raises the bar for difficulty & quality.

@Murderlon: FrontierCode finally dropped, a coding agents benchmark for the real world. Human-verified through an extensive hardeni…