@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

X AI KOLs Following Papers

Summary

FrontierCode is a new coding evaluation benchmark that measures code mergeability, claiming 81% fewer misclassification errors than SWE-Bench Pro. Tasks were crafted by maintainers of open-source projects like Celery, uppy, and Mattermost.

FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually merge this code? The result is 81% fewer misclassification errors than SWE-Bench Pro, the most accurate model ranking available today. Every task was crafted by the actual maintainers of repos like Celery, uppy, and Mattermost. I also encourage you to dive into the blog post, it's beautiful and interactive!
Original Article
View Cached Full Text

Cached at: 06/08/26, 11:28 PM

FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually merge this code?

The result is 81% fewer misclassification errors than SWE-Bench Pro, the most accurate model ranking available today.

Every task was crafted by the actual maintainers of repos like Celery, uppy, and Mattermost. I also encourage you to dive into the blog post, it’s beautiful and interactive!

Cognition (@cognition): Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

Similar Articles

FrontierCode

Hacker News Top

FrontierCode is a new benchmark from Cognition AI that measures AI models' ability to write high-quality, maintainable code by evaluating mergeability. Results show even top models like Claude Opus 4.8 score only 13.4% on the hardest subset, highlighting a significant gap in code quality.