frontiercode

#frontiercode

@devindesktop: Try Kimi K3 in Devin Desktop and the Devin CLI.

X AI KOLs Following ↗ · yesterday Cached

Kimi K3, an open source AI model, is now available in Devin Desktop and CLI, approaching frontier-level performance on FrontierCode 1.1.

0 favorites 0 likes

#frontiercode

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

X AI KOLs Following ↗ · 2026-06-08 Cached

FrontierCode is a new coding evaluation benchmark that measures code mergeability, claiming 81% fewer misclassification errors than SWE-Bench Pro. Tasks were crafted by maintainers of open-source projects like Celery, uppy, and Mattermost.

0 favorites 0 likes

#frontiercode

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

X AI KOLs Following ↗ · 2026-06-08 Cached

FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.

0 favorites 0 likes

#frontiercode

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

X AI KOLs Following ↗ · 2026-06-08 Cached

Cognition announces FrontierCode, a new coding evaluation benchmark that goes beyond unit tests to measure code quality, scope, test correctness, and human reviewer approval, addressing the issue of agents writing sloppy code that passes tests but is not maintainable.

0 favorites 0 likes

#frontiercode

@scaling01: Opus 4.8 is the best coding model out there FrontierCode by Cognition is probably the highest quality coding benchmark …

X AI KOLs Timeline ↗ · 2026-06-08 Cached

Cognition introduces FrontierCode, a high-quality coding benchmark that goes beyond unit tests to measure code maintainability, regression safety, and quality, with 150 handcrafted tasks by open-source developers.

0 favorites 0 likes

#frontiercode

@SanthProject: Now this is a bench i can get behind not the rigged as fuck deepswe benchmark

X AI KOLs Following ↗ · 2026-06-08 Cached

SanthProject praises Cognition's new FrontierCode coding evaluation benchmark, calling it a fair alternative to the DeepSwe benchmark.

0 favorites 0 likes

frontiercode

@devindesktop: Try Kimi K3 in Devin Desktop and the Devin CLI.

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

@scaling01: Opus 4.8 is the best coding model out there FrontierCode by Cognition is probably the highest quality coding benchmark …

@SanthProject: Now this is a bench i can get behind not the rigged as fuck deepswe benchmark

Submit Feedback