@swyx: Finally! the first eval ship from cog!!!!!!!!!! To contextualize: @METR_Evals cap out at ~16 hours. Cog has private ent…
Summary
Cognition released the first evaluation suite for Devin, offering up to 100-hour enterprise evals with a financial guarantee. The dataset includes real-world Java/TypeScript/Python/C# tasks from 126 enterprise users, aiming to measure engineering productivity more accurately than existing benchmarks.
View Cached Full Text
Cached at: 06/05/26, 11:21 PM
Finally! the first eval ship from cog!!!!!!!!!!
To contextualize: @METR_Evals cap out at ~16 hours.
Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it
METR dataset: ML eng, GPU kernels, cybersecurity
“METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth”. rlog of 0.83
Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations
“We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers.” rlog of 0.74 on held out set
this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I’m really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!
Cognition (@cognition): AI should earn its keep. Introducing the AI Productivity Guarantee.
If Devin delivers less engineering value than you’re paying for, Cognition will fund your usage until it does, up to $10 million.
It’s time for the AI industry to stop maximizing tokens and start maximizing
Similar Articles
@garrytan: This is the new standard for engineering evals
Announcing DeepSWE, a new benchmark for agentic coding that reveals true differences between models, reflecting real-world developer experiences.
@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…
FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.
CogScale: Scalable Benchmark for Sequence Processing
CogScale is a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate cognitive and memory abilities in sequence processing models. It provides a lightweight framework for rapid architectural validation and includes evaluations of seven architectures under strict parameter budgets.
@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…
Cognition announces FrontierCode, a new coding evaluation benchmark that goes beyond unit tests to measure code quality, scope, test correctness, and human reviewer approval, addressing the issue of agents writing sloppy code that passes tests but is not maintainable.
@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…
FrontierCode is a new coding evaluation benchmark that measures code mergeability, claiming 81% fewer misclassification errors than SWE-Bench Pro. Tasks were crafted by maintainers of open-source projects like Celery, uppy, and Mattermost.