@swyx: Finally! the first eval ship from cog!!!!!!!!!! To contextualize: @METR_Evals cap out at ~16 hours. Cog has private ent…

X AI KOLs Following 06/04/26, 07:03 PM Products

evals devin enterprise productivity-guarantee cognition code-ai real-world-evals

Summary

Cognition released the first evaluation suite for Devin, offering up to 100-hour enterprise evals with a financial guarantee. The dataset includes real-world Java/TypeScript/Python/C# tasks from 126 enterprise users, aiming to measure engineering productivity more accurately than existing benchmarks.

Finally! the first eval ship from cog!!!!!!!!!! To contextualize: @METR_Evals cap out at ~16 hours. Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it METR dataset: ML eng, GPU kernels, cybersecurity > "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83 Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations > "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog of 0.74 on held out set this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

Original Article

View Cached Full Text

Cached at: 06/05/26, 11:21 PM

Finally! the first eval ship from cog!!!!!!!!!!

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it

METR dataset: ML eng, GPU kernels, cybersecurity

“METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth”. rlog of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

“We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers.” rlog of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I’m really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

Cognition (@cognition): AI should earn its keep. Introducing the AI Productivity Guarantee.

If Devin delivers less engineering value than you’re paying for, Cognition will fund your usage until it does, up to $10 million.

It’s time for the AI industry to stop maximizing tokens and start maximizing

@swyx: Finally! the first eval ship from cog!!!!!!!!!! To contextualize: @METR_Evals cap out at ~16 hours. Cog has private ent…

Similar Articles

@garrytan: This is the new standard for engineering evals

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

CogScale: Scalable Benchmark for Sequence Processing

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…

Submit Feedback

Similar Articles

@garrytan: This is the new standard for engineering evals

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

CogScale: Scalable Benchmark for Sequence Processing

@denizbirlikci: To understand why we built FrontierCode, read @METR_Evals's blog post on why "many SWE-bench-passing PRs would not be m…

@dabit3: FrontierCode is the first eval to measure the metric that matters most in real software engineering: would you actually…