@swyx: Finally! the first eval ship from cog!!!!!!!!!! To contextualize: @METR_Evals cap out at ~16 hours. Cog has private ent…

X AI KOLs Following Products

Summary

Cognition released the first evaluation suite for Devin, offering up to 100-hour enterprise evals with a financial guarantee. The dataset includes real-world Java/TypeScript/Python/C# tasks from 126 enterprise users, aiming to measure engineering productivity more accurately than existing benchmarks.

Finally! the first eval ship from cog!!!!!!!!!! To contextualize: @METR_Evals cap out at ~16 hours. Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it METR dataset: ML eng, GPU kernels, cybersecurity > "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog​ of 0.83 Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations > "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog​ of 0.74 on held out set this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!
Original Article
View Cached Full Text

Cached at: 06/05/26, 11:21 PM

Finally! the first eval ship from cog!!!!!!!!!!

To contextualize: @METR_Evals cap out at ~16 hours.

Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it

METR dataset: ML eng, GPU kernels, cybersecurity

“METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth”. rlog​ of 0.83

Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations

“We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers.” rlog​ of 0.74 on held out set

this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I’m really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

Cognition (@cognition): AI should earn its keep. Introducing the AI Productivity Guarantee.

If Devin delivers less engineering value than you’re paying for, Cognition will fund your usage until it does, up to $10 million.

It’s time for the AI industry to stop maximizing tokens and start maximizing

Similar Articles

CogScale: Scalable Benchmark for Sequence Processing

arXiv cs.AI

CogScale is a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate cognitive and memory abilities in sequence processing models. It provides a lightweight framework for rapid architectural validation and includes evaluations of seven architectures under strict parameter budgets.