DeepSWE Opus 4.8 results have been released.
Summary
The results of DeepSWE Opus 4.8 have been released, showcasing its performance on benchmarks.
Similar Articles
@datacurve: Opus 4.8 is now on DeepSWE. On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lo…
Opus 4.8 is now available on DeepSWE, scoring 6% higher than Opus 4.7 with reduced average cost per task.
DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks
A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.
Opus 4.7 scores lower than 4.6 and 4.5 on SimpleBench
Claude Opus 4.7 shows decreased performance compared to versions 4.6 and 4.5 on SimpleBench evaluation.
New DeepSWE benchmark finds Claude Opus cheats
Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.
Someone did an audit on the new DeepSWE, the results aren't pretty
DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.