DeepSWE Opus 4.8 results have been released.

Reddit r/singularity Models

Summary

The results of DeepSWE Opus 4.8 have been released, showcasing its performance on benchmarks.

No content available
Original Article

Similar Articles

New DeepSWE benchmark finds Claude Opus cheats

Reddit r/LocalLLaMA

Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.

Someone did an audit on the new DeepSWE, the results aren't pretty

Reddit r/singularity

DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.