Tag
An analysis of the DeepSWE benchmark data reveals surprising cost and performance differences among models, with GPT 5.5 leading in capability and cost efficiency while open weights models can be expensive per pass.
The DeepSWE benchmark costs are per task, not per total run. Running models like Mimo V2.5 Pro can cost ~$225 for a full run, while Mimo V2.5 non-pro costs ~$7.15. Users should be aware of this before running expensive models.
A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.
The results of DeepSWE Opus 4.8 have been released, showcasing its performance on benchmarks.
Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.
Announcing DeepSWE, a new benchmark for agentic coding that reveals true differences between models, reflecting real-world developer experiences.