deep-swe

Tag

Cards List
#deep-swe

I just created a detailed report based on the DeepSWE benchmark data

Reddit r/singularity · 2d ago

An analysis of the DeepSWE benchmark data reveals surprising cost and performance differences among models, with GPT 5.5 leading in capability and cost efficiency while open weights models can be expensive per pass.

0 favorites 0 likes
#deep-swe

Heads up for DeepSWE benchmark: The cost is measured per task, not the total run.

Reddit r/singularity · 3d ago

The DeepSWE benchmark costs are per task, not per total run. Running models like Mimo V2.5 Pro can cost ~$225 for a full run, while Mimo V2.5 non-pro costs ~$7.15. Users should be aware of this before running expensive models.

0 favorites 0 likes
#deep-swe

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

Reddit r/LocalLLaMA · 3d ago

A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.

0 favorites 0 likes
#deep-swe

DeepSWE Opus 4.8 results have been released.

Reddit r/singularity · 4d ago

The results of DeepSWE Opus 4.8 have been released, showcasing its performance on benchmarks.

0 favorites 0 likes
#deep-swe

New DeepSWE benchmark finds Claude Opus cheats

Reddit r/LocalLLaMA · 2026-05-27 Cached

Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.

0 favorites 0 likes
#deep-swe

@garrytan: This is the new standard for engineering evals

X AI KOLs Following · 2026-05-26 Cached

Announcing DeepSWE, a new benchmark for agentic coding that reveals true differences between models, reflecting real-world developer experiences.

0 favorites 0 likes
← Back to home

Submit Feedback