deep-swe

#deep-swe

@YRSM_Simon: Looking forward to Local AI reaching the top tier as soon as possible

X AI KOLs Following ↗ · 3d ago Cached

Kimi K3 becomes the first open-weight model to achieve frontier performance on the DeepSWE benchmark, ranking third, with effectiveness comparable to Claude Fable and GPT-5.6 Sol.

0 favorites 0 likes

#deep-swe

@rohanpaul_ai: Today’s edition of my newsletter just went out. https://rohan-paul.com/p/gpt-56-beats-fable-5-by-on-deepswe… GPT 5.6 Be…

X AI KOLs Following ↗ · 2026-07-11 Cached

A newsletter covering multiple AI developments: GPT-5.6 outperforms Fable-5 on DeepSWE at lower cost, 1X debuts tendon-driven robotic hands, Microsoft replaces OpenAI/Anthropic models in Copilot, GitHub releases SpecKit, Claude Code shows large productivity gains, and Google DeepMind shares task design advice.

0 favorites 0 likes

#deep-swe

DeepSWE for GPT-5.6

Reddit r/singularity ↗ · 2026-07-09

DeepSWE is a specialized variant of GPT-5.6 tailored for software engineering tasks.

0 favorites 0 likes

#deep-swe

GLM-5.2 is on DeepSWE

Reddit r/LocalLLaMA ↗ · 2026-06-22

GLM-5.2 has been released on the DeepSWE platform.

0 favorites 0 likes

#deep-swe

Qwen 3.6 27B on DeepSWE

Reddit r/LocalLLaMA ↗ · 2026-06-07

Qwen 3.6 27B scored 2% on the DeepSWE benchmark, placing 18/20 above Haiku 4.5 and Minimax M2.7, highlighting the gap between local and leading-edge models.

0 favorites 0 likes

#deep-swe

I just created a detailed report based on the DeepSWE benchmark data

Reddit r/singularity ↗ · 2026-06-01

An analysis of the DeepSWE benchmark data reveals surprising cost and performance differences among models, with GPT 5.5 leading in capability and cost efficiency while open weights models can be expensive per pass.

0 favorites 0 likes

#deep-swe

Heads up for DeepSWE benchmark: The cost is measured per task, not the total run.

Reddit r/singularity ↗ · 2026-05-31

The DeepSWE benchmark costs are per task, not per total run. Running models like Mimo V2.5 Pro can cost ~$225 for a full run, while Mimo V2.5 non-pro costs ~$7.15. Users should be aware of this before running expensive models.

0 favorites 0 likes

#deep-swe

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

Reddit r/LocalLLaMA ↗ · 2026-05-31

A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.

0 favorites 0 likes

#deep-swe

DeepSWE Opus 4.8 results have been released.

Reddit r/singularity ↗ · 2026-05-30

The results of DeepSWE Opus 4.8 have been released, showcasing its performance on benchmarks.

0 favorites 0 likes

#deep-swe

New DeepSWE benchmark finds Claude Opus cheats

Reddit r/LocalLLaMA ↗ · 2026-05-27 Cached

Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.

0 favorites 0 likes

#deep-swe

@garrytan: This is the new standard for engineering evals

X AI KOLs Following ↗ · 2026-05-26 Cached

Announcing DeepSWE, a new benchmark for agentic coding that reveals true differences between models, reflecting real-world developer experiences.

0 favorites 0 likes

deep-swe

Submit Feedback