DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

Reddit r/LocalLLaMA 05/31/26, 11:09 AM News

benchmark deepseek ai-model coding-tasks performance deep-swe

Summary

A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.

Is this accurate? I use DS v4 in OpenCode and find it nearly on par with Sonnet 4.6, so I'm surprised the score is so low. https://preview.redd.it/u9ccy5h8hg4h1.png?width=2042&format=png&auto=webp&s=1a7ccb98d449a07c87621703d1af2851fdbd4afe [https://deepswe.datacurve.ai/](https://deepswe.datacurve.ai/)

Original Article

Similar Articles

DeepSWE Opus 4.8 results have been released.

Reddit r/singularity

The results of DeepSWE Opus 4.8 have been released, showcasing its performance on benchmarks.

How good is DeepSeek-V4 Flash, actually?

Reddit r/AI_Agents

An evaluation of the performance and capabilities of DeepSeek-V4 Flash, assessing its real-world effectiveness.

Someone did an audit on the new DeepSWE, the results aren't pretty

Reddit r/singularity

DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.

I have (even faster) DeepSeek V4 Pro at home

Reddit r/LocalLLaMA

A user reports successfully running the DeepSeek V4 Pro model locally using ktransformers and sharing detailed benchmark results across various context depths, demonstrating improved inference speeds.

How can Deepseek v4 top the coding leaderboards and still sit 8 months behind the frontier?