Someone did an audit on the new DeepSWE, the results aren't pretty

Reddit r/singularity Tools

Summary

DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.

While this post on the DeepSWE Benchmark github is mainly focused on DeepSeek failing in many places where it shouldn't, it shows many problems with how the benchmark was conducted. It seems that the benchmark was rushed out the door and still needs a lot more work before it can be considered a reliable reference for the quality of the models they benchmarked.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:47 PM

datacurve-ai/deep-swe

Source: https://github.com/datacurve-ai/deep-swe

DeepSWE

DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks drawn from active open-source repositories. The benchmark includes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, with isolated environments and program-based verifiers.

Task format

DeepSWE tasks use the Harbor task format:

task.toml         Metadata: repository, base commit, language, prebuilt image, resource limits
instruction.md    The prompt the agent sees
environment/      Dockerfile that reproduces the prebuilt image (fallback if the image is unavailable)
tests/            Verifier: test.sh (entry point) + test.patch (test additions, applied at grading time)
solution/         Reference solution (held out from the agent; for human and AI reviewers)

The verifier exercises the behavior the prompt describes. It accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure. The reference patch in solution/ is never used at grading time; it exists so reviewers can spot-check correctness offline.

Quickstart

Use Pier to run the benchmark:

git clone https://github.com/datacurve-ai/deep-swe
uv tool install datacurve-pier

# Claude Opus 4.7 via Claude Code
export ANTHROPIC_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7

# GPT-5.5 via Codex
export OPENAI_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5

What is Pier

Pier is a Harbor-compatible framework for sandboxed coding-agent evals. It began as a fork of Harbor to support CLI agents in air-gapped tasks: Harbor blocks all outbound traffic in allow_internet = false tasks, including dependency installs and LLM API calls. Pier adds per-agent network allowlists, giving agents only the network access they need while keeping the task environment isolated.

Pier also adds more complete trajectory metadata, a better trajectory viewer, and pier critique run for analyzing agent trajectories. All leaderboard scores were produced with Pier running mini-swe-agent on Modal.

Agents and models

mini-swe-agent is model-agnostic. Pier also drives claude-code, codex, gemini-cli, and opencode directly. Pass --env modal to run in parallel sandboxes on Modal.

Subsets and single tasks

Deterministic random subset of the 113-task corpus:

pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

Single task:

pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

Similar Articles

New DeepSWE benchmark finds Claude Opus cheats

Reddit r/LocalLLaMA

Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.

Introducing SWE-bench Verified

OpenAI Blog

OpenAI is releasing SWE-bench Verified, a human-validated subset of the SWE-bench benchmark designed to more reliably evaluate AI models' ability to autonomously solve real-world software engineering tasks. The release addresses issues with overly specific or irrelevant unit tests that caused correct solutions to be incorrectly rejected.