Someone did an audit on the new DeepSWE, the results aren't pretty

Reddit r/singularity 06/03/26, 07:50 PM Tools

Summary

DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.

While this post on the DeepSWE Benchmark github is mainly focused on DeepSeek failing in many places where it shouldn't, it shows many problems with how the benchmark was conducted. It seems that the benchmark was rushed out the door and still needs a lot more work before it can be considered a reliable reference for the quality of the models they benchmarked.

Original Article

View Cached Full Text

Cached at: 06/03/26, 09:47 PM

datacurve-ai/deep-swe

Source: https://github.com/datacurve-ai/deep-swe

DeepSWE

DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks drawn from active open-source repositories. The benchmark includes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, with isolated environments and program-based verifiers.

Task format

DeepSWE tasks use the Harbor task format:

task.toml         Metadata: repository, base commit, language, prebuilt image, resource limits
instruction.md    The prompt the agent sees
environment/      Dockerfile that reproduces the prebuilt image (fallback if the image is unavailable)
tests/            Verifier: test.sh (entry point) + test.patch (test additions, applied at grading time)
solution/         Reference solution (held out from the agent; for human and AI reviewers)

The verifier exercises the behavior the prompt describes. It accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure. The reference patch in solution/ is never used at grading time; it exists so reviewers can spot-check correctness offline.

Quickstart

Use Pier to run the benchmark:

git clone https://github.com/datacurve-ai/deep-swe
uv tool install datacurve-pier

# Claude Opus 4.7 via Claude Code
export ANTHROPIC_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7

# GPT-5.5 via Codex
export OPENAI_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5

What is Pier

Pier is a Harbor-compatible framework for sandboxed coding-agent evals. It began as a fork of Harbor to support CLI agents in air-gapped tasks: Harbor blocks all outbound traffic in allow_internet = false tasks, including dependency installs and LLM API calls. Pier adds per-agent network allowlists, giving agents only the network access they need while keeping the task environment isolated.

Pier also adds more complete trajectory metadata, a better trajectory viewer, and pier critique run for analyzing agent trajectories. All leaderboard scores were produced with Pier running mini-swe-agent on Modal.

Agents and models

mini-swe-agent is model-agnostic. Pier also drives claude-code, codex, gemini-cli, and opencode directly. Pass --env modal to run in parallel sandboxes on Modal.

Subsets and single tasks

Deterministic random subset of the 113-task corpus:

pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0

Single task:

pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent

Someone did an audit on the new DeepSWE, the results aren't pretty

datacurve-ai/deep-swe

DeepSWE

Task format

Quickstart

What is Pier

Agents and models

Subsets and single tasks

Similar Articles

@OpenAI: We audited SWE-Bench Pro, one of the most widely used AI coding benchmarks, and found it no longer reliably measures fr…

New DeepSWE benchmark finds Claude Opus cheats

@garrytan: This is the new standard for engineering evals

@OpenAI: To audit SWE-Bench Pro, we used model-based investigator agents alongside independent reviews from five independent exp…

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks

Submit Feedback

Similar Articles

@OpenAI: We audited SWE-Bench Pro, one of the most widely used AI coding benchmarks, and found it no longer reliably measures fr…

New DeepSWE benchmark finds Claude Opus cheats

@garrytan: This is the new standard for engineering evals

@OpenAI: To audit SWE-Bench Pro, we used model-based investigator agents alongside independent reviews from five independent exp…

DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks