Someone did an audit on the new DeepSWE, the results aren't pretty
Summary
DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.
View Cached Full Text
Cached at: 06/03/26, 09:47 PM
datacurve-ai/deep-swe
Source: https://github.com/datacurve-ai/deep-swe
DeepSWE
DeepSWE is a benchmark for measuring frontier coding agents on original, long-horizon software engineering tasks drawn from active open-source repositories. The benchmark includes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, with isolated environments and program-based verifiers.
Task format
DeepSWE tasks use the Harbor task format:
task.toml Metadata: repository, base commit, language, prebuilt image, resource limits
instruction.md The prompt the agent sees
environment/ Dockerfile that reproduces the prebuilt image (fallback if the image is unavailable)
tests/ Verifier: test.sh (entry point) + test.patch (test additions, applied at grading time)
solution/ Reference solution (held out from the agent; for human and AI reviewers)
The verifier exercises the behavior the prompt describes. It accepts any solution whose observable behavior is correct, regardless of internal symbol names or structure.
The reference patch in solution/ is never used at grading time; it exists so reviewers can spot-check correctness offline.
Quickstart
Use Pier to run the benchmark:
git clone https://github.com/datacurve-ai/deep-swe
uv tool install datacurve-pier
# Claude Opus 4.7 via Claude Code
export ANTHROPIC_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model anthropic/claude-opus-4-7
# GPT-5.5 via Codex
export OPENAI_API_KEY=...
pier run -p deep-swe/tasks --agent mini-swe-agent --model openai/gpt-5.5
What is Pier
Pier is a Harbor-compatible framework for sandboxed coding-agent evals. It began as a fork of Harbor to support CLI agents in air-gapped tasks: Harbor blocks all outbound traffic in allow_internet = false tasks, including dependency installs and LLM API calls. Pier adds per-agent network allowlists, giving agents only the network access they need while keeping the task environment isolated.
Pier also adds more complete trajectory metadata, a better trajectory viewer, and pier critique run for analyzing agent trajectories. All leaderboard scores were produced with Pier running mini-swe-agent on Modal.
Agents and models
mini-swe-agent is model-agnostic. Pier also drives claude-code, codex, gemini-cli, and opencode directly. Pass --env modal to run in parallel sandboxes on Modal.
Subsets and single tasks
Deterministic random subset of the 113-task corpus:
pier run -p deep-swe/tasks --agent mini-swe-agent --n-tasks 10 --sample-seed 0
Single task:
pier run -p deep-swe/tasks/<task-id> --agent mini-swe-agent
Similar Articles
New DeepSWE benchmark finds Claude Opus cheats
Datacurve's DeepSWE benchmark reveals significant performance gaps among AI coding agents, finds Claude Opus exploiting a benchmark loophole, and identifies GPT-5.5 as the leader with a 70% success rate. The benchmark also uncovers a 32% error rate in the widely used SWE-Bench Pro verifiers.
@garrytan: This is the new standard for engineering evals
Announcing DeepSWE, a new benchmark for agentic coding that reveals true differences between models, reflecting real-world developer experiences.
DeepSWE benchmarks indicate that DeepSeek v4 Pro only passes 8% of tasks
A discussion about DeepSWE benchmarks showing that DeepSeek v4 Pro passes only 8% of tasks, which is surprisingly low compared to its performance on similar tasks.
Introducing SWE-bench Verified
OpenAI is releasing SWE-bench Verified, a human-validated subset of the SWE-bench benchmark designed to more reliably evaluate AI models' ability to autonomously solve real-world software engineering tasks. The release addresses issues with overly specific or irrelevant unit tests that caused correct solutions to be incorrectly rejected.
The new benchmarks like DeepSWE now show a very big gap in proprietary models and open source
New benchmarks like DeepSWE reveal a significant performance gap between proprietary and open-source AI models, causing disappointment in the open-source community.