@xdotli: mini-swe-agent is impressive. 100 lines, one bash tool, same prompt for every model tops on DeepSWE by @datacurve where…

X AI KOLs Timeline 06/12/26, 04:26 AM Tools

open-source swe-agent code-tool evaluation-harness command-line benchmarking

Summary

mini-swe-agent is a minimal, open-source SWE-agent implementation that tops DeepSWE benchmarks with just 100 lines of code and a single bash tool. The team also open-sourced mini-swe-code for interactive use and mini-swe-acp for evaluation harness across benchmarks.

mini-swe-agent is impressive. 100 lines, one bash tool, same prompt for every model tops on DeepSWE by @datacurve where it matches or beats the vendors' own harnesses. So we open-sourced two things around it: - mini-swe-code: play with it in @opencode's TUI, one command: mini-opencode --attach - mini-swe-acp: run it as an eval harness on any benchmark via @benchflow_ai (ACP) hats off to @KLieret @jyangballin @ArpandeepKhatua and the SWE-agent team. repo in and welcome our new MTS intern @bingran_bry who recently joined @benchflow_ai from quantum physics PhD program at Berkeley!

Original Article

View Cached Full Text

Cached at: 06/12/26, 06:54 AM

mini-swe-agent is impressive.

100 lines, one bash tool, same prompt for every model

tops on DeepSWE by @datacurve where it matches or beats the vendors’ own harnesses.

So we open-sourced two things around it:

mini-swe-code: play with it in @opencode’s TUI, one command: mini-opencode –attach
mini-swe-acp: run it as an eval harness on any benchmark via @benchflow_ai (ACP)

hats off to @KLieret @jyangballin @ArpandeepKhatua and the SWE-agent team. repo in

and welcome our new MTS intern @bingran_bry who recently joined @benchflow_ai from quantum physics PhD program at Berkeley!

Similar Articles

Someone did an audit on the new DeepSWE, the results aren't pretty

Reddit r/singularity

DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.

DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch

Hugging Face Daily Papers

DeNovoSWE is a large-scale dataset for training code agents to generate entire software repositories from documentation, using a sandboxed agentic workflow and difficulty-aware filtering. Fine-tuning Qwen3-30B-A3B on it boosts performance on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.

@garrytan: This is the new standard for engineering evals

X AI KOLs Following

Announcing DeepSWE, a new benchmark for agentic coding that reveals true differences between models, reflecting real-world developer experiences.

SWE Context Bench just proved something I think a lot of coding agent users already feel

Reddit r/AI_Agents

A new benchmark paper 'SWE Context Bench' tests whether coding agents can reuse knowledge across tasks, highlighting a gap in existing benchmarks that only evaluate isolated problem-solving. The author discusses solutions like external memory and mentions tools such as langmem, mem0, supermemory, and Greplica.

SWE-Explore: Benchmarking How Coding Agents Explore Repositories