@xdotli: mini-swe-agent is impressive. 100 lines, one bash tool, same prompt for every model tops on DeepSWE by @datacurve where…
Summary
mini-swe-agent is a minimal, open-source SWE-agent implementation that tops DeepSWE benchmarks with just 100 lines of code and a single bash tool. The team also open-sourced mini-swe-code for interactive use and mini-swe-acp for evaluation harness across benchmarks.
View Cached Full Text
Cached at: 06/12/26, 06:54 AM
mini-swe-agent is impressive.
100 lines, one bash tool, same prompt for every model
tops on DeepSWE by @datacurve where it matches or beats the vendors’ own harnesses.
So we open-sourced two things around it:
- mini-swe-code: play with it in @opencode’s TUI, one command: mini-opencode –attach
- mini-swe-acp: run it as an eval harness on any benchmark via @benchflow_ai (ACP)
hats off to @KLieret @jyangballin @ArpandeepKhatua and the SWE-agent team. repo in
and welcome our new MTS intern @bingran_bry who recently joined @benchflow_ai from quantum physics PhD program at Berkeley!
Similar Articles
Someone did an audit on the new DeepSWE, the results aren't pretty
DeepSWE is a new benchmark for evaluating AI coding agents on real-world software engineering tasks from active open-source repositories, comprising 113 tasks across TypeScript, Go, Python, JavaScript, and Rust with isolated environments and program-based verifiers.
DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch
DeNovoSWE is a large-scale dataset for training code agents to generate entire software repositories from documentation, using a sandboxed agentic workflow and difficulty-aware filtering. Fine-tuning Qwen3-30B-A3B on it boosts performance on the BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.
@garrytan: This is the new standard for engineering evals
Announcing DeepSWE, a new benchmark for agentic coding that reveals true differences between models, reflecting real-world developer experiences.
SWE Context Bench just proved something I think a lot of coding agent users already feel
A new benchmark paper 'SWE Context Bench' tests whether coding agents can reuse knowledge across tasks, highlighting a gap in existing benchmarks that only evaluate isolated problem-solving. The author discusses solutions like external memory and mentions tools such as langmem, mem0, supermemory, and Greplica.
SWE-Explore: Benchmarking How Coding Agents Explore Repositories
SWE-Explore introduces a benchmark for evaluating coding agents' repository exploration capabilities, requiring ranked lists of relevant code regions within line budgets. Experiments show agentic exploration outperforms traditional retrieval, and line-level coverage remains a key differentiator.