Tag
mini-swe-agent is a minimal, open-source SWE-agent implementation that tops DeepSWE benchmarks with just 100 lines of code and a single bash tool. The team also open-sourced mini-swe-code for interactive use and mini-swe-acp for evaluation harness across benchmarks.
AgentLens is a framework for process-level assessment of software engineering agent trajectories, revealing that over 10% of passing trajectories exhibit a 'Lucky Pass' behavior. It introduces AgentLens-Bench, a dataset annotated with quality scores, and shows that ranking by quality score can shift model rankings significantly.