Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
Summary
Socratic-SWE introduces a closed-loop self-evolution framework for software engineering agents that leverages historical solving traces to generate targeted repair tasks, achieving 50.40% on SWE-bench Verified after three iterations.
View Cached Full Text
Cached at: 06/08/26, 03:30 AM
Paper page - Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
Source: https://huggingface.co/papers/2606.07412
Abstract
Socratic-SWE enables self-evolving software engineering agents by leveraging historical solving traces to generate targeted repair tasks that improve agent performance through iterative refinement.
LLM-driven software engineering agentshave become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existingsynthetic data methodstypically create tasks throughfixed mutationorbug-injection procedures, making the resulting distributions largely independent of the agent’s own weaknesses and training progress. We introduce Socratic-SWE, aclosed-loop self-evolution frameworkthat reuses the agent’shistorical solving tracesas a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them intostructured agent skillsthat summarize recurring failures and effectiverepair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked throughexecution-based validationand scored with asolver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling thetask curriculumto adapt over successive rounds. AcrossSWE-bench Verified,SWE-bench Lite,SWE-bench Pro, andTerminal-Bench 2.0, Socratic-SWE consistently improves overself-evolving baselinesunder the same compute budget, reaching 50.40% onSWE-bench Verifiedafter three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07412 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.07412 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.07412 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SWE-chat: Coding Agent Interactions From Real Users in the Wild
SWE-chat introduces a 6,000-session dataset of real-world coding agent interactions, revealing that only 44% of agent-generated code survives in commits and highlighting inefficiencies and security issues in current AI-assisted development.
SWE Context Bench just proved something I think a lot of coding agent users already feel
A new benchmark paper 'SWE Context Bench' tests whether coding agents can reuse knowledge across tasks, highlighting a gap in existing benchmarks that only evaluate isolated problem-solving. The author discusses solutions like external memory and mentions tools such as langmem, mem0, supermemory, and Greplica.
@rohanpaul_ai: Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own s…
A new paper from Meta, CMU, and other labs presents Self-play SWE-RL, a method where coding agents train themselves by manufacturing and fixing bugs in real codebases, achieving significant gains on SWE-bench benchmarks without relying on human-written tasks.
Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents
TRACE is a skill-layer pipeline that mines user corrections from interactive coding agents to compile runtime checks, reducing repeated preference violations significantly better than memory alone, as demonstrated on ClawArena and MemoryArena tasks.
@sheriyuo: Every "self-evolving agent" paper this year has mutated text: prompts, skill files, workflow graphs, memory schemas. MO…
MOSS introduces source-level rewriting for self-evolving agents, enabling fixes to structural failures that text-layer evolution cannot reach. It lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle on OpenClaw without human intervention.