@bibryam: Autogenesis: A Self-Evolving Agent Protocol https://arxiv.org/abs/2604.15034 tldr: Treat every agent component like a v…

X AI KOLs Timeline Papers

Summary

Introduces Autogenesis Protocol (AGP), a self-evolving agent protocol that decouples components from their evolution, enabling lifecycle management, version tracking, and safe rollback for prompts, agents, tools, environments, and memory in LLM-based multi-agent systems.

Autogenesis: A Self-Evolving Agent Protocol https://t.co/zY0mNMaIsj tldr: Treat every agent component like a version-controlled resource that can be automatically improved and rolled back. https://t.co/I1FCThoK0y
Original Article
View Cached Full Text

Cached at: 06/17/26, 04:01 PM

Autogenesis: A Self-Evolving Agent Protocol https://t.co/zY0mNMaIsj

tldr: Treat every agent component like a version-controlled resource that can be automatically improved and rolled back. https://t.co/I1FCThoK0y


Autogenesis: A Self-Evolving Agent Protocol

Source: https://arxiv.org/html/2604.15034 Wentao Zhang111footnotemark:1,Zhe Zhao211footnotemark:1,Haibin Wen411footnotemark:1,Yingcheng Wu2,Cankun Guo5, Ming Yin322footnotemark:2,Bo An122footnotemark:2,Mengdi Wang322footnotemark:2 1Nanyang Technological University2Stanford University3Princeton University 4City University of Hong Kong5University of Science and Technology of China [email protected]@[email protected] Project Code:https://github.com/DVampire/Autogenesis

Abstract

Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduceAutogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources111Unless otherwise specified, resources refer to instances of the five RSPL entity types:prompt,agent,tool/MCP/skill,environment,memorywith agentoutputs/solutions.Toolrefers to local code-based tools, MCP tools, and skills.with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building onAGP, we presentAutogenesis System (AGS), a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluateAGSon multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution.

††footnotetext:∗Equal contribution. First Author Contact:[email protected]†Corresponding authors.## 1Introduction

Recent advances in LLM-based agent systems have demonstrated significant potential in tackling complex, long-horizon tasksYaoet al.(2022); Weiet al.(2022); Brownet al.(2020), yet static designs often prove insufficient against the diversity and stochasticity of real-world environments. Endowing agents with self-evolution capabilities has thus emerged as a critical avenue toward robust autonomy. However, existing implementations remain largely fragmented and ad hoc: components such as prompts, tools, and memory are tightly coupled to agent logic, shared standards are absent, and the lack of explicit lifecycle management and safe update interfaces introduces significant risks of runtime instability, preventing self-evolution from being composable, auditable, or systematically reproducible.

Although protocols such as MCPAnthropic (2025b)and A2AGoogle (2025)have standardized connectivity for model-tool invocation and inter-agent communication, they operate solely at the level of invocation and message passing, leaving internal resource states opaque. Neither provides mechanisms for lifecycle management, version lineage, or controlled state mutation, which are precisely the requirements of a closed-loop evolutionary system. Bridging this gap calls for a dedicated protocol addressing three essential properties:Decoupling, so that resources such as prompts, tools, and memory are managed as independent entities rather than tightly coupled code;Safety & Auditability, through strict version control and rollback to ensure every evolutionary step is traceable and reversible; andFormalism, via standardized operators (e.g., reflect, propose, verify) that convert heuristic modifications into a rigorous control loop.

To address these challenges, we proposeAutogenesis Protocol (AGP), a two-layer protocol architecture that formally decouples the evolutionary substrate from the evolutionary logic. The central design principle is to standardize resource representations, enabling uniform application of optimization algorithmsYuksekgonulet al.(2025); Shaoet al.(2024); Hu (2025b)across heterogeneous agent components. TheResource Substrate Protocol Layer (RSPL)constitutes the substrate of evolution, modeling prompts, agents, tools, environments, and memory systems as protocol-registered resources endowed with explicit state, lifecycle, and versioned interfaces, thereby rendering them well-defined objects amenable to systematic observation and controlled manipulation. TheSelf-Evolution Protocol Layer (SEPL)establishes a closed-loop operator interface grounded in control theory, specifying a set of atomic operators that formally govern the evolution cycle and guarantee that every self-modification is fully auditable and subject to strict safety constraints. Building upon this protocol, we instantiateAutogenesis System (AGS), a self-evolving multi-agent system system that dynamically registers, retrieves, and refines protocol resources at runtime. Empirical evaluation on a suite of challenging benchmarks, including GPQAReinet al.(2024), AIME, GAIAMialonet al.(2023), HLE(Phanet al.,2025), and LeetCodeLeetCode, demonstrates thatAGSachieves consistent and substantial improvements over strong baselines, validating the efficacy of principled resource management and closed-loop self-evolution. The contributions of this work are threefold:

  • •We proposeAutogenesis Protocol (AGP), a two-layer self-evolution protocol decoupling evolutionary substrate from logic. RSPL endows resources with explicit state, lifecycle, and versioned interfaces; SEPL governs the evolution cycle via a closed-loop operator interface with auditable lineage and rollback.
  • •We presentAutogenesis System (AGS), a self-evolving multi-agent system system that dynamically registers, retrieves, and refines protocol resources at runtime, demonstrating the practical viability of protocol-driven self-evolution.
  • •We conduct empirical evaluation on five challenging benchmarks (GPQA, AIME, GAIA, HLE, and LeetCode), demonstrating consistent and substantial improvements over strong baselines and validating the efficacy of principled resource management and closed-loop evolution.

2Related Work

2.1LLM-based Agent Systems and Protocols

LLM-based agent systems have demonstrated strong capabilities in complex, long-horizon tasks requiring multi-step reasoning and external tool interactionReinet al.(2024); Mialonet al.(2023); Yaoet al.(2022); Weiet al.(2022); Schicket al.(2023), with LLMs serving as centralized decision-making modules that decompose tasks and invoke tools to act on the environment. However, most existing frameworks treat prompts, tools, and memory as tightly coupled internal components: tools are manually curated fixed modules integrated directly into the agent pipelineQinet al.(2023); Schicket al.(2023); Chenet al.(2021), limiting systematic reuse and controlled adaptation as task requirements evolve. Efforts such as Anthropic’s MCPAnthropic (2025a)and Google’s A2A protocol have standardized model-tool interaction and inter-agent communication at the level of invocation and message passing, but leave the internal state of agents and resources opaque, providing no mechanisms for managing resource lifecycles, tracking version lineage, or constraining state mutations over time. In contrast, our approach treats prompts, agents, local code-based tools, MCP tools, and skillsAnthropic (2025b)as protocol-registered entities endowed with explicit interfaces and versioned state, thereby supporting dynamic instantiation, controlled refinement, and auditable evolution throughout execution.

2.2Self-Evolution and Optimization of Agent Components

A parallel line of work investigates iterative agent improvement via gradient-free methods such as TextGradYuksekgonulet al.(2025), which treat natural language feedback as a gradient signalPryzantet al.(2023); Zhouet al.(2022), and reinforcement learning approaches such as Reinforce++Hu (2025a)and GRPOShaoet al.(2024), which frame agent components as policies optimized via evaluation rewardsShinnet al.(2023); Madaanet al.(2023); Zelikmanet al.(2022). More recent frameworks such as EvoAgentXWanget al.(2025)and Hermes AgentNousResearch (2025)further pursue self-evolving agent workflows, autonomously constructing and refining multi-agent pipelines or skill libraries from interaction history. Despite this progress, these approaches focus on optimizing a narrow subset of agent components, typically prompts or task workflows, and do not provide a unified abstraction for managing the full spectrum of agent-internal entities including prompts, tools, and environments. Updates are applied directly without lifecycle control, version tracking, or rollback, precluding safe and auditable evolution. Our approach addresses this limitation via a two-layer architecture that exposes all agent components as protocol-registered resources governed by a principled closed-loop operator interface.

3Autogenesis Protocol

Refer to captionFigure 1:The Autogenesis protocol and system architecture.This section describes theAGPspecification, covering the resource substrate and the evolution operator interface. The concrete system instantiation built upon this protocol is presented in the next section. Despite growing interest in self-evolving agentsGaoet al.(2025), most systems remain engineered in an ad hoc manner and lack a shared protocol standard that makes evolution composable, auditable, and interoperable. As shown inFigure˜1,, we introduceAGP, a two-layer self-evolution protocol. TheResource Substrate Protocol Layer (RSPL)specifies the evolvable substrate, namely which resources may change and how they are represented, versioned, and accessed. TheSelf-Evolution Protocol Layer (SEPL)specifies the evolution logic, namely how updates are proposed, assessed, and committed through a safe operator interface. Inspired by interface standardization efforts in agent tooling, this separation cleanly decoupleswhat evolvesfromhow evolution occurs, enabling modularity, traceability, and safety-preserving evolution across components.

3.1Layer 1: Resource Substrate Protocol Layer

The Resource Substrate Protocol Layer (RSPL) defines the evolvable substrate as a set of protocol-registered resources with explicit state, lifecycle, and version lineage. We identify five entity types as a minimal yet expressive common denominator across modern agent stacks, providing a uniform target space on which SEPL can operate: (i)instructions(Prompt), (ii)decision policies(Agent), (iii)actuation interfaces(Tool), encompassing local code-based tools, MCP toolsAnthropic (2025a), and agent skillsAnthropic (2025b), (iv)task/world dynamics(Environment), and (v)persistent state(Memory). Crucially, resources in RSPL arepassive, meaning they encapsulate no optimization logic, cannot self-modify, and change state only through controlled operations mediated by interfaces and invoked by higher layers. This separation decouples agent logic from task-specific instructions and capability bundlesWuet al.(2024); Honget al.(2023); Chenet al.(2023), enabling the same policy to be deployed across tasks with different resource configurations.

3.1.1Infrastructure Services

A self-evolution protocol requires reliable foundational support in which model access remains consistent as components are swapped, every state transition is traceable and reversible, resources persist across sessions and can be safely hot-swapped, and execution behavior is observable for diagnosis and improvement. To meet these requirements, RSPL provides four cross-cutting infrastructure services: (i) Amodel managerstandardizes LLM API calls across heterogeneous providers, including Anthropic, OpenAI, Google, xAI, and OpenRouter, and supports routing and fallback to ensure consistent model access as resources evolve. (ii) Aversion managermaintains immutable snapshots and version lineage, enabling rollback, branching, and auditability at every state transition. (iii) Adynamic managerhandles serialization and hot-swapping of resource configurations at runtime without restarting the agent system. (iv) Atrace managercaptures fine-grained execution traces for interpretability, debugging, and retrospective optimization.

3.1.2Core Entities

Definition 3.1(Resource Entity).

A resource entity and its type-level collection are defined as:

eτ,i\displaystyle e_{\tau,i}=(nτ,i,dτ,i,ϕτ,i,gτ,i,mτ,i),\displaystyle=(n_{\tau,i},\,d_{\tau,i},\,\phi_{\tau,i},\,g_{\tau,i},\,m_{\tau,i}),(1)ℰτ\displaystyle\mathcal{E}_{\tau}={eτ,i∣i∈ℐτ},\displaystyle=\{\,e_{\tau,i}\mid i\in\mathcal{I}_{\tau}\,\},where𝒯={Prompt,Agent,Tool,Env,Mem}\mathcal{T}=\{\textsc{Prompt},\textsc{Agent},\textsc{Tool},\textsc{Env},\textsc{Mem}\}is the set of RSPL entity types,τ∈𝒯\tau\in\mathcal{T}indexes the type,ℐτ\mathcal{I}_{\tau}is the index set for instances of typeτ\tau, andi∈ℐτi\in\mathcal{I}_{\tau}indexes an individual instance.nτ,in_{\tau,i}is a unique name,dτ,id_{\tau,i}a short description,ϕτ,i:𝒳τ→𝒴τ\phi_{\tau,i}:\mathcal{X}_{\tau}\rightarrow\mathcal{Y}_{\tau}an input-to-output mapping,gτ,i∈{0,1}g_{\tau,i}\in\{0,1\}an evolvability marker, andmτ,im_{\tau,i}an auxiliary metadata dictionary.

To support resource registration, unified management, and instantiation, RSPL stores a serializable registration record for each resource instance.

Definition 3.2(Resource Registration Record).

A resource registration record and its type-level collection can be represented as:

cτ,i\displaystyle c_{\tau,i}=(eτ,i,vτ,i,ητ,i,θτ,i,ℱτ,i),\displaystyle=(e_{\tau,i},\,v_{\tau,i},\,\eta_{\tau,i},\,\theta_{\tau,i},\,\mathcal{F}_{\tau,i}),(2)𝒞τ\displaystyle\mathcal{C}_{\tau}={cτ,i∣i∈ℐτ},\displaystyle=\{\,c_{\tau,i}\mid i\in\mathcal{I}_{\tau}\,\},whereτ∈𝒯\tau\in\mathcal{T}indexes the entity type andi∈ℐτi\in\mathcal{I}_{\tau}indexes an individual instance. Hereeτ,ie_{\tau,i}is the resource entity tuple defined inDefinition˜3.1,vτ,i∈𝕍v_{\tau,i}\in\mathbb{V}is a version string,ητ,i\eta_{\tau,i}is an implementation descriptor (e.g., import path, class definition, or source-code string),θτ,i\theta_{\tau,i}are instantiation parameters (e.g., constructor arguments), andℱτ,i\mathcal{F}_{\tau,i}is a set of exported representations used by LLMs to interact with the resource (e.g., function-calling schema, plain text, and structured argument schema).

Definition 3.3(Protocol-registered resource).

For each entity typeτ\tau, letℛτ\mathcal{R}_{\tau}denote the type-specific registry of protocol-registered resources, and letℛ=⋃τℛτ\mathcal{R}=\bigcup_{\tau}\mathcal{R}_{\tau}denote the corresponding global registry. RSPL associates each entity typeτ\tauwith a dedicated context managerℳτ\mathcal{M}_{\tau}and a server-exposed interface𝒜τ\mathcal{A}_{\tau}. We represent the type-level registered resource as

rτ=(𝒞τ,ℳτ,𝒜τ),\footnotesize r_{\tau}=(\mathcal{C}_{\tau},\;\mathcal{M}_{\tau},\;\mathcal{A}_{\tau}),(3)where eachcτ,i∈𝒞τc_{\tau,i}\in\mathcal{C}_{\tau}denotes a registration record as defined inDefinition˜3.2. The context managerℳτ\mathcal{M}_{\tau}maintains the record collection𝒞τ\mathcal{C}_{\tau}and the version lineage associated with typeτ\tau, while implementing lifecycle and update operations over these records. The server-exposed interface𝒜τ\mathcal{A}_{\tau}encapsulatesℳτ\mathcal{M}_{\tau}and provides a unified external interface by delegating incoming requests to the corresponding context-manager routines.

Context manager and server interface.Each resource type is governed by a context manager, which serves as the management plane. It maintains a registry of materialized resources, preserves versioned histories for restoration, and supportscontract generationby producing a consolidated capability specification. This specification reduces prompt bloat and enablescontext engineeringthrough controlled injection. For tools, the contract takes askills.md-style form(Anthropic,2025b)that enumerates actions, arguments, and usage constraints. The context-manager API provides operators for lifecycle management (init,build), retrieval (list,get), versioning (update,restore), execution (run), and serialization (save_to_json,load_from_json,save_contract,load_contract). The server interface encapsulates this internal complexity behind a uniform set of endpoints with consistent request and response semantics, providing a single control plane for safe and version-aware interactions with RSPL resources. Full specifications are inSection˜E.2.2.

3.2Layer 2: Self-Evolution Protocol Layer (SEPL)

The Self-Evolution Protocol Layer (SEPL) formalizes agentic system evolution as a generalized optimization problem over a heterogeneous state space, modeling evolutionary dynamics as a state transition function governed by a strictly typed operator algebra. By mediating all state mutations through standardized RSPL interfaces, SEPL guarantees that evolution is traceable, reversible, and safe-by-construction. While this paper focuses on the reflection-driven optimizer as the primary instantiation, the same state manipulation primitives also accommodate textual-gradient methods such as TextGrad(Yuksekgonulet al.,2025)and reinforcement learning approaches such as GRPO(Shaoet al.,2024)and Reinforce++(Hu,2025b).

3.2.1Evolvable Variables

To transition from heuristic adaptation to a systematic evolution protocol, we introduce the concept ofvariable lifting. This abstraction projects discrete, heterogeneous RSPL resources (e.g., tool code, system prompts) onto a unified representation of evolvable variables. This formalism offers significant theoretical advantages by homogenizing the interaction surface for evolutionary operators and rigorously delineating the trainable subspace via an explicit learnability mask.

Definition 3.4(Evolvable Variable Set).

We define the universal set of evolvable variables as𝒱evo=(⋃τ∈𝒯ℰτ)∪{y}\mathcal{V}_{\text{evo}}=\bigl(\bigcup_{\tau\in\mathcal{T}}\mathcal{E}_{\tau}\bigr)\cup\{y\}, whereℰτ\mathcal{E}_{\tau}denotes the set of resource entities of typeτ\taugoverned by the RSPL. The elementyyencapsulates execution artifacts, specifically final outputs and reasoning traces, which constitute the observational basis for retrospective optimization. Furthermore, each variablev∈𝒱evov\in\mathcal{V}_{\text{evo}}is associated with a binary learnability constraintgv∈{0,1}g_{v}\in\{0,1\}, thereby strictly defining the trainable parameter subspaceΘ={v∈𝒱evo∣gv=1}\Theta=\{v\in\mathcal{V}_{\text{evo}}\mid g_{v}=1\}.

3.2.2Operator Algebra

To systematically govern state transitions over𝒱evo\mathcal{V}_{\text{evo}}, we introduce the notion of aSEPL operator: a typed, composable function that reads the current evolvable state together with auxiliary signals, produces an updated state, and emits signals for downstream operators. Formalizing evolution as an algebra of such operators ensures that every modification is interface-mediated, auditable, and reversible, regardless of the specific optimization strategy instantiated.

Definition 3.5(SEPL Operator).

Let𝒱evo\mathcal{V}_{\text{evo}}be the evolvable variable set and𝒫\mathcal{P}amessage spacecarrying auxiliary signals (e.g., traces, hypotheses, gradients, or reward signals) passed between operators. ASEPL operatoris a function

f:𝒱evo×𝒫in→𝒱evo′×𝒫out,\footnotesize f:\mathcal{V}_{\text{evo}}\times\mathcal{P}_{\text{in}}\;\rightarrow\;\mathcal{V}^{\prime}_{\text{evo}}\times\mathcal{P}_{\text{out}},(4)where𝒫in,𝒫out⊆𝒫\mathcal{P}_{\text{in}},\mathcal{P}_{\text{out}}\subseteq\mathcal{P}are the incoming and outgoing message types, and𝒱evo′\mathcal{V}^{\prime}_{\text{evo}}is the updated evolvable state. Operators arecomposable: the output(𝒱evo′,𝒫out)(\mathcal{V}^{\prime}_{\text{evo}},\mathcal{P}_{\text{out}})of one operator serves as the input to the next, enabling the construction of an evolutionary pipelinefn∘⋯∘f1f_{n}\circ\cdots\circ f_{1}.

3.2.3Evolutionary Loop

Given an initial evolvable state𝒱evo(0)\mathcal{V}_{\text{evo}}^{(0)}and an empty message𝒫(0)=∅\mathcal{P}^{(0)}=\emptyset, the evolutionary loop at each iterationttapplies a sequence of operatorsf1,…,fnf_{1},\ldots,f_{n}in composition:

(𝒱evo(t+1),𝒫(t+1))=(fn∘⋯∘f1)​(𝒱evo(t),𝒫(t)),\footnotesize\bigl(\mathcal{V}_{\text{evo}}^{(t+1)},\,\mathcal{P}^{(t+1)}\bigr)=(f_{n}\circ\cdots\circ f_{1})\bigl(\mathcal{V}_{\text{evo}}^{(t)},\,\mathcal{P}^{(t)}\bigr),(5)where eachfif_{i}reads the current state and incoming messages, produces an updated state and outgoing messages consumed byfi+1f_{i+1}. The loop repeats until convergence or budget exhaustion. By routing all state mutations through RSPL interfaces, each transition is versioned and reversible, guaranteeing that evolution isgroundedin execution data,traceablethrough versioned updates, andsafe-by-construction. For example, the reflection optimizer instantiates this loop with five operators:Reflectmaps execution traces and current state to causal failure hypotheses,Selectidentifies target evolvable entities from the current state and hypotheses, generating concrete modification proposals,Improveapplies proposals via RSPL interfaces to yield a candidate state,Evaluatescores the candidate against the objective and safety invariants, andCommitconditionally accepts or rolls back the transition. Full pseudocode for all instantiations is inSection˜E.3.

4Autogenesis System

4.1Autogenesis System Architecture

As shown inFigure˜1, building onAGP, we instantiate the two-layer protocol intoAGS, a self-evolving multi-agent system. A self-evolving system requires that agents, tools, and coordination structures remain dynamically modifiable at runtime, which is fundamentally incompatible with monolithic controllers or hard-wired pipelines that tightly couple execution logic to agent identity. To satisfy this requirement, we adopt abus interaction modelWuet al.(2024); Honget al.(2023): the planning agent and all sub-agents register as first-class participants on a sharedAgent Bus, and all inter-agent communication is mediated exclusively through standardized bus messages. This decoupling enables loose coupling, transparent observability, and concurrent sub-agent execution, while allowing any participant to be replaced or evolved without disrupting the rest of the system. Throughout all configurations, prompts, tools, and agents are treated asfirst-class RSPL resourceswith explicit lifecycle and version lineage. The system operates through three interleaved mechanisms:

Orchestration via Plan Generation.Upon receiving a task via the bus, the planning agent is responsible solely for planning and coordination and does not execute subtasks directly. It produces a structuredplan.mdartifact comprising five components: the original task description, a to-do list of subtask steps each assigned to a designated sub-agent (e.g., deep researcher agent, browser-use agent, deep analyzer agent and vibe coding agent), an execution flowchart, a running execution history, and a final result summary. The planning agent dispatches subtasks to the designated sub-agents via the bus, executing independent subtasks concurrently and dependent ones sequentially, and collects all results through the bus before proceeding to the next round.

Concurrent Sub-Agent Execution and Iterative Re-planning.Upon receiving a dispatched subtask, each sub-agent independently retrieves relevant prompt and tool resources from the RSPL registry, executes tool calls, and writes results and reasoning traces to shared memory. Multiple sub-agents execute concurrently, as the bus decouples dispatch from completion. Once a round concludes, the planning agent collects outputs via the bus, updatesplan.md, and determines whether the task is complete or a further round of decomposition is required. This collect-and-replan loop continues until the termination condition is met. As a complementary pattern,AGSalso supportsagent-as-toolcomposition, in which a sub-agent is wrapped behind a standard RSPL tool schema and invoked directly by a tool-calling agent, enabling lightweight collaboration without bus-level orchestration.

Self-Evolution.Interleaved with the bus coordination loop,AGSinvokes the SEPL evolutionary loop whenever execution traces signal correctable failures or suboptimal performance. The loop applies a sequence of SEPL operators to reflect, select, improve, evaluate, and commit resource modifications as versioned RSPL transitions with auditable lineage and rollback. As an example instantiation, the reflection optimizer (Algorithm1) reflects on execution traces to derive causal failure hypotheses, generates modification proposals (e.g., prompt text, tool source code, MCP configurations, or skill definitions), and commits accepted updates only after evaluating candidates against the task objective. Successful updates are immediately available to all sub-agents in subsequent bus rounds, ensuring that evolution remains traceable throughout the agent lifetime.

Beyond the reflection optimizer, our implementation supports additional optimization strategies that map naturally onto the same SEPL operator interface.TextGrad(Yuksekgonulet al.,2025)instantiates the proposal and improvement operators as a gradient-informed text editor, treating natural-language feedback as a textual gradient applied to string variables.Reinforce++ / GRPO(Hu,2025b; Shaoet al.,2024; Ouyanget al.,2022; Ziegleret al.,2019; Schulmanet al.,2017)adopt a reinforcement-learning perspective, treating evolvable variables as policies optimized via policy-gradient estimates against evaluation rewards. These strategies demonstrate that SEPL is sufficiently general to accommodate inference-time reflection optimization, textual-gradient-based string updates, and reward-driven policy optimization within a unified protocol.

5Empirical Studies

In this section, we present empirical results of deployingAGSacross various challenging benchmarks withAGPto demonstrate its comprehensive capabilities.

Benchmark Instruction. We organize our evaluation into three categories. (i)Scientific and Mathematical Benchmarks.GPQA-Diamond(198 questions) presents graduate-level STEM multiple-choice questions (biology, chemistry, and physics) under a closed-book, non-retrieval protocol, measuring deep scientific understanding and multi-step reasoning.AIME24andAIME25each consist of 30 competition-level mathematics problems requiring exact integer answers, measuring long-horizon symbolic reasoning and arithmetic precision. (ii)General Agent Benchmarks.GAIA(Mialonet al.,2023)includes a Validation split (165 tasks) and a Test split (300 tasks), each specifying a real-world, multi-step objective requiring planning and tool use (e.g., web browsing, document operations), measured by task-completion accuracy across three difficulty tiers.Humanity’s Last Exam (HLE)(Phanet al.,2025)comprises extremely difficult expert-level questions spanning mathematics, science, and humanities, measuring the agent’s capacity for deep reasoning at the boundary of human expert knowledge. (iii)Self-Evolving Code Agent Benchmark.Existing code benchmarks evaluate one-shot correctness under fixed model capability and therefore cannot measure an agentś self-evolution capability during inference. To directly assess this self-evolution capability, we construct an in-houseLeetCodebenchmark of 100 recently released problems across diverse algorithmic categories (e.g., arrays, trees, linked lists), with reduced data contamination. The agent solves each problem in multiple languages (Python, C++, Java, Go, Kotlin), and we report acceptance rate, test-case pass rate, runtime efficiency, and human-relative performance metrics.

Table 1:Scientific and Mathematical Benchmarks.AgentGPQAAIME24AIME25gpt-4oVanilla47.9813.346.67Prompt-Evo53.8113.3413.34Solution-Evo53.5316.6713.34PS-Joint-Evo58.0816.6713.34Improvement(%)21.05↑\uparrow24.97↑\uparrow100↑\uparrowgpt-4.1Vanilla65.1523.3420.00Prompt-Evo68.6833.3323.33Solution-Evo68.6836.6730.00PS-Joint-Evo67.6740.0033.33Improvement(%)3.87↑\uparrow71.38↑\uparrow66.65↑\uparrowgrok-4.1-fastVanilla83.3396.6790.00Prompt-Evo83.8496.6793.33Solution-Evo87.8196.6790.00PS-Joint-Evo89.3496.6796.67Improvement(%)7.21↑\uparrow0.007.41↑\uparrowclaude-sonnet-4.5Vanilla78.2876.6773.33Prompt-Evo79.7986.6790.00Solution-Evo80.3080.0090.00PS-Joint-Evo81.4486.6790.00Improvement(%)4.04↑\uparrow13.04↑\uparrow22.73↑\uparrowgemini-3-flash-previewVanilla88.3883.3383.33Prompt-Evo88.8993.3386.67Solution-Evo87.8893.3390.00PS-Joint-Evo90.4093.3393.33Improvement(%)2.28↑\uparrow12.00↑\uparrow12.00↑\uparrow Table 2:GAIA Validation and Test Benchmarks.AgentLevel1Level2Level3Avg.ValidationHF ODR(HuggingFace,2024)67.9253.4934.6255.15o3-DR(OpenAI,2025)74.2969.0647.6067.36DeSearch(Desearch-ai,2024)90.5772.0138.4672.73Co-Sight(Zhanget al.,2025a)86.7973.2642.3172.73Manus(Shen and Yang,2025)86.5070.1057.6973.90AWorld(Yuet al.,2025)88.6877.9153.8577.58Langfun(Google,2024)88.6880.2357.6979.39Skywork(Zhanget al.,2025b)92.4583.7257.6982.42agent-203096.2390.7057.6987.27Alita(Qiuet al.,2025)88.6889.5376.9287.27Vanilla92.4588.3788.4689.70Agent-Evo96.2393.0288.4693.33Improvement(%)4.09↑\uparrow5.26↑\uparrow0.004.05↑\uparrowTesto4-mini-DR(OpenAI,2025)67.5959.1044.2859.30JoyAgent(Liuet al.,2025)77.4267.3046.9467.11o3-DR(OpenAI,2025)79.4268.9747.4868.70Langfun(Google,2024)84.9573.5848.9873.09Alita(Qiuet al.,2025)92.4771.7055.1075.42DeSearch(Desearch-ai,2024)91.4075.4761.2278.07h2o(H2O.ai,2025)89.2579.8761.2279.73Su-Zero-Ultra93.5577.3665.3180.40AWorld(Yuet al.,2025)95.7081.1357.1481.73HALO(Houet al.,2025)94.6284.9169.3985.38ToolOrchestra(Suet al.,2025)95.7082.3987.7687.38openJiuwen(openJiuwen,2026)98.9288.6887.7691.69Vanilla91.4077.3661.2279.07Agent-Evo98.9285.5381.6389.04Improvement(%)8.23↑\uparrow10.56↑\uparrow33.34↑\uparrow12.61↑\uparrow [Uncaptioned image]

Figure 2:Performance comparison on the HLE full set benchmark(Zoom AI,2025).

5.1Experiments on Scientific and Mathematical Benchmarks

Experiment Setting. We evaluateAGSon GPQA-Diamond, AIME24, and AIME25, focusing on evolving prompts and agent outputs (problem solutions). Since these benchmarks primarily test reasoning capability rather than tool use, we exclude external tools and conduct a controlled comparison across three evolution strategies:Prompt-Evo,Solution-Evo, andPrompt-Solution-Joint-Evo. We evaluate across diverse mainstream language models using the reflection optimizer with up to 3 rounds, after which the final agent output is taken as the solution. Performance is measured by exact-match accuracy, requiring the selected option to match the ground-truth answer for GPQA-Diamond, and numerical output to exactly match the reference integer for AIME24 and AIME25.

Results and Analysis.Table˜2reveals four key observations. (i)Self-evolution yields consistent gains, with greater benefit for weaker models.Weaker models gain substantially:gpt-4.1improves by 71.4% on AIME24 and 66.7% on AIME25 under PS-Joint-Evo, andclaude-sonnet-4.5gains 13.0% and 22.7% respectively. Stronger models also benefit, albeit more modestly:gemini-3-flash-preview(vanilla 88.4% GPQA-Diamond, 83.3% AIME24/25) improves by 2.3%, 12.0%, and 12.0%, consistent with diminishing headroom at higher baselines. (ii)PS-Joint-Evo consistently outperforms single-strategy evolution.Forgpt-4.1on AIME24, Prompt-Evo reaches 33.3% and Solution-Evo 36.7%, while PS-Joint-Evo reaches 40.0%, confirming that prompt and solution refinement address complementary failure modes. (iii)Math benchmarks benefit more than science QA.AIME24/25 show larger relative gains than GPQA-Diamond across all models: forgpt-4.1, AIME24 improves by 71.4% versus 3.9% on GPQA-Diamond. Long-horizon symbolic reasoning exposes more intermediate failure points amenable to reflection, whereas closed-book science QA relies more on factual recall. (iv)Ceiling effects limit gains on saturated benchmarks.grok-4.1-fastreaches 96.7% on AIME24 under vanilla, leaving negligible headroom and yielding no gain from evolution. On GPQA-Diamond and AIME25 where its baselines are lower (83.3% and 90.0%), it still improves by 7.2% and 7.4%, confirming that self-evolution is most effective when sufficient headroom exists. Overall, PS-Joint-Evo is the preferred strategy when inference budget permits, as it addresses complementary failure modes simultaneously. For cost-constrained deployment, evolution budgets are best allocated to weaker models, harder tasks, or low-confidence samples. In near-saturated settings, adaptive triggering based on confidence or task difficulty is more effective than fixed-budget evolution.

5.2Experiments on General Agent Benchmarks

Table 3:Model performance on the Self-Evolving Code Agent Benchmark withAGSself-evolution. ModelCapability metricsEfficiency metricsHuman metricsPRTLEMLECEREWATORpEAR (ms)AM (MB)APCARB (%)AMB (%)Python3deepseek-v3.2341102381361806.7955.91640.3863.0425.81grok-4.1-fast73900313031860.9056.02741.6049.9230.15claude-4.5-sonnet42372001081880.9845.16702.6461.0622.12claude-4.5-opus8290005311559.8770.77749.4564.7732.70gemini-3-flash-preview79400214101376.1956.59750.8973.2836.62+ Solution-Evo8730019001269.3959.08750.9870.2942.15Improvement(%)10.1↑\uparrow25.0↑\uparrow0050↑\uparrow35.7↑\uparrow100↑\uparrow07.8↑\uparrow4.4↓\downarrow0.04.1↓\downarrow15.1↑\uparrowC++deepseek-v3.211103006443158.73163.59605.8273.1174.05grok-4.1-fast799010524428.32223.68748.6158.5746.67claude-4.5-sonnet4142201923379.68179.86710.5956.1750.84claude-4.5-opus856000612382.45184.22758.2164.0655.58gemini-3-flash-preview8420211001266.04168.93743.3168.0259.24+ Solution-Evo990000100142.60148.43749.8688.9973.14Improvement(%)17.9↑\uparrow100↑\uparrow0100↑\uparrow100↑\uparrow90↑\uparrow0100↑\uparrow46.4↑\uparrow12.1↑\uparrow0.9↓\downarrow30.8↑\uparrow23.5↑\uparrowJavadeepseek-v3.21100471153572.91143.63481.4557.3732.71grok-4.1-fast7350501214227.45136.80746.2352.9841.97claude-4.5-sonnet41401111501161.49130.54679.4158.0446.22claude-4.5-opus874100611188.63134.27748.6359.5455.61gemini-3-flash-preview840022912125.04126.09752.8671.0359.18+ Solution-Evo98100010096.30120.00751.0988.3372.38Improvement(%)16.7↑\uparrow00100↑\uparrow100↑\uparrow88.9↑\uparrow100↑\uparrow100↑\uparrow23.0↑\uparrow4.8↑\uparrow0.2↑\uparrow24.4↑\uparrow22.3↑\uparrowGodeepseek-v3.27003900153112.7112.59709.5762.7354.36grok-4.1-fast6930160435194.9023.26755.4366.8362.44claude-4.5-sonnet44410001302222.6419.71712.5557.0953.32claude-4.5-opus845000902162.5019.95744.4572.9163.00gemini-3-flash-preview821090701139.2222.01739.4676.2263.48+ Solution-Evo950000500111.6418.35754.1781.5267.94Improvement(%)15.9↑\uparrow100↑\uparrow0100↑\uparrow028.6↑\uparrow0100↑\uparrow19.8↑\uparrow16.6↑\uparrow2.0↓\downarrow7.0↑\uparrow7.0↑\uparrowKotlindeepseek-v3.272000014859.2962.27793.5758.3374.56grok-4.1-fast6220220824307.4575.45759.5578.1272.83claude-4.5-sonnet42361811011192.6278.49757.6481.5977.79claude-4.5-opus834050503210.4776.60750.9883.1876.53gemini-3-flash-preview7520811022171.9972.80760.4383.4979.07+ Solution-Evo951000400122.8377.88749.3883.5867.21Improvement(%)26.7↑\uparrow50↑\uparrow0100↑\uparrow100↑\uparrow60↑\uparrow100↑\uparrow100↑\uparrow28.6↑\uparrow7.0↓\downarrow1.5↓\downarrow0.1↑\uparrow15.0↓\downarrow[Uncaptioned image]

Figure 3:Performance comparison of evolving and vanillaAGSwithin-inference. Experiment Setting. For both GAIA and HLE, we focus on evolving agents (including agent prompts, tool implementations, and agent code), as these benchmarks primarily demand tool-augmented multi-step reasoning rather than pure deductive inference. Our system deploys a top-level planning agent (m=30m=30) coordinating four specialized sub-agents (deep researcher, browser-use, deep analyzer, tool calling agent and vibe coding agent, each withm=20m=20usinggemini-3.1-pro-preview), wheremmdenotes the maximum reasoning steps. Agent self-evolution is driven by the vibe coding agent, which iteratively refines agent prompts and code through the SEPL reflection optimizer, with evolved agents registered as versioned RSPL resources and reused across subsequent tasks. For GAIA, we report Pass@1 accuracy at each difficulty tier (Level 1–3) and the overall average on both validation and test splits. For HLE, we follow the official evaluation protocol usingo3-minias the judge.

Results and Analysis.Table˜2andFigure˜2reveal three key observations. (i)AGSachieves highly competitive performance among systems with comparable backbone models.On GAIA,AGSattains strong results on both Test and Validation, reaching 89.04% on Test and the best reported Validation score of 93.33% among all listed systems. Although openJiuwen achieves a higher GAIA Test score, it relies on substantially stronger backbone models, which are orthogonal to the evolution protocol itself. On HLE,AGSranks second overall, outperforming all systems except Claude Mythos Preview, which similarly benefits from a more capable frontier backbone. These comparisons suggest that the remaining gaps are primarily associated with backbone strength rather than the proposed self-evolution protocol. (ii)Agent evolution yields the largest gains on hard tasks.On GAIA Test, Agent-Evo improves the vanilla baseline by 12.6% on average, with gains increasing with task difficulty. Improvements range from 8.2% on Level 1 to 33.3% on Level 3, indicating that harder tasks expose more correctable failure modes for iterative refinement, while easier tasks leave less room for improvement. This trend mirrors the headroom pattern observed in the math benchmarks. (iii)Self-evolution generalizes to open-ended agent tasks.GAIA requires coherent state management across multi-domain transitions, such as from browser retrieval to file analysis, while HLE demands expert-level multi-step reasoning. By registering prompts, agent code, and tools as versioned RSPL resources,AGSpreserves task-critical state and reuses evolved capabilities across subsequent subtasks. Overall, these results show thatAGP’s self-evolution protocol improves difficult agent tasks, remains competitive among systems with comparable backbone models, and extends from closed-form reasoning to complex, tool-intensive agent scenarios.

5.3Experiments on Self-Evolving Code Agent Benchmark

Experiment Setting. Existing code generation benchmarks evaluate one-shot generation and do not measure an agent’s ability to iteratively improve solutions during inference. To address this, we construct a benchmark based on the LeetCode online judge using 100 recently released problems to mitigate contamination (details inAppendix˜D). We compare a vanilla baseline againstAGSwith Solution-Evo enabled across five languages (Python3, C++, Java, Go, Kotlin), usinggemini-3-flash-previewas the backbone and a reflection budget of 3 rounds. We report three groups of metrics covering functional correctness, runtime and memory efficiency, and human-referenced competitiveness, with full definitions provided inTable˜6in the appendix.

Results and Analysis.Table˜3andFigure˜3reveal three key findings for self-evolving code agents under execution-guided evaluation and human-submission comparison. (i)Self-evolution consistently improves functional correctness across languages.Solution-Evo increases pass rates by 10.1–26.7% across the five languages, with the largest gain in Kotlin and high solved counts in compiled languages, including 99 problems in C++ and 98 in Java. Execution-blocking errors, including compile, runtime, and answer errors, are reduced to near zero, suggesting that inference-time feedback effectively repairs both format- and logic-level failures. (ii)Execution-guided evolution improves efficiency beyond correctness.Average runtime decreases in all languages, with a 7.8% reduction in Python3 and larger reductions of 19.8–46.4% in compiled languages. These gains align with fewer time-limit-exceeded errors, indicating that the agent not only fixes invalid outputs but also discovers more efficient algorithms. Memory usage decreases in most compiled languages by 4.8–16.6%, while increasing modestly in Python3 and Kotlin, likely due to auxiliary data structures introduced for correctness or speed. (iii)Evolved solutions become more competitive relative to human submissions.Runtime beats improve in compiled languages by up to 30.8%, while Python3 shows a modest decrease, consistent with a memory-speed trade-off. Memory beats improve in four of the five languages by 7.0–23.5%, but decrease in Kotlin, suggesting that long-tail languages may favor correctness over memory efficiency. Overall, Solution-Evo provides a strong default strategy for algorithmic coding tasks by combining execution feedback, iterative self-repair, and measurable competitiveness against human submissions.

6Limitations and Impact Statement

First, self-evolution introduces additional inference rounds that increase latency and token consumption, and systematic analysis of the efficiency-effectiveness trade-off under strict budget constraints remains future work. Second, whileAGPprovides a unified protocol interface for all RSPL resource types, our experiments focus on Prompt-Evo, Solution-Evo, and Agent-Evo as primary comparison targets. Evolution of Environment and Memory resources has been implemented but not yet evaluated as independent ablation targets, and we leave this to future work. On the impact side, self-evolving agent systems may exhibit unintended behavioral drift if evolution objectives are misspecified or reward signals are noisy. The version control and rollback mechanisms in SEPL provide basic safeguards, but rigorous alignment verification remains an open challenge for broader deployment.

7Conclusion

We presentedAGP, a two-layer self-evolution protocol that decouples the evolutionary substrate from optimization logic, standardizing how agent resources are registered, versioned, and evolved. Instantiated asAGS, the protocol drives consistent improvements across scientific reasoning, open-ended agent tasks, and algorithmic code generation, demonstrating that a single evolution mechanism generalizes across task types and resource categories. We believeAGPoffers a reusable foundation for future work on multi-agent collaboration, safe online adaptation, and human-aligned self-improvement in dynamic real-world environments.

References

  • Equipping agents for the real world with agent skills.Note:https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skillsAccessed October 2025Cited by:§C.1,Appendix C,§E.2.2,§E.2,§E.3.6,§E.3.7,§2.1,§3.1.
  • Anthropic (2025b)Introduction to agent skills.Note:https://anthropic.skilljar.com/introduction-to-agent-skillsCited by:§E.2.2,§E.2,§E.3.6,§E.3.7,§1,§2.1,§3.1.2,§3.1.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al.(2020)Language models are few-shot learners.Advances in neural information processing systems33,pp. 1877–1901.Cited by:§1.
  • M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman,et al.(2021)Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374.Cited by:§2.1.
  • W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Chan, H. Yu, Y. Lu, Y. Hung, C. Qian,et al.(2023)Agentverse: facilitating multi-agent collaboration and exploring emergent behaviors.InThe Twelfth International Conference on Learning Representations,Cited by:§3.1.
  • Desearch-ai (2024)desearch.py: Official Async Python SDK for the Desearch API.Note:https://github.com/Desearch-ai/desearch.pyCited by:Table 2,Table 2.
  • H. Gao, J. Geng, W. Hua, M. Hu, X. Juan, H. Liu, S. Liu, J. Qiu, X. Qi, Y. Wu,et al.(2025)A survey of self-evolving agents: what, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046.Cited by:§3.
  • Google (2024)Langfun: Object-Oriented Programming for Language Models.Note:https://github.com/google/langfunCited by:Table 2,Table 2.
  • Google (2025)A2A: a new era of agent interoperability.Note:https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/Google Developers Blog. Accessed: 2026-04-20Cited by:§C.1,Appendix C,§1.
  • H2O.ai (2025)Enterprise h2oGPTe: Agentic AI for Generative and Predictive Intelligence.Note:https://h2o.ai/platform/enterprise-h2ogpte/Cited by:Table 2.
  • S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin,et al.(2023)MetaGPT: meta programming for a multi-agent collaborative framework.InThe twelfth international conference on learning representations,Cited by:§3.1,§4.1.
  • Z. Hou, J. Tang, and Y. Wang (2025)Halo: hierarchical autonomous logic-oriented orchestration for multi-agent llm systems.arXiv preprint arXiv:2505.13516.Cited by:Table 2.
  • J. Hu (2025a)Reinforce++: a simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262.Cited by:§2.2.
  • J. Hu (2025b)Reinforce++: a simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262.Cited by:§E.3,§1,§3.2,§4.1.
  • HuggingFace (2024)Open-source DeepResearch - Freeing Our Search Agents.Note:https://huggingface.co/blog/open-deep-researchCited by:Table 2.
  • [16]LeetCodeLeetCode online judge.Note:https://leetcode.comAccessed 2025Cited by:§1.
  • J. Liu, S. Xu, S. Liu, Y. Li, W. Liu, M. Liu, X. Zhou, H. Wang, S. Jia, S. Tian,et al.(2025)JoyAgent-jdgenie: technical report on the gaia.arXiv preprint arXiv:2510.00510.Cited by:Table 2.
  • A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang,et al.(2023)Self-refine: iterative refinement with self-feedback.Advances in neural information processing systems36,pp. 46534–46594.Cited by:§2.2.
  • G. Mialon, C. Fourrier, T. Wolf, Y. LeCun, and T. Scialom (2023)Gaia: a benchmark for general ai assistants.InThe Twelfth International Conference on Learning Representations,Cited by:§1,§2.1,§5.
  • NousResearch (2025)Hermes Agent: A Self-Improving Open-Source Agent Framework.Note:https://github.com/NousResearch/hermes-agentCited by:§2.2.
  • OpenAI (2025)Introducing Deep Research.Note:https://openai.com/index/introducing-deep-research/Cited by:Table 2,Table 2,Table 2.
  • openJiuwen (2026)openJiuwen Agent Platform.Note:https://openjiuwen.com/en/Cited by:Table 2.
  • L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al.(2022)Training language models to follow instructions with human feedback.Advances in neural information processing systems35,pp. 27730–27744.Cited by:§4.1.
  • L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi,et al.(2025)Humanity’s last exam.arXiv preprint arXiv:2501.14249.Cited by:§1,§5.
  • R. Pryzant, D. Iter, J. Li, Y. Lee, C. Zhu, and M. Zeng (2023)Automatic prompt optimization with “gradient descent” and beam search.InProceedings of the 2023 conference on empirical methods in natural language processing,pp. 7957–7968.Cited by:§2.2.
  • Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian,et al.(2023)Toolllm: facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789.Cited by:§2.1.
  • J. Qiu, X. Qi, T. Zhang, X. Juan, J. Guo, Y. Lu, Y. Wang, Z. Yao, Q. Ren, X. Jiang,et al.(2025)Alita: generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution.arXiv preprint arXiv:2505.20286.Cited by:Table 2,Table 2.
  • D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman (2024)GPQA: a graduate-level google-proof q&a benchmark.InFirst Conference on Language Modeling,Cited by:§1,§2.1.
  • T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom (2023)Toolformer: language models can teach themselves to use tools.Advances in neural information processing systems36,pp. 68539–68551.Cited by:§2.1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017)Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347.Cited by:§4.1.
  • Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. Li, Y. Wu,et al.(2024)Deepseekmath: pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300.Cited by:§E.3,§1,§2.2,§3.2,§4.1.
  • M. Shen and Q. Yang (2025)From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent.External Links:2505.02024,LinkCited by:Table 2.
  • N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao (2023)Reflexion: language agents with verbal reinforcement learning.Advances in neural information processing systems36,pp. 8634–8652.Cited by:§2.2.
  • H. Su, S. Diao, X. Lu, M. Liu, J. Xu, X. Dong, Y. Fu, P. Belcak, H. Ye, H. Yin,et al.(2025)Toolorchestra: elevating intelligence via efficient model and tool orchestration.arXiv preprint arXiv:2511.21689.Cited by:Table 2.
  • Y. Wang, S. Liu, J. Fang, and Z. Meng (2025)Evoagentx: an automated framework for evolving agentic workflows.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp. 643–655.Cited by:§2.2.
  • J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou,et al.(2022)Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35,pp. 24824–24837.Cited by:§1,§2.1.
  • Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu,et al.(2024)Autogen: enabling next-gen llm applications via multi-agent conversations.InFirst conference on language modeling,Cited by:§3.1,§4.1.
  • S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao (2022)React: synergizing reasoning and acting in language models.InThe eleventh international conference on learning representations,Cited by:§1,§2.1.
  • C. Yu, S. Lu, C. Zhuang, D. Wang, Q. Wu, Z. Li, R. Gan, C. Wang, S. Hou, G. Huang,et al.(2025)Aworld: orchestrating the training recipe for agentic ai.arXiv preprint arXiv:2508.20404.Cited by:Table 2,Table 2.
  • M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, P. Lu, Z. Huang, C. Guestrin, and J. Zou (2025)Optimizing generative ai by backpropagating language model feedback.Nature639(8055),pp. 609–616.Cited by:§E.3,§1,§2.2,§3.2,§4.1.
  • E. Zelikman, Y. Wu, J. Mu, and N. Goodman (2022)Star: bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems35,pp. 15476–15488.Cited by:§2.2.
  • H. Zhang, J. Lu, S. Jiang, C. Zhu, L. Xie, C. Zhong, H. Chen, Y. Zhu, Y. Du, Y. Gao, L. Huang, B. Wang, F. Tan, and P. Zou (2025a)Co-sight: enhancing llm-based agents via conflict-aware meta-verification and trustworthy reasoning with structured facts.arXiv preprint arXiv:2510.21557.External Links:LinkCited by:Table 2.
  • W. Zhang, C. Cui, Y. Zhao, Y. Liu, and B. An (2025b)AgentOrchestra: a hierarchical multi-agent framework for general-purpose task solving.External Links:2506.12508Cited by:Table 2.
  • Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba (2022)Large language models are human-level prompt engineers.InThe eleventh international conference on learning representations,Cited by:§2.2.
  • D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving (2019)Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593.Cited by:§4.1.
  • Zoom AI (2025)HLE Leaderboard.Note:https://huggingface.co/spaces/zoom-ai/hle-leaderboardCited by:Figure 2,Figure 2.

Appendix ANotation

We summarize the main mathematical symbols and their meanings in Table4. Symbols are organized into six functional categories (highlighted in grey) following the two-layer structure ofAGP. The first three cover theRSPL substrate: indexing conventions, the resource entity tuple, and protocol-registered resources including the context managerℳτ\mathcal{M}_{\tau}and server interface𝒜τ\mathcal{A}_{\tau}. The remaining three cover theSEPL layer: evolvable variables and the trainable subspaceΘ\Theta, the auxiliary spaces (𝒫\mathcal{P},𝒵\mathcal{Z},ℋ\mathcal{H},𝒟\mathcal{D},𝒢\mathcal{G},𝒮\mathcal{S}) and five canonical reflection operators{ρ,σ,ι,ε,κ}\{\rho,\sigma,\iota,\varepsilon,\kappa\}, and iteration-level variables of the evolutionary loop.

Table 4:Notation used in the paper. Grey rows indicate categories.SymbolDescriptionIndexing and Sets𝒯\mathcal{T}Set of RSPL entity types,{Prompt,Agent,Tool,Env,Mem}\{\textsc{Prompt},\textsc{Agent},\textsc{Tool},\textsc{Env},\textsc{Mem}\}.τ\tauEntity type index,τ∈𝒯\tau\in\mathcal{T}.ℐτ\mathcal{I}_{\tau}Index set of resource instances of typeτ\tau.iiInstance index,i∈ℐτi\in\mathcal{I}_{\tau}.𝕍\mathbb{V}Space of version strings.℘​(⋅)\wp(\cdot)Power set operator.RSPL Resource Entity (Def.E.1)eτ,ie_{\tau,i}Resource entity tuple(nτ,i,dτ,i,ϕτ,i,gτ,i,mτ,i)(n_{\tau,i},d_{\tau,i},\phi_{\tau,i},g_{\tau,i},m_{\tau,i}).nτ,in_{\tau,i}Unique resource name.dτ,id_{\tau,i}Short description.ϕτ,i:𝒳τ→𝒴τ\phi_{\tau,i}:\mathcal{X}_{\tau}\!\rightarrow\!\mathcal{Y}_{\tau}Input-to-output mapping of the resource.gτ,ig_{\tau,i}Evolvability marker,gτ,i∈{0,1}g_{\tau,i}\in\{0,1\}, indicating whether the resource is evolvable.mτ,im_{\tau,i}Auxiliary metadata dictionary.ℰτ\mathcal{E}_{\tau}Set of resource entities of typeτ\tau.RSPL Registration Record (Def.E.2)cτ,ic_{\tau,i}Registration record(eτ,i,vτ,i,ητ,i,θτ,i,ℱτ,i)(e_{\tau,i},v_{\tau,i},\eta_{\tau,i},\theta_{\tau,i},\mathcal{F}_{\tau,i}).𝒞τ\mathcal{C}_{\tau}Set of registration records for typeτ\tau.vτ,iv_{\tau,i}Version string of the resource instance.ητ,i\eta_{\tau,i}Implementation descriptor (e.g., import path, class, or source).θτ,i\theta_{\tau,i}Instantiation parameters (e.g., constructor arguments).ℱτ,i\mathcal{F}_{\tau,i}Exported representations for LLM interaction (schemas/text/structured args).Protocol-registered Resource (Def.E.3)ℛτ\mathcal{R}_{\tau}Type-specific registry of protocol-registered resources.ℛ\mathcal{R}Global registry,⋃τℛτ\bigcup_{\tau}\mathcal{R}_{\tau}.ℳτ\mathcal{M}_{\tau}Context manager for typeτ\tau(maintains registry and version lineage).𝒜τ\mathcal{A}_{\tau}Server-exposed interface for typeτ\tau(delegates toℳτ\mathcal{M}_{\tau}).rτr_{\tau}Type-level registered resource triple(𝒞τ,ℳτ,𝒜τ)(\mathcal{C}_{\tau},\mathcal{M}_{\tau},\mathcal{A}_{\tau}).SEPL Variables, Spaces, and Operators𝒱evo\mathcal{V}_{\text{evo}}Universal set of evolvable variables (all managed entities plus execution artifacts).vvA variable in𝒱evo\mathcal{V}_{\text{evo}}.gvg_{v}Learnability constraint for variablevv(binary).Θ\ThetaTrainable subspace,{v∈𝒱evo∣gv=1}\{v\in\mathcal{V}_{\text{evo}}\mid g_{v}=1\}.yyExecution artifacts (e.g., outputs and reasoning traces).𝒫\mathcal{P}Message space carrying auxiliary signals (traces, hypotheses, gradients, rewards) between operators.𝒫in,𝒫out\mathcal{P}_{\text{in}},\mathcal{P}_{\text{out}}Incoming and outgoing message types of a SEPL operator.ffA SEPL operator,f:𝒱evo×𝒫in→𝒱evo′×𝒫outf:\mathcal{V}_{\text{evo}}\times\mathcal{P}_{\text{in}}\rightarrow\mathcal{V}^{\prime}_{\text{evo}}\times\mathcal{P}_{\text{out}}.𝒵\mathcal{Z}Trace space (execution observations).ℋ\mathcal{H}Hypothesis space (causal failure attributions).𝒟\mathcal{D}Modification space (proposed resource changes).𝒢\mathcal{G}Objective specification (task goals and safety invariants).𝒮\mathcal{S}Evaluation space (performance metrics and safety status).ρ,σ,ι,ε,κ\rho,\sigma,\iota,\varepsilon,\kappaReflect, Select, Improve, Evaluate, and Commit operators (reflection instantiation).Optimization Loop (Alg.1)AAAgentic system.TTOptimization budget (number of iterations).ttIteration index.𝒱evo(t)\mathcal{V}_{\text{evo}}^{(t)}Evolvable state at iterationtt.𝒫(t)\mathcal{P}^{(t)}Message passed between operators at iterationtt.𝒵(t)\mathcal{Z}^{(t)}Observational trace at iterationtt.ℋ(t)\mathcal{H}^{(t)}Hypotheses at iterationtt.𝒟(t)\mathcal{D}^{(t)}Proposed modifications at iterationtt.𝒱~evo(t+1)\widetilde{\mathcal{V}}_{\text{evo}}^{(t+1)}Candidate state after applying modifications.𝒮(t+1)\mathcal{S}^{(t+1)}Evaluation result for the candidate state.

Appendix BCode, Prompts, and Resources

All source code, agent prompts, and optimizer prompts forAGPare organized as follows.

Agent and Optimizer Prompts.

All system prompts and task-specific prompts used by the agents, as well as the prompts used by each optimizer instantiation (Reflection Optimizer, TextGrad, Reinforce++, and GRPO), are provided in theautogenesis/directory of the supplementary material. The directory is structured by component: each subdirectory corresponds to a distinct agent role or optimizer module and contains the associated prompt templates.

Self-Evolving Code Agent Benchmark Data.

The benchmark problems, test cases, and reference solutions for the Self-Evolving Code Agent Benchmark are provided in thedata/directory of the supplementary material. The dataset covers all collected LeetCode-derived problems across five programming languages (Python, C++, Java, JavaScript, and Go). The evaluation scripts are located inautogenesis/src/benchmark/.

Appendix CComparison with Other Protocols

Table5provides a structured protocol-level comparison betweenAGP, Google A2A[9], and Anthropic MCP[1]. While A2A and MCP have standardized inter-agent communication and model-to-tool invocation respectively, both operate solely at the level of message passing and invocation, leaving internal resource states opaque and providing no primitives for lifecycle management, version lineage, or controlled state mutation. These are precisely the three properties (Decoupling,Safety & Auditability, andFormalism) thatAGPis designed to provide. The comparison is organized into five dimensions (grey rows), with blue-highlighted entries marking capabilities that are prerequisite for closed-loop self-evolution but absent from communication- or invocation-centric protocols.

Table 5:Protocol-level comparison:Autogenesis Protocol (AGP)vs. Google A2A vs. Anthropic MCP across key dimensions for agentic systems and self-evolution. Symbols:✓\checkmark= Supported,△\triangle= Partial,×\times= Not supported. Highlighted rows (blue background) emphasize evolution-enabling capabilities.DimensionAGPA2AMCPBasic InformationProposerOur workGoogleAnthropicProtocol FocusSelf-evolution Agentic SystemMulti-agent System CollaborationToolEntity ScopePrompt/Agent/Tool/Env/MemoryAgent/ToolToolAgent and System CapabilitiesAgent First-Class✓\checkmark✓\checkmark×\timesMulti-Agent✓\checkmark✓\checkmark×\timesTracer✓\checkmark△\triangle×\timesMemory as Resource✓\checkmark×\times×\timesEvolvable Resource ManagementLifecycle Ops✓\checkmark△\triangle×\timesVersioning and Rollback✓\checkmark×\times×\timesRegistry and Retrieval✓\checkmark△\triangle△\triangleContract Generation✓\checkmark△\triangle×\timesSelf-Evolution MechanismClosed-Loop Evolution✓\checkmark×\times×\timesOperatorized Updates✓\checkmark×\times×\timesAuditability✓\checkmark△\triangle△\triangleGeneral and EcosystemModel-Agnostic✓\checkmark✓\checkmark✓\checkmarkScalabilityO​(log⁡n)O(\log n)O​(n2)O(n^{2})O​(n)O(n)Open Ecosystem✓\checkmark△\triangle△\triangle### C.1Basic Information

Proposer.Google’s A2A[9]is introduced as a protocol for multi-agent communication, enabling agents to collaborate through standardized interaction primitives. Anthropic’s MCP[1]standardizes model to tool invocation interfaces. In contrast,AGPis proposed in this work as a self-evolution protocol for composable, auditable, and safely updateable agentic systems.

Protocol Focus.AGPfocuses on closed-loop improvement of agentic systems by organizing resource updates through typed protocol operators and versioned state transitions. A2A primarily addresses inter-agent communication and task delegation. MCP primarily addresses standardized model to tool invocation.

Entity Scope.AGPgoverns heterogeneous entities, including prompts, agents, tools, environments, and memory, as protocol-registered resources with explicit state and version lineage. This design supports component-level evolution, including prompt refinement and tool code updates. A2A treats agents and tools as interaction endpoints without unified lifecycle management. MCP exposes tools as callable interfaces but does not model them as evolvable components with version lineage.

C.2Agent and System Capabilities

Agent First-Class.AGPmodels agents as managed protocol components with explicit schemas, metadata, and lifecycle hooks. This enables registration, discovery, orchestration, and controlled updates. A2A is agent-centric but treats agents primarily as service endpoints without unified lifecycle management or version lineage. MCP does not define agents as protocol components and instead focuses on model to tool connectivity.

Multi-Agent.AGPsupports multi-agent configurations as part of its system substrate, enabling coordinated execution with traceability and evolution-ready state. A2A directly supports agent-to-agent collaboration. MCP does not treat multi-agent orchestration as a protocol-level concern.

Execution Tracing.AGPprovides protocol-level trace capture over inputs, outputs, intermediate decisions, and tool calls. These traces provide the learning signals required for auditable evolution. A2A and MCP leave tracing to application-level instrumentation, which can lead to inconsistent observability across deployments.

Memory as Resource.AGPmodels memory as a first-class protocol resource with explicit read and write interfaces, state, and version lineage. This enables persistent cross-task improvement and reproducible evolution. A2A and MCP do not prescribe a memory management protocol and instead delegate persistence to external systems.

C.3Evolvable Resource Management

Lifecycle Ops.AGPprovides standardized lifecycle operators for initialization, registration, construction, and decommissioning. These operators ensure that updates are applied to well-defined and protocol-governed targets. A2A offers partial lifecycle support for agents. MCP does not define lifecycle management across heterogeneous component types.

Versioning and Rollback.Version lineage and rollback form the safety foundation of closed-loop evolution. Each update produces an immutable snapshot that supports comparison, auditing, and restoration after regressions.AGPintegrates versioning as a first-class protocol capability. A2A and MCP do not natively support version lineage over protocol-managed components, which limits systematic evolution.

Registry and Retrieval.AGPmaintains a unified registry of protocol-registered resources and supports semantic retrieval to reduce duplication and improve composability across tasks. A2A and MCP provide partial discovery mechanisms, but they do not define a unified management plane over heterogeneous component types.

Contract Generation.AGPsupports automated generation of consolidated capability specifications that enumerate tool actions, arguments, preconditions, and usage constraints. This provides a principled form of context engineering that reduces prompt bloat and improves orchestration reliability. A2A and MCP rely on static descriptions or application-layer documentation without protocol-level contract aggregation.

C.4Self-Evolution Mechanism

Closed-Loop Evolution.AGPis built around an iterative improvement loop consisting of execution, reflection, proposal generation, evaluation, and commitment. This loop enables sustained and evidence-grounded refinement rather than one-off adaptation. A2A and MCP do not provide native self-evolution primitives.

Operatorized Updates.AGPexpresses state mutations as typed and composable SEPL operators with well-defined input and output contracts. This enables controlled and repeatable evolution. A2A and MCP do not define a composable operator interface for resource modification, leaving updates to application-specific logic.

Auditability.AGPenforces auditability at the protocol level by recording each state transition, the execution evidence that motivated it, and the evaluation outcome that justified it. This audit trail is supported by version lineage and rollback. A2A and MCP provide only partial audit trails through external instrumentation and do not offer protocol-level guarantees.

C.5General and Ecosystem

Model-Agnostic.This dimension assesses whether a protocol can operate across different LLM backends and providers.AGPis model-agnostic by design through a unified model interface layer. A2A and MCP are also broadly model-agnostic because they define interaction standards rather than binding the protocol to a specific model.

Scalability.Scalability characterizes how coordination and discovery behave as the number of components increases.AGPsupports scalable management by treating heterogeneous components as registry-governed resources with retrieval mechanisms, enabling efficient lookup and controlled orchestration. A2A may incur increasing coordination overhead as interactions become denser in large multi-agent settings. MCP standardizes tool interfaces but may still require application-level orchestration for large tool or resource sets.

Open Ecosystem.Open ecosystem support refers to whether a protocol can enable reusable and interoperable components.AGPprovides a protocol stack for managing, evolving, and auditing agentic components, which supports component sharing and safe integration. A2A and MCP provide partial ecosystem support through interoperability and tool interface standardization, but they typically require additional layers for evolution-ready resource management.

Appendix DDetails of the Self-Evolving Code Agent Benchmark

D.1Benchmark Design Rationale

Our benchmark is designed to evaluate self-evolving code agents under execution-grounded and human-referenced conditions. Unlike conventional code generation benchmarks that primarily assess final correctness, self-evolving agents can improve within a single inference episode by producing an initial solution, observing execution feedback, reflecting on failure modes, and revising the solution accordingly. This adaptive process requires a benchmark that measures not only whether the final submission is accepted, but also how performance evolves throughout refinement. Accordingly, our benchmark is motivated by three objectives: (i) evaluating inference-time self-evolution on executable code, (ii) calibrating agent performance against human submission distributions, and (iii) assessing cross-language robustness under long-tail language usage.

The first objective is to make self-evolution directly measurable during inference. In algorithmic coding tasks, execution feedback provides concrete and fine-grained signals, including compilation status, runtime errors, wrong answers, time-limit violations, memory-limit violations, and execution statistics for accepted submissions. These signals allow the agent to identify whether a failure stems from syntax errors, interface mismatches, corner-case logic, algorithmic inefficiency, or excessive resource usage. A benchmark for self-evolving agents should therefore expose such feedback at each refinement round and record the resulting improvement trajectory. This design distinguishes agents that solve problems through stable and efficient refinement from those that achieve correctness only through costly or unstable trial-and-error behavior.

The second objective is to evaluate coding performance relative to human submissions. Absolute pass rates are informative but insufficient for assessing practical coding competence, since they do not indicate whether an accepted solution is efficient compared with human-written solutions. We therefore build on the LeetCode online judge, which reports runtime and memory usage for accepted submissions, together with percentile-basedruntime beatsandmemory beatsstatistics computed from human submission distributions. These human-referenced metrics provide an interpretable basis for assessing whether self-evolution improves not only correctness, but also competitiveness relative to human programmers.

The third objective is to evaluate robustness across programming languages, including long-tail languages. Many coding benchmarks are dominated by Python or other high-resource languages, which can obscure language-specific failures related to syntax, libraries, typing discipline, compilation, and runtime behavior. LeetCode provides standardized starter code across a broad set of languages, enabling the same problem to be evaluated under comparable interfaces in Python3, C++, Java, Go, and Kotlin. This design supports systematic analysis of whether self-evolution generalizes across languages and whether feedback-driven refinement remains effective across both high-resource and lower-resource programming ecosystems.

Overall, the benchmark provides a controlled setting for evaluating self-evolving code agents as dynamic problem solvers. By combining execution-based judging, iterative feedback, human-referenced efficiency statistics, and multi-language evaluation, it jointly measures functional correctness, resource efficiency, refinement dynamics, and human-relative competitiveness under a unified protocol.

D.2Benchmark Construction

Data Collection.We collect the full set of 3,822 programming problems available on LeetCode at the time of crawling. For each problem, we extract the natural-language statement, official input and output examples, constraints, platform-provided difficulty label, topical tags, and language-specific starter code templates. The topical tags characterize the algorithmic concepts required by each problem, including arrays, trees, graphs, dynamic programming, greedy methods, binary search, and mathematics. These annotations support stratified analysis across difficulty levels, algorithmic categories, and programming languages. Figure5summarizes the tag and difficulty distributions of the selected problems.

Refer to captionFigure 4:Self-evolving code agent benchmark evaluation pipeline. Refer to captionFigure 5:Problem distribution.

The collected data are normalized into a unified problem representation. Each instance contains a fixed task specification, official examples, a language-specific starter template, and metadata for difficulty and topic categories. We preserve the original interface required by the online judge so that generated code can be submitted without modifying function signatures or class definitions. This design ensures that performance differences arise from agent behavior rather than inconsistencies in task formatting or evaluation interfaces. We conduct quality checks by filtering malformed records, removing duplicates, and verifying that starter templates are available for all target languages. From the full pool, we select 100 recently released problems to mitigate training-data contamination. The selected problems span diverse topical categories and difficulty levels, and are instantiated across Python3, C++, Java, Go, and Kotlin to enable controlled cross-language evaluation.

Problem Characteristics.LeetCode-style algorithmic problems provide a controlled and challenging setting for evaluating code-agent competence. Each task specifies explicit constraints and precise input and output behavior, requiring instruction following, edge-case coverage, and faithful implementation under a fixed interface. The breadth of tags and difficulty levels evaluates algorithm selection, data-structure proficiency, and complexity-aware reasoning. Because evaluation is execution-based, brittle solutions can be exposed through concrete failures such as off-by-one errors, corner-case bugs, interface mismatches, and language-specific pitfalls. Standardized starter templates across languages further enable systematic cross-language comparison, including robustness analysis in long-tail languages. Since a solution can be revised within the same inference episode, this setting directly measures agentic capabilities such as self-repair, feedback-grounded hypothesis testing, and efficiency-aware optimization under runtime and memory constraints.

Problem Evaluation.For each problem, the agent receives a fixed input representation and submits generated code to the official execution-based judge, which evaluates functional correctness on hidden test cases and reports resource usage statistics. This protocol ensures all agents are assessed under identical task inputs, execution conditions, and scoring criteria. When evaluating agents with self-evolution capability, the agent is additionally allowed to iteratively refine its solution within the same problem-solving episode under a fixed budget of 3 rounds, using execution feedback from the online judge to reflect on failure causes, identify actionable error patterns, and propose targeted code revisions, while keeping the task specification, prompt schema, and evaluation interface unchanged.

D.3Evaluation Metrics

Table 6:Evaluation metrics for the algorithmic coding benchmark.MetricDescriptionCapability metricsPRNumber of problems passing all hidden test cases within time and memory limits.TLENumber of problems exceeding the allowed execution time limit.MLENumber of problems exceeding the allowed memory usage.CENumber of problems where generated code failed to compile.RENumber of problems encountering a runtime error during execution.WANumber of problems producing incorrect output.TONumber of problems where the model failed to respond within the timeout.RpENumber of problems where the model returned an invalid or unparseable response.Efficiency metricsARMean runtime in milliseconds over accepted solutions.AMMean memory usage in megabytes over accepted solutions.APCMean number of test cases passed before failure.Human-referenced metricsARBPercentage of accepted solutions whose runtime outperforms human submissions.AMBPercentage of accepted solutions whose memory usage outperforms human submissions.We report three groups of metrics that capture complementary aspects of code-agent performance. Capability metrics evaluate functional correctness and diagnose execution-blocking failure modes. PR measures the number of fully accepted problems, while TLE, MLE, CE, RE, WA, TO, and RpE identify whether failures arise from algorithmic inefficiency, excessive memory use, compilation errors, runtime exceptions, incorrect logic, response timeout, or invalid output formatting. These metrics are particularly important for self-evolving agents because different failure modes correspond to different refinement opportunities.

Efficiency metrics characterize the computational quality of accepted solutions and the progress made by partially correct submissions. AR and AM summarize runtime and memory usage over accepted solutions, which allows us to assess whether self-evolution improves efficiency rather than merely increasing pass rate. APC measures how many test cases are passed before failure and provides a fine-grained signal for unsuccessful submissions. This metric is useful when an agent progresses from early failure to passing most hidden tests, even if the final solution is not accepted.

Human-referenced metrics situate accepted agent solutions within the empirical distribution of human submissions. ARB measures the fraction of accepted human submissions whose runtime is slower than the agent solution, while AMB measures the analogous fraction for memory usage. These metrics provide an interpretable basis for comparing evolved agent solutions with human-written solutions and help determine whether an accepted solution is merely correct or also competitive.

For self-evolving agents, all metrics can be computed at each refinement round as well as for the final submission. This enables trajectory-level evaluation of inference-time improvement, including whether correctness increases across rounds, whether runtime and memory usage improve or degrade, and whether human-relative competitiveness changes after reflection and revision. The benchmark therefore evaluates both endpoint performance and the refinement process through which an agent reaches that endpoint.

Appendix EDetails of Self-Evolution Protocol

E.1Design Motivation

Existing LLM-based agent systems pursue self-improvement in largely ad hoc ways: prompts, tools, and memory are tightly coupled to agent logic, updated without version control, and impossible to roll back when an update degrades performance. This architecture fragility motivates the two-layer design ofAGP. We outline the core motivating principles below.

  • •Decoupling substrate from logic.In most existing frameworks, what an agent operates over (prompts, tools, memory) and how it evolves them (optimization algorithms, feedback loops) are interleaved in a single codebase. This coupling makes it difficult to swap, reuse, or safely update individual components.AGPseparates theevolvable substrate(RSPL) from theevolution logic(SEPL), so that any compliant optimizer can be applied to any registered resource without modifying the component itself. This modularity is essential for principled, reproducible self-evolution.
  • •Safety and auditability through lifecycle management.Self-evolving agents modify their own components at runtime, which introduces risks of cascading failures or undetectable regressions. Without explicit version control and rollback support, a single bad update can silently degrade system behavior. RSPL endows every resource with versioned state and a controlled mutation interface, ensuring that each evolutionary step is traceable, reversible, and subject to explicit commit or rollback decisions. SEPL enforces that updates proceed only after formal evaluation, making every change auditable by design.
  • •Formalism over heuristics.Prior self-evolution approaches apply modifications heuristically, for example, by prompting a model to “improve itself” or by directly patching code, with no standardized interface governing what constitutes a valid update cycle. This informality makes it impossible to guarantee safety or reason about correctness across runs. SEPL formalizes the update cycle as a closed-loop operator interface, transforming ad hoc modifications into a rigorous protocol with well-defined pre- and post-conditions. This formalism enables algorithm-agnostic instantiation: the same operator interface supports prompt optimization, reinforcement learning, and gradient-free search.
  • •Uniform abstraction across heterogeneous components.Agent systems compose heterogeneous entities, including LLM instructions, external tool scripts, MCP services, and in-context memory, that are typically managed through disparate, component-specific mechanisms.AGPprovides a single, unified resource entity abstraction that encompasses all five types (Prompt, Agent, Tool, Environment, Memory), enabling SEPL to apply the same evolution operators uniformly across all components without special-casing.

Together, these principles ground the two-layer architecture in a coherent design philosophy: RSPL provides a stable, typed, versioned substrate that renders agent internals observable and controllable, while SEPL provides a safe, formal operator interface that governs how those internals are updated. The remainder of this section specifies each layer in detail.

E.2Layer 1: Resource Substrate Protocol Layer

The Resource Substrate Protocol Layer (RSPL) defines the evolvable substrate as a set of protocol-registered resources with explicit state, lifecycle, and version lineage. In this paper, these resources comprise (i)instructions(Prompt), (ii)decision policies(Agent), (iii)actuation interfaces(Tool), which encompass native tool scripts, MCP tools[1], and agent skills[2], (iv)task/world dynamics(Environment), and (v)persistent state(Memory). Crucially, resources in RSPL arepassive: they encapsulate no optimization logic and cannot self-modify; all observations and state transitions occur only through controlled, interface-mediated operations invoked by higher layers.

E.2.1Core Entities

We focus on these five entity types as a minimal yet expressive substrate for agentic systems. This choice is not intended to be exhaustive, but rather to identify a common denominator across modern agent stacks and provide a uniform target space on which SEPL can operate.

Definition E.1(Resource Entity).

A resource entity of typeτ\tauand its type-level collection can be represented as:

eτ,i\displaystyle e_{\tau,i}=(nτ,i,dτ,i,ϕτ,i,gτ,i,mτ,i),\displaystyle=(n_{\tau,i},\,d_{\tau,i},\,\phi_{\tau,i},\,g_{\tau,i},\,m_{\tau,i}),(6)ℰτ\displaystyle\mathcal{E}_{\tau}={eτ,i∣i∈ℐτ},\displaystyle=\{\,e_{\tau,i}\mid i\in\mathcal{I}_{\tau}\,\},where𝒯={Prompt,Agent,Tool,Env,Mem}\mathcal{T}=\{\textsc{Prompt},\textsc{Agent},\textsc{Tool},\textsc{Env},\textsc{Mem}\}denotes the set of RSPL entity types,τ∈𝒯\tau\in\mathcal{T}indexes the entity type,ℐτ\mathcal{I}_{\tau}is the index set of resource instances of typeτ\tau, andi∈ℐτi\in\mathcal{I}_{\tau}indexes an individual instance. Herenτ,in_{\tau,i}is a unique resource name,dτ,id_{\tau,i}is a short description,ϕτ,i:𝒳τ→𝒴τ\phi_{\tau,i}:\mathcal{X}_{\tau}\rightarrow\mathcal{Y}_{\tau}is an input-to-output mapping,gτ,i∈{0,1}g_{\tau,i}\in\{0,1\}is an evolvability marker, andmτ,im_{\tau,i}is an auxiliary metadata dictionary.

A key motivation for making prompt, tool, and memory explicit RSPL resources isdecoupling. Many agent systems package prompts, tools, and memory as internal components of an agent, which entangles agent logic with task-specific instructions and capability bundles, increasing maintenance and limiting transfer. By externalizing them as first-class, versioned resources with standardized interfaces, the same tool-calling agent policy can be paired with different prompts and tool sets, and deployed unchanged across tasks and environments.

To support resource registration, unified management, and instantiation, RSPL stores a serializable registration record for each resource instance.

Definition E.2(Resource Registration Record).

A resource registration record and its type-level collection can be represented as:

cτ,i\displaystyle c_{\tau,i}=(eτ,i,vτ,i,ητ,i,θτ,i,ℱτ,i),\displaystyle=(e_{\tau,i},\,v_{\tau,i},\,\eta_{\tau,i},\,\theta_{\tau,i},\,\mathcal{F}_{\tau,i}),(7)𝒞τ\displaystyle\mathcal{C}_{\tau}={cτ,i∣i∈ℐτ},\displaystyle=\{\,c_{\tau,i}\mid i\in\mathcal{I}_{\tau}\,\},whereτ∈𝒯\tau\in\mathcal{T}indexes the entity type andi∈ℐτi\in\mathcal{I}_{\tau}indexes an individual instance. Hereeτ,ie_{\tau,i}is the resource entity tuple defined inDefinition˜E.1,vτ,i∈𝕍v_{\tau,i}\in\mathbb{V}is a version string,ητ,i\eta_{\tau,i}is an implementation descriptor (e.g., import path, class definition, or source-code string),θτ,i\theta_{\tau,i}are instantiation parameters (e.g., constructor arguments), andℱτ,i\mathcal{F}_{\tau,i}is a set of exported representations used by LLMs to interact with the resource (e.g., function-calling schema, plain text, and structured argument schema).

Definition E.3(Protocol-registered resource).

For each entity typeτ\tau, letℛτ\mathcal{R}_{\tau}denote the type-specific registry of protocol-registered resources, and letℛ=⋃τℛτ\mathcal{R}=\bigcup_{\tau}\mathcal{R}_{\tau}denote the global registry. RSPL binds each entity typeτ\tauto a dedicated context managerℳτ\mathcal{M}_{\tau}and a server-exposed interface𝒜τ\mathcal{A}_{\tau}. We represent the type-level registered resource as

rτ=(𝒞τ,ℳτ,𝒜τ),r_{\tau}=(\mathcal{C}_{\tau},\;\mathcal{M}_{\tau},\;\mathcal{A}_{\tau}),(8)where eachcτ,i∈𝒞τc_{\tau,i}\in\mathcal{C}_{\tau}is a registration record inDefinition˜E.2. The context managerℳτ\mathcal{M}_{\tau}maintains the collection𝒞τ\mathcal{C}_{\tau}, the version lineage for typeτ\tau, and implements lifecycle and update operations over these records; the server-exposed interface𝒜τ\mathcal{A}_{\tau}encapsulatesℳτ\mathcal{M}_{\tau}and exposes a unified external interface by delegating requests to the corresponding context-manager routines.

E.2.2Context Manager

The context manager implements the management plane for each resource type. Beyond lifecycle control and dependency constraints, it maintains (i) an active registry of materialized resources and (ii) a versioned history for restoration. Its exported API exposes operators for lifecycle (init,build), retrieval (list,get), versioning (update,restore), execution (run), and serialization (save_to_json,load_from_json,save_contract,load_contract). The manager explicitly supportscontract generation, producing a consolidated capability and constraint specification for the managed entities, which provides stable, up-to-date descriptions that improve reliability and reduce prompt bloat, enabling systematiccontext engineeringvia controlled prompt injection. For instance, for tools (which may be native tool scripts, MCP-connected tools[1], or agent skills) the contract can take askills.md-style form[2]that enumerates tool actions, arguments, preconditions, and usage constraints. The exported management interface implemented byℳτ\mathcal{M}_{\tau}and exposed by𝒜τ\mathcal{A}_{\tau}are as follows:

Table 7:Operator set of Context Manager and Server Interface.OperatorDescriptionLifecycle & Registration𝚒𝚗𝚒𝚝\mathtt{init}Auto discover resources and register the resource configuration to the registry.𝚋𝚞𝚒𝚕𝚍\mathtt{build}Build a resource instance from code and configuration.𝚛𝚎𝚐𝚒𝚜𝚝𝚎𝚛\mathtt{register}Register a new resource instance with a unique name and version.𝚞𝚗𝚛𝚎𝚐𝚒𝚜𝚝𝚎𝚛\mathtt{unregister}Unregister a resource instance from the active registry and version history.Retrieval & Inspection𝚐𝚎𝚝\mathtt{get}Retrieve a resource instance by name from the active registry.𝚐𝚎𝚝​_​𝚒𝚗𝚏𝚘\mathtt{get\_info}Retrieve a resource configuration by name from the active registry.𝚕𝚒𝚜𝚝\mathtt{list}List all registered resource names.𝚛𝚎𝚝𝚛𝚒𝚎𝚟𝚎\mathtt{retrieve}Retrieve similar resources via semantic search when supported.𝚐𝚎𝚝​_​𝚜𝚝𝚊𝚝𝚎\mathtt{get\_state}Get the current state of a resource instance when supported.Versioning𝚞𝚙𝚍𝚊𝚝𝚎\mathtt{update}Update a resource implementation and generate a new version.𝚌𝚘𝚙𝚢\mathtt{copy}Duplicate a resource with an optional new name and version.𝚛𝚎𝚜𝚝𝚘𝚛𝚎\mathtt{restore}Restore a specific historical version by name and version string.𝚐𝚎𝚝​_​𝚟𝚊𝚛𝚒𝚊𝚋𝚕𝚎𝚜\mathtt{get\_variables}Expose resource code/configuration as evolvable variables.𝚜𝚎𝚝​_​𝚟𝚊𝚛𝚒𝚊𝚋𝚕𝚎𝚜\mathtt{set\_variables}Update resource variables and generate a new version.Execution𝚛𝚞𝚗\mathtt{run}Run a resource instance with structured input.Serialization𝚜𝚊𝚟𝚎​_​𝚝𝚘​_​𝚓𝚜𝚘𝚗\mathtt{save\_to\_json}Serialize configurations and version history to a JSON file.𝚕𝚘𝚊𝚍​_​𝚏𝚛𝚘𝚖​_​𝚓𝚜𝚘𝚗\mathtt{load\_from\_json}Deserialize configurations and version history from a JSON file.𝚜𝚊𝚟𝚎​_​𝚌𝚘𝚗𝚝𝚛𝚊𝚌𝚝\mathtt{save\_contract}Save the contract of a resource instance to a file.𝚕𝚘𝚊𝚍​_​𝚌𝚘𝚗𝚝𝚛𝚊𝚌𝚝\mathtt{load\_contract}Load the contract of a resource instance from a file.

E.2.3Server Interface

The server is introduced to encapsulate the context manager’s internal complexity and present a stable, simplified interface for external callers. It packages heterogeneous management routines behind a uniform set of endpoints with consistent request/response semantics, while delegating the implementation details to the context manager. This separation isolates clients from internal design changes, reduces coupling, and provides a single control plane through which the protocol mediates safe, version-aware interactions with RSPL resources.

E.2.4Infrastructure Services

RSPL further includes cross-cutting services that support reliable evolution, including reproducibility, safe deployment, and versioned recovery:

Model manager.A unified model-API layer that standardizes calls across providers (e.g., OpenAI, Anthropic, Google, and OpenRouter, etc.), while supporting routing, fallback, and cost-aware selection to keep model access consistent as components evolve.

Version manager.Maintains version lineage for each resource, enabling rollback, branching, and diffing. Versions are auto-incremented identifiers (e.g., semantic versions) assigned on register or update, each referencing an immutable snapshot of the configuration record and associated artifacts for auditability and reproducibility.

Dynamic manager.Handles serialization and deserialization of resource configurations for persistence and transfer, enabling safe hot-swapping of resource configurations at runtime without restarting the agent system.

Trace manager.Captures fine-grained execution traces (inputs, outputs, intermediate decisions, tool interactions, etc.) for interpretability and debugging, and as training signals for dataset synthesis and retrospective improvement.

E.3Layer 2: Self-Evolution Protocol Layer

The Self-Evolution Protocol Layer (SEPL) formalizes agentic system evolution as a generalized optimization problem over a heterogeneous state space, modeling evolutionary dynamics as a state transition function governed by a strictly typed operator algebra. By mediating all state mutations through standardized RSPL interfaces, SEPL guarantees that evolution is traceable, reversible, and safe-by-construction. While this paper focuses on the reflection-driven optimizer as the primary instantiation, the same state manipulation primitives also accommodate textual-gradient methods such as TextGrad[40]and reinforcement learning approaches such as GRPO[31]and Reinforce++[14].

E.3.1Evolvable Variables

To transition from heuristic adaptation to a systematic evolution protocol, SEPL introduces the concept ofvariable lifting: projecting discrete, heterogeneous RSPL resources (e.g., tool code, system prompts, memory modules, and environment configurations) onto a unified representation of evolvable variables. This homogenizes the interaction surface for all evolutionary operators and rigorously delineates the trainable subspace via an explicit learnability mask.

Definition E.4(Evolvable Variable Set).

We define the universal set of evolvable variables as

𝒱evo=(⋃τ∈𝒯ℰτ)∪{y},\mathcal{V}_{\text{evo}}=\Bigl(\bigcup_{\tau\in\mathcal{T}}\mathcal{E}_{\tau}\Bigr)\cup\{y\},(9)whereℰτ\mathcal{E}_{\tau}denotes the set of resource entities of typeτ\taugoverned by RSPL, andyyencapsulates execution artifacts (final outputs and reasoning traces) that constitute the observational basis for retrospective optimization. Each variablev∈𝒱evov\in\mathcal{V}_{\text{evo}}is associated with a binary learnability constraintgv∈{0,1}g_{v}\in\{0,1\}, strictly defining the trainable parameter subspace

Θ={v∈𝒱evo∣gv=1}.\Theta=\{v\in\mathcal{V}_{\text{evo}}\mid g_{v}=1\}.(10)

The evolvability markergvg_{v}allows SEPL to operate selectively: frozen components (e.g., a fixed tool API) are excluded from the trainable subspace, while designated evolvable resources (e.g., system prompts, tool implementations) are exposed for modification. This explicit masking ensures that only intended components are mutated during evolution.

E.3.2Operator Algebra

Definition E.5(SEPL Operator).

Let𝒱evo\mathcal{V}_{\text{evo}}be the evolvable variable set and𝒫\mathcal{P}amessage spacecarrying auxiliary signals (e.g., traces, hypotheses, gradients, or reward signals) passed between operators. ASEPL operatoris a function

f:𝒱evo×𝒫in→𝒱evo′×𝒫out,f:\mathcal{V}_{\text{evo}}\times\mathcal{P}_{\text{in}}\;\rightarrow\;\mathcal{V}^{\prime}_{\text{evo}}\times\mathcal{P}_{\text{out}},(11)where𝒫in,𝒫out⊆𝒫\mathcal{P}_{\text{in}},\mathcal{P}_{\text{out}}\subseteq\mathcal{P}are the incoming and outgoing message types, and𝒱evo′\mathcal{V}^{\prime}_{\text{evo}}is the updated evolvable state. Operators arecomposable: the output(𝒱evo′,𝒫out)(\mathcal{V}^{\prime}_{\text{evo}},\mathcal{P}_{\text{out}})of one operator serves as the input to the next, enabling the construction of an evolutionary pipelinefn∘⋯∘f1f_{n}\circ\cdots\circ f_{1}. All mutations to𝒱evo\mathcal{V}_{\text{evo}}must be routed through RSPL interfaces, ensuring every state transition is versioned, auditable, and reversible regardless of the specific optimizer instantiation.

The auxiliary spaces used by operators are: trace space𝒵\mathcal{Z}(execution observations), hypothesis spaceℋ\mathcal{H}(causal failure attributions), modification space𝒟\mathcal{D}(proposed resource changes), objective specification𝒢\mathcal{G}(task goals and safety invariants), and evaluation space𝒮\mathcal{S}(performance metrics and safety status). The five canonical operators of the reflection instantiation are{ρ,σ,ι,ε,κ}\{\rho,\sigma,\iota,\varepsilon,\kappa\}, corresponding toReflect,Select,Improve,Evaluate, andCommit, operating over these spaces in sequence. Other instantiations (TextGrad, GRPO, Reinforce++) reuse the same operator interface but replace the internal logic of individual operators, as detailed in the method-specific subsections below.

E.3.3Evolutionary Loop

Given an initial evolvable state𝒱evo(0)\mathcal{V}_{\text{evo}}^{(0)}and an empty message𝒫(0)=∅\mathcal{P}^{(0)}=\emptyset, the evolutionary loop at each iterationttapplies a sequence of operatorsf1,…,fnf_{1},\ldots,f_{n}in composition:

(𝒱evo(t+1),𝒫(t+1))=(fn∘⋯∘f1)​(𝒱evo(t),𝒫(t)),\bigl(\mathcal{V}_{\text{evo}}^{(t+1)},\,\mathcal{P}^{(t+1)}\bigr)=(f_{n}\circ\cdots\circ f_{1})\bigl(\mathcal{V}_{\text{evo}}^{(t)},\,\mathcal{P}^{(t)}\bigr),(12)where eachfif_{i}reads the current state and incoming messages, produces an updated state and outgoing messages consumed byfi+1f_{i+1}. The loop repeats until convergence or budget exhaustion. By routing all state mutations through RSPL interfaces, each transition is versioned and reversible, guaranteeing that evolution isgroundedin execution data,traceablethrough versioned updates, andsafe-by-construction.

The specific operator sequence instantiated by each method determines the behavior of the loop. The reflection optimizer instantiates this loop with five operators:Reflectmaps execution traces and current state to causal failure hypotheses,Selectidentifies target evolvable entities and generates concrete modification proposals,Improveapplies proposals via RSPL interfaces to yield a candidate state,Evaluatescores the candidate against the objective and safety invariants, andCommitconditionally accepts or rolls back the transition. TextGrad, GRPO, and Reinforce++ reuse the same loop structure but replace the internal logic of individual operators, as detailed in the method-specific subsections below.

E.3.4Reflection Optimizer

Evolvable Variables.In the reflection-driven instantiation, the evolvable state is given by the lifted variable set𝒱evo\mathcal{V}_{\text{evo}}introduced above. Concretely,𝒱evo\mathcal{V}_{\text{evo}}includes RSPL-managed resources (e.g., prompts, tools, memories, and agent components) together with execution artifacts (e.g., the produced answer and reasoning trace). A binary learnability mask specifies which variables may be modified, allowing the optimizer to target only authorized components while keeping non-learnable resources fixed.

Operator Algebra.We instantiate SEPL with the canonical reflection-driven operator suite. The operator signatures and their intended roles are as follows.

  • •Reflect (ρ\rho).Defined asρ:𝒵×𝒱evo→℘​(ℋ)\rho:\mathcal{Z}\times\mathcal{V}_{\text{evo}}\rightarrow\wp(\mathcal{H}), this operator bridges the gap between raw observation and optimization direction. It approximates the “semantic gradient” of the system by mapping high-dimensional execution traces to specific, causal failure hypotheses within the variable space.
  • •Select (σ\sigma).Formulated asσ:𝒱evo×℘​(ℋ)→℘​(𝒟)\sigma:\mathcal{V}_{\text{evo}}\times\wp(\mathcal{H})\rightarrow\wp(\mathcal{D}), this operator acts as the targeting policy. It identifies which evolvable entities within𝒱evo\mathcal{V}_{\text{evo}}are implicated by the diagnostic hypotheses, then generates concrete modification proposals𝒟\mathcal{D}targeting those entities, subject to structural constraints.
  • •Improve (ι\iota).The mutation operator,ι:𝒱evo×℘​(𝒟)→𝒱evo′\iota:\mathcal{V}_{\text{evo}}\times\wp(\mathcal{D})\rightarrow\mathcal{V}^{\prime}_{\text{evo}}, executes the physical state transition. It applies discrete updates𝒟\mathcal{D}via standardized RSPL interfaces to yield a provisional candidate state.
  • •Evaluate (ε\varepsilon).Specified asε:𝒱evo′×𝒢→𝒮\varepsilon:\mathcal{V}^{\prime}_{\text{evo}}\times\mathcal{G}\rightarrow\mathcal{S}, this operator serves as the objective function. It maps the candidate state and goal specification to the evaluation space𝒮\mathcal{S}(comprising quantitative scores and strict safety invariants).
  • •Commit (κ\kappa).Operating asκ:𝒱evo′×𝒮→𝒱evo\kappa:\mathcal{V}^{\prime}_{\text{evo}}\times\mathcal{S}\rightarrow\mathcal{V}_{\text{evo}}, this function acts as a conditional gating mechanism. It utilizes the evaluation signals in𝒮\mathcal{S}to govern state transition, rigorously enforcing safety invariants and performance monotonicity by accepting the candidate𝒱evo′\mathcal{V}^{\prime}_{\text{evo}}only when specific success criteria are met.

The Evolutionary Loop.These operators are composed into the reflection-driven closed-loop procedure shown in Algorithm1. Starting from an initial lifted state𝒱evo(0)\mathcal{V}_{\text{evo}}^{(0)}, the agent first executes to collect an observational trace𝒵\mathcal{Z}(tool outputs, intermediate decisions, failures, and progress signals). The reflect operatorρ\rhomaps𝒵\mathcal{Z}to a set of causal hypothesesℋ\mathcal{H}, which are then translated byσ\sigmainto concrete modification primitives𝒟\mathcal{D}(e.g., prompt edits, tool adjustments, or memory updates) over the learnable subset of𝒱evo\mathcal{V}_{\text{evo}}. The improve operatorι\iotaapplies𝒟\mathcal{D}via RSPL interfaces to obtain a candidate state, which is evaluated byε\varepsilonto produce𝒮\mathcal{S}capturing both performance metrics and safety constraints. Finally, the commit operatorκ\kappagates the transition by accepting only candidates that satisfy the predefined criteria, recording each accepted change as a versioned resource update with auditable lineage and enabling rollback when necessary.

Algorithm 1Reflection Optimizer Evolutionary Loop0:Agentic System

𝒜\mathcal{A}, Objective

𝒢\mathcal{G}, Budget

TT 0:Optimized state

𝒱evo∗\mathcal{V}_{\text{evo}}^{*} 1:Initialization:

2:

𝒱evo(0)←VariableLifting​(𝒜)\mathcal{V}_{\text{evo}}^{(0)}\leftarrow\text{VariableLifting}(\mathcal{A})⊳\rhdProject resources to optimization manifold

3:

𝒵(0)←Execute​(𝒜,𝒱evo(0))\mathcal{Z}^{(0)}\leftarrow\text{Execute}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(0)})⊳\rhdTrace: tool I/O, failures, latencies, progress

4:Optimization Cycle:

5:for

t=0,1,…,T−1t=0,1,\ldots,T-1do

6:// Phase 1: Diagnosis & Proposal

7:

ℋ(t)←ρ​(𝒵(t),𝒱evo(t))\mathcal{H}^{(t)}\leftarrow\rho(\mathcal{Z}^{(t)},\mathcal{V}_{\text{evo}}^{(t)})⊳\rhdReflect: attribute failures / inefficiencies

8:

𝒟(t)←σ​(𝒱evo(t),ℋ(t))\mathcal{D}^{(t)}\leftarrow\sigma(\mathcal{V}_{\text{evo}}^{(t)},\mathcal{H}^{(t)})⊳\rhdSelect: propose edits over learnable variables

9:// Phase 2: Mutation & Verification

10:

𝒱~evo(t+1)←ι​(𝒱evo(t),𝒟(t))\widetilde{\mathcal{V}}_{\text{evo}}^{(t+1)}\leftarrow\iota(\mathcal{V}_{\text{evo}}^{(t)},\mathcal{D}^{(t)})⊳\rhdImprove: apply proposed updates (candidate)

11:

𝒮(t+1)←ε​(𝒱~evo(t+1),𝒢)\mathcal{S}^{(t+1)}\leftarrow\varepsilon(\widetilde{\mathcal{V}}_{\text{evo}}^{(t+1)},\mathcal{G})⊳\rhdEvaluate: metrics + safety invariants

12:// Phase 3: Gating & Transition

13:if

Accept​(𝒮(t+1))\text{Accept}(\mathcal{S}^{(t+1)})then

14:// Accept: safe & non-degrading

15:

𝒱evo(t+1)←κ​(𝒱~evo(t+1),𝒮(t+1))\mathcal{V}_{\text{evo}}^{(t+1)}\leftarrow\kappa(\widetilde{\mathcal{V}}_{\text{evo}}^{(t+1)},\mathcal{S}^{(t+1)})⊳\rhdCommit: versioned update

16:else

17:// Reject: rollback / keep previous state

18:

𝒱evo(t+1)←𝒱evo(t)\mathcal{V}_{\text{evo}}^{(t+1)}\leftarrow\mathcal{V}_{\text{evo}}^{(t)} 19:endif

20:// Phase 4: Next Iteration

21:

𝒵(t+1)←Execute​(𝒜,𝒱evo(t+1))\mathcal{Z}^{(t+1)}\leftarrow\text{Execute}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(t+1)})⊳\rhdRe-run under updated resources

22:if

Converged​(𝒮(t+1))\text{Converged}(\mathcal{S}^{(t+1)})then

23:break

24:endif

25:endfor

26:return

𝒱evo(t)\mathcal{V}_{\text{evo}}^{(t)}

E.3.5TextGrad Optimizer

Evolvable Variables.In the TextGrad instantiation, the evolvable variables are restricted to a subset ofprompt variablesmarked as optimizable and lifted into TextGrad variables with explicit role descriptions. In our implementation, each optimizable prompt module is represented as a TextGrad variable whose value is the current prompt text and whose role description specifies the prompt’s function, enabling the optimizer to condition updates on its intended semantics.

Operator Algebra.TextGrad instantiates SEPL with a prompt-level operatorization in which “gradients” are natural-language critiques produced by an LLM evaluator and updates are implemented as constrained prompt rewrites. Following the standard TextGrad view, we express the method with five core operators, namelyExecute,Loss,Backward,Improve, andCommit, where the “gradient” is a piece of text (a critique) rather than a numeric vector:

  • •Execute (χtg\chi_{\mathrm{tg}}).χtg:(A,𝒱evo,x,f)→𝒵\chi_{\mathrm{tg}}:(A,\mathcal{V}_{\text{evo}},x,f)\rightarrow\mathcal{Z}runs the agent under the current prompt variables and produces an execution trace/outcome.
  • •Loss (λtg\lambda_{\mathrm{tg}}).λtg:𝒵→𝒢tg\lambda_{\mathrm{tg}}:\mathcal{Z}\rightarrow\mathcal{G}_{\mathrm{tg}}, where𝒢tg\mathcal{G}_{\mathrm{tg}}is a space of natural-language critiques (textual gradients). In our implementation,λtg\lambda_{\mathrm{tg}}is realized byTextLoss, which queries an evaluator LLM and returns critique feedback.
  • •Backward (βtg\beta_{\mathrm{tg}}).βtg:𝒱evo×𝒢tg→𝒱evo\beta_{\mathrm{tg}}:\mathcal{V}_{\text{evo}}\times\mathcal{G}_{\mathrm{tg}}\rightarrow\mathcal{V}_{\text{evo}}assigns textual gradients to optimizable prompt variables by storing the critique (optionally with context) in a per-variable gradient buffer. In our current implementation, we distribute the same critique to each optimizable prompt variable for stability.
  • •Improve (ιtg\iota_{\mathrm{tg}}).ιtg:𝒱evo→𝒱evo′\iota_{\mathrm{tg}}:\mathcal{V}_{\text{evo}}\rightarrow\mathcal{V}^{\prime}_{\text{evo}}rewrites prompt variables via a textual-gradient-descent step: it constructs an update instruction from each variable’s role description, current value, and accumulated textual gradients, then queries an optimizer LLM and extracts the improved variable text from a constrained output format.
  • •Commit (κtg\kappa_{\mathrm{tg}}).κtg:𝒱evo′→𝒱evo\kappa_{\mathrm{tg}}:\mathcal{V}^{\prime}_{\text{evo}}\rightarrow\mathcal{V}_{\text{evo}}synchronizes the updated prompt variables back into the running agent and clears caches, completing the state transition.

The Evolutionary Loop.Algorithm2presents the full TextGrad optimization cycle in operator form. At each iteration, the agent is executed under the current prompt variables to obtain a trace𝒵\mathcal{Z}viaχtg\chi_{\mathrm{tg}}, an LLM-based evaluator produces a natural-language critiqueg∈𝒢tgg\in\mathcal{G}_{\mathrm{tg}}viaλtg\lambda_{\mathrm{tg}}, the critique is assigned as atextual gradientto the optimizable prompt variables viaβtg\beta_{\mathrm{tg}}, the prompt variables are improved viaιtg\iota_{\mathrm{tg}}using textual-gradient-descent, and the candidate state is committed viaκtg\kappa_{\mathrm{tg}}to synchronize the updated prompts back into the running agent (and clear caches) before the next iteration.

Algorithm 2TextGrad Prompt Optimization Loop0:Agentic System

𝒜\mathcal{A}, task

xx, attachments

ff(optional), Budget

KK, evaluator/optimizer LLMs

Meval,MoptM_{\text{eval}},M_{\text{opt}} 0:Updated state

𝒱evo∗\mathcal{V}_{\text{evo}}^{*}(prompt variables updated via TextGrad)

1:// Phase 0: Setup

2:Set backward engine to

MevalM_{\text{eval}}⊳\rhdEvaluator used by TextLoss

3:

𝒱evo(0)←VariableLifting​(𝒜)\mathcal{V}_{\text{evo}}^{(0)}\leftarrow\text{VariableLifting}(\mathcal{A})⊳\rhdLift optimizable prompts to TextGrad variables

4:Initialize textual optimizer with

MoptM_{\text{opt}}⊳\rhdTextualGradientDescent over prompt vars

5:// Optimization Cycle

6:for

k=0,1,…,K−1k=0,1,\ldots,K-1do

7:// Phase 1: Execute (Forward)

8:

𝒵(k)←χtg​(𝒜,𝒱evo(k),x,f)\mathcal{Z}^{(k)}\leftarrow\chi_{\mathrm{tg}}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(k)},x,f)⊳\rhdRun agent with current prompts

9:// Phase 2: Loss (Textual Gradient)

10:Build evaluation instruction from

𝒵(k)\mathcal{Z}^{(k)}⊳\rhdCondition on success/error

11:

g(k)←λtg​(𝒵(k))g^{(k)}\leftarrow\lambda_{\mathrm{tg}}(\mathcal{Z}^{(k)})⊳\rhdTextLoss produces critique string

12:// Phase 3: Backward (Assign Gradients)

13:

𝒱evo(k)←βtg​(𝒱evo(k),g(k))\mathcal{V}_{\text{evo}}^{(k)}\leftarrow\beta_{\mathrm{tg}}(\mathcal{V}_{\text{evo}}^{(k)},g^{(k)})⊳\rhdAssign critique to gradient buffers

14:// Phase 4: Improve (Textual Gradient Descent)

15:

𝒱~evo(k+1)←ιtg​(𝒱evo(k))\widetilde{\mathcal{V}}_{\text{evo}}^{(k+1)}\leftarrow\iota_{\mathrm{tg}}(\mathcal{V}_{\text{evo}}^{(k)})⊳\rhdRewrite prompts via textual GD

16:// Phase 5: Commit & Next Iteration

17:

𝒱evo(k+1)←κtg​(𝒱~evo(k+1))\mathcal{V}_{\text{evo}}^{(k+1)}\leftarrow\kappa_{\mathrm{tg}}(\widetilde{\mathcal{V}}_{\text{evo}}^{(k+1)})⊳\rhdSync back; clear caches

18:if

Converged​(g(k))\text{Converged}(g^{(k)})then

19:break

20:endif

21:endfor

22:return

𝒱evo(k)\mathcal{V}_{\text{evo}}^{(k)}

E.3.6Reinforce++ Optimizer

Evolvable Variables.Reinforce++ optimizes a trainable subset of RSPL resources, focusing on prompt variables and tool implementations (native scripts, MCP tools[1], and agent skills[2]), and optionally refining the produced solution text. Our implementation follows a two stage structure: (i) update trainable variables that govern behavior (e.g., prompts and tools), and (ii) update the solution itself when enabled.

Operator Algebra.Reinforce++ is characterized by a clipped objective with an explicit penalty to a reference solution, while using reflection to translate RL signals into concrete edits. We group the method into a small set of core operators:

  • •Sample (χrpp\chi_{\mathrm{rpp}}).χrpp:(A,𝒱evo,x,f)→𝒵\chi_{\mathrm{rpp}}:(A,\mathcal{V}_{\text{evo}},x,f)\rightarrow\mathcal{Z}samples a rollout under the current resources and yields an execution trace containing the produced answer.
  • •Reward (εrpp\varepsilon_{\mathrm{rpp}}).εrpp:(y(t),y(t−1),y∗,ysft)→(r(t),A(t),J(t),π(t))\varepsilon_{\mathrm{rpp}}:(y^{(t)},y^{(t-1)},y^{*},y_{\mathrm{sft}})\rightarrow(r^{(t)},A^{(t)},J^{(t)},\pi^{(t)})computes the RL signal tuple from the current solutiony(t)y^{(t)}. Herer(t)r^{(t)}is a task reward comparingy(t)y^{(t)}withy∗y^{*}, andπ(t)\pi^{(t)}is a policy ratio surrogate approximated via text similarityη​(⋅,⋅)\eta(\cdot,\cdot)asπ(t)≜η​(y(t−1),y(t))\pi^{(t)}\triangleq\eta(y^{(t-1)},y^{(t)})(since token-level probability ratios are unavailable in inference-only LLM settings). We define a penalty to a reference solutionysfty_{\mathrm{sft}}aspen(t)≜β​|log⁡max⁡(η​(ysft,y(t)),ϵ0)|\mathrm{pen}^{(t)}\triangleq\beta\,\bigl|\log\max(\eta(y_{\mathrm{sft}},y^{(t)}),\epsilon_{0})\bigr|and setA(t)≜r(t)−pen(t)A^{(t)}\triangleq r^{(t)}-\mathrm{pen}^{(t)}. The clipped Reinforce++ objective is J(t)≜min⁡(π(t)​A(t),π¯(t)​A(t)),π¯(t)≜clip​(π(t),1−ϵ,1+ϵ).J^{(t)}\triangleq\min\bigl(\pi^{(t)}A^{(t)},\;\bar{\pi}^{(t)}A^{(t)}\bigr),\quad\bar{\pi}^{(t)}\triangleq\mathrm{clip}(\pi^{(t)},1-\epsilon,1+\epsilon).
  • •Diagnose (δrpp\delta_{\mathrm{rpp}}).δrpp:(𝒵,𝒱train,r(t),A(t),J(t),π(t))→ℋ\delta_{\mathrm{rpp}}:(\mathcal{Z},\mathcal{V}_{\text{train}},r^{(t)},A^{(t)},J^{(t)},\pi^{(t)})\rightarrow\mathcal{H}produces an edit oriented diagnosis that is explicitly conditioned on the RL metrics and the execution trace.
  • •Improve (ιrpp\iota_{\mathrm{rpp}}).ιrpp:(𝒱,ℋ)→𝒱evo′\iota_{\mathrm{rpp}}:(\mathcal{V},\mathcal{H})\rightarrow\mathcal{V}^{\prime}_{\text{evo}}applies RL informed edits to either (i) the trainable resources𝒱train\mathcal{V}_{\text{train}}such as prompts and tools, or (ii) the solution variable itself when solution refinement is enabled, yielding a candidate state.
  • •Commit (κrpp\kappa_{\mathrm{rpp}}).κrpp:𝒱evo′→𝒱evo\kappa_{\mathrm{rpp}}:\mathcal{V}^{\prime}_{\text{evo}}\rightarrow\mathcal{V}_{\text{evo}}applies accepted updates back to RSPL resources, completing the state transition.

The Evolutionary Loop.Algorithm3summarizes the Reinforce++ loop in a phased form. Each iteration (i) computes Reinforce++ signals via the clipped objective and the penalty to the reference solution, (ii) improves trainable resources through RL conditioned reflection and edits, (iii) optionally improves the solution text, and (iv) applies an early stopping evaluation.

Algorithm 3Reinforce++ Optimization Loop0:Agentic System

𝒜\mathcal{A}, task

xx, attachments

ff(optional), ground truth

y∗y^{*}, reference solution

ysfty_{\mathrm{sft}}, Budget

TT 0:Final solution

y(t)y^{(t)}and updated trainable resources

𝒱train\mathcal{V}_{\text{train}} 1:// Initialization

2:

𝒱evo(0)←VariableLifting​(𝒜)\mathcal{V}_{\text{evo}}^{(0)}\leftarrow\text{VariableLifting}(\mathcal{A})⊳\rhdLift trainable resources

3:

𝒵(0)←χrpp​(𝒜,𝒱evo(0),x,f)\mathcal{Z}^{(0)}\leftarrow\chi_{\mathrm{rpp}}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(0)},x,f)⊳\rhdSample once

4:Extract solution

y(0)y^{(0)}from

𝒵(0)\mathcal{Z}^{(0)} 5:

y(−1)←y(0)y^{(-1)}\leftarrow y^{(0)}⊳\rhdInitialize previous solution

6:for

t=0,1,…,T−1t=0,1,\ldots,T-1do

7:// Phase 1: Reinforce++ reward and objective

8:

(r(t),A(t),J(t),π(t))←εrpp​(y(t),y(t−1),y∗,ysft)(r^{(t)},A^{(t)},J^{(t)},\pi^{(t)})\leftarrow\varepsilon_{\mathrm{rpp}}(y^{(t)},y^{(t-1)},y^{*},y_{\mathrm{sft}})⊳\rhdReward, penalty, clipped objective

9:// Phase 2: Improve trainable resources (prompt and tool)

10:

𝒱train(t)←GetTrainables​(𝒱evo(t))\mathcal{V}_{\text{train}}^{(t)}\leftarrow\text{GetTrainables}(\mathcal{V}_{\text{evo}}^{(t)}) 11:

ℋtrain(t)←δrpp​(𝒵(t),𝒱train(t),r(t),A(t),J(t),π(t))\mathcal{H}_{\text{train}}^{(t)}\leftarrow\delta_{\mathrm{rpp}}(\mathcal{Z}^{(t)},\mathcal{V}_{\text{train}}^{(t)},r^{(t)},A^{(t)},J^{(t)},\pi^{(t)})⊳\rhdDiagnose conditioned on RL signals

12:

𝒱~train(t+1)←ιrpp​(𝒱train(t),ℋtrain(t))\widetilde{\mathcal{V}}_{\text{train}}^{(t+1)}\leftarrow\iota_{\mathrm{rpp}}(\mathcal{V}_{\text{train}}^{(t)},\mathcal{H}_{\text{train}}^{(t)})⊳\rhdApply edits to trainables (candidate)

13:

𝒱train(t+1)←κrpp​(𝒱~train(t+1))\mathcal{V}_{\text{train}}^{(t+1)}\leftarrow\kappa_{\mathrm{rpp}}(\widetilde{\mathcal{V}}_{\text{train}}^{(t+1)})⊳\rhdCommit updates

14:// Phase 3: Re run under updated resources

15:

𝒵(t+1)←χrpp​(𝒜,𝒱evo(t)∪𝒱train(t+1),x,f)\mathcal{Z}^{(t+1)}\leftarrow\chi_{\mathrm{rpp}}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(t)}\cup\mathcal{V}_{\text{train}}^{(t+1)},x,f) 16:Extract solution

y(t+1)y^{(t+1)}from

𝒵(t+1)\mathcal{Z}^{(t+1)} 17:// Phase 4: Optional solution refinement

18:

ℋsol(t)←δrpp​(𝒵(t+1),{y(t+1)},r(t),A(t),J(t),π(t))\mathcal{H}_{\text{sol}}^{(t)}\leftarrow\delta_{\mathrm{rpp}}(\mathcal{Z}^{(t+1)},\{y^{(t+1)}\},r^{(t)},A^{(t)},J^{(t)},\pi^{(t)})⊳\rhdDiagnose solution quality

19:

y~(t+1)←ιrpp​(y(t+1),ℋsol(t))\widetilde{y}^{(t+1)}\leftarrow\iota_{\mathrm{rpp}}(y^{(t+1)},\mathcal{H}_{\text{sol}}^{(t)})⊳\rhdEdit solution text (candidate)

20:

y(t+1)←κrpp​(y~(t+1))y^{(t+1)}\leftarrow\kappa_{\mathrm{rpp}}(\widetilde{y}^{(t+1)})⊳\rhdCommit solution update

21:// Phase 5: Early stopping

22:if

Satisfied​(𝒵(t+1))\text{Satisfied}(\mathcal{Z}^{(t+1)})then

23:break

24:endif

25:

y(t)←y(t+1)y^{(t)}\leftarrow y^{(t+1)}⊳\rhdAdvance current solution

26:endfor

27:return

y(t)y^{(t)}

E.3.7GRPO Optimizer

Evolvable Variables.GRPO optimizes a trainable subset of RSPL resources, focusing on prompt variables and tool implementations (native scripts, MCP tools[1], and agent skills[2]), and optionally refining the produced solution text. Similar to Reinforce++, our implementation follows a two stage structure: (i) update trainable variables that govern behavior (e.g., prompts and tools), and (ii) update the solution itself when enabled.

Operator Algebra.GRPO is characterized by sampling multiple candidate solutions per step and using group normalized advantages with a clipped objective. We formalize the method with the following core operators:

  • •Sample (χgrpo\chi_{\mathrm{grpo}}).χgrpo:(A,𝒱evo,x,f,K)→{𝒵i}i=1K\chi_{\mathrm{grpo}}:(A,\mathcal{V}_{\text{evo}},x,f,K)\rightarrow\{\mathcal{Z}_{i}\}_{i=1}^{K}samplesKKindependent rollouts under the current resources, yieldingKKexecution traces each containing a candidate solutionyiy_{i}.
  • •Reward (εgrpo\varepsilon_{\mathrm{grpo}}).εgrpo:({yi}i=1K,y∗,y(t−1))→({ri}i=1K,{Ai}i=1K,{Ji}i=1K,{πi}i=1K)\varepsilon_{\mathrm{grpo}}:(\{y_{i}\}_{i=1}^{K},y^{*},y^{(t-1)})\rightarrow(\{r_{i}\}_{i=1}^{K},\{A_{i}\}_{i=1}^{K},\{J_{i}\}_{i=1}^{K},\{\pi_{i}\}_{i=1}^{K})computes RL signals for allKKcandidates. For each candidateyiy_{i}, we compute a task rewardrir_{i}comparingyiy_{i}withy∗y^{*}, a policy ratio surrogateπi≜η​(y(t−1),yi)\pi_{i}\triangleq\eta(y^{(t-1)},y_{i})approximated via text similarityη​(⋅,⋅)\eta(\cdot,\cdot)(since token-level probability ratios are unavailable in inference-only LLM settings), and a group normalized advantageAiA_{i}by normalizing rewards across the candidate set:Ai=(ri−r¯)/σrA_{i}=(r_{i}-\bar{r})/\sigma_{r}wherer¯\bar{r}andσr\sigma_{r}are the mean and standard deviation of{ri}i=1K\{r_{i}\}_{i=1}^{K}. The GRPO clipped objective for each candidate is Ji≜min⁡(πi​Ai,π¯i​Ai),π¯i≜{min⁡(πi,1+ϵ)if​Ai≥0max⁡(πi,1−ϵ)if​Ai<0.J_{i}\triangleq\min\bigl(\pi_{i}A_{i},\;\bar{\pi}_{i}A_{i}\bigr),\quad\bar{\pi}_{i}\triangleq\begin{cases}\min(\pi_{i},1+\epsilon)&\text{if }A_{i}\geq 0\\ \max(\pi_{i},1-\epsilon)&\text{if }A_{i}<0\end{cases}.
  • •Diagnose (δgrpo\delta_{\mathrm{grpo}}).δgrpo:({𝒵i}i=1K,𝒱train,{ri,Ai,Ji,πi}i=1K)→ℋ\delta_{\mathrm{grpo}}:(\{\mathcal{Z}_{i}\}_{i=1}^{K},\mathcal{V}_{\text{train}},\{r_{i},A_{i},J_{i},\pi_{i}\}_{i=1}^{K})\rightarrow\mathcal{H}produces an edit oriented diagnosis that is explicitly conditioned on the multiple candidate solutions and their RL metrics, enabling the optimizer to identify patterns across candidates.
  • •Improve (ιgrpo\iota_{\mathrm{grpo}}).ιgrpo:(𝒱,ℋ)→𝒱evo′\iota_{\mathrm{grpo}}:(\mathcal{V},\mathcal{H})\rightarrow\mathcal{V}^{\prime}_{\text{evo}}applies RL informed edits to either (i) the trainable resources𝒱train\mathcal{V}_{\text{train}}such as prompts and tools, or (ii) the solution variable itself when solution refinement is enabled, yielding a candidate state.
  • •Commit (κgrpo\kappa_{\mathrm{grpo}}).κgrpo:𝒱evo′→𝒱evo\kappa_{\mathrm{grpo}}:\mathcal{V}^{\prime}_{\text{evo}}\rightarrow\mathcal{V}_{\text{evo}}applies accepted updates back to RSPL resources, completing the state transition.

The Evolutionary Loop.Algorithm4summarizes the GRPO loop in a phased form. Each iteration (i) samplesKKcandidate solutions, (ii) computes GRPO signals via group normalized advantages and clipped objectives, (iii) improves trainable resources through multi candidate conditioned reflection and edits, (iv) optionally improves the solution text, and (v) applies an early stopping evaluation.

Algorithm 4GRPO Optimization Loop0:Agentic System

𝒜\mathcal{A}, task

xx, attachments

ff(optional), ground truth

y∗y^{*}, Budget

TT, number of candidates

KK 0:Final solution

y(t)y^{(t)}and updated trainable resources

𝒱train\mathcal{V}_{\text{train}} 1:// Initialization

2:

𝒱evo(0)←VariableLifting​(𝒜)\mathcal{V}_{\text{evo}}^{(0)}\leftarrow\text{VariableLifting}(\mathcal{A})⊳\rhdLift trainable resources

3:

𝒵(0)←χgrpo​(𝒜,𝒱evo(0),x,f,1)\mathcal{Z}^{(0)}\leftarrow\chi_{\mathrm{grpo}}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(0)},x,f,1)⊳\rhdSample initial solution

4:Extract solution

y(0)y^{(0)}from

𝒵(0)\mathcal{Z}^{(0)} 5:

y(−1)←y(0)y^{(-1)}\leftarrow y^{(0)}⊳\rhdInitialize previous solution

6:for

t=0,1,…,T−1t=0,1,\ldots,T-1do

7:// Phase 1: Sample multiple candidates

8:

{𝒵i(t)}i=1K←χgrpo​(𝒜,𝒱evo(t),x,f,K)\{\mathcal{Z}_{i}^{(t)}\}_{i=1}^{K}\leftarrow\chi_{\mathrm{grpo}}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(t)},x,f,K)⊳\rhdSampleKKrollouts

9:Extract candidate solutions

{yi(t)}i=1K\{y_{i}^{(t)}\}_{i=1}^{K}from

{𝒵i(t)}i=1K\{\mathcal{Z}_{i}^{(t)}\}_{i=1}^{K} 10:// Phase 2: GRPO reward and objective

11:

({ri(t)}i=1K,{Ai(t)}i=1K,{Ji(t)}i=1K,{πi(t)}i=1K)←εgrpo​({yi(t)}i=1K,y∗,y(t−1))(\{r_{i}^{(t)}\}_{i=1}^{K},\{A_{i}^{(t)}\}_{i=1}^{K},\{J_{i}^{(t)}\}_{i=1}^{K},\{\pi_{i}^{(t)}\}_{i=1}^{K})\leftarrow\varepsilon_{\mathrm{grpo}}(\{y_{i}^{(t)}\}_{i=1}^{K},y^{*},y^{(t-1)})⊳\rhdGroup normalized advantages, clipped objectives

12:// Phase 3: Improve trainable resources (prompt and tool)

13:

𝒱train(t)←GetTrainables​(𝒱evo(t))\mathcal{V}_{\text{train}}^{(t)}\leftarrow\text{GetTrainables}(\mathcal{V}_{\text{evo}}^{(t)}) 14:

ℋtrain(t)←δgrpo​({𝒵i(t)}i=1K,𝒱train(t),{ri(t),Ai(t),Ji(t),πi(t)}i=1K)\mathcal{H}_{\text{train}}^{(t)}\leftarrow\delta_{\mathrm{grpo}}(\{\mathcal{Z}_{i}^{(t)}\}_{i=1}^{K},\mathcal{V}_{\text{train}}^{(t)},\{r_{i}^{(t)},A_{i}^{(t)},J_{i}^{(t)},\pi_{i}^{(t)}\}_{i=1}^{K})⊳\rhdDiagnose conditioned on multi candidate RL signals

15:

𝒱~train(t+1)←ιgrpo​(𝒱train(t),ℋtrain(t))\widetilde{\mathcal{V}}_{\text{train}}^{(t+1)}\leftarrow\iota_{\mathrm{grpo}}(\mathcal{V}_{\text{train}}^{(t)},\mathcal{H}_{\text{train}}^{(t)})⊳\rhdApply edits to trainables (candidate)

16:

𝒱train(t+1)←κgrpo​(𝒱~train(t+1))\mathcal{V}_{\text{train}}^{(t+1)}\leftarrow\kappa_{\mathrm{grpo}}(\widetilde{\mathcal{V}}_{\text{train}}^{(t+1)})⊳\rhdCommit updates

17:// Phase 4: Re run under updated resources

18:

𝒵(t+1)←χgrpo​(𝒜,𝒱evo(t)∪𝒱train(t+1),x,f,1)\mathcal{Z}^{(t+1)}\leftarrow\chi_{\mathrm{grpo}}(\mathcal{A},\mathcal{V}_{\text{evo}}^{(t)}\cup\mathcal{V}_{\text{train}}^{(t+1)},x,f,1) 19:Extract solution

y(t+1)y^{(t+1)}from

𝒵(t+1)\mathcal{Z}^{(t+1)} 20:// Phase 5: Optional solution refinement

21:

ℋsol(t)←δgrpo​({𝒵i(t)}i=1K,{y(t+1)},{ri(t),Ai(t),Ji(t),πi(t)}i=1K)\mathcal{H}_{\text{sol}}^{(t)}\leftarrow\delta_{\mathrm{grpo}}(\{\mathcal{Z}_{i}^{(t)}\}_{i=1}^{K},\{y^{(t+1)}\},\{r_{i}^{(t)},A_{i}^{(t)},J_{i}^{(t)},\pi_{i}^{(t)}\}_{i=1}^{K})⊳\rhdDiagnose solution quality using multi candidate context

22:

y~(t+1)←ιgrpo​(y(t+1),ℋsol(t))\widetilde{y}^{(t+1)}\leftarrow\iota_{\mathrm{grpo}}(y^{(t+1)},\mathcal{H}_{\text{sol}}^{(t)})⊳\rhdEdit solution text (candidate)

23:

y(t+1)←κgrpo​(y~(t+1))y^{(t+1)}\leftarrow\kappa_{\mathrm{grpo}}(\widetilde{y}^{(t+1)})⊳\rhdCommit solution update

24:// Phase 6: Early stopping

25:if

Satisfied​(𝒵(t+1))\text{Satisfied}(\mathcal{Z}^{(t+1)})then

26:break

27:endif

28:

y(t)←y(t+1)y^{(t)}\leftarrow y^{(t+1)}⊳\rhdAdvance current solution

29:endfor

30:return

y(t)y^{(t)}

Similar Articles

@qinzytech: https://x.com/qinzytech/status/2066585405479371092

X AI KOLs Timeline

A technical analysis of two approaches to building self-evolving AI agents: model-based (via architecture like SSMs or transformer with fast-weight updates, and training methods) and harness-based (via memory or meta harness that can rewrite itself). The author provides practical recommendations for different audiences.