MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

arXiv cs.LG Papers

Summary

MetaEvo proposes a two-stage framework for continual evolution of LLM-based agents, using preference-based optimization to enhance principle abstraction and modular architecture for experience reuse, outperforming strong baselines on reasoning benchmarks.

arXiv:2606.07603v1 Announce Type: new Abstract: Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve through task interactions. Existing experience-driven methods often rely on memory or heuristics without enhancing the model's ability to learn, treating it as a passive executor and leading to early performance plateaus and limited long-term improvement. To address this issue, we propose MetaEvo, a two-stage framework for continual agent evolution that focuses on improving how the model learns from tasks experience, rather than solely on what it stores. MetaEvo first applies preference-based optimization to enhance the model's ability of principle abstraction, then enables the accumulation and reuse of these principles within a modular agent architecture. Experimental results on diverse reasoning benchmarks demonstrate that MetaEvo consistently outperforms strong baselines, maintains reliable improvement across iterations. These findings validate the effectiveness of meta-optimization in enabling agents to learn from experience and continually enhance their reasoning capabilities.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:49 AM

# MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
Source: [https://arxiv.org/html/2606.07603](https://arxiv.org/html/2606.07603)
Bowen Ren1,Heyan Huang1,2,Yinghao Li1,Yang Gao1,2,

1School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China 2Beijing Institute of Technology Southeast Academy of Information Technology, Putian, China \{bwren\-bit, hhy63, yhli, gyang\}@bit\.edu\.cn

###### Abstract

Large language models \(LLMs\) exhibit strong reasoning capabilities, yet most LLM\-based agents are statically deployed and unable to improve through task interactions\. Existing experience\-driven methods often rely on memory or heuristics without enhancing the model’s ability to learn, treating it as a passive executor and leading to early performance plateaus and limited long\-term improvement\. To address this issue, we propose MetaEvo, a two\-stage framework for continual agent evolution that focuses on improving how the model learns from tasks experience, rather than solely on what it stores\. MetaEvo first applies preference\-based optimization to enhance the model’s ability of principle abstraction, then enables the accumulation and reuse of these principles within a modular agent architecture\. Experimental results on diverse reasoning benchmarks demonstrate that MetaEvo consistently outperforms strong baselines, maintains reliable improvement across iterations\. These findings validate the effectiveness of meta\-optimization in enabling agents to learn from experience and continually enhance their reasoning capabilities\.

MetaEvo: A Meta\-Optimization Framework for Experience\-Driven Agent Evolution

Bowen Ren1, Heyan Huang1,2, Yinghao Li1, Yang Gao1,2,1School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China2Beijing Institute of Technology Southeast Academy of Information Technology, Putian, China\{bwren\-bit, hhy63, yhli, gyang\}@bit\.edu\.cn

## 1Introduction

Large Language Models \(LLMs\) have demonstrated strong performance across a wide range of natural language processing tasks\(Brownet al\.,[2020](https://arxiv.org/html/2606.07603#bib.bib2); Touvronet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib3); Vaswaniet al\.,[2017](https://arxiv.org/html/2606.07603#bib.bib1)\)\. However, most LLM\-based agents are statically deployed and cannot accumulate or reuse knowledge learned from past successes and failures, collectively known asexperience, leading to repeated reasoning or planning errors\(Madaanet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib11); Shinnet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib10); Liet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib26); Gouet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib12); Yanget al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib4); Chenet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib7)\)\. Recent research has begun to explore experience\-driven agent evolution by distilling accumulated task interaction data into structured and memory\-based knowledge, enabling agents to progressively refine their behavior and support continual self\-evolution\. As illustrated in Figure[1](https://arxiv.org/html/2606.07603#S1.F1), many existing methods represent accumulated task experience as high\-level textualprinciplesthat provide guidelines for correcting model behavior and are explicitly injected into the context during inference to guide reasoning and decision\-making\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib6); Zhaoet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib5); Caiet al\.,[2025](https://arxiv.org/html/2606.07603#bib.bib46)\)\.

![Refer to caption](https://arxiv.org/html/2606.07603v1/x1.png)Figure 1:An instance of principle guided generation\.While such principles provide an effective way to summarize past mistakes and guide future behavior correction, they are not continually optimized and exhibit unstable quality, positioning the LLM primarily as a passive executor rather than an active adapter during agent evolution\. The fundamental limitation is that existing approaches treat learning from experience as a static procedure, lacking a mechanism to optimize the capacity for learning itself, which often leads to limited improvement during evolution and prevents sustained performance gains\.

To address this issue, we treat the process of principle extraction as a learnable and optimizable capability, framing it as an evolving meta\-ability\. To achieve this, we propose MetaEvo, a framework that shifts the focus from optimizing performance outcomes to dynamically enhancing the learning process\. The framework consists of two stages: meta\-optimization for improving principle abstraction, and principle accumulation, an iterative evolutionary cycle\.

Specifically, in the first stage, we enhance the model’s meta\-ability to learn how to abstract high\-quality principles by distilling insights from stronger and more abstract alternatives, referred to as meta\-optimization\. To achieve this, we leverage a more capable external model to construct a preference dataset and then apply Direct Preference Optimization \(DPO\)Rafailovet al\.\([2023](https://arxiv.org/html/2606.07603#bib.bib29)\)to align the model’s outputs with the preferred principles\. In the second stage, we build a principle library using the meta\-optimized model and apply it to guide generation during the inference phase\. Through multiple iterations of this procedure, the principle library is progressively refined, enabling it to provide increasingly effective guidance for the model\.

The framework is implemented through a modular agent system consisting ofplan,memory, andexecutionmodules\. Within theplanmodule, to address the issue that principles derived by current methods are often overly generic and misaligned with concrete reasoning errors, we design a Contrast\-Driven Abstraction \(CDA\) method that enables the model to generate targeted and actionable principles by contrasting fine\-grained differences in answers\. This module is applied in both stage of MetaEvo\. Besides, thememorymodule maintains a structured repository of principles, while theexecutionmodule is responsible for retrieving and applying these principles; both modules are applied in the second stage\.

We evaluate MetaEvo across diverse reasoning benchmarks and observe consistent improvements over strong baselines, validating the effectiveness of meta\-ability optimization\. Moreover, meta\-ability enables sustained performance improvement through iterative self\-evolution, rather than short\-lived gains\. Our contributions can be summarized as follows:

- •We propose MetaEvo, a experience\-driven framework featuring meta\-optimization and principle accumulation, and evaluate its effectiveness across diverse reasoning benchmarks, where it consistently outperforms competitive baselines\.
- •We introduce meta\-optimization as a learning paradigm for agent systems, showing that optimizing intermediate capabilities leads to more effective self\-improvement than directly optimizing final outputs\.
- •We propose a contrast\-driven principle extraction method CDA, which ensures that the derived principles directly address underlying strategic errors and provide actionable guidance\.

## 2Related Work

### 2\.1Experience\-Driven Evolution\.

Recent work has explored self\-evolving agents that improve through continual learning from experience\(Gaoet al\.,[2025](https://arxiv.org/html/2606.07603#bib.bib30)\)\. A major line of research focuses on identifying model failures and distilling corrective principles or reusable knowledge, which are incorporated via auxiliary supervision or external memory to guide future behavior\(Sunet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib15); Yanget al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib4); Madaanet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib11)\)\. Related methods enhance adaptability through retrieval\-augmented inference and memory\-based prompting, enabling models to reference prior corrections during generation\. Additional efforts address scalability via structured experience replay and memory management\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib6); Zhaoet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib5); Gonget al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib9); Liuet al\.,[2025](https://arxiv.org/html/2606.07603#bib.bib42); Ouyanget al\.,[2025](https://arxiv.org/html/2606.07603#bib.bib43); Xuet al\.,[2025a](https://arxiv.org/html/2606.07603#bib.bib44)\)\. Experience\-driven evolution has also been extended to embodied agents that refine actions through real\-world feedback\(Liet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib45)\)\.

Separately, meta\-level optimization has gained increasing attention, where the learning process itself becomes the object of improvement\. Some approaches introduce secondary evaluators to shape learning rewards\(xiong2025mpoboostingllmagents\), while others design higher\-order planning mechanisms that reason over planning strategies\(wu2025meta\)\. These methods share MetaEvo’s high\-level goal of enhancing model performance by optimizing beyond direct task execution, MetaEvo enables the model to learn how to derive better principles for future self\-improvement\.

### 2\.2Memory\-based Agent System

Equipping LLM agents with external memory has emerged as a core approach for enabling continual learning, behavioral adaptation, and long\-horizon reasoning\. Early studies primarily rely on experience replay, storing intermediate reasoning steps or error\-feedback pairs to guide future decisions\(Yanget al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib4); Li and Qiu,[2023](https://arxiv.org/html/2606.07603#bib.bib8); Gaoet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib6)\)\. Complementary to this line, structural memory methods organize stored information into semantically or functionally structured segments, improving interpretability and retrieval efficiency\(Zenget al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib23); Zhaoet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib5)\)\. More recent work shifts attention to memory management and evolution for long\-term agent behavior\. MemoryBank\(Zhonget al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib40)\)and Task\-Core Memory\(Huaiet al\.,[2025](https://arxiv.org/html/2606.07603#bib.bib39)\)study retention and consolidation mechanisms to mitigate forgetting in continual settings\. Beyond static storage, A\-MEM\(Xuet al\.,[2025b](https://arxiv.org/html/2606.07603#bib.bib38)\)and Mem0\(Chhikaraet al\.,[2025](https://arxiv.org/html/2606.07603#bib.bib41)\)explicitly model memory evolution, enabling agents to refine and reorganize stored knowledge over time\.

Collectively, these mechanisms represent a paradigm shift from static retrieval\-based memory to dynamically evolving knowledge architectures\.

![Refer to caption](https://arxiv.org/html/2606.07603v1/x2.png)Figure 2:Illustration of the pipeline of theMetaEvoframework\. \(1\)Meta Optimization:We first train a model to enhance its core meta\-ability through preference\-based learning on principles\. \(2\)Principle Accumulation:The enhanced model then abstracts and accumulates a refined set of principles into a structuredmemorymodule, which can be iteratively expanded\. At inference time, the agent retrieves the most relevant principles from memory to steer its final response\.

## 3Methodology

This section presents MetaEvo, a meta\-optimization framework implemented as a modular agent system, that enables experience\-driven and principle\-guided evolution through its three core modules:plan,memory, andexecution\. Accordingly, we first present the overall workflow of the framework, and then delve into the specifics of each constituent module\.

### 3\.1Framework Pipeline

#### 3\.1\.1Meta\-Optimization

In this stage, we fine\-tune the base LLM to enhance itsmeta\-ability, the capacity to learn how to abstract and internalize actionable and instructive principles from experience\. Rather than learning to solve tasks directly, the model is optimized to develop a preference for principles that provide clearer and more operational corrective guidance, formulated as a meta\-optimization problem over revision knowledge\.

For each queryqiq\_\{i\}, the base model first produces an initial answeraia\_\{i\}\. A principle abstraction process then takes\(qi,ai,yi\)\(q\_\{i\},a\_\{i\},y\_\{i\}\)to derive a revision principle\. We perform this abstraction twice: once using the base model to obtain a less instructive principlepi−p\_\{i\}^\{\-\}, and once using a more capable external LLM to obtain a preferred principlepi\+p\_\{i\}^\{\+\}\. This yields a preference pair\(pi\+,pi−\)\(p\_\{i\}^\{\+\},p\_\{i\}^\{\-\}\)for each query, forming the meta\-optimization dataset:

𝒟meta=\{\(qi,pi\+,pi−\)\}i=1N\.\\mathcal\{D\}\_\{\\text\{meta\}\}=\\left\\\{\\left\(q\_\{i\},p\_\{i\}^\{\+\},p\_\{i\}^\{\-\}\\right\)\\right\\\}\_\{i=1\}^\{N\}\.\(1\)
We optimize the base model on𝒟meta\\mathcal\{D\}\_\{\\text\{meta\}\}using Direct Preference Optimization \(DPO\), which incorporates pairwise preference supervision by minimizing the expected loss:

minθ⁡𝔼\(q,p\+,p−\)∼𝒟meta​\[ℒmeta​\(πθ;q,p\+,p−\)\],\\min\_\{\\theta\}\\,\\mathbb\{E\}\_\{\(q,p^\{\+\},p^\{\-\}\)\\sim\\mathcal\{D\}\_\{\\text\{meta\}\}\}\\left\[\\mathcal\{L\}\_\{\\text\{meta\}\}\(\\pi\_\{\\theta\};q,p^\{\+\},p^\{\-\}\)\\right\],\(2\)whereπθ\\pi\_\{\\theta\}denotes the model parameterized byθ\\theta\. The loss encourages higher likelihood for the preferred principlep\+p^\{\+\}over the dispreferredp−p^\{\-\}:

ℒmeta=−𝔼\(q,p\+,p−\)\[logσ\(β\(logπθ\(p\+∣q\)−logπθ\(p−∣q\)\)\)\]\\begin\{split\}\\mathcal\{L\}\_\{\\text\{meta\}\}&=\-\\mathbb\{E\}\_\{\(q,p^\{\+\},p^\{\-\}\)\}\\bigg\[\\log\\sigma\\Big\(\\beta\\big\(\\\\ &\\quad\\log\\pi\_\{\\theta\}\(p^\{\+\}\\mid q\)\-\\log\\pi\_\{\\theta\}\(p^\{\-\}\\mid q\)\\big\)\\Big\)\\bigg\]\\end\{split\}\(3\)whereσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function andβ\\betais a temperature parameter\.

This process yields ameta\-enhancedLLM with an improved capacity for abstracting and internalizing corrective principles, serving as the foundation for the subsequent principle accumulation stage\.

#### 3\.1\.2Principle Accumulation

The objective of this stage is to leverage the meta\-optimized model to construct a rich and structured repository of high\-quality principles\. Using the same labeled dataset, theexecutionmodule retrieves task\-relevant principles from memory to guide answer generation, producing a collection of queries, gold answers, and model\-generated responses\. Theplanmodule then extracts candidate principles from these outputs\.

The extracted principles are systematically organized and stored in thememorymodule, typically indexed by semantic representations of their corresponding tasks, thereby forming a task\-oriented knowledge base\.

This process is inherently iterative\. In the first iteration \(t=1t=1\), the memory is empty, and generation proceeds without guidance\. At iterationtt, the principle libraryMem​t−1\\text\{Mem\}\{t\-1\}from the previous iteration is used for retrieval\. Newly generated principles𝒫t​new\\mathcal\{P\}^\{t\}\{\\text\{new\}\}are then consolidated withMemt−1\\text\{Mem\}\_\{t\-1\}to produce the updated memoryMemt\\text\{Mem\}\_\{t\}, which serves as the retrieval source for iterationt\+1t\+1\.

Benefiting from meta\-optimization, the model progressively identifies previously overlooked issues and generates corrective principles at each iteration\. As this process repeats, the memory expands and the model’s capabilities steadily improve, enabling continual evolution\.

Upon completion of the Principle Accumulation phase, the agent is equipped with a structured memory of reasoning principles\. For a new input, the agent retrieves the most relevant principles and integrates them into the model context as actionable guidance to steer the final generation\.

### 3\.2Agent System Components

#### 3\.2\.1Plan Module

Theplanmodule serves as the central reasoning component of the agent system, responsible for identifying deficiencies in a given response and abstracting generalizable principles for improvement\. We achieve this through a systematic method termedContrast\-Driven Abstraction\(CDA\)\.

CDA operates through a two\-step pipeline: \(1\) Discrepancy Analysis and \(2\) Principle Abstraction, as illustrated in Figure[3](https://arxiv.org/html/2606.07603#S3.F3)\.

In the first step,Discrepancy Analysis, the LLM is prompted with the user queryqq, the base model’s responsexx, and an expert reference answeryy\. The goal of this stage is to conduct a fine\-grained comparative analysis and produce a structured discrepancy representation, denoted asΔ\\Delta\.

The resulting structureΔ\\Deltaenumerates the identified discrepancies as a list of entries, each comprising four elements: the aspect under comparison, a high\-quality excerpt from the reference answer, the corresponding deficiency in the model\-generated response, and an explanation characterizing the nature of the difference\.

In the second step,Principle Abstraction, the structured discrepancy representationΔ\\Deltais provided as input to the LLM for abstraction\. The model synthesizes the detailed comparisons and distills them into a single, high\-quality revision principlepp\. This principle is expressed as a concise and actionable natural\-language directive that captures the core lesson revealed by the contrastive analysis\.

![Refer to caption](https://arxiv.org/html/2606.07603v1/x3.png)Figure 3:Overview of the agent architecture in MetaEvo\. The Plan Module derives revision principles via contrastive difference analysis\. The Memory Module stores principles in a task\-oriented structure and consolidates newly generated principles using three integration strategies\. The Execution Module retrieves task\-relevant principles to guide generation\.
#### 3\.2\.2Memory Module

Thememorymodule serves as the agent’s long\-term knowledge repository, responsible for systematically storing and managing the principles generated by theplanmodule\. We formalize the memoryM​e​mMem, as a key–value structure in which each key corresponds to a task descriptortt, and each value is a set of task\-relevant principles𝒫​t\\mathcal\{P\}t:

M​e​m=\{t↦𝒫t∣t∈𝒯\}Mem=\\\{t\\mapsto\\mathcal\{P\}\_\{t\}\\mid t\\in\\mathcal\{T\}\\\}\(4\)Here,𝒯\\mathcal\{T\}denotes the set of all task descriptors\. Each descriptorttis a concise natural\-language summary of an input queryqq, generated by an LLM to capture the query’s core intent\.

To ensure both the quality and efficiency of the repository, the memory is actively curated rather than passively appended\. When a new principlepnewp\_\{\\text\{new\}\}is generated for a tasktt, it is validated against the existing principle set𝒫​t\\mathcal\{P\}t\. This validation process, conducted by an LLM\-based evaluator, examines semantic redundancy and logical consistency\. For each existing principlep∈𝒫​tp\\in\\mathcal\{P\}t, the evaluator compares the pair\(pnew,p\)\(p\_\{\\text\{new\}\},p\)and applies the following update rules: \(1\)Redundant:Ifpnewp\_\{\\text\{new\}\}is deemed a paraphrase or minor variant of an existing principle, it replaces the older version, ensuring that the memory reflects the most current reasoning\. \(2\)Conflict:Ifpnewp\_\{\\text\{new\}\}contradicts an existing principle, bothpoldp\_\{\\text\{old\}\}andpnewp\_\{\\text\{new\}\}are used to guide generation, and their correctness is evaluated\. The superior principle is retained, while the other is discarded\. \(3\)Irrelevant:Ifpnewp\_\{\\text\{new\}\}is determined to be non\-redundant and non\-conflicting, it is directly incorporated into the memory for tasktt\.

This curation mechanism prevents the accumulation of redundant or erroneous information, enabling the memory to evolve into a compact, coherent, and high\-quality knowledge base that supports effective principle\-guided generation\.

#### 3\.2\.3Execution Module

Theexecutionmodule is the agent’s action\-taking component, responsible for generating the final, principle\-guided response during inference\. It leverages the curated knowledge within thememorymodule to inform its generation process, which unfolds in two phases: Principle Retrieval and Guided Generation\.

The retrieval phase begins with a new queryqq\. The module first generates its task semantic descriptionsqs\_\{q\}, which serves as the semantic key for searching the memory\. The keysqs\_\{q\}is then compared against all stored task descriptions𝒮⊂ℳ\\mathcal\{S\}\\subset\\mathcal\{M\}to identify the most semantically similar entry, denoted ass∗s^\{\*\}\. If the similarity score of the best match exceeds a predefined thresholdτr\\tau\_\{r\}, the entire set of principles𝒫\*\\mathcal\{P\}^\{\\text\{\*\}\}associated withs∗s^\{\*\}is retrieved\.

In the Guided Generation phase, the retrieved principles𝒫\*\\mathcal\{P\}^\{\\text\{\*\}\}are incorporated into the context for a generation LLM,πmeta\\pi\_\{\\text\{meta\}\}\. These principles act as explicit, context\-aware instructions or constraints, steering the model’s output\. The final response is thus generated with the benefit of proven, task\-relevant strategiesπmeta​\(q,𝒫∗\)\\pi\_\{\\text\{meta\}\}\(q,\\mathcal\{P\}^\{\*\}\)

By dynamically retrieving and applying relevant knowledge, theexecutionmodule allows the agent to generalize from past experiences to new, unseen problems, ensuring its responses are not only accurate but also strategically sound\.

## 4Experiments

### 4\.1Experimental Setup

##### Datasets

We conduct our experiments and evaluate our model on three categories of datasets: \(1\)Arithmetic Reasoning: This category includes GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.07603#bib.bib16)\), SVAMP\(Patelet al\.,[2021](https://arxiv.org/html/2606.07603#bib.bib17)\), and MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.07603#bib.bib18)\)\. \(2\)Knowledge Reasoning: We select the MMLUHendryckset al\.\([2020](https://arxiv.org/html/2606.07603#bib.bib19)\)benchmark, focusing on a subset of high\-relevance tasks that the reasoning process is grounded in recognizing and applying known concepts, theories, or definitions\. Specifically, we include: College Biology, College Chemistry, College computer Science, College Mathematics, College Medicine, College Physics, Computer Security\. \(3\)Complex Reasoning: This involves the Big\-Bench Hard \(BBH\) subsetSuzgunet al\.\([2023](https://arxiv.org/html/2606.07603#bib.bib20)\)\. We focus on four reasoning\-intensive tasks, namely Logical Deduction, Tracking Shuffled Objects, Reasoning about Colored Objects, and Causal Judgement\. These tasks require consistent state tracking, rule\-based inference, or causal chain reasoning, and their difficulty arises from the complexity of the reasoning process itself\.

In the meta optimization stage, we use the training sets of GSM8K, SVAMP, and MATH to generate the DPO data for arithmetic reasoning tasks\. For the knowledge and complex reasoning tasks, we utilize the CoT\-CollectionKimet al\.\([2023](https://arxiv.org/html/2606.07603#bib.bib34)\), which is a consolidated dataset that unifies nine diverse tasks released in FLAN into the Chain\-of\-Thought format, to construct the DPO data\.

##### Models

We conduct experiments using two backbone language models: LLaMA 3\.1\-8B\-Instruct\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib21)\)and Qwen 2\.5\-14B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib22)\)\. These models are used as the initial agent foundation for generation, principle abstraction, and memory interaction\. We use DeepSeek\-R1\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.07603#bib.bib28)\)as the strong language model for constructing preference pairs used in meta optimization\.

### 4\.2Baselines

We employ the following baseline methods\. \(1\)Base Model, the original model without additional optimization; \(2\)Self\-Refine\(Madaanet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib11)\), an iterative framework that optimizes model outputs through a recursive loop of autonomous feedback and self\-correction; \(3\)Self\-ICL\(Chenet al\.,[2023](https://arxiv.org/html/2606.07603#bib.bib36)\), a zero\-shot framework that prompts the model to generate its own pseudo\-inputs and pseudo\-labels, which are then prepended as few\-shot demonstrations to guide the final inference; \(4\)SE\-GPT\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07603#bib.bib6)\), a framework that develops task\-specific expertise by retrieving past experiences, practicing on self\-generated tasks, and inducing new strategies into a persistent memory library\. \(5\)Self\-DiscoverZhouet al\.\([2024](https://arxiv.org/html/2606.07603#bib.bib37)\), a framework that enables LLMs to autonomously construct explicit, task\-level reasoning paths by selecting, adapting, and composing atomic reasoning modules into an executable logical structure, which is then used to solve specific task instances; \(6\)MetaEvo w/o MO: Our framework directly using base LLM to run principle accumulation without applying meta\-optimization; \(7\)MetaEvo w/o CDA: Our framework directly generation principles via prompt without applying Contrast Driven Abstraction\(CDA\) method; \(8\)MetaEvo: The full framework combining meta optimization with CDA to enable experience\-driven self\-evolution\.

Table 1:Performance \(%\) of reasoning and self\-improvement methods across benchmarks\.Bold numbersrepresent the best performance on each dataset, whileunderlined numbersdenote the second\-best results\.

## 5Analysis

### 5\.1Main Result

Table[1](https://arxiv.org/html/2606.07603#S4.T1)reports the evaluation results of our method across five benchmark datasets\. Compared with the corresponding base models, our approach consistently achieves performance gains on all benchmarks\.

On GSM8K and SVAMP, which emphasize numerical reasoning, MetaEvo yields substantial accuracy improvements, indicating enhanced error correction and generalization capabilities\. On MATH, the framework demonstrates strong robustness in solving complex multi\-step reasoning problems\. Further gains on MMLU and BBH validate the generalizability of MetaEvo across diverse, knowledge\-intensive tasks\.

Overall, these results highlight the effectiveness of MetaEvo in improving both task\-specific performance and cross\-task generalization\.

![Refer to caption](https://arxiv.org/html/2606.07603v1/x4.png)Figure 4:Compares two training methods on GSM8K\. Our method achieves higher performance with fewer training samples\.
### 5\.2Meta Ability Enhances the Model’s Capacity for Self\-Improvement

Meta optimization is critical for enabling effective principle abstraction and principle\-guided generation\.To assess the impact of meta optimization, we compare the full MetaEvo framework with a variant that excludes it \(w/o MO in Table[1](https://arxiv.org/html/2606.07603#S4.T1)\)\. Although w/o MO achieves competitive performance on GSM8K and SVAMP, it consistently underperforms MetaEvo across all benchmarks, highlighting the role of meta\-level optimization in refining corrective principles and improving generalization\.

Further analysis indicates that, without meta optimization, the extracted principles are often overly generic and lack task\-specific utility, leading to redundant or inconsistent guidance\. In contrast, MetaEvo produces more targeted and actionable principles aligned with observed failure modes, enabling precise revisions and effective self\-correction\.

Figure[4](https://arxiv.org/html/2606.07603#S5.F4)compares MetaEvo with standard supervised fine\-tuning \(SFT\) on GSM8K and SVAMP\. While SFT directly trains on labeled examples, MetaEvo learns and accumulates intermediate principles to guide generation\. Despite using fewer training samples, MetaEvo consistently achieves higher performance, demonstrating the sample\-efficiency gains enabled by meta optimization\.

![Refer to caption](https://arxiv.org/html/2606.07603v1/x5.png)Figure 5:Principle Count Achieved Across Five Training Iterations\.Enhanced Meta\-Ability in Principle Abstraction Raises the Performance Ceiling of Iterative Improvement\.

We further examine the effect of iterative evolution in MetaEvo by performing up to three iterations on GSM8K and SVAMP, using LLaMA3\.1\-8B\-Instruct and Qwen2\.5\-14B\-Instruct as base models\. Each iteration consists of principle accumulation followed by principle\-guided generation, where outputs from the previous iteration are reused as inputs\. Across iterations, both the principle memory and response quality improve steadily\.

As shown in Figure[6](https://arxiv.org/html/2606.07603#S5.F6), MetaEvo consistently improves across iterations and outperforms other self\-evolving baselines\. Performance increases monotonically over three iterations, indicating that progressive principle refinement directly enhances reasoning quality\. After three iterations, MetaEvo achieves absolute gains of 9\.6

These results demonstrate that meta\-ability optimization enables sustained self\-improvement by prioritizing the acquisition of improvement strategies over direct answer prediction\.

![Refer to caption](https://arxiv.org/html/2606.07603v1/x6.png)Figure 6:Accuracy\(%\) comparison across iterations for MetaEvo and baseline methods\. The figure reports iterative performance on two tasks of BBH: Logical Deduction \(LD\) and Tracking Shuffled Objects \(TSO\)\.Meta\-Optimization enhances the intrinsic reasoning ability of the model, even without explicit reuse of previous experience\.Our results demonstrate that directly strengthening the model’s capacity for principle abstraction via meta\-optimization leads to consistent performance gains\. As shown in Table[2](https://arxiv.org/html/2606.07603#S5.T2), models trained with preference\-based supervision outperform their base counterparts across multiple reasoning tasks, confirming that abstract principle alignment can improve general reasoning behavior independent of memory retrieval or prior instance reuse\.

We argue that this abstraction process represents a meta\-level capability that operates above specific instances, enabling the model to generalize from error patterns and revise its own behavior accordingly\. Rather than memorizing task\-specific corrections, the model learns to organize and apply high\-level strategies that support more coherent, self\-aware reasoning\. This improvement feeds back into downstream performance, as the model internalizes a transferable scaffold for decision\-making\. In essence, principle abstraction serves not just as a mechanism for fixing past mistakes, but as a foundation for building more adaptive and generalizable reasoning behavior\.

Table 2:Performance comparison with and without meta\-optimization \(MO\) on arithmetic reasoning tasks\.Table 3:Accuracy \(%\) on GSM8K and SVAMP under different principle supervision strategies\.
### 5\.3Contrastive Analysis Drives Precise and Actionable Principle abstraction

Effective evolution relies on high\-quality principles, and contrastive analysis is key to extracting effective principles\.To assess the impact of different principle generation strategies, we compare three settings: \(1\)Random Principles, which introduce task\-irrelevant noise; \(2\)Direct Abstraction, in which principles are generated without contrastive analysis, by directly prompting the model using a predefined abstraction template; and \(3\)MetaEvo w/ CDA, which employs contrastive\-driven abstraction to derive principles\.

As shown in Table[3](https://arxiv.org/html/2606.07603#S5.T3), random principles significantly degrade performance, confirming that irrelevant guidance can mislead the model’s reasoning process\. Direct abstraction produces unguided principles, which may introduce noise or even conflict with the model’s original reasoning trajectory\. In contrast, principles derived through contrastive analysis yield the highest and most consistent performance, achieving 92\.4% on GSM8K and 91\.9% on SVAMP, demonstrating their effectiveness in guiding generation\. These results validate the crucial role of contrastive analysis in extracting reliable, high\-quality principles that serve as effective guidance for principle\-guided generation\.

## 6Conclusion

In this paper, we introduce MetaEvo, a meta\-optimization framework that facilitates principle\-guided evolution in large language models\. By enhancing the model’s meta\-ability, MetaEvo shifts the objective from direct answer optimization to learning how to revise\. The framework integrates meta optimization with an agent system that extracts, stores, and reuses high\-quality revision principles\. Experimental results across multiple reasoning benchmarks demonstrate that MetaEvo consistently improves performance, supports iterative self\-improvement, and enhances generalization\.

## References

- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.CoRRabs/2005\.14165\.External Links:[Link](https://arxiv.org/abs/2005.14165),2005\.14165Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- Y\. Cai, Y\. Hao, J\. Zhou, H\. Yan, Z\. Lei, R\. Zhen, Z\. Han, Y\. Yang, J\. Li, Q\. Pan, T\. Huai, Q\. Chen, X\. Li, K\. Chen, B\. Zhang, X\. Qiu, and L\. He \(2025\)Building self\-evolving agents via experience\-driven lifelong learning: a framework and benchmark\.External Links:2508\.19005,[Link](https://arxiv.org/abs/2508.19005)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- Grimoire is all you need for enhancing large language models\.CoRRabs/2401\.03385\.External Links:[Link](https://doi.org/10.48550/arXiv.2401.03385),[Document](https://dx.doi.org/10.48550/ARXIV.2401.03385),2401\.03385Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- W\. Chen, C\. Wu, Y\. Chen, and H\. Chen \(2023\)Self\-icl: zero\-shot in\-context learning with self\-generated demonstrations\.External Links:2305\.15035,[Link](https://arxiv.org/abs/2305.15035)Cited by:[§4\.2](https://arxiv.org/html/2606.07603#S4.SS2.p1.1.3)\.
- P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav \(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.External Links:2504\.19413,[Link](https://arxiv.org/abs/2504.19413)Cited by:[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.CoRRabs/2110\.14168\.External Links:[Link](https://arxiv.org/abs/2110.14168),2110\.14168Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px1.p1.1)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Ding, H\. Xin, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Wang, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, S\. Ye, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Zhao, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Rozière, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. M\. Kloumann, I\. Misra, I\. Evtimov, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, and et al\. \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu, H\. Wang, H\. Xiao, Y\. Zhou, S\. Zhang, J\. Zhang, J\. Xiang, Y\. Fang, Q\. Zhao, D\. Liu, Q\. Ren, C\. Qian, Z\. Wang, M\. Hu, H\. Wang, Q\. Wu, H\. Ji, and M\. Wang \(2025\)A survey of self\-evolving agents: on path to artificial super intelligence\.External Links:2507\.21046,[Link](https://arxiv.org/abs/2507.21046)Cited by:[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1)\.
- J\. Gao, X\. Ding, Y\. Cui, J\. Zhao, H\. Wang, T\. Liu, and B\. Qin \(2024\)Self\-evolving GPT: A lifelong autonomous experiential learner\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 6385–6432\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.346),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.346)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.07603#S4.SS2.p1.1.4)\.
- R\. Gong, Q\. Huang, X\. Ma, Y\. Noda, Z\. Durante, Z\. Zheng, D\. Terzopoulos, L\. Fei\-Fei, J\. Gao, and H\. Vo \(2024\)MindAgent: emergent gaming interaction\.InFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16\-21, 2024,K\. Duh, H\. Gómez\-Adorno, and S\. Bethard \(Eds\.\),pp\. 3154–3183\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-naacl.200),[Document](https://dx.doi.org/10.18653/V1/2024.FINDINGS-NAACL.200)Cited by:[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, Y\. Shen, Y\. Yang, N\. Duan, and W\. Chen \(2024\)CRITIC: large language models can self\-correct with tool\-interactive critiquing\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=Sx038qxjek)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.CoRRabs/2009\.03300\.External Links:[Link](https://arxiv.org/abs/2009.03300),2009\.03300Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.CoRRabs/2103\.03874\.External Links:[Link](https://arxiv.org/abs/2103.03874),2103\.03874Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Huai, J\. Zhou, Y\. Cai, Q\. Chen, W\. Wu, X\. Wu, X\. Qiu, and L\. He \(2025\)Task\-core memory management and consolidation for long\-term continual learning\.External Links:2505\.09952,[Link](https://arxiv.org/abs/2505.09952)Cited by:[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- S\. Kim, S\. J\. Joo, D\. Kim, J\. Jang, S\. Ye, J\. Shin, and M\. Seo \(2023\)The cot collection: improving zero\-shot and few\-shot learning of language models via chain\-of\-thought fine\-tuning\.arXiv preprint arXiv:2305\.14045\.Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px1.p2.1)\.
- C\. Li, J\. Wang, Y\. Zhang, K\. Zhu, W\. Hou, J\. Lian, F\. Luo, Q\. Yang, and X\. Xie \(2023\)Large language models understand and can be enhanced by emotional stimuli\.External Links:2307\.11760,[Link](https://arxiv.org/abs/2307.11760)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- J\. Li, P\. Chen, S\. Wu, C\. Zheng, H\. Xu, and J\. Jia \(2024\)RoboCoder: robotic learning from basic skills to general tasks with large language models\.External Links:2406\.03757,[Link](https://arxiv.org/abs/2406.03757)Cited by:[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1)\.
- X\. Li and X\. Qiu \(2023\)MoT: memory\-of\-thought enables chatgpt to self\-improve\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 6354–6374\.External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.392),[Document](https://dx.doi.org/10.18653/V1/2023.EMNLP-MAIN.392)Cited by:[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- Y\. Liu, C\. Si, K\. Narasimhan, and S\. Yao \(2025\)Contextual experience replay for self\-improvement of language agents\.External Links:2506\.06698,[Link](https://arxiv.org/abs/2506.06698)Cited by:[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.07603#S4.SS2.p1.1.2)\.
- S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang, V\. Tirumalashetty, G\. Lee, M\. Rofouei, H\. Lin, J\. Han, C\. Lee, and T\. Pfister \(2025\)ReasoningBank: scaling agent self\-evolving with reasoning memory\.External Links:2509\.25140,[Link](https://arxiv.org/abs/2509.25140)Cited by:[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1)\.
- A\. Patel, S\. Bhattamishra, and N\. Goyal \(2021\)Are NLP models really able to solve simple math word problems?\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2021, Online, June 6\-11, 2021,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tür, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),pp\. 2080–2094\.External Links:[Link](https://doi.org/10.18653/v1/2021.naacl-main.168),[Document](https://dx.doi.org/10.18653/V1/2021.NAACL-MAIN.168)Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p4.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- Z\. Sun, Y\. Shen, Q\. Zhou, H\. Zhang, Z\. Chen, D\. D\. Cox, Y\. Yang, and C\. Gan \(2023\)Principle\-driven self\-alignment of language models from scratch with minimal human supervision\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/0764db1151b936aca59249e2c1386101-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. V\. Le, E\. H\. Chi, D\. Zhou, and J\. Wei \(2023\)Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.InFindings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9\-14, 2023,A\. Rogers, J\. L\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),pp\. 13003–13051\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-acl.824),[Document](https://dx.doi.org/10.18653/V1/2023.FINDINGS-ACL.824)Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample \(2023\)LLaMA: open and efficient foundation language models\.CoRRabs/2302\.13971\.External Links:[Link](https://doi.org/10.48550/arXiv.2302.13971),[Document](https://dx.doi.org/10.48550/ARXIV.2302.13971),2302\.13971Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.External Links:1706\.03762Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1)\.
- H\. Xu, J\. Hu, K\. Zhang, L\. Yu, Y\. Tang, X\. Song, Y\. Duan, L\. Ai, and B\. Shi \(2025a\)SEDM: scalable self\-evolving distributed memory for agents\.External Links:2509\.09498,[Link](https://arxiv.org/abs/2509.09498)Cited by:[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025b\)A\-mem: agentic memory for llm agents\.External Links:2502\.12110,[Link](https://arxiv.org/abs/2502.12110)Cited by:[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.CoRRabs/2412\.15115\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.15115),[Document](https://dx.doi.org/10.48550/ARXIV.2412.15115),2412\.15115Cited by:[§4\.1](https://arxiv.org/html/2606.07603#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Yang, P\. Li, and Y\. Liu \(2023\)Failures pave the way: enhancing large language models through tuning\-free rule accumulation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6\-10, 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 1751–1777\.External Links:[Link](https://doi.org/10.18653/v1/2023.emnlp-main.109),[Document](https://dx.doi.org/10.18653/V1/2023.EMNLP-MAIN.109)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- R\. Zeng, J\. Fang, S\. Liu, and Z\. Meng \(2024\)On the structural memory of llm agents\.External Links:2412\.15266,[Link](https://arxiv.org/abs/2412.15266)Cited by:[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: LLM agents are experiential learners\.InThirty\-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty\-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20\-27, 2024, Vancouver, Canada,M\. J\. Wooldridge, J\. G\. Dy, and S\. Natarajan \(Eds\.\),pp\. 19632–19642\.External Links:[Link](https://doi.org/10.1609/aaai.v38i17.29936),[Document](https://dx.doi.org/10.1609/AAAI.V38I17.29936)Cited by:[§1](https://arxiv.org/html/2606.07603#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07603#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2023\)MemoryBank: enhancing large language models with long\-term memory\.External Links:2305\.10250,[Link](https://arxiv.org/abs/2305.10250)Cited by:[§2\.2](https://arxiv.org/html/2606.07603#S2.SS2.p1.1)\.
- P\. Zhou, J\. Pujara, X\. Ren, X\. Chen, H\. Cheng, Q\. V\. Le, E\. H\. Chi, D\. Zhou, S\. Mishra, and H\. S\. Zheng \(2024\)Self\-discover: large language models self\-compose reasoning structures\.External Links:2402\.03620,[Link](https://arxiv.org/abs/2402.03620)Cited by:[§4\.2](https://arxiv.org/html/2606.07603#S4.SS2.p1.1.5)\.

Similar Articles

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Hugging Face Daily Papers

MLEvolve is a self-evolving LLM-based multi-agent framework for automated ML algorithm discovery that extends tree search to Progressive MCGS with graph-based cross-branch information flow and retrospective memory. It achieves state-of-the-art performance on MLE-Bench and outperforms AlphaEvolve on mathematical algorithm optimization tasks.

Rethinking Experience Utilization in Self-Evolving Language Model Agents

arXiv cs.CL

This paper introduces ExpWeaver, a framework that optimizes how self-evolving language model agents utilize past experiences during runtime decision-making. It demonstrates that selectively invoking experience based on reasoning uncertainty improves performance across various environments and models.