CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

arXiv cs.CL Papers

Summary

CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.

arXiv:2604.15840v1 Announce Type: new Abstract: Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:29 AM

# CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
Source: https://arxiv.org/html/2604.15840

###### Abstract

Reinforcement learning for LLM agents is typically conducted on a static data distribution, which fails to adapt to the agent's evolving behavior and leads to poor coverage of complex environment interactions. To address these challenges, we propose CoEvolve, an agent-data mutual evolution framework that enables LLM agents to improve through closed-loop, interaction-driven training. Specifically, CoEvolve extracts feedback signals such as forgetting and uncertainty from rollout trajectories to identify failure-prone interaction patterns, and utilizes them to guide LLM-based task synthesis. The synthesized tasks are validated through environment interaction and utilized to update the data distribution, enabling joint adaptation of the agent and its data. Extensive experiments on AppWorld and BFCL across Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B demonstrate consistent and significant improvements over strong base models, yielding absolute gains of 19.43%, 15.58%, and 18.14%, respectively.

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

Shidong Yang*, Ziyu Ma*, Tongwen Huang*, Yiming Hu, Yong Wang†, Xiangxiang Chu
AMAP, Alibaba Group
https://github.com/AMAP-ML/CoEvolve

1Equal contribution.
2Project lead and corresponding author.

## 1 Introduction

The rapid advancement of large language models (LLMs) has driven the development of LLM-based agents, which have been widely applied to scenarios such as web information retrieval, software engineering, web navigation, and personal assistance. Reinforcement learning (RL) has emerged as the dominant approach for training these agents with complex interactive capabilities, offering a general solution for acquiring adaptive behaviors in open-ended environments.

However, current agent RL training methods heavily rely on human-written demonstrations, where experts manually interact with the environment to construct trajectory datasets. These curated trajectories are then used to train the agent's policy, as illustrated in Fig. 1(a). While effective on simple tasks, this reliance on manually curated data introduces several critical limitations: (1) Collecting interaction data in real environments is prohibitively expensive, with a single trajectory often requiring several minutes or more of human expert effort. Given the limited availability of expert time, broad exploration of the environment becomes difficult. (2) More fundamentally, these expert demonstrations represent static snapshots of interaction patterns and fail to cover the long-tail variations found in real-world settings. As a result, agents trained on such data struggle to generalize beyond the observed distribution. For instance, a web navigation agent may fail entirely if a button label changes from "Book Now" to "Reserve Now".

The challenge of insufficient and static data has led to significant interest in synthetic data generation. A typical pipeline, illustrated in Fig. 1(b), prompts a large language model (LLM) with environment descriptions and task specifications to explore the environment. By leveraging its world knowledge and reasoning capabilities, the LLM generates synthetic trajectories that are subsequently used to train the agent. While synthetic data reduces reliance on human annotation, it is typically generated through random exploration guided solely by the LLM's world knowledge, without any feedback from the agent's actual performance or interaction signals. Therefore, the environment exploration remains shallow and incomplete, failing to sufficiently cover diverse environment configurations. Moreover, the generated data still constitutes a static corpus that cannot adapt to the agent's evolving capabilities, leading to inefficient training that neither targets specific weaknesses nor supports continual improvement.

To address these issues, we propose CoEvolve, an agent-data mutual evolution framework in which the agent and its training distribution evolve jointly through interaction-driven feedback, as shown in Fig. 1(c). Our core idea is to use feedback signals, such as forgetting signals, to identify failure-prone interaction patterns and guide LLM-based task discovery accordingly. Unlike previous methods that rely on static datasets, CoEvolve synthesizes new tasks targeting the agent's current weaknesses, validates them in the environment, and integrates them into training without human supervision. This closed loop allows the agent to reshape its learning distribution (data evolving) while continually overcoming its limitations (agent evolving).

We evaluate CoEvolve on two representative benchmarks, AppWorld and BFCL, using Qwen2.5-7B, Qwen3-4B, and Qwen3-30B-A3B as backbones. By continuously synthesizing new tasks from training-time feedback, CoEvolve improves average performance by 19.43%, 15.58%, and 18.14%, respectively, demonstrating strong scalability and generalization across models and environments. Our contributions can be summarized as follows:

- We propose CoEvolve, an agent-data mutual evolution framework that alternates between agent optimization and data distribution updates without any human supervision.
- Unlike previous synthetic data generation based on unguided random exploration, we incorporate feedback signals (e.g., forgetting signals) into LLM-based environment exploration.
- CoEvolve yields large gains over baseline models (e.g., Qwen3-4B) across interactive benchmarks (e.g., AppWorld), demonstrating its effectiveness in complex environments.

## 2 Related Work

**Large Language Model Agents.** Recent work has shown that large language models (LLMs) can be instantiated as autonomous agents capable of long-horizon reasoning and action through iterative interaction with environments. Early frameworks such as ReAct and Reflexion demonstrate that coupling reasoning, tool use, and feedback enables LLMs to solve complex multi-step tasks, while later systems further enhance planning and memory for more persistent behaviors. Despite these advances, most existing LLM agents are trained via imitation learning on static collections of expert trajectories, which fundamentally limits exploration and constrains learning to the coverage of pre-collected data. In contrast, our work departs from this static paradigm by enabling agents to learn in a dynamic, self-evolving training process without relying on fixed expert demonstrations.

**Trajectory Synthesis for Agent Training.** To reduce reliance on expert demonstrations, recent work explores synthetic trajectory generation for training LLM agents. Most prior approaches generate trajectories in an *offline* or weakly adaptive manner, including open-loop synthesis with reflection or correction, as well as large-scale pipelines based on tutorials, scripted exploration, simulators, and self-training. Recent extensions introduce more autonomous exploration or structured curricula, yet trajectory generation remains largely *open-loop*, loosely coupled to the agent's evolving failure modes. In contrast, our method closes this loop by using environment feedback to synthesize trajectories on demand, enabling continuous adaptation of the training distribution. Conceptually, CoEvolve also differs from recent self-improving or curriculum-style frameworks that refine trajectories for a fixed pool of queries or generate variants around seed tasks. Our feedback is used to drive the agent back into the interactive environment to discover new executable queries and states, so data evolution is not limited to rewriting or filtering an offline query set.

## 3 Method

We propose CoEvolve, an agent-data co-evolution framework for training LLM agents without human supervision. In this section, we first introduce agent training on synthetic tasks and the extraction of weakness signals from rollout trajectories (Section 3.1). Then, Section 3.2 details how these signals are used as feedback to prompt LLM-based re-exploration for new task discovery. Section 3.3 finally describes how the discovered interactions are abstracted and validated into executable tasks and incorporated into training. The overall framework is illustrated in Fig. 2.

### 3.1 Training and Signal Extraction

#### Training on Synthetic Tasks

At training iteration $t$, we maintain a task set $\mathcal{D}_t$ consisting of executable synthetic tasks. The initial task set $\mathcal{D}_0$ is obtained via unguided exploration by a large language model interacting with the environment. As training proceeds, newly synthesized and validated tasks (described in later stages) are appended to $\mathcal{D}_t$, allowing the task distribution to evolve together with the agent.

For a task $x \in \mathcal{D}_t$, we sample a group of $K$ trajectories $\{\tau_k\}_{k=1}^K \sim \pi_\theta(\cdot|x)$ and assign each trajectory a scalar reward $R(\tau_k)$. The agent is optimized using Group Relative Policy Optimization (GRPO) by maximizing:

$$\mathcal{J}(\theta) = \frac{1}{\sum_{k=1}^K|\tau_k|}\sum_{k=1}^K\sum_{t=1}^{|\tau_k|}\text{CLIP}(r_{k,t}(\theta),\hat{A}_k,\epsilon) - \beta \cdot \mathbb{D}_{\text{KL}}\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right],$$

where $r_{k,t}(\theta) = \frac{\pi_\theta(a_t^k|s_t^k)}{\pi_{\theta_{\text{old}}}(a_t^k|s_t^k)}$ is the importance ratio, and $\text{CLIP}(r,A,\epsilon) = \min[r \cdot A, \text{clip}(r,1-\epsilon,1+\epsilon) \cdot A]$. Here $\hat{A}_k$ denotes the group-relative advantage, $\pi_{\text{ref}}$ is a fixed reference policy, and $\beta$ weights the KL regularization term.

#### Signal Extraction

Beyond policy optimization, rollout trajectories generated during training contain instances of agent underperformance. To identify such weaknesses, we analyze these trajectories and define three types of behavioral signals: forgetting signals, boundary signals, and rare signals.

**(1) Forgetting Signals**

Following prior work, we use forgetting signals to detect cases where the agent previously succeeded on a task but now fails under the current policy. Let $s_{\text{now}} \in [0,1]$ denote the task-level score of the current trajectory $\tau_{\text{now}}$, computed from the environment's terminal reward or task-specific evaluation signal. For each task (or task type), we maintain a sliding window of recent scores:

$$\mathcal{H}_{\text{recent}} = \{s_{t-W+1}, \ldots, s_t\},$$

where $W$ is the window size. A forgetting signal is triggered if

$$\exists\, s_i \in \mathcal{H}_{\text{recent}} \text{ such that } s_i \geq 0.5 \quad \text{and} \quad s_{\text{now}} < 0.5.$$

This condition indicates that the agent has previously succeeded on the task but now fails under the current policy. The curr

Similar Articles

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

arXiv cs.CL

MemEvoBench introduces the first benchmark for evaluating memory safety in LLM agents, measuring behavioral degradation from adversarial memory injection, noisy outputs, and biased feedback across QA and workflow tasks. The work reveals that memory evolution significantly contributes to safety failures and that static defenses are insufficient.