Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
Summary
This paper proposes a method to train LLM agents with intrinsic meta-evolution capabilities, enabling spontaneous self-improvement without external rewards at inference time. Applied to Qwen3-30B and Seed-OSS-36B, the approach yields a 20% performance boost on web navigation benchmarks, with a 14B model outperforming Gemini-2.5-Flash.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
Source: https://huggingface.co/papers/2604.18131
Abstract
Agents equipped with intrinsic meta-evolution capabilities demonstrate improved performance on web navigation tasks through self-generated world knowledge without external supervision.
Most agents today ``self-evolve’’ by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsicmeta-evolutioncapability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design anoutcome-based reward mechanismthat measures how much an agent’s self-generatedworld knowledgeimproves its success rate ondownstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performsnative self-evolutionto adapt to unknown environments using its internal parameters. When applied toQwen3-30BandSeed-OSS-36B, this shift to native evolution yields a 20% performance increase onWebVoyagerandWebWalker. Most strikingly, the generatedworld knowledgeeven enables a compact 14B Qwen3 model to outperform the unassistedGemini-2.5-Flash, establishing a new paradigm for truly evolving agents.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2604\.18131
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18131 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18131 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18131 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.
OpenSkill: Open-World Self-Evolution for LLM Agents
OpenSkill is a framework for LLM agents to self-evolve skills and verification signals from open-world resources without target-task supervision, achieving high performance across benchmarks.
EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents
EvolveMem introduces a self-evolving memory architecture for LLM agents that optimizes retrieval configurations through LLM-powered diagnosis and iterative research cycles, achieving significant performance improvements on benchmarks like LoCoMo and MemBench.
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
MetaEvo proposes a two-stage framework for continual evolution of LLM-based agents, using preference-based optimization to enhance principle abstraction and modular architecture for experience reuse, outperforming strong baselines on reasoning benchmarks.
Rethinking Continual Experience Internalization for Self-Evolving LLM Agents
This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.