Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Hugging Face Daily Papers Papers

Summary

This paper proposes a method to train LLM agents with intrinsic meta-evolution capabilities, enabling spontaneous self-improvement without external rewards at inference time. Applied to Qwen3-30B and Seed-OSS-36B, the approach yields a 20% performance boost on web navigation benchmarks, with a 14B model outperforming Gemini-2.5-Flash.

Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Source: https://huggingface.co/papers/2604.18131

Abstract

Agents equipped with intrinsic meta-evolution capabilities demonstrate improved performance on web navigation tasks through self-generated world knowledge without external supervision.

Most agents today ``self-evolve’’ by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsicmeta-evolutioncapability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design anoutcome-based reward mechanismthat measures how much an agent’s self-generatedworld knowledgeimproves its success rate ondownstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performsnative self-evolutionto adapt to unknown environments using its internal parameters. When applied toQwen3-30BandSeed-OSS-36B, this shift to native evolution yields a 20% performance increase onWebVoyagerandWebWalker. Most strikingly, the generatedworld knowledgeeven enables a compact 14B Qwen3 model to outperform the unassistedGemini-2.5-Flash, establishing a new paradigm for truly evolving agents.

View arXiv pageView PDFGitHub1Add to collection

Get this paper in your agent:

hf papers read 2604\.18131

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18131 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18131 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18131 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution

arXiv cs.CL

CoEvolve proposes an agent-data mutual evolution framework for training LLM agents through closed-loop, interaction-driven learning that adapts both the agent and its training data distribution. The method extracts feedback signals from rollout trajectories to guide LLM-based task synthesis, demonstrating significant improvements (15-19% absolute gains) across multiple Qwen models on AppWorld and BFCL benchmarks.

OpenSkill: Open-World Self-Evolution for LLM Agents

Hugging Face Daily Papers

OpenSkill is a framework for LLM agents to self-evolve skills and verification signals from open-world resources without target-task supervision, achieving high performance across benchmarks.

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

arXiv cs.CL

This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.