FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

arXiv cs.AI Papers

Summary

FORGE is a protocol that enables LLM agents to evolve their memory via population broadcast without weight updates, converting failed trajectories into reusable knowledge artifacts. It significantly improves performance on the CybORG CAGE-2 network-defense task over zero-shot and Reflexion baselines across multiple LLM families.

arXiv:2605.16233v1 Announce Type: new Abstract: Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.
Original Article

Similar Articles

Scaling Self-Evolving Agents via Parametric Memory

arXiv cs.AI

Researchers from Alibaba/Qwen and Peking University introduce TMEM, a self-evolving parametric memory framework that uses online LoRA weight updates to let LLM agents genuinely learn from experience within a single episode, rather than relying solely on prompt-space memory. TMEM outperforms summary-based and retrieval-based baselines across multiple benchmarks including LoCoMo, LongMemEval-S, and CL-Bench.

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

Hugging Face Daily Papers

HAGE introduces a weighted multi-relational memory framework that enables query-conditioned traversal over unified relational memory graphs, improving long-horizon reasoning accuracy through adaptive memory retrieval and reinforcement learning-based optimization.

From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

Hugging Face Daily Papers

This survey paper proposes an evolutionary framework for LLM agent memory mechanisms, categorizing their development into three stages: storage, reflection, and experience. It analyzes core drivers such as long-range consistency and continual learning to provide design principles for next-generation agents.