On Safety Risks in Experience-Driven Self-Evolving Agents

arXiv cs.CL Papers

Summary

Researchers from Harbin Institute of Technology and Singapore Management University investigate safety risks in experience-driven self-evolving LLM agents, finding that even benign task experience can compromise safety in high-risk scenarios due to agents' execution-oriented tendencies, and revealing a fundamental safety–utility trade-off.

arXiv:2604.16968v1 Announce Type: new Abstract: Experience-driven self-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self-curated experience introduces underexplored safety risks. In this study, we investigate how experience accumulation and utilization in self-evolving agents affect safety performance across web-based and embodied environments. Notably, experience gathered solely from benign tasks can still compromise safety in high-risk scenarios. Further analysis attributes this degradation to the execution-oriented nature of accumulated experience, which reinforces agents' tendency to act rather than refuse. In more realistic settings where agents encounter both benign and harmful tasks, refusal-related experience mitigates safety decline but induces over-refusal, revealing a fundamental safety-utility trade-off. Overall, our findings expose inherent limitations of current self-evolving agents and call for more principled strategies to ensure safe and reliable adaptation.
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:05 AM

# On Safety Risks in Experience-Driven Self-Evolving Agents
Source: [https://arxiv.org/html/2604.16968](https://arxiv.org/html/2604.16968)
Weixiang Zhao1, Yichen Zhang1, Yingshuo Wang1††footnotemark:, Yang Deng2, Yanyan Zhao1, Xuda Zhi3,Yongbo Huang3,HaoHe3,Wanxiang Che1,Bing Qin1,Ting Liu1 1Harbin Institute of Technology,2Singapore Management University,3SERES \{wxzhao, yiczhang, yswang, yyzhao\}@ir\.hit\.edu\.cn

###### Abstract

Experience\-driven self\-evolution has emerged as a promising paradigm for improving the autonomy of large language model agents, yet its reliance on self\-curated experience introduces underexplored safety risks\. In this study, we investigate how experience accumulation and utilization in self\-evolving agents affect safety performance across web\-based and embodied environments\. Notably, experience gathered solely from benign tasks can still compromise safety in high\-risk scenarios\. Further analysis attributes this degradation to the execution\-oriented nature of accumulated experience, which reinforces agents’ tendency to act rather than refuse\. In more realistic settings where agents encounter both benign and harmful tasks, refusal\-related experience mitigates safety decline but induces over\-refusal, revealing a fundamental safety–utility trade\-off\. Overall, our findings expose inherent limitations of current self\-evolving agents and call for more principled strategies to ensure safe and reliable adaptation\.WARNING: This paper may contain content that is harmful\.

On Safety Risks in Experience\-Driven Self\-Evolving Agents

## 1Introduction

With the arrival of the era of experience, large language model \(LLM\) agents are expected to attain superhuman competence largely through learning from their own interactions\(Silver and Sutton,[2025](https://arxiv.org/html/2604.16968#bib.bib52)\)\. In this context, experience\-driven self\-evolving agents have quickly emerged as a major research frontier\(Gaoet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib18); Douet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib55); Caiet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib56); Bellet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib61)\), offering a practical mechanism for agents to adapt and refine their behavior over time\. With human\-written data plateauing and scaling reaching diminishing returns\(Villaloboset al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib53); Longpreet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib54)\), experience\-based self\-evolution is now viewed as a promising route toward greater generality and even AGI\(Hendryckset al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib40); Hu,[2025](https://arxiv.org/html/2604.16968#bib.bib50)\)\.

A self\-evolving agent generally works by gathering experiences from its interactions and then retrieving relevant ones to guide future decisions\. However, as agents increasingly rely on such self\-curated experience to reshape their behavior, they also face novel safety risks, with unintended patterns potentially being reinforced over time\(Ecoffetet al\.,[2020](https://arxiv.org/html/2604.16968#bib.bib60); Rudner and Toner,[2021](https://arxiv.org/html/2604.16968#bib.bib35); Bengioet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib58); Sunet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib48)\)\. To this end, we conduct the systematic study of safety degradation in self\-evolving LLM agents, structured around three core research questions \(RQs\)\.

We begin by systematically examining\(RQ1\)whether and in what ways experience\-driven self\-evolving agents exhibit safety degradation\(§[3](https://arxiv.org/html/2604.16968#S3)\)\. Our study spans two representative environments, web\(Zhouet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib6); Kumaret al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib7)\)and household embodiment\(Yinet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib9)\), and covers both offline\(Wanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib1)\)and online\(Ouyanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib2)\)self\-evolving paradigms\. We evaluate 7 LLM backbones, including both closed\-source and open\-weight models\(Hurstet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib5); Anthropic,[2025](https://arxiv.org/html/2604.16968#bib.bib10); Liuet al\.,[2025a](https://arxiv.org/html/2604.16968#bib.bib3); Yanget al\.,[2025a](https://arxiv.org/html/2604.16968#bib.bib4)\)\. Experimental results uncover a striking and consistent pattern: agents that gather experience exclusively from benign tasks nevertheless exhibit reduced safety when that experience is reapplied in high\-stakes scenarios, despite the backbone LLM weights remaining untouched\.

We then investigate \(RQ2\)why benign experience leads to such degradation and what properties of experience are responsible for this effect\(§[4](https://arxiv.org/html/2604.16968#S4)\)\. To probe the origins of this degradation, we conduct in\-depth case analyses and observe that unsafe behaviors primarily stem from the*execution bias*embedded in benign experiences, which encourages agents to complete tasks \(§[4\.1](https://arxiv.org/html/2604.16968#S4.SS1)\)\. This reveals the core property of experience: it guides agents to act and complete benign tasks, not to refrain from them\. Accordingly, in safety\-sensitive contexts, such execution\-oriented signals can unintentionally amplify the agent’s propensity to act, thereby increasing the likelihood of harmful outcomes\. We further examine how the quantity of retrieved experience affects safety performance \(§[4\.2](https://arxiv.org/html/2604.16968#S4.SS2)\)\. Even when each experience entry is individually harmless, increasing the number of examples consistently worsens safety, suggesting that accumulating more execution signals compounds the risk\. Finally, through both behavioral evidence and mechanical interpretation \(§[4\.3](https://arxiv.org/html/2604.16968#S4.SS3)\), we confirm that this degradation is causally driven by the content of the retrieved experience itself, not by incidental effects such as longer context length or additional noise\(Genget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib51); Tanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib17)\)\.

Finally, we turn to \(RQ3\)how experience composition shapes safety–utility trade\-offs in realistic post\-deployment self\-evolution, where agents inevitably accumulate experience from a mixture of benign and harmful tasks \(§[5](https://arxiv.org/html/2604.16968#S5)\)\. In this context, experience related to*harmful tasks*may manifest in three forms: execution\-only, refusal\-only, or a natural combination of both\. Under online self\-evolution, we find that the presence of execution experience on harmful tasks leads to more severe safety degradation, a intuitive yet troubling effect\. Incorporating refusal experience, even when interleaved with execution traces, effectively mitigates unsafe behaviors but also induces over\-refusal\(Röttgeret al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib64)\)on benign inputs\. These findings expose a core limitation in how current self\-evolving agents leverage experience, highlighting the need for more principled mechanisms that can better balance safety and utility in future designs\.

Overall, our study reveals a consistent pattern of safety degradation in self\-evolving agents \(§[3](https://arxiv.org/html/2604.16968#S3)\), traces its root to execution\-oriented experience \(§[4](https://arxiv.org/html/2604.16968#S4)\), and highlights a non\-trivial safety–utility trade\-off that must be carefully managed \(§[5](https://arxiv.org/html/2604.16968#S5)\)\.

## 2Preliminaries

We formally define experience\-driven self\-evolving agents as agents that progressively improve their behavior by*accumulating*,*retrieving*, and*exploiting*past experiences, without modifying the underlying model parameters\(Gaoet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib18)\)\.

After each interaction with the environment, the agent produces a trajectoryτ\\tauand receives the feedbackrr\. From each\(τ,r\)\(\\tau,r\)pair, a compact*experience unit*EEis distilled and stored in an external memoryM=\{E1,E2,…,En\}M=\\\{E\_\{1\},E\_\{2\},\\dots,E\_\{n\}\\\}\.

When presented with a new task inputxx, the agent retrieves a relevant subset of experiencesM​\(x\)⊂MM\(x\)\\subset Mand augments the input as\[x;M​\(x\)\]\[x;M\(x\)\]for inference, yielding the output

y=πθ​\(\[x;M​\(x\)\]\)\.y=\\pi\_\{\\theta\}\(\[x;M\(x\)\]\)\.
We consider two self\-evolution paradigms\. In the*offline*setting, all experience units are pre\-extracted from a fixed dataset and the memoryMMremains frozen at inference time\. In contrast, the*online*setting continuously updatesMMduring deployment through ongoing interactions\.

This work investigates how incorporating prior experiencesM​\(x\)M\(x\)influences the agent’s safety behavior, and demonstrates that such experience\-driven adaptation can introduce previously underexplored safety vulnerabilities\.

![Refer to caption](https://arxiv.org/html/2604.16968v1/x1.png)Figure 1:Category\-level ASR shifts before and after offline self\-evolution on BrowserART\. Results are shown for GPT\-4o, Claude\-4\.5\-Sonnet, DeepSeek\-V3\.2, and Qwen3\-235B\-A22B\.Table 1:Attack Success Rate \(ASR\) before and after offline self\-evolution across three benchmark environments: BrowserART, Agent\-SafetyBench, and SafeAgentBench\. Higher ASR indicates worse safety\.
## 3Safety Degradation in Self\-Evolution

We begin by empirically answeringRQ1: whether and in what ways experience accumulation in self\-evolving agents leads to safety degradation\.

### 3\.1Experimental Setup

#### Agent Framework\.

We adopt two representative agent frameworks to model experience\-driven self\-evolution: Agent Workflow Memory \(AWM\)\(Wanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib1)\)for*offline*evolution and ReasoningBank\(Ouyanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib2)\)for*online*evolution\. In both settings, the LLM backbone remains fixed, while self\-evolution arises solely from the accumulation, retrieval, and exploitation of past experiences maintained in an external memory\. Further details of the two frameworks are in Appendix[A](https://arxiv.org/html/2604.16968#A1)\.

#### Backbone Model\.

We conduct experiments using a diverse set of LLM backbones\. On the closed\-source side, we include GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib5)\)and Claude\-4\.5\-Sonnet\(Anthropic,[2025](https://arxiv.org/html/2604.16968#bib.bib10)\)\. For open\-weight models, we benchmark a wide spectrum of the Qwen3 family, including dense variants ranging from 8B to 32B parameters, the large\-scale mixture\-of\-experts model Qwen3\-235B\-A22B\(Yanget al\.,[2025a](https://arxiv.org/html/2604.16968#bib.bib4)\), as well as DeepSeek\-V3\.2\(Liuet al\.,[2025a](https://arxiv.org/html/2604.16968#bib.bib3)\)\.

#### Environment & Benchmark\.

We evaluate across two representative settings: web\-based and household embodied environments\.

For the web environment, agents first engage in self\-evolving interactions onWebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib6)\), where they complete long\-horizon web navigation tasks and accumulate experiences in memory\. Following this experience accumulation stage, safety is assessed using two web\-oriented benchmarks:BrowserART\(Kumaret al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib7)\)and the web\-related subset ofAgent\-SafetyBench\(Zhanget al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib8)\)\.

In the household embodied environment, agents perform self\-evolution on a curated set of benign tasks usingSafeAgentBench\(Yinet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib9)\)\. Safety is subsequently evaluated on a disjoint set of harmful household instructions, specifically designed to probe physical\-world safety risks\.

Safety is quantified by the attack success rate \(ASR\)\. All safety evaluations are performed automatically using GPT\-4o, following benchmark protocols, and shown to strongly correlate with human annotations\. Detailed benchmark configurations and examples of tasks used in both environments are provided in Appendix[B](https://arxiv.org/html/2604.16968#A2)\.

#### Implementation Details\.

Closed\-source and large\-scale open\-weight models are accessed via official APIs, while other open\-weight models are deployed locally with vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2604.16968#bib.bib12)\)on NVIDIA A800 GPUs\. At each step, the agent retrieves the top\-3 experience items\. We follow the default decoding settings of each framework \(temperature 0\.1 for AWM and 0\.7 for ReasoningBank\)\. Additional details are provided in Appendix[C](https://arxiv.org/html/2604.16968#A3)\.

### 3\.2Evaluation of Offline Self\-Evolving

Table[1](https://arxiv.org/html/2604.16968#S2.T1)summarizes the outcomes of offline self\-evolution with AWM across both web\-based and household embodied settings\. Agent safety is assessed on three benchmarks, comparing performance before and after experience accumulation\. A detailed breakdown of safety performance by risk category is illustrated in Figure[1](https://arxiv.org/html/2604.16968#S2.F1), with additional category\-level analyses provided in Appendix[D\.1](https://arxiv.org/html/2604.16968#A4.SS1)\.

#### Safety degradation is a universal phenomenon in offline self\-evolution\.

Table[1](https://arxiv.org/html/2604.16968#S2.T1)demonstrates that, for all tested models and environments, offline self\-evolution systematically increases the ASR, signaling a widespread erosion of agent safety\. This behavior is consistent in both web\-based scenarios and household embodied settings\. Overall, the results point to a stable and repeatable effect: even when learning is driven solely by task\-relevant and non\-harmful queries, the continual accumulation and reuse of execution experience can progressively undermine safety guarantees\.

#### Offline experience induces systematic safety decline across risk categories\.

Figure[1](https://arxiv.org/html/2604.16968#S2.F1)demonstrates that offline self\-evolution under the AWM framework leads to clear safety degradation across a wide spectrum of high\-risk categories in BrowserART\. While models with stronger initial safety profiles \(e\.g\., Claude\-4\.5\-Sonnet\) exhibit relatively smaller degradations, the decline remains non\-negligible\. In contrast, models with higher baseline ASR \(e\.g\., Qwen3\-235B\-A22B\) show pronounced and widespread amplification of risk, spanning more than ten categories\.

![Refer to caption](https://arxiv.org/html/2604.16968v1/x2.png)Figure 2:Online self\-evolution on SafeAgentBench: Attack Success Rate \(ASR\) over time for seven backbone models\. Evaluation is conducted every 20 steps\.

### 3\.3Evaluation of Online Self\-Evolving

The evolution of safety performance in the household embodied environment is illustrated in Figure[2](https://arxiv.org/html/2604.16968#S3.F2), where the ASR is periodically evaluated every 20 self\-evolving steps\. Results on the web\-based environment can be found in Appendix[D\.2](https://arxiv.org/html/2604.16968#A4.SS2)\.

#### Online self\-evolution induces immediate and compounding safety degradation across backbones\.

Across both environments, the ASR rises sharply during the initial stages of self\-evolution and remains elevated throughout subsequent self\-evolving iterations\. Importantly, all experiences stored in memory originate solely from benign and non\-harmful tasks, eliminating direct exposure to unsafe instructions as a contributing factor\. These results suggest that once external experiences are integrated into memory and reused online, their impact on agent behavior manifests rapidly and persists over time, rather than diminishing\.

#### Safety degradation persists with no signs of natural recovery, indicating a lasting behavioral drift\.

Across all models, ASR curves plateau at elevated levels after early\-stage degradation, with no model recovering to its initial safety level\. This plateau effect suggests that experience\-driven adaptation leads to a persistent degradation of safety, rather than transient noise or fluctuation\. In Appendix[D\.3](https://arxiv.org/html/2604.16968#A4.SS3), we further conduct long\-horizon experiments \(beyond 800 steps\) and observe continued safety decline, reinforcing the concern that such degradation is not self\-correcting over time\. More detailed analysis is provided therein\.

## 4Causes of Safety Degradation

To understand the origins of safety degradation during self\-evolution\(RQ2\), we conduct in\-depth analyses under the online self\-evolving setting with ReasoningBank, which subsumes the offline case and can be viewed as a sequence of snapshots with increasing experience\. Specifically, we present case studies to characterize experience\-induced safety failures \(§[4\.1](https://arxiv.org/html/2604.16968#S4.SS1)\), analyze how the amount of retrieved experience affects safety \(§[4\.2](https://arxiv.org/html/2604.16968#S4.SS2)\), and examine whether the degradation is driven by the content of experience rather than confounding factors such as increased context length \(§[4\.3](https://arxiv.org/html/2604.16968#S4.SS3)\)\.

Table 2:Distribution of dominant causes for safety degradation after experience retrieval across models on BrowserART and SafeAgentBench\.### 4\.1Execution Bias in Benign Experience

To identify the causes of safety degradation, we manually inspect cases where incorporating retrieved experience flips an agent’s response from safe to unsafe\. For each instance, we analyze the primary factor that leads to the emergence of unsafe behavior after experience injection\.

We categorize reasons for safety degradation into three types: \(1\) Sensitive Execution \(Sen\-Exe\), where the retrieved experience are benign in isolation but may be unsafe in sensitive contexts \(e\.g\., ignition in household scenario\)\. \(2\) Standard Execution \(Sta\-Exe\), where experience conveys generic and executable procedural patterns \(e\.g\., “open → place”\)\. \(3\) Format Recovery \(Format\), where experience mainly restores output structure or formatting, enabling task completion that was previously blocked\. Detailed annotation criteria and cases are provided in Appendix[D\.4](https://arxiv.org/html/2604.16968#A4.SS4)\.

Table[2](https://arxiv.org/html/2604.16968#S4.T2)summarizes the distribution of these causes across models and benchmarks\. On both BrowserART and SafeAgentBench, safety regressions are predominantly attributed to Sensitive Execution and Standard Execution, while Format Recovery consistently accounts for a minority of cases\. For example, GPT\-4o and DeepSeek\-V3\.2 exhibit substantial safety failures driven by generic execution patterns on BrowserART, whereas Qwen\-series models show notable vulnerability to format recovery effects, especially on SafeAgentBench\.

Overall, these results reveal that retrieved experience mainly reinforces execution\-oriented behaviors—how to proceed and complete tasks—rather than when and how to refrain\. Even when the experience itself is benign, its action\-centric structure can override safety constraints in sensitive scenarios, exposing a fundamental fragility of experience reuse in self\-evolving agents\.

![Refer to caption](https://arxiv.org/html/2604.16968v1/x3.png)Figure 3:Attack success rate on Agent\-SafetyBench \(web\-based\) during self\-evolution with different numbers of retrieved experience entries\. The framework is ReasoningBank based on GPT\-4o\.
### 4\.2Effect of Retrieved Experience Size

We investigate how the number of retrieved experience entries affects safety during self\-evolution\. As shown in Figure[3](https://arxiv.org/html/2604.16968#S4.F3), increasing the number of retrieved entries leads to a clear and persistent rise in unsafe behavior\. Even though each individual experience is benign, aggregating more of them consistently results in higher unsafe response rates across self\-evolving steps, compared to smaller settings\. For more results in the household embodied environment, please refer to Appendix[D\.6](https://arxiv.org/html/2604.16968#A4.SS6)\.

This observation confirms a compounding effect: execution\-oriented signals, when scaled up through experience accumulation, amplify the agent’s propensity to act, thereby raising safety risks\. It reveals a fundamental vulnerability in the reuse of benign experience—namely, that quantity alone can induce degradation, even in the absence of explicit harmful content\.

\(a\)Layer\-wise Integrated Gradient \(IG\) attribution of different prompt segments during online self\-evolution\.### 4\.3Experience vs\. Enhanced Context Length

#### Setup\.

In our setting, each prompt consists of three distinct segments: system instruction, experience item, and task goal\. To verify whether the observed safety degradation is caused by the content of retrieved experience rather than by the increased context length itself\(Liuet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib13); Duet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib14); Genget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib51)\), we design a controlled length\-matched experiment\. We first measure the additional context length introduced by experience retrieval, then remove the retrieved experience segment and compensate for the resulting length difference by enriching the system instructions with additional descriptive details, while keeping the overall context length unchanged\. Safety performance is evaluated on BrowserART and SafeAgentBench, and illustrative examples of the length\-matching procedure are provided in Appendix[D\.5](https://arxiv.org/html/2604.16968#A4.SS5)\.

Table 3:Attack Success Rate \(%\) on BrowserART and SafeAgentBench before and after online self\-evolution with experience retrieval, and under a length\-matched prompt expansion control\.
#### Results & Analysis\.

Table[3](https://arxiv.org/html/2604.16968#S4.T3)reports the ASR under different settings\. Across all evaluated backbones, introducing experience through online self\-evolution leads to a substantial increase in ASR\. In contrast, expanding the segment of system instructions to match and compensate the increased context length, without including any experience content, results in ASR that remain close to the pre\-self\-evolution baseline\.

These results provide strong evidence that the observed safety degradation is driven by the semantic content of retrieved experience rather than by contextual noise introduced by longer inputs\. Even when the total context length is held constant, only the inclusion of experience content leads to systematic erosion of safety performance, supporting our core claim that experience reuse is the primary cause of safety boundary shift\.

\(a\)Performance comparison under realistic deployment settings where experience from both benign and harmful tasks are accumulated\. The red dashed line denotes the performance under purely benign experience\.#### Mechanical Interpretability\.

To further establish that the observed safety degradation is causally driven by the retrieved experience segment, rather than being a superficial prompt\-level artifact, we analyze the agent backbone’s internal information flow from a mechanistic perspective\. Specifically, we aim to quantify how information originating from different prompt segments propagates through attention mechanisms and contributes to the final prediction\(Simonyanet al\.,[2013](https://arxiv.org/html/2604.16968#bib.bib15)\)\.

To this end, we employ Integrated Gradients \(IG\)\(Wanget al\.,[2023](https://arxiv.org/html/2604.16968#bib.bib16); Tanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib17)\), a gradient\-based attribution method that provides a principled way to measure the contribution of a specific prompt segment to the prediction by combining attention weights with gradients of the loss, allowing us to trace how retrieved experience influences generation behavior at different layers and heads\.

Formally, for thehh\-th attention head in thell\-th layer, we compute the IG score as follows:

IGh,l\\displaystyle\\mathrm\{IG\}\_\{h,l\}=Ah,lT⊙\|∂ℒθ​\(Y\|X\)∂Ah,l\|,\\displaystyle=A\_\{h,l\}^\{T\}\\odot\\left\|\\frac\{\\partial\\mathcal\{L\}\_\{\\theta\}\(Y\|X\)\}\{\\partial A\_\{h,l\}\}\\right\|,\(1\)IGh,l\(r\)\\displaystyle\\mathrm\{IG\}^\{\(r\)\}\_\{h,l\}=1\|𝒯s\|​∑xi∈𝒯s∑yj∈YIGh,l​\[i,j\]\.\\displaystyle=\\frac\{1\}\{\|\\mathcal\{T\}\_\{s\}\|\}\\sum\\limits\_\{x\_\{i\}\\in\\mathcal\{T\}\_\{s\}\}\\sum\\limits\_\{y\_\{j\}\\in Y\}\\mathrm\{IG\}\_\{h,l\}\[i,j\]\.\(2\)whereℒθ​\(Y\|X\)\\mathcal\{L\}\_\{\\theta\}\(Y\|X\)denotes the prediction loss,Ah,lA\_\{h,l\}is the attention matrix, and𝒯s\\mathcal\{T\}\_\{s\}corresponds to one of the aforementioned prompt segments, i\.e\., system instruction, experience item or task goal\. Further, each entryIGh,l​\[i,j\]\\mathrm\{IG\}\_\{h,l\}\[i,j\]reflects the estimated information flow between an input tokenxix\_\{i\}and an output tokenyjy\_\{j\}mediated by attention\.

The aggregated scoreIGh,l\(r\)\\mathrm\{IG\}^\{\(r\)\}\_\{h,l\}thus captures the contribution of retrieved experience to the model’s output at a specific head and layer\. We further average this quantity across all heads and layers to obtain a global attribution scoreIG\(r\)\\mathrm\{IG\}^\{\(r\)\}, where higher values indicate a stronger influence of retrieved experience item on the final prediction\.

Based on the Qwen3\-32B results in Figure[5\(a\)](https://arxiv.org/html/2604.16968#S4.F5.sf1), we observe a clear difference in the orange curves, which represent the retrieved experience \(left\) and its expanded counterpart \(right\)\. The experience item exhibits consistently high IG attribution across layers, even increasing slightly in deeper layers\. This indicates a strong and persistent influence of retrieved experience on the agent’s output\.

In contrast, when the same length of content is replaced by expanded system instructions, the IG attribution of the orange curve is significantly reduced with depth, suggesting that the substitute content fails to exert similar control over the generation\. This stark contrast provides direct evidence that it is the specific semantics of the experience item, rather than its token count or structural position, that dominates the agent’s internal computation and ultimately drives unsafe behaviors\.

Taken together, these results reinforce our central claim: retrieved experience can structurally override safety\-relevant instructions, and its influence is mechanistically traceable through internal attribution, even when such experience is individually harmless\. For results on Qwen3\-8B and Qwen3\-14B, see Appendix[D\.7](https://arxiv.org/html/2604.16968#A4.SS7)\.

## 5Safety in Realistic Self\-Evolution

To complement our earlier analyses focused on self\-evolution over purely benign tasks, we now turn to a more realistic post\-deployment scenario where agents continuously accumulate experience from both benign and harmful interactions\(RQ3\)\.

#### Setup\.

We adopt the ReasoningBank online self\-evolving framework with four LLM backbones: GPT\-4o, DeepSeek\-V3\.2, Qwen3\-14B, and Qwen3\-32B\. In the web\-based environment,we sample 50 harmful tasks from Agent\-SafetyBench and execute online self\-evolution on them in the same manner as with benign tasks\. In the household embodied setting, we similarly select 50 harmful tasks from SafeAgentBench\. These tasks are excluded from downstream evaluation to avoid data leakage\. Through explicit manual control, we ensure that experience derived from harmful tasks appears in exactly one of the following three forms:

- •Refusal\-only experience: includes only refusal behaviors in response to harmful inputs\.
- •Execution\-only experience: contains only successful execution traces on harmful tasks\.
- •Mixed experience: comprises both refusal and execution behaviors\.

We explore how each type of harmful experience, when interleaved with benign\-task experience, impacts agent performance under the online self\-evolving setting, thereby simulating more realistic post\-deployment conditions\.

#### Results & Analysis

Figure[7\(a\)](https://arxiv.org/html/2604.16968#S4.F7.sf1)reports the safety \(left\) and utility \(right\) of agents during online self\-evolution under different experience configurations\. The LLM backbone is GPT\-4o\. For results in the household embodied environment and with other backbones, please refer to Appendix[D\.8](https://arxiv.org/html/2604.16968#A4.SS8)\. We derive the following key insights:

#### Execution experience on harmful tasks consistently degrades safety\.

As shown in FigureLABEL:subfig:mem\_control\_asr, accumulating*execution\-only*experience from harmful tasks leads to a sustained increase in ASR throughout online self\-evolution\. This suggests that once agents are exposed to executable traces on harmful tasks, such execution\-oriented experience is repeatedly reused during decision making, gradually biasing the agent toward unsafe actions and weakening effective safety constraints\.

#### Refusal experience mitigates safety risks but induces a safety–utility trade\-off\.

As shown in FigureLABEL:subfig:mem\_control\_asr, incorporating refusal behaviors into the memory, either in isolation or interleaved with execution traces, substantially suppresses the rise in ASR\. However, FigureLABEL:subfig:mem\_control\_tsrindicates that these safety improvements are accompanied by a notable decline in task success on benign inputs, suggesting a tendency toward over\-refusal\. Together, these findings highlight a fundamental tension in self\-evolving agents: while refusal\-based experience can effectively stabilize safety, it may simultaneously degrade task utility, underscoring the necessity of more principled memory control mechanisms for realistic post\-deployment scenarios\.

## 6Related Works

#### Experience\-Driven Self\-Evolving Agents\.

Recent work has increasingly explored agents that improve their behavior by accumulating and reusing past interaction experience\(Taoet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib20); Gaoet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib18); Zhenget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib19); Fanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib21)\)\. Central to this paradigm is the externalization of experience into an explicit memory, which is retrieved to guide future decision\-making\. Based on how experience is collected and utilized, existing approaches can be broadly categorized into offline and online paradigms\(Liuet al\.,[2025b](https://arxiv.org/html/2604.16968#bib.bib27)\)\.

In the offline setting, experience is induced from pre\-collected training data and stored in a fixed memory during deployment\(Li and Qiu,[2023](https://arxiv.org/html/2604.16968#bib.bib25); Yanget al\.,[2023](https://arxiv.org/html/2604.16968#bib.bib24); Zhonget al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib26); Zhaoet al\.,[2024a](https://arxiv.org/html/2604.16968#bib.bib22); Fuet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib28); Zhouet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib23); Yanget al\.,[2025b](https://arxiv.org/html/2604.16968#bib.bib31)\)\. Representative methods such as Agent Workflow Memory \(AWM\)\(Wanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib1)\)learn reusable workflows from historical trajectories and retrieve them at test time to guide action generation\. In contrast, online experience\-driven agents continuously accumulate and refine experience during deployment, enabling memory to evolve over time\(Chenet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib29); Zhanget al\.,[2025a](https://arxiv.org/html/2604.16968#bib.bib30),[b](https://arxiv.org/html/2604.16968#bib.bib32); Suzgunet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib33)\)\. For example, ReasoningBank\(Ouyanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib2)\)distills reasoning strategies from ongoing interactions and incrementally integrates them into memory for subsequent reuse\. While these approaches provide flexible mechanisms for self\-evolution, their safety implications remain largely unexplored\.

#### Safety Risks in Open\-Ended AI\.

Open\-ended AI systems endowed with self\-evolving capabilities are widely regarded as a promising pathway toward Artificial General Intelligence\(Stanley,[2019](https://arxiv.org/html/2604.16968#bib.bib47); Morriset al\.,[2023](https://arxiv.org/html/2604.16968#bib.bib45); Hugheset al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib46); Zhaoet al\.,[2024b](https://arxiv.org/html/2604.16968#bib.bib65); Hendryckset al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib40)\)\. However, beyond their potential for continual performance gains, recent studies increasingly suggest that open\-ended self\-evolution gives rise to distinct and insufficiently understood safety challenges\(Shethet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib34); Weston and Foerster,[2025](https://arxiv.org/html/2604.16968#bib.bib38); Suet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib37); DeChant,[2025](https://arxiv.org/html/2604.16968#bib.bib43); Zhaoet al\.,[2026a](https://arxiv.org/html/2604.16968#bib.bib66),[b](https://arxiv.org/html/2604.16968#bib.bib67)\)\.

For example, empirical findings on agentic misalignment indicate that autonomous agents may deliberately engage in harmful behaviors in pursuit of their objectives\(Lynchet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib39); Herrador,[2025](https://arxiv.org/html/2604.16968#bib.bib44)\)\. Moreover, errors in goal specification can be exacerbated through long\-horizon adaptation, resulting in progressively larger divergences from human intent\(Rudner and Toner,[2021](https://arxiv.org/html/2604.16968#bib.bib35); Hanet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib42)\)\. Closely related to our work, a concurrent study indicate a phenomenon termed mis\-evolution, revealing the safety risks of self\-evolving agents from a behavioral perspective\(Shaoet al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib41)\)\.

Whereas prior work primarily examines surface\-level behaviors, our study uncovers the underlying mechanisms of safety degradation and provides actionable insights for mitigation\.

## 7Conclusion

This work provides a comprehensive analysis of the safety dynamics in experience\-driven self\-evolving agents, revealing a consistent pattern of safety degradation even when learning from benign experience\. Our analysis identifies execution\-oriented experience as a key driver of this degradation, with stronger execution signals amplifying unsafe behaviors\. Under more realistic deployment settings, we further show that refusal experience can mitigate unsafe behaviors but leads to over\-refusal, exposing a fundamental safety–utility trade\-off\. We hope this work draws broader attention to the unique safety challenges of self\-evolution and motivates future research toward principled, controllable, and safer adaptation for long\-term agent deployment\.

## Limitation

While our study provides a systematic investigation into safety risks introduced by experience\-driven self\-evolving agents, several limitations remain\. First, our evaluation is conducted on a focused set of benchmarks that span both web\-based and embodied scenarios\. However, these benchmarks may not fully capture the diversity of real\-world deployment environments, especially those involving multi\-agent interactions or multi\-modal inputs\. Extending our analysis to broader task distributions remains an important direction\. Second, due to computational constraints, our experiments study self\-evolving agents over a finite number of self\-evolution steps \(up to 800 steps\)\. While this already reveals persistent safety degradation, real\-world deployed agents may undergo self\-evolution over far longer, and potentially unbounded, time horizons\. How safety dynamics evolve under such indefinite experience accumulation, and whether new failure modes emerge beyond the studied regime, remain open questions for future work\.

Overall, this work takes a first step toward understanding safety erosion in self\-evolving agents\. We hope future efforts will explore more general, principled, and verifiable mechanisms to ensure long\-term safety in experience\-driven AI systems\.

## Ethical Considerations

This work is conducted solely for research purposes, with the goal of understanding and mitigating safety risks in experience\-driven self\-evolving agents\. All experiments are performed in controlled simulation environments and established safety benchmarks, without deployment in real\-world systems\. We believe that systematically identifying and characterizing such risks is essential for developing safer agentic systems\. By exposing potential failure modes and trade\-offs in current self\-evolving frameworks, this work aims to inform the design of more robust safety mechanisms rather than to enable misuse\.

## Acknowledgments

We thank the anonymous reviewers for their comments and suggestions\. This work was supported by the National Natural Science Foundation of China \(NSFC\) via grant 62441614 and 62576125, and the Singapore Ministry of Education \(MOE\) Academic Research Fund \(AcRF\) Tier 1 grant \(Proposal ID: 24\-SIS\-SMU\-002\)\.

## References

- Anthropic \(2025\)Introducing claude sonnet 4\.5\.Anthropic\.External Links:[Link](https://www.anthropic.com/news/claude-sonnet-4-5)Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px2.p1.1)\.
- J\. Bell, L\. Quarantiello, E\. N\. Coleman, L\. Li, M\. Li, M\. Madeddu, E\. Piccoli, and V\. Lomonaco \(2025\)The future of continual learning in the era of foundation models: three key directions\.arXiv preprint arXiv:2506\.03320\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1)\.
- Y\. Bengio, G\. Hinton, A\. Yao, D\. Song, P\. Abbeel, T\. Darrell, Y\. N\. Harari, Y\. Zhang, L\. Xue, S\. Shalev\-Shwartz,et al\.\(2024\)Managing extreme ai risks amid rapid progress\.Science384\(6698\),pp\. 842–845\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p2.1)\.
- Y\. Cai, Y\. Hao, J\. Zhou, H\. Yan, Z\. Lei, R\. Zhen, Z\. Han, Y\. Yang, J\. Li, Q\. Pan,et al\.\(2025\)Building self\-evolving agents via experience\-driven lifelong learning: a framework and benchmark\.arXiv preprint arXiv:2508\.19005\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1)\.
- M\. Chen, Y\. Li, Y\. Yang, S\. Yu, B\. Lin, and X\. He \(2024\)Automanual: constructing instruction manuals by llm agents via interactive environmental learning\.Advances in Neural Information Processing Systems37,pp\. 589–631\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- C\. DeChant \(2025\)Episodic memory in ai agents poses risks that should be studied and mitigated\.In2025 IEEE Conference on Secure and Trustworthy Machine Learning \(SaTML\),pp\. 321–332\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Dou, M\. Zhang, C\. Huang, J\. Chen, F\. Chen, S\. Liu, Y\. Liu, C\. Liu, C\. Zhong, Z\. Zhang,et al\.\(2025\)Evalearn: quantifying the learning capability and efficiency of llms via sequential problem solving\.arXiv preprint arXiv:2506\.02672\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1)\.
- Y\. Du, M\. Tian, S\. Ronanki, S\. Rongali, S\. B\. Bodapati, A\. Galstyan, A\. Wells, R\. Schwartz, E\. A\. Huerta, and H\. Peng \(2025\)Context length alone hurts llm performance despite perfect retrieval\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 23281–23298\.Cited by:[§4\.3](https://arxiv.org/html/2604.16968#S4.SS3.SSS0.Px1.p1.1)\.
- A\. Ecoffet, J\. Clune, and J\. Lehman \(2020\)Open questions in creating safe open\-ended ai: tensions between control and creativity\.InArtificial Life Conference Proceedings 32,pp\. 27–35\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p2.1)\.
- J\. Fang, Y\. Peng, X\. Zhang, Y\. Wang, X\. Yi, G\. Zhang, Y\. Xu, B\. Wu, S\. Liu, Z\. Li,et al\.\(2025\)A comprehensive survey of self\-evolving ai agents: a new paradigm bridging foundation models and lifelong agentic systems\.arXiv preprint arXiv:2508\.07407\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Fu, D\. Kim, J\. Kim, S\. Sohn, L\. Logeswaran, K\. Bae, and H\. Lee \(2024\)Autoguide: automated generation and selection of context\-aware guidelines for large language model agents\.Advances in Neural Information Processing Systems37,pp\. 119919–119948\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- H\. Gao, J\. Geng, W\. Hua, M\. Hu, X\. Juan, H\. Liu, S\. Liu, J\. Qiu, X\. Qi, Y\. Wu,et al\.\(2025\)A survey of self\-evolving agents: on path to artificial super intelligence\.arXiv preprint arXiv:2507\.21046\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1),[§2](https://arxiv.org/html/2604.16968#S2.p1.1),[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Geng, H\. Chen, R\. Liu, M\. H\. Ribeiro, R\. Willer, G\. Neubig, and T\. L\. Griffiths \(2025\)Accumulating context changes the beliefs of language models\.arXiv preprint arXiv:2511\.01805\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p4.1),[§4\.3](https://arxiv.org/html/2604.16968#S4.SS3.SSS0.Px1.p1.1)\.
- S\. Han, J\. Liu, Y\. Su, W\. Duan, X\. Liu, C\. Xie, M\. Bansal, M\. Ding, L\. Zhang, and H\. Yao \(2025\)Alignment tipping process: how self\-evolution pushes llm agents off the rails\.arXiv preprint arXiv:2510\.04860\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p2.1)\.
- D\. Hendrycks, D\. Song, C\. Szegedy, H\. Lee, Y\. Gal, E\. Brynjolfsson, S\. Li, A\. Zou, L\. Levine, B\. Han,et al\.\(2025\)A definition of agi\.arXiv preprint arXiv:2510\.18212\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1),[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- M\. Herrador \(2025\)The pacifaist benchmark: would an artificial intelligence choose to sacrifice itself for human safety?\.arXiv preprint arXiv:2508\.09762\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p2.1)\.
- B\. Hu \(2025\)On improvisation and open\-endedness: insights for experiential ai\.arXiv preprint arXiv:2511\.00529\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1)\.
- E\. Hughes, M\. Dennis, J\. Parker\-Holder, F\. Behbahani, A\. Mavalankar, Y\. Shi, T\. Schaul, and T\. Rocktäschel \(2024\)Position: open\-endedness is essential for artificial superhuman intelligence\.InProceedings of the 41st International Conference on Machine Learning,pp\. 20597–20616\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px2.p1.1)\.
- P\. Kumar, E\. Lau, S\. Vijayakumar, T\. Trinh, E\. T\. Chang, V\. Robinson, S\. Zhou, M\. Fredrikson, S\. M\. Hendryx, S\. Yue,et al\.\(2025\)Aligned llms are not aligned browser agents\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2604.16968#A2.SS1.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px3.p2.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px4.p1.1)\.
- X\. Li and X\. Qiu \(2023\)MoT: memory\-of\-thought enables chatgpt to self\-improve\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 6354–6374\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025a\)DeepSeek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px2.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§4\.3](https://arxiv.org/html/2604.16968#S4.SS3.SSS0.Px1.p1.1)\.
- Y\. Liu, C\. Si, K\. R\. Narasimhan, and S\. Yao \(2025b\)Contextual experience replay for self\-improvement of language agents\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14179–14198\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Longpre, R\. Mahari, A\. Lee, C\. Lund, H\. Oderinwale, W\. Brannon, N\. Saxena, N\. Obeng\-Marnu, T\. South, C\. Hunter,et al\.\(2024\)Consent in crisis: the rapid decline of the ai data commons\.Advances in Neural Information Processing Systems37,pp\. 108042–108087\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1)\.
- A\. Lynch, B\. Wright, C\. Larson, S\. J\. Ritchie, S\. Mindermann, E\. Hubinger, E\. Perez, and K\. Troy \(2025\)Agentic misalignment: how llms could be insider threats\.arXiv preprint arXiv:2510\.05179\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p2.1)\.
- M\. R\. Morris, J\. Sohl\-Dickstein, N\. Fiedel, T\. Warkentin, A\. Dafoe, A\. Faust, C\. Farabet, and S\. Legg \(2023\)Levels of agi for operationalizing progress on the path to agi\.arXiv preprint arXiv:2311\.02462\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Ouyang, J\. Yan, I\. Hsu, Y\. Chen, K\. Jiang, Z\. Wang, R\. Han, L\. T\. Le, S\. Daruki, X\. Tang,et al\.\(2025\)Reasoningbank: scaling agent self\-evolving with reasoning memory\.arXiv preprint arXiv:2509\.25140\.Cited by:[Appendix A](https://arxiv.org/html/2604.16968#A1.p1.1),[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- P\. Röttger, H\. Kirk, B\. Vidgen, G\. Attanasio, F\. Bianchi, and D\. Hovy \(2024\)Xstest: a test suite for identifying exaggerated safety behaviours in large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 5377–5400\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p5.1)\.
- T\. G\. Rudner and H\. Toner \(2021\)Key concepts in ai safety: specification in machine learning\.Center for Security and Emerging Technology, December\. http://cset\. georgetown\. edu/wp\-content/uploads/Key\-Concepts\-in\-AI\-Safety\-Specification\-in\-Machine\-Learning\. pdf\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p2.1),[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p2.1)\.
- S\. Shao, Q\. Ren, C\. Qian, B\. Wei, D\. Guo, Y\. JingYi, X\. Song, L\. Zhang, W\. Zhang, D\. Liu,et al\.\(2025\)Your agent may misevolve: emergent risks in self\-evolving llm agents\.InSocially Responsible and Trustworthy Foundation Models at NeurIPS 2025,Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p2.1)\.
- I\. Sheth, J\. Wehner, S\. Abdelnabi, R\. Binkyte, and M\. Fritz \(2025\)Safety is essential for responsible open\-ended systems\.arXiv preprint arXiv:2502\.04512\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- D\. Silver and R\. S\. Sutton \(2025\)Welcome to the era of experience\.Google AI1\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1)\.
- K\. Simonyan, A\. Vedaldi, and A\. Zisserman \(2013\)Deep inside convolutional networks: visualising image classification models and saliency maps\.arXiv preprint arXiv:1312\.6034\.Cited by:[§4\.3](https://arxiv.org/html/2604.16968#S4.SS3.SSS0.Px3.p1.1)\.
- K\. O\. Stanley \(2019\)Why open\-endedness matters\.Artificial life25\(3\),pp\. 232–235\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- H\. Su, J\. Luo, C\. Liu, X\. Yang, Y\. Zhang, Y\. Dong, and J\. Zhu \(2025\)A survey on autonomy\-induced security risks in large model\-based agents\.arXiv preprint arXiv:2506\.23844\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Sun, X\. Wang, J\. Fu, C\. Lu, and B\. Zhou \(2025\)R2AI: towards resistant and resilient ai in an evolving world\.arXiv preprint arXiv:2509\.06786\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p2.1)\.
- M\. Suzgun, M\. Yuksekgonul, F\. Bianchi, D\. Jurafsky, and J\. Zou \(2025\)Dynamic cheatsheet: test\-time learning with adaptive memory\.arXiv preprint arXiv:2504\.07952\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- Z\. Tang, B\. Ji, J\. Li, L\. Wu, H\. Gui, and M\. Zhang \(2025\)Revisiting long\-context modeling from context denoising perspective\.arXiv preprint arXiv:2510\.05862\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p4.1),[§4\.3](https://arxiv.org/html/2604.16968#S4.SS3.SSS0.Px3.p2.1)\.
- Z\. Tao, T\. Lin, X\. Chen, H\. Li, Y\. Wu, Y\. Li, Z\. Jin, F\. Huang, D\. Tao, and J\. Zhou \(2024\)A survey on self\-evolution of large language models\.arXiv preprint arXiv:2404\.14387\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p1.1)\.
- P\. Villalobos, A\. Ho, J\. Sevilla, T\. Besiroglu, L\. Heim, and M\. Hobbhahn \(2024\)Position: will we run out of data? limits of llm scaling based on human\-generated data\.InForty\-first International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p1.1)\.
- L\. Wang, L\. Li, D\. Dai, D\. Chen, H\. Zhou, F\. Meng, J\. Zhou, and X\. Sun \(2023\)Label words are anchors: an information flow perspective for understanding in\-context learning\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 9840–9855\.Cited by:[§4\.3](https://arxiv.org/html/2604.16968#S4.SS3.SSS0.Px3.p2.1)\.
- Z\. Z\. Wang, J\. Mao, D\. Fried, and G\. Neubig \(2025\)Agent workflow memory\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2604.16968#A1.p1.1),[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- J\. Weston and J\. Foerster \(2025\)AI & human co\-improvement for safer co\-superintelligence\.arXiv preprint arXiv:2512\.05356\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px2.p1.1)\.
- W\. Yang, J\. Xiao, H\. Zhang, Q\. Zhang, Y\. Wang, and B\. Xu \(2025b\)Coarse\-to\-fine grounded memory for llm agent planning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 13040–13067\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- Z\. Yang, P\. Li, and Y\. Liu \(2023\)Failures pave the way: enhancing large language models through tuning\-free rule accumulation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 1751–1777\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- S\. Yin, X\. Pang, Y\. Ding, M\. Chen, Y\. Bi, Y\. Xiong, W\. Huang, Z\. Xiang, J\. Shao, and S\. Chen \(2024\)Safeagentbench: a benchmark for safe task planning of embodied llm agents\.arXiv preprint arXiv:2412\.13178\.Cited by:[§B\.2](https://arxiv.org/html/2604.16968#A2.SS2.p2.1),[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px3.p3.1)\.
- G\. Zhang, M\. Fu, G\. Wan, M\. Yu, K\. Wang, and S\. Yan \(2025a\)G\-memory: tracing hierarchical memory for multi\-agent systems\.arXiv preprint arXiv:2506\.07398\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li,et al\.\(2025b\)Agentic context engineering: evolving contexts for self\-improving language models\.arXiv preprint arXiv:2510\.04618\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- Z\. Zhang, S\. Cui, Y\. Lu, J\. Zhou, J\. Yang, H\. Wang, and M\. Huang \(2024\)Agent\-safetybench: evaluating the safety of llm agents\.arXiv preprint arXiv:2412\.14470\.Cited by:[§B\.1](https://arxiv.org/html/2604.16968#A2.SS1.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px3.p2.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024a\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- W\. Zhao, X\. Sui, J\. Guo, Y\. Hu, Y\. Deng, Y\. Zhao, X\. Zhi, Y\. Huang, H\. He, W\. Che,et al\.\(2026a\)Trade\-offs in large reasoning models: an empirical analysis of deliberative and adaptive reasoning over foundational capabilities\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 34976–34984\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- W\. Zhao, S\. Wang, Y\. Hu, Y\. Zhao, B\. Qin, X\. Zhang, Q\. Yang, D\. Xu, and W\. Che \(2024b\)Sapt: a shared attention framework for parameter\-efficient continual learning of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 11641–11661\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- W\. Zhao, Y\. Wang, Y\. Zhang, Y\. Deng, Y\. Zhao, W\. Che, B\. Qin, and T\. Liu \(2026b\)Large language model agents are not always faithful self\-evolvers\.arXiv preprint arXiv:2601\.22436\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Zheng, C\. Shi, X\. Cai, Q\. Li, D\. Zhang, C\. Li, D\. Yu, and Q\. Ma \(2025\)Lifelong learning of large language model based agents: a roadmap\.arXiv preprint arXiv:2501\.07278\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)Memorybank: enhancing large language models with long\-term memory\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19724–19731\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- H\. Zhou, Y\. Chen, S\. Guo, X\. Yan, K\. H\. Lee, Z\. Wang, K\. Y\. Lee, G\. Zhang, K\. Shao, L\. Yang,et al\.\(2025\)Memento: fine\-tuning llm agents without fine\-tuning llms\.arXiv preprint arXiv:2508\.16153\.Cited by:[§6](https://arxiv.org/html/2604.16968#S6.SS0.SSS0.Px1.p2.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2024\)WebArena: a realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2604.16968#A2.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2604.16968#S1.p3.1),[§3\.1](https://arxiv.org/html/2604.16968#S3.SS1.SSS0.Px3.p2.1)\.

## Appendix ASelf\-Evolving Agents

We present detailed overviews of the two experience\-driven self\-evolving agents used in our experiments: Agent Workflow Memory \(AWM\)\(Wanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib1)\)and ReasoningBank\(Ouyanget al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib2)\)\. These frameworks correspond to offline and online self\-evolution paradigms, respectively, in which agents adapt their behavior through the accumulation and reuse of experience stored in memory, rather than through updates to model parameters\. Below, we summarize their key design principles and memory mechanisms:

- •Agent Workflow Memory \(AWM\)embodies an offline\-oriented self\-evolving agent paradigm\. It endows the agent with a structured memory that contains reusable task workflows abstracted from previous task trajectories\. These workflows capture high\-level action patterns that have demonstrated effectiveness in prior interactions\. At inference time, the agent applies these workflows to the current task and integrates them into the prompt to steer decision\-making and action generation\. In the offline setting examined in this work, all workflows are induced in advance from a fixed training corpus and remain unchanged during evaluation\. Consequently, AWM allows agents to leverage accumulated experience without altering model parameters or dynamically updating memory at test time, serving as a clean example of experience\-driven self\-evolution based on static, pre\-collected memory\.
- •ReasoningBankexemplifies an online self\-evolving agent\. It maintains a continuously growing memory that stores distilled reasoning patterns extracted from the agent’s own interaction history, including both successful and failed attempts\. After each task execution, the agent evaluates its performance and selectively integrates new experiences into the memory bank\. At test time, relevant reasoning strategies are retrieved and injected into the agent’s context to inform subsequent interactions\. This process creates a closed feedback loop in which experience accumulation, retrieval, and reuse occur throughout deployment, allowing the agent’s behavior to evolve over time even though the underlying language model remains fixed\.

Despite their methodological differences, AWM and ReasoningBank share a unifying abstraction: experience is externalized into an explicit memory and reused as contextual guidance for future actions\. This shared design makes them well\-suited for our study, as any observed behavioral drift, including potential degradation of safety boundaries, can be attributed to memory construction, retrieval, and utilization rather than to parameter\-level learning\. By jointly evaluating offline \(AWM\) and online \(ReasoningBank\) self\-evolving agents under a unified protocol, we are able to assess whether safety boundary erosion is an inherent characteristic of experience\-driven memory usage, independent of the manner in which experience is acquired\.

## Appendix BEnvironment and Benchmark

### B\.1Web Environment

We consider a web interaction environment in which language\-based agents execute long\-horizon tasks on realistic websites through natural language instructions\. This setting reflects common real\-world agent applications such as web navigation, online information management, and task automation, while exposing agents to complex action spaces and diverse task objectives\.

#### Experience Accumulation Environment\.

We adoptWebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib6)\)as the web environment for experience accumulation\. WebArena is a realistic and reproducible platform that hosts fully functional websites spanning four representative domains: e\-commerce, social forums, collaborative software development, and content management systems\. Tasks in WebArena are multi\-step and long\-horizon, requiring agents to interact with web interfaces, external tools, and documentation to complete goals\. In our experiments, WebArena is used exclusively for self\-evolving interaction and experience collection\.

#### Web Safety Benchmarks\.

To evaluate safety performance in web\-based agent settings, we adoptBrowserART\(Kumaret al\.,[2025](https://arxiv.org/html/2604.16968#bib.bib7)\)and the web\-related subset ofAgent\-SafetyBench\(Zhanget al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib8)\), both of which are specifically designed to assess safety risks arising from agentic interaction and tool use\. Importantly, these benchmarks are used solely for safety evaluation and are disjoint from the experience accumulation environment\.

BrowserARTis a red\-teaming benchmark tailored for browser\-based agents\. It consists of100diverse browser\-related harmful behaviors spanning both synthetic and real websites\. Unlike traditional chatbot safety benchmarks, BrowserART explicitly targets agentic settings where LLMs interact with web browsers and external tools, probing whether safety refusals learned in chat contexts generalize to browser\-based execution\.

Agent\-SafetyBenchis a comprehensive benchmark for evaluating the safety of LLM agents across interactive environments\. It includes 349 interaction environments and 2,000 test cases covering multiple categories of safety risks and common failure modes in agentic behavior\. In our experiments, we focus on the web\-based interaction subset, which contains657test cases, and use it to assess agents’ robustness and risk awareness under safety\-critical web scenarios\.

Together, BrowserART and Agent\-SafetyBench enable a rigorous evaluation of safety risks specific to web agents, complementing WebArena’s role as an experience accumulation environment\.

### B\.2Household Embodied Environment

In this scenario, agents operate within a simulated physical environment to carry out task\-oriented instructions that require navigation, object manipulation, and action planning\. Such settings are inherently safety\-critical, as inappropriate actions can result in potential physical hazards\.

Experiments are conducted on SafeAgentBench\(Yinet al\.,[2024](https://arxiv.org/html/2604.16968#bib.bib9)\), a benchmark specifically designed to assess the safety awareness of embodied LLM agents in interactive simulation environments\. Following the official evaluation protocol, agents first perform experience\-driven self\-evolution on a subset ofbenign tasks\(269tasks\), which serve solely for experience accumulation\. Safety performance is subsequently assessed on a disjoint set ofhazardous tasks\(269tasks\), aimed at evaluating the agent’s ability to handle safety\-critical instructions in embodied household settings\. SafeAgentBench supports robust safety assessment from both execution\-level and semantic\-level perspectives\.

## Appendix CImplementation Details

We provide additional details for the experience\-driven self\-evolving agents evaluated in both web and household embodied environments\.

#### Web Environment

In the offline self\-evolving setting, AWM accumulates experience from a fixed set of 812 WebArena tasks, which are used exclusively for inducing and storing workflows in memory\. No further experience is added during safety evaluation\. Following the official AWM configuration, the decoding temperature is set to 0\.1 for both experience accumulation and safety evaluation\.

In the online self\-evolving setting on ReasoningBank, due to the substantial computational cost of online self\-evolution, particularly the need to periodically evaluate safety performance to capture temporal trends, we perform online experience accumulation on the Reddit subset from WebArena, consisting of 106 tasks\. During online interaction, the agent incrementally updates its memory based on newly acquired experience\. Safety evaluation is conducted every 10 evolving steps to monitor the evolution of safety behavior over time\. Consistent with the official ReasoningBank setup, the decoding temperature is set to 0\.7 for both experience accumulation and safety evaluation\.

#### Household Embodied Environment

Both offline \(AWM\) and online \(ReasoningBank\) self\-evolving agents accumulate experience on the same subset of 269 benign tasks from SafeAgentBench, which are explicitly non\-harmful and used solely for experience collection\. In the online setting, safety performance is evaluated every 20 evolving steps to track changes in safety behavior as experience accumulates\. Safety evaluation is performed on the hazardous task subset as described in the main text\. The decoding temperatures for AWM and ReasoningBank remain consistent with those used in the web environment, namely 0\.1 for AWM and 0\.7 for ReasoningBank\.

## Appendix DAdditional Experimental Results

![Refer to caption](https://arxiv.org/html/2604.16968v1/x4.png)Figure 7:Category\-level ASR shifts before and after offline self\-evolution on BrowserART\. Results are shown for Qwen3\-8B, Qwen3\-14B, and Qwen3\-32B\.### D\.1Category\-Level Analysis of Safety Degradation from Offline Self\-Evolution

To deepen our understanding of how offline self\-evolving agents degrade in safety, we present a category\-level evaluation across three safety\-critical agent benchmarks: BrowserART, Agent\-SafetyBench, and SafeAgentBench\. We report results for seven LLM backbones, comparing their Base and AWM variants\.

#### BrowserART: Amplified Vulnerabilities on Action\-Content Prompts\.

As shown in Figure[7](https://arxiv.org/html/2604.16968#S4.F7), AWM consistently increases ASR across all backbones, with notable spikes on GPT\-4o, Qwen3\-8B, and Qwen3\-32B, indicating strong risk amplification under offline memory integration\. Claude\-4\.5\-Sonnet shows comparatively limited ASR escalation, hinting at stronger inherent refusal strategies or prompt robustness\. Specific categories—such as \#1 Fraud Clicking, \#2 Fake Identity, and \#7 Cyber Offense—experience the most substantial post\-AWM risk escalation, suggesting heightened susceptibility in action\-execution prompts\. The pattern confirms that offline\-accumulated memory can compromise refusal behavior even in well\-aligned models, especially when harmful experience is retained across sessions\.

![Refer to caption](https://arxiv.org/html/2604.16968v1/x5.png)Figure 8:Category\-level ASR shifts before and after offline self\-evolution on Agent\-SafetyBench\. Results are shown for GPT\-4o, Claude\-4\.5\-Sonnet, DeepSeek\-V3\.2, and Qwen3\-235B\-A22B\.![Refer to caption](https://arxiv.org/html/2604.16968v1/x6.png)Figure 9:Category\-level ASR shifts before and after offline self\-evolution on Agent\-SafetyBench\. Results are shown for Qwen3\-8B, Qwen3\-14B, and Qwen3\-32B\.
#### Agent\-SafetyBench: Degradation in Security\-Critical Planning\.

In Figure[8](https://arxiv.org/html/2604.16968#A4.F8)and Figure[9](https://arxiv.org/html/2604.16968#A4.F9), Agent\-SafetyBench covers 8 categories of agent safety threats, including data leakage, code injection, and misinformation spread\. All models experience moderate\-to\-severe ASR increase post\-AWM, notably on \#1 Availability Compromise, \#5 Sensitive Info Leakage, and \#7 Unsafe Info Spread\. GPT\-4o, DeepSeek, and Qwen3\-14B show a 20–30% rise in ASR, revealing AWM’s tendency to memorize and reuse unsafe strategies in future tasks\. Claude\-4\.5\-Sonnet again shows the lowest offline ASR gap, indicating better boundary retention or task generalization\. Across backbones, the results imply that once a model executes unsafe behaviors offline, it becomes increasingly likely to replicate them even in unrelated tasks, degrading its long\-term trustworthiness\.

![Refer to caption](https://arxiv.org/html/2604.16968v1/x7.png)Figure 10:Category\-level ASR shifts before and after offline self\-evolution on SafeAgentBench\. Results are shown for GPT\-4o, Claude\-4\.5\-Sonnet, DeepSeek\-V3\.2, and Qwen3\-235B\-A22B\.![Refer to caption](https://arxiv.org/html/2604.16968v1/x8.png)Figure 11:Category\-level ASR shifts before and after offline self\-evolution on SafeAgentBench\. Results are shown for Qwen3\-8B, Qwen3\-14B, and Qwen3\-32B\.
#### SafeAgentBench: Elevated Physical Risk in Embodied Scenarios\.

In Figure[10](https://arxiv.org/html/2604.16968#A4.F10)and Figure[11](https://arxiv.org/html/2604.16968#A4.F11), SafeAgentBench focuses on 12 household hazards, such as electrical shock, fire, and object damage\. Post\-AWM models universally show increased ASR in physical safety threats, especially on \#1 Other Human Hazards, \#8 Breakage, and \#12 Property Damage\. DeepSeek\-V3\.2 and GPT\-4o exhibit alarming rises, reflecting vulnerability to physical\-harm instructions once unsafe memory is formed\. Smaller backbones like Qwen3\-8B also show high susceptibility, likely due to limited ability to dissociate sensitive commands from benign contexts\.

\(a\)ASR curves of 7 LLM backbones during online self\-evolving in the WebArena environment\.### D\.2Safety Dynamics in Web\-based Environments

To further investigate the safety dynamics under realistic web\-based deployments, we evaluate the safety performance of online self\-evolving agents across seven LLM backbones using the ReasoningBank framework with WebArena as the interaction environment\. The evolution of attack success rates \(ASR\) is reported in FigureLABEL:subfig:online\_browserart\(BrowserART\) and FigureLABEL:subfig:online\_agentsafety\(Agent\-SafetyBench\)\.

#### All models exhibit rising unsafe behavior over time\.

Across both benchmarks, all LLM backbones show a clear upward trend or remain at elevated ASR levels after initial rises\. This indicates that the integration of accumulated experience leads to safety degradation even without direct exposure to harmful instructions\.

#### Safety degradation patterns are architecture\-dependent but consistently persistent\.

While the pace and volatility of ASR growth differ, none of the models revert to their initial safety levels\. This reveals that online self\-evolving can induce lasting safety shifts, with degradation emerging early and persisting throughout the trajectory\.

![Refer to caption](https://arxiv.org/html/2604.16968v1/x9.png)Figure 13:ASR of Qwen3\-32B on Agent\-SafetyBench under long\-horizon online self\-evolution using the full ReasoningBank \(over 800 steps\)\. Safety degradation persists and worsens over time without recovery\.### D\.3Long\-Horizon Online Self\-Evolution

To examine the long\-term safety dynamics of self\-evolving agents, we conduct an extended online evolution experiment on the WebArena dataset\. The agent, built upon Qwen3\-32B, interacts continuously with benign tasks, accumulating and reusing its own experience over more than 800 self\-evolving steps\. Safety performance is periodically evaluated on Agent\-SafetyBench, and the results are shown in Figure[13](https://arxiv.org/html/2604.16968#A4.F13)\.

We observe a monotonic degradation in safety over time: the Attack Success Rate \(ASR\) increases from approximately 52% to over 55%, and this elevated unsafe behavior persists through the remainder of the evolution\. Despite minor fluctuations, the agent never returns to its initial safety level, confirming that the degradation is not stochastic, but rather the result of gradual, compounding shifts in the agent’s behavioral boundary\.

These long\-horizon results reinforce our earlier findings: even when grounded entirely in benign interactions, self\-evolving agents can drift into unsafe regimes due to the unchecked accumulation of execution\-oriented experience\. This underscores the critical need for long\-term monitoring and memory intervention to prevent irreversible safety erosion in real\-world deployments\.

### D\.4Annotation Protocol for Execution\-Bias Case Study

#### Annotators\.

We hired three annotators to conduct the manual inspection and labeling\. All annotators are young adults with higher\-education backgrounds \(i\.e\., currently enrolled in or graduated from a university program\)\. They were financially compensated following a pre\-agreed hourly rate\.

#### Annotation scope and unit\.

We focus onflipcases where incorporating experience changes the agent’s response from safe \(e\.g\., refusal or safe alternative\) to unsafe \(e\.g\., executing or facilitating unsafe actions\)\. Each annotation instance consists of: \(i\) the original user query and context, \(ii\) the retrieved experience snippet\(s\), \(iii\) the agent response without experience, and \(iv\) the agent response with experience\. Annotators assign exactly one dominant cause label to each flip case, prioritizing the most direct trigger of unsafe behavior\.

#### Label set and detailed criteria\.

We categorize the dominant causes into the following three types:

- •Sensitive Execution, where the retrieved experience is benign in isolation but becomes unsafe when instantiated in safety\-sensitive contexts, as it contains actions that are conditionally hazardous \(e\.g\., ignition\-related operations in household scenarios\)\. Detailed examples are shown in Tabel[4](https://arxiv.org/html/2604.16968#A4.T4)\.
- •Standard Execution, where the retrieved experience provides generic, executable procedural patterns \(e\.g\., “open → place”\) that promote task completion and are broadly applicable, but may lead to unsafe behavior when blindly transferred to contexts requiring refusal\. Detailed examples are shown in Tabel[5](https://arxiv.org/html/2604.16968#A4.T5)\.
- •Format Recovery, where the retrieved experience primarily restores the output structure or formatting \(e\.g\., stepwise layout or schema compliance\), thereby enabling task completion that was previously prevented by formatting or structural failures\. Detailed examples are shown in Tabel[6](https://arxiv.org/html/2604.16968#A4.T6)\.

\(a\)The prompt structure of online self\-evolving framework ReasoningBank\.### D\.5Length\-Controlled Prompt Construction

We detail the implementation of the length\-controlled prompt used in the*Experience vs\. Enhanced Context Length*analysis\.

#### Prompt Structure\.

FigureLABEL:subfig:experience\_promptillustrates the prompt formulation of ReasoningBank\. Each prompt can be decomposed into three components: \(1\)System Instruction, \(2\)Retrieved Experience Item, and \(3\)Task Goal\. In the online self\-evolving setting, the retrieved experience is inserted between the system instruction and the current task goal to guide the agent’s behavior\.

#### Length Measurement\.

To construct a length\-matched control, we first measure the token length contributed by theRetrieved Experiencecomponent for each BrowserART sample during online self\-evolution\. We then compute the*average retrieved experience length*across all samples as the target length for context compensation\.

#### Length\-Matched Prompt Expansion\.

We remove the retrieved experience entirely and compensate for the resulting context length reduction by expanding theSystem Instruction\. Specifically, we use GPT\-4o to enrich and elaborate the system instruction with additional descriptive details, clarifications, and constraints, while preserving its original intent and safety requirements\. The expanded system instruction is carefully constructed to match the average token length of the removed retrieved experience, ensuring that the overall prompt length remains unchanged\. FigureLABEL:subfig:expanded\_promptpresents a concrete example comparing the original prompt with retrieved experience and the corresponding length\-matched prompt with expanded system instruction\.

![Refer to caption](https://arxiv.org/html/2604.16968v1/x10.png)Figure 15:Attack success rate on SafeAgentBench \(household embodiment\) during self\-evolution with different numbers of retrieved experience entries\. The framework is ReasoningBank based on Qwen3\-14B\.### D\.6Effect of Retrieved Experience Size

To further verify the generality of our findings, we evaluate how the number of retrieved experience entries impacts safety performance in the household embodied environment\. As shown in Figure[15](https://arxiv.org/html/2604.16968#A4.F15), we observe a consistent pattern: even though each individual memory is benign, increasing the number of retrieved experiences leads to higher unsafe behavior\. Specifically, agents retrieving 7 or 9 entries consistently perform worse than those retrieving fewer \(1 or 3\), with an observable and persistent gap throughout self\-evolving steps\.

This result echoes our findings in the web environment \(Section[4\.2](https://arxiv.org/html/2604.16968#S4.SS2)\) and reinforces the hypothesis that experience accumulation—despite being individually harmless—compounds execution bias and amplifies safety risks\. It highlights the need for carefully controlled memory size and content filtering mechanisms when deploying self\-evolving agents in embodied settings\.

\(a\)Layer\-wise Integrated Gradient \(IG\) attribution of different prompt segments during online self\-evolution\. The LLM backbone is Qwen3\-8B\.\(b\)Layer\-wise Integrated Gradient \(IG\) attribution of different prompt segments during online self\-evolution\. The LLM backbone is Qwen3\-14B\.### D\.7Mechanical Interpretability

To further confirm the causal role of retrieved experience in driving safety degradation, we extend the mechanistic attribution analysis to two smaller model variants: Qwen3\-8B and Qwen3\-14B, and visualize the layer\-wise Integrated Gradient \(IG\) results in Figure[17\(a\)](https://arxiv.org/html/2604.16968#A4.F17.sf1)and Figure[17\(b\)](https://arxiv.org/html/2604.16968#A4.F17.sf2), respectively\.

We observe a consistent pattern:

- •When retrieved experience is included in the prompt \(left\), the orange curve representing the “Experience Item” maintains a significant IG attribution across a wide range of layers, especially in middle\-to\-upper layers\. This indicates that the retrieved content exerts substantial influence on the model’s prediction pathway throughout the self\-evolution process\.
- •In contrast, when the same prompt length is preserved but the retrieved content is replaced by an expanded system instruction \(right\), the corresponding orange curve \(“Expanded Prompt”\) exhibits a sharp drop, especially in later layers\. This stark decline reveals that the content of the retrieved experience—not merely its position or length—is the primary driver of the model’s behavioral shift\.

This contrast between the left and right panels substantiates our hypothesis: the performance degradation stems from the semantic information embedded in the retrieved experience items, rather than being an artifact of prompt length or format\.

### D\.8Safety in Realistic Self\-Evolution

We present detailed results under the household embodiment environment \(SafeAgentBench\) using three additional LLM backbones: DeepSeek\-V3\.2 \(Figure[17\(c\)](https://arxiv.org/html/2604.16968#A4.F17.sf3)\), Qwen3\-32B \(Figure[17\(d\)](https://arxiv.org/html/2604.16968#A4.F17.sf4)\), and Qwen3\-14B \(Figure[17\(e\)](https://arxiv.org/html/2604.16968#A4.F17.sf5)\)\. Across all models, we observe consistent behavioral patterns with respect to different experience configurations:

#### Execution\-only experience leads to increasing ASR\.

For all backbones, we observe that continuously accumulating execution traces on harmful tasks induces a monotonic or oscillatory increase in attack success rate \(ASR\) over self\-evolving steps \(left subfigures\)\. This effect is especially pronounced in DeepSeek\-V3\.2 and Qwen3\-32B, where final ASR values exceed those of purely benign experience\. These results confirm that execution\-oriented experience contributes significantly to safety degradation\.

#### Refusal experience constrains ASR but reduces benign task success\.

Refusal\-only experience consistently maintains the lowest ASR across all backbones\. In particular, Qwen3\-14B demonstrates a notably stable safety performance with refusal\-based experience\. However, this safety benefit comes with a drop in benign task success rate \(right subfigures\), again indicating over\-refusal\. By contrast, the mixed experience configuration offers a middle ground, suppressing ASR more than execution\-only experience, while preserving more task utility than refusal\-only one\.

#### Consistency across backbones supports generalizability\.

Despite differences in model family and scale, the same trade\-off dynamics emerge across all evaluated LLMs: refusal mitigates safety risk but harms utility; execution degrades safety; and mixed experience offers partial balance\. These results underscore the generality of experience\-induced behavior drift in self\-evolving agents and motivate future work on selective experience filtering and dynamic memory scheduling policies\.

\(c\)Performance comparison under realistic deployment settings where experience from both benign and harmful tasks are accumulated\. The red dashed line denotes the performance under purely benign experience\. The underlying LLM backbone is DeepSeek\-V3\.2\.\(d\)Performance comparison under realistic deployment settings where experience from both benign and harmful tasks are accumulated\. The red dashed line denotes the performance under purely benign experience\. The underlying LLM backbone is Qwen3\-32B\.\(e\)Performance comparison under realistic deployment settings where experience from both benign and harmful tasks are accumulated\. The red dashed line denotes the performance under purely benign experience\. The underlying LLM backbone is Qwen3\-14B\.Table 4:Representative examples of Sensitive Execution failures\.
Table 5:Representative examples of Standard Execution failures\.
Table 6:Representative examples of Format Recovery failures\.

Similar Articles

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

arXiv cs.CL

This paper investigates how incorporating web retrieval into LLM agents can degrade safety alignment, revealing the 'Safe Source Paradox' where even safety-oriented documents increase harmful compliance. It introduces the AgentREVEAL diagnostic framework and HarmURLBench benchmark to analyze and evaluate retrieval-induced safety vulnerabilities.

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

arXiv cs.CL

This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.

Rethinking Experience Utilization in Self-Evolving Language Model Agents

arXiv cs.CL

This paper introduces ExpWeaver, a framework that optimizes how self-evolving language model agents utilize past experiences during runtime decision-making. It demonstrates that selectively invoking experience based on reasoning uncertainty improves performance across various environments and models.

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Hugging Face Daily Papers

The paper introduces PhoneSafety, a benchmark of 700 safety-critical moments across 130+ apps to evaluate phone-use agents. Results show that avoiding harmful outcomes does not necessarily indicate safety, as models may fail to act or make unsafe choices, requiring a distinction between capability and safety signals.