MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

arXiv cs.AI 05/25/26, 04:00 AM Papers
ai-safety llm-agents memory-poisoning causal-attribution anomaly-detection security-audit post-hoc-analysis
Summary
MemAudit is a post-hoc auditing framework for memory-augmented LLM agents that identifies poisoned memories by combining counterfactual influence scores and structural anomaly detection, reducing attack success rates from over 70% to 0% in realistic scenarios.
arXiv:2605.23723v1 Announce Type: new Abstract: Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:59 AM
# MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
Source: [https://arxiv.org/html/2605.23723](https://arxiv.org/html/2605.23723)
Zhewen Tan1,2,3,4,Yilun Yao2,3,Huiyan Jin3,Wenhan Yu2,3,Guoan Wang3, Mengyuan Fan3,liang lu1,4,Feng Liu1,4,Xiangzheng Zhang2, Duohe Ma1,4,Tong Yang311footnotemark:1,Lin Sun211footnotemark:1 1Institute of Information Engineering, Chinese Academy of Sciences2Qiyuan Tech 3Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 4School of Cyber Security, University of Chinese Academy of Sciences Correspondence:Duohe Ma:[maduohe@iie\.ac\.cn](https://arxiv.org/html/2605.23723v1/mailto:[email protected]), Tong Yang:[yangtong@pku\.edu\.cn](https://arxiv.org/html/2605.23723v1/mailto:[email protected]), Lin Sun:[sunlin1@360\.cn](https://arxiv.org/html/2605.23723v1/mailto:[email protected])

###### Abstract

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long\-horizon task execution\. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent’s memory through ordinary interaction, and these records can later be retrieved to steer the agent’s reasoning and actions\. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post\-hoc question of which stored memories are responsible after harmful behavior has already been observed\. We proposeMemAudit, a post\-hoc causal memory auditing framework for memory\-augmented LLM agents\. The framework combines two complementary signals: \(1\) a counterfactual memory influence score that measures each memory’s causal contribution to harmful outputs, and \(2\) a memory consistency graph that identifies structurally anomalous memories within the broader memory store\. We evaluate MemAudit against MINJA, a query\-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory\-bank modification\. Across both QA and reasoning\-agent settings, MemAudit substantially reduces attack success rates under realistic post\-hoc auditing scenarios\. The results show that QA attack success is reduced from70%70\\%to0%0\\%, while RAP attack success drops from83\.3%83\.3\\%to0%0\\%\.

## 1Introduction

Large language model \(LLM\) agents are rapidly evolving from passive conversational assistants into autonomous systems capable of interacting with external environments and executing complex long\-horizon tasksYaoet al\.\([2022b](https://arxiv.org/html/2605.23723#bib.bib10)\)\. Recent agent systems such as OpenHands demonstrate that modern agents can already perform realistic software engineering, web interaction, command execution, and multi\-step planning tasks, enabling their increasing integration into daily workflows such as software development, personal assistance, information management, and online automationWanget al\.\([2024](https://arxiv.org/html/2605.23723#bib.bib7)\)\. As these systems become more capable, users are also beginning to delegate increasingly sensitive permissions and decision\-making authority to agents\.

This trend substantially raises the security stakes of autonomous agents\. Once an agent performs an unsafe or manipulated action, the resulting damage may extend far beyond a single incorrect text generation\. Harmful actions can cause severe real\-world consequences by propagating harmful decisions across external systems and long\-running user workflowsFerraget al\.\([2025](https://arxiv.org/html/2605.23723#bib.bib23)\)\. As a result, ensuring the reliability and safety of autonomous agents is becoming a critical challenge for real\-world deployment\.

A major factor behind modern agent capability is the use of persistent memoryLewiset al\.\([2020](https://arxiv.org/html/2605.23723#bib.bib1)\)\. Many recent agents continuously store observations, reasoning trajectories, user preferences, and historical interactions to support adaptive decision making across sessionsShinnet al\.\([2023](https://arxiv.org/html/2605.23723#bib.bib11)\)\. External memory enables agents to accumulate experience over time and support long\-horizon adaptation instead of operating as stateless generatorsWanget al\.\([2023](https://arxiv.org/html/2605.23723#bib.bib9)\)\. This design significantly improves long\-horizon reasoning, personalization, and behavioral consistencyParket al\.\([2023](https://arxiv.org/html/2605.23723#bib.bib8)\)\. However, the mechanism also introduces a dangerous new attack surface: once malicious information is written into memory, it can remain active across future interactions and repeatedly influence the agent’s behavior\.

Recent work has shown that such memory attacks are practical in deployed agent systems\. MINJA demonstrates that adversaries can inject malicious reasoning trajectories into agent memory through natural interaction, causing delayed and persistent behavioral manipulation across future scenarioDonget al\.\([2025](https://arxiv.org/html/2605.23723#bib.bib13)\)\. More importantly, these attacks are often highly stealthy because the original poisoning interaction may appear benign while the harmful behavior only emerges much later during deployment\.

Despite the growing threat of memory poisoning, existing research still primarily focuses on*online defense*Donget al\.\([2024](https://arxiv.org/html/2605.23723#bib.bib18)\)\. However, real\-world deployment often follows a very different failure pattern\. In practice, harmful agent behavior is frequently discovered only after deployment through abnormal actions, unexpected logs, or user reports\. Once such failures have already occurred, the central problem is no longer how to block a single response, but rather how to determine*which specific memory entries caused the failure*\.

To address this challenge, we proposeMemAudit, a unified framework for post\-hoc causal auditing of memory\-augmented LLM agents\. Figure[1](https://arxiv.org/html/2605.23723#S1.F1)presents the overall pipeline\. The key intuition behind our framework is that harmful memories exhibit two complementary properties\. First, they should exert measurable*causal influence*on harmful outputs\. Second, they should often appear*structurally inconsistent*with the broader memory distribution\. Based on this intuition,MemAuditcombines two complementary signals: \(1\) a counterfactual memory influence score that estimates each memory’s causal contribution through replay\-based intervention, and \(2\) a structural anomaly detector that identifies semantically inconsistent memory patterns within the global memory graph\. By fusing these signals,MemAuditcan identify suspicious memories without relying on oracle poison labels\.

![Refer to caption](https://arxiv.org/html/2605.23723v1/x1.png)Figure 1:Overview of MemAudit\. Given a harmful evente=\(q∗,y∗,R∗\)e=\(q^\{\*\},y^\{\*\},R^\{\*\}\), the framework performs post\-hoc auditing over the memory store\. It combines two complementary signals: CMIS, which measures the causal contribution of retrieved memories through counterfactual replay, and MCG, which identifies structurally anomalous memories in the global memory graph\. The two signals are fused into a detoxification score for ranking suspicious memories\. After removing the top\-ranked memories, the agent becomes safer while preserving useful memory\.We evaluateMemAuditunder realistic post\-hoc settings using the interaction\-based memory injection attack MINJA\. Across QA and RAP,MemAuditconsistently reduces attack success after targeted memory removal\. In QA settings, ASR drops from70\.0%70\.0\\%to0\.0%0\.0\\%onGPT\-4o, while RAP attack success drops from83\.3%83\.3\\%to0%0\\%onGPT\-4o\.

Our contributions are summarized as follows:

- •To the best of our knowledge, this work is the first to formulate*post\-hoc causal memory auditing*as a practical security problem for memory\-augmented LLM agents under interaction\-induced memory poisoning\.
- •We proposeMemAudit, a dual\-signal post\-hoc auditing framework that combines counterfactual causal attribution with structural anomaly detection to identify and remove suspicious memories without requiring poison labels\.
- •We evaluateMemAuditunder realistic memory injection settings and show that it can substantially reduce attack success in both QA and long\-horizon reasoning\-agent settings, while also revealing a clear operating boundary as poisoning becomes dense\.

## 2Related Work

#### Memory poisoning attacks on LLM agents\.

Early safety research on LLM systems mainly studied attacks that act within a single interaction, such as prompt injection or unsafe instruction followingGreshakeet al\.\([2023](https://arxiv.org/html/2605.23723#bib.bib17)\)\. As agent architectures began to incorporate external memory and cross\-session experience reuse, the threat model expanded from transient context manipulation to persistent state corruption\. This shift is especially important for memory\-augmented agents, because once malicious content is stored, it can be retrieved repeatedly and influence future decisions far beyond the original interaction\. AgentPoison shows that poisoning external memory or retrieval corpora can induce downstream backdoor behaviors during retrieval and reasoningChenet al\.\([2024](https://arxiv.org/html/2605.23723#bib.bib12)\)\. MINJA advances the threat model further by removing the assumption of direct database access: instead, the attacker injects malicious trajectories through ordinary interaction and allows them to be written into memory through the normal agent pipelineDonget al\.\([2025](https://arxiv.org/html/2605.23723#bib.bib13)\)\. Subsequent work broadens the attack surface again\. MemoryGraft shows that poisoned experiences can create persistent behavioral drift by exploiting the agent’s tendency to imitate prior successful trajectoriesSrivastava and He \([2025](https://arxiv.org/html/2605.23723#bib.bib14)\), while more recent work demonstrates that memory poisoning may also arise indirectly through environmental observation in web\-agent settingsZouet al\.\([2026](https://arxiv.org/html/2605.23723#bib.bib4)\)\.

#### Defenses for memory\-augmented agents\.

The defense literature has followed a related but distinct trajectory\. Early work on LLM safety primarily emphasized online safeguards, including runtime filtering, prompt sanitization, and response moderation, with the goal of blocking unsafe behavior at inference timeDonget al\.\([2024](https://arxiv.org/html/2605.23723#bib.bib18)\)\. Broader surveys on jailbreak defense and alignment further systematize this line of thinking, but they mainly treat safety as a response\-time control problem rather than a persistent memory\-state problemYiet al\.\([2024](https://arxiv.org/html/2605.23723#bib.bib20)\)\. As memory poisoning threats became more concrete, newer defenses began to move closer to the memory layer itself\. A\-MemGuard introduces proactive memory validation and dual\-memory designs to reduce the chance that harmful memory entries are used during retrievalWeiet al\.\([2025](https://arxiv.org/html/2605.23723#bib.bib27)\)\. Sunil et al\. explore trust scoring and memory sanitization mechanisms to reduce the influence of suspicious memory entriesSunilet al\.\([2026](https://arxiv.org/html/2605.23723#bib.bib28)\)\. Bhardwaj further proposes a Bayesian trust\-based memory defense to improve retrieval reliability under poisoning settingsBhardwaj \([2026](https://arxiv.org/html/2605.23723#bib.bib29)\)\. This progression marks an important change in perspective: instead of guarding only the model’s output, these methods attempt to control which memories are considered trustworthy inputs to the agent\. However, they remain preventive or online defenses rather than auditing methods: they aim to block or reduce harmful memory influence during operation, not to identify which stored memories were actually responsible after harmful behavior has already been observed\.

## 3Preliminary

We consider memory\-augmented LLM agents that maintain an external memory store to accumulate and reuse information across interactions\. Formally, let the memory store be

M=\{m1,m2,…,mn\},M=\\\{m\_\{1\},m\_\{2\},\\dots,m\_\{n\}\\\},\(1\)where each memory itemmim\_\{i\}is a textual entry\. Given a user queryqq, the agent retrieves a subset of memory

R\(q,M\)⊆M,R\(q,M\)\\subseteq M,\(2\)and produces an output

y=f\(q,R\(q,M\)\),y=f\\bigl\(q,\\;R\(q,M\)\\bigr\),\(3\)whereffdenotes the agent’s generation policy\. This formulation captures a key property of memory\-augmented agents: the agent’s behavior depends not only on the current query, but also on information accumulated from previous interactions\. Within this setting, we study post\-hoc memory auditing after harmful behavior has already been observed\. We formalize a harmful event as

e=\(q∗,y∗,R∗\),e=\(q^\{\*\},y^\{\*\},R^\{\*\}\),\(4\)whereq∗q^\{\*\}is the triggering query,y∗y^\{\*\}is the observed harmful output, andR∗R^\{\*\}is the retrieved memory context\. In practice,R∗R^\{\*\}may not always be explicitly logged; when necessary, it can be reconstructed using the same retrieval mechanism\. A key feature of our setting is non\-oracle observability\. The auditor does not have access to ground\-truth poison labels or attacker objectives\. Instead, it must reason from observable failures alone, such as task errors or unsafe actions\. We assume an adversary can insert or influence a subset of memory entries inMM, for example through interaction\-induced memory updates rather than direct modification of external retrieval corpora\. These entries are designed to alter future behavior when retrieved\. The auditor has access to the memory storeMM, a set of harmful eventsℰ\\mathcal\{E\}, and the agent or a replay interface, but does not have access to online intervention or oracle attack annotations\. Given memoryMMand a set of harmful events

ℰ=\{e1,…,eT\},\\mathcal\{E\}=\\\{e\_\{1\},\\dots,e\_\{T\}\\\},\(5\)our objective is to identify a subsetS⊆MS\\subseteq Msuch that removingSSreduces harmful behavior after auditing:

minS⁡HarmAfter\(M∖S\)\.\\min\_\{S\}\\ \\text\{HarmAfter\}\(M\\setminus S\)\.\(6\)
This objective captures the central goal of post\-hoc memory auditing: to identify suspicious memories whose removal meaningfully detoxifies the agent after harmful behavior has already been observed\. The challenge is to isolate a small set of influential memory entries from the full memory store without access to oracle poison labels or attacker\-side information\.

## 4Methodology

We presentMemAudit, a post\-hoc auditing framework that ranks suspicious memories for targeted removal after harmful behavior has been observed\. Given a harmful event, the framework scores candidate memories from two complementary views: their*event\-level causal effect*on the observed failure and their*global structural inconsistency*within the memory store\. These two signals are then fused into a single ranking used for memory purification\. For a harmful evente=\(q∗,y∗,R∗\)e=\(q^\{\*\},y^\{\*\},R^\{\*\}\), we estimate the contribution of each retrieved memory by removing it from the memory store and replaying the event\. We define the counterfactual memory influence score as

CMIS\(mi\)=h\(q∗,y∗\)−h\(q∗,f\(q∗,R\(q∗,M∖\{mi\}\)\)\),\\text\{CMIS\}\(m\_\{i\}\)=h\(q^\{\*\},y^\{\*\}\)\-h\\big\(q^\{\*\},f\(q^\{\*\},R\(q^\{\*\},M\\setminus\\\{m\_\{i\}\\\}\)\)\\big\),\(7\)whereh\(⋅\)h\(\\cdot\)is a harm scoring function\. A larger CMIS value indicates that removingmim\_\{i\}leads to a greater reduction in harmful behavior, suggesting that this memory plays a stronger causal role in the observed failure\. Because this score is computed through replay\-based intervention, it directly captures query\-memory interaction effects\. At the same time, it is a local signal tied to a specific harmful event and may be less effective when harmful behavior is supported by multiple coordinated memories\. To capture signals beyond event\-level causal influence, we introduce a structural score defined over the full memory store\. We construct a memory graphG=\(V,E\)G=\(V,E\)in which each node corresponds to a memory entry and edges reflect semantic relatedness and logical consistency\. Semantic similarity is used to model neighborhood structure, while natural language inference is used to detect entailment and contradiction between memory pairsWilliamset al\.\([2018](https://arxiv.org/html/2605.23723#bib.bib26)\)\. We instantiate this consistency signal with a DeBERTa\-v3\-basedHeet al\.\([2021](https://arxiv.org/html/2605.23723#bib.bib25)\), and use its output to define the inconsistency weightw\(mi,mj\)w\(m\_\{i\},m\_\{j\}\)\. Based on this graph, we define the structural anomaly score as

CAS\(mi\)=∑mj∈𝒩\(mi\)w\(mi,mj\)sim\(mi,mj\),\\text\{CAS\}\(m\_\{i\}\)=\\sum\_\{m\_\{j\}\\in\\mathcal\{N\}\(m\_\{i\}\)\}w\(m\_\{i\},m\_\{j\}\)\\,\\text\{sim\}\(m\_\{i\},m\_\{j\}\),\(8\)wherewwcaptures contradiction or inconsistency andsim\(⋅,⋅\)\\text\{sim\}\(\\cdot,\\cdot\)measures semantic similarity\. This score is motivated by the observation that benign memories typically lie in semantically coherent neighborhoods, whereas poisoned memories are more likely to appear weakly connected or inconsistent with surrounding entries\. In practice, anomalies can be identified relative to the global score distribution, for example through

CAS\(mi\)\>μ\+2σ\.\\text\{CAS\}\(m\_\{i\}\)\>\\mu\+2\\sigma\.\(9\)
We combine the causal and structural scores into a unified detoxification score:

DS\(mi\)=α⋅CMIS~\(mi\)\+\(1−α\)⋅CAS~\(mi\),\\text\{DS\}\(m\_\{i\}\)=\\alpha\\cdot\\widetilde\{\\text\{CMIS\}\}\(m\_\{i\}\)\+\(1\-\\alpha\)\\cdot\\widetilde\{\\text\{CAS\}\}\(m\_\{i\}\),\(10\)whereCMIS~\\widetilde\{\\text\{CMIS\}\}andCAS~\\widetilde\{\\text\{CAS\}\}denote normalized scores\. The fusion weightα\\alphacontrols the balance between event\-level causal evidence and global structural evidence\. Higherα\\alphaplaces more weight on replay\-based attribution, while lowerα\\alphaemphasizes anomaly detection over the full memory graph\.

Input:Memory store

MM, harmful events

ℰ\\mathcal\{E\}, retrieval function

RR, agent

ff, harm scorer

hh, fusion weight

α\\alpha
Output:Removal set

SS
Initialize

CMIS\(mi\)←0\\text\{CMIS\}\(m\_\{i\}\)\\leftarrow 0for all

mi∈Mm\_\{i\}\\in M;

for*each harmful eventet=\(qt∗,yt∗,Rt∗\)∈ℰe\_\{t\}=\(q\_\{t\}^\{\*\},y\_\{t\}^\{\*\},R\_\{t\}^\{\*\}\)\\in\\mathcal\{E\}*do

for*each memorymi∈Rt∗m\_\{i\}\\in R\_\{t\}^\{\*\}*do

Replay the agent with

M∖\{mi\}M\\setminus\\\{m\_\{i\}\\\};

Compute

CMISt\(mi\)\\text\{CMIS\}\_\{t\}\(m\_\{i\}\);

Update

CMIS\(mi\)←CMIS\(mi\)\+CMISt\(mi\)\\text\{CMIS\}\(m\_\{i\}\)\\leftarrow\\text\{CMIS\}\(m\_\{i\}\)\+\\text\{CMIS\}\_\{t\}\(m\_\{i\}\);

Construct memory graph

G=\(V,E\)G=\(V,E\)from

MM;

Compute

CAS\(mi\)\\text\{CAS\}\(m\_\{i\}\)for each

mi∈Mm\_\{i\}\\in M;

NormalizeCMISandCAS;

for*each memorymi∈Mm\_\{i\}\\in M*do

Compute

DS\(mi\)=α⋅CMIS~\(mi\)\+\(1−α\)⋅CAS~\(mi\)\\text\{DS\}\(m\_\{i\}\)=\\alpha\\cdot\\widetilde\{\\text\{CMIS\}\}\(m\_\{i\}\)\+\(1\-\\alpha\)\\cdot\\widetilde\{\\text\{CAS\}\}\(m\_\{i\}\);

Rank memories by

DS\(mi\)\\text\{DS\}\(m\_\{i\}\);

Select top\-ranked memories as

SS;

return

SS;

Algorithm 1MemAudit: Post\-hoc Memory AuditingAlgorithm[1](https://arxiv.org/html/2605.23723#algorithm1)summarizes the complete auditing procedure\. Given a set of harmful events, MemAudit performs batch auditing on the original memory store rather than modifying memory after each event\. For each harmful event, the method evaluates retrieved memories through counterfactual replay, aggregates event\-level CMIS values across events, computes global CAS values over the full memory graph, and finally ranks all memories by the fused detoxification score\. The event\-level suspicious sets can be aggregated as

S=⋃t=1TSt\.S=\\bigcup\_\{t=1\}^\{T\}S\_\{t\}\.\(11\)
Memory is not updated during auditing\. All harmful events are analyzed against the same underlying memory state, and removal is applied only after the final ranking is obtained\. This batch design avoids order effects and prevents early deletions from changing the attribution results of later events\. In this way, MemAudit formulates post\-hoc memory repair as a ranking problem over candidate memories\. Counterfactual replay provides event\-specific causal evidence, while graph\-based anomaly scoring supplies a complementary global signal\. Their combination allows the auditor to identify harmful memories without oracle poison labels and remove them in a targeted manner\.

## 5Experimental Setup

#### Post\-hoc batch auditing protocol\.

For each attacked memory storeMM, we follow a three\-stage post\-hoc evaluation pipeline\. Let

ℰ=\{e1,e2,…,eT\}\\mathcal\{E\}=\\\{e\_\{1\},e\_\{2\},\\dots,e\_\{T\}\\\}\(12\)denote the set of harmful events collected from replaying the attacked agent\. In Stage 1, we replay the attacked agent on the evaluation tasks and collect harmful eventsℰ\\mathcal\{E\}\. In Stage 2, we applyMemAuditto the original memory storeMMand obtain a suspicious memory set

In Stage 3, we removeSSfrom memory and re\-evaluate the same tasks on the purified memory store

M′=M∖S\.M^\{\\prime\}=M\\setminus S\.\(14\)
This audit\-then\-remove protocol ensures that all attribution scores are computed with respect to the same underlying memory state\. As a result, the auditing process is not affected by intermediate deletions, and the final ranking remains free from sequential order effects\.

#### Evaluation metrics\.

Our primary reported metric is the*attack success rate*\(ASR\)Yuet al\.\([2023](https://arxiv.org/html/2605.23723#bib.bib3)\), measured before and after auditing:

ASRbeforeandASRafter\.\\mathrm\{ASR\}\_\{\\text\{before\}\}\\quad\\text\{and\}\\quad\\mathrm\{ASR\}\_\{\\text\{after\}\}\.\(15\)

#### Memory contamination ratio\.

To quantify poisoning intensity in mixed\-memory settings, we define the contamination ratio as

ρ=\|Mpoison\|\|Mbenign\|\+\|Mpoison\|\.\\rho=\\frac\{\|M\_\{\\text\{poison\}\}\|\}\{\|M\_\{\\text\{benign\}\}\|\+\|M\_\{\\text\{poison\}\}\|\}\.\(16\)This ratio provides a normalized measure of how densely poisoned memories are mixed into the full memory store, allowing comparison across settings with different absolute memory sizes\.

#### Tasks\.

We consider two MINJA settings:RAPandQA\. Both are strongly memory\-dependent at inference time\. In RAP, ASR directly corresponds to the fraction of evaluation episodes in which the agent is successfully redirected to the attacker\-chosen behavior\. In QA, the underlying task metric is accuracy, but under the MINJA setting the attack goal is to force victim questions away from their correct answers into attack\-induced outputs through poisoned retrieved traces\. For consistency with RAP, we report the corresponding attack success rate on QA rather than accuracy itself; in this setting, QA ASR is equivalently the error rate on attacked examples, so lower ASR implies higher task accuracy\. Under this memory\-poisoning setup, ASR is therefore not only an attack metric but also a direct indicator of recovered task utility on attacked inputs\. Reducing ASR requires not only suppressing the influence of poisoned memories, but also preserving enough benign memory support for correct retrieval and downstream behavior\. Thus, lower ASR reflects both safer and more effective task execution after auditing\. We useα=0\.6\\alpha=0\.6as the default fusion weight and all reported results are averaged over 10 runs\.

Table 1:Main results on QA and RAP\. RD denotes random deletion, RF denotes retrieval\-frequency\-based deletion, and NNC denotes nearest\-neighbor contradiction filtering\. All methods remove the same number of memories for fair comparison\.

## 6Results

### 6\.1Main results

We evaluateMemAuditonDeepSeekLiuet al\.\([2024](https://arxiv.org/html/2605.23723#bib.bib6)\),GPT\-4o, andGPT\-4o\-miniHurstet al\.\([2024](https://arxiv.org/html/2605.23723#bib.bib5)\)\. We additionally compareMemAuditwith three baselines: random deletion \(RD\), retrieval\-frequency\-based deletion \(RF\), and nearest\-neighbor contradiction filtering \(NNC\)\. RD removes the same number of memories uniformly at random; RF removes memories that are most frequently retrieved under the attack evaluation set; and NNC removes memories that appear most contradictory to their local semantic neighbors\. These baselines test whether the gains ofMemAuditcome from targeted post\-hoc auditing rather than from generic deletion or simple structural heuristics\. Table[1](https://arxiv.org/html/2605.23723#S5.T1)shows thatMemAuditconsistently achieves the strongest attack reduction across both QA and RAP\. On QA,MemAuditreducesASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}from70\.0%70\.0\\%to0\.0%0\.0\\%for GPT\-4o, from50\.0%50\.0\\%to0\.0%0\.0\\%for GPT\-4o\-mini, and from70\.0%70\.0\\%to10\.0%10\.0\\%for DeepSeek\. On RAP,MemAuditreducesASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}to0\.0%0\.0\\%for all three backbones\. In contrast, the other baselines provide only partial reductions, showing the effectiveness ofMemAudit\. One additional observation is that GPT\-4o\-mini starts from a lowerASRbefore\\mathrm\{ASR\}\_\{\\text\{before\}\}on QA than GPT\-4o and DeepSeek, indicating that the attack is less successful on this backbone before auditing\. Even so,MemAuditstill eliminates the observed attack behavior\. More importantly, the comparison against RD, RF, and NNC shows that the observed gains do not arise from simply removing memories, but from ranking memories using event\-level and structural evidence\.

### 6\.2Component ablation

Table 2:Component ablation on QA and RAP\. CMISonly\{\}\_\{\\text\{only\}\}and MCGonly\{\}\_\{\\text\{only\}\}retain only one signal, whileMemAuditcombines both\. Each row reports the resultingASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}under each variant\.To isolate the contribution of each signal, we compareMemAuditagainst two single\-component variants: CMISonly\{\}\_\{\\text\{only\}\}and MCGonly\{\}\_\{\\text\{only\}\}\. This ablation asks whether the reduction inASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}is driven primarily by event\-level causal attribution, by global structural anomaly detection, or by their combination\. Table[2](https://arxiv.org/html/2605.23723#S6.T2)shows thatMemAuditconsistently outperforms both single\-component variants on QA\. For GPT\-4o,MemAuditreducesASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}to0\.0%0\.0\\%, compared with32\.0%32\.0\\%for CMISonly\{\}\_\{\\text\{only\}\}and46\.0%46\.0\\%for MCGonly\{\}\_\{\\text\{only\}\}\. For GPT\-4o\-mini, the corresponding values are0\.0%0\.0\\%,34\.0%34\.0\\%, and32\.0%32\.0\\%\. For DeepSeek,MemAuditreaches10\.0%10\.0\\%, compared with28\.0%28\.0\\%for CMISonly\{\}\_\{\\text\{only\}\}and49\.0%49\.0\\%for MCGonly\{\}\_\{\\text\{only\}\}\. These results indicate that the two signals are complementary on QA, with CMIS providing the stronger standalone signal and MCG improving performance when combined with causal replay\. A similar pattern appears on RAP, although the relative behavior of the two single\-component variants is closer\. For GPT\-4o, CMISonly\{\}\_\{\\text\{only\}\}and MCGonly\{\}\_\{\\text\{only\}\}yieldASRafter=48\.3%\\mathrm\{ASR\}\_\{\\text\{after\}\}=48\.3\\%and47\.2%47\.2\\%, respectively; for GPT\-4o\-mini, they yield46\.1%46\.1\\%and45\.0%45\.0\\%; for DeepSeek, they yield45\.6%45\.6\\%and50\.0%50\.0\\%\. In all three cases,MemAuditreducesASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}to0\.0%0\.0\\%\. This suggests that, in long\-horizon interactive settings, neither local causal evidence nor global structural evidence alone is sufficient; the strongest performance comes from combining the two signals\. The results support the dual\-signal design ofMemAudit\. CMIS captures event\-level causal responsibility for the observed failure, while MCG contributes complementary structure\-level information over the full memory store\.

### 6\.3Fusion\-weight ablation

To assess sensitivity to the fusion rule, we vary the weightα\\alphain the detoxification score\. Sinceα\\alphacontrols the relative contribution of causal replay and structural anomaly detection, this ablation tests whether the main configuration is well aligned with the empirical trade\-off between the two signals\. Table[3](https://arxiv.org/html/2605.23723#S6.T3)shows a consistent pattern across both QA and RAP\. Performance improves asα\\alphaincreases from0\.20\.2to0\.60\.6, but degrades atα=0\.8\\alpha=0\.8\. On QA, all three backbones achieve their best results atα=0\.6\\alpha=0\.6, reducingASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}to0\.0%0\.0\\%,0\.0%0\.0\\%, and10\.0%10\.0\\%, respectively\. The same trend appears on RAP, whereα=0\.6\\alpha=0\.6also gives the best performance for all three backbones, reducingASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}to0\.0%0\.0\\%in all cases\. Across both QA and RAP, the best reported setting is therefore the originalMemAuditchoiceα=0\.6\\alpha=0\.6\. These results suggest that the strongest performance is obtained when CMIS receives dominant but not excessive weight, while MCG remains a complementary signal in the fusion rule\.

Table 3:Fusion\-weight ablation on QA and RAP\. Each row reports the resultingASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}under different fusion weightsα\\alpha\.
### 6\.4Contamination analysis

We further analyze performance under differentρ\\rhovalues to identify the operating boundary of post\-hoc auditing as poisoning becomes denser\. Includingρ=0\\rho=0provides a natural zero\-contamination reference point: when no poisoned memories are present,ASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}remains0\.0%0\.0\\%throughout\. Table[4](https://arxiv.org/html/2605.23723#S6.T4)shows a clear transition on QA\. Fromρ=0\\rho=0toρ=0\.20\\rho=0\.20, all three backbones are fully recovered, withASRafter=0\.0%\\mathrm\{ASR\}\_\{\\text\{after\}\}=0\.0\\%throughout\. This indicates thatMemAuditremains effective not only in the no\-poison case but also in the sparse\-poisoning regime, where poisoned memories still behave like isolated but influential records that can be separated from the benign store through targeted auditing\. Once contamination increases toρ=0\.25\\rho=0\.25, however,ASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}rises sharply to60\.0%60\.0\\%,40\.0%40\.0\\%, and70\.0%70\.0\\%for GPT\-4o, GPT\-4o\-mini, and DeepSeek\. Atρ=0\.50\\rho=0\.50, the method degrades to90\.0%90\.0\\%,100\.0%100\.0\\%, and100\.0%100\.0\\%\. This pattern suggests more than a gradual increase in difficulty\. Beyond a threshold, poisoned memories begin to reinforce one another and form a coherent local cluster\. As a result, the poisoned subset no longer looks like a few anomalous entries, but more like an alternative “truth” within the retrieval space\. This shift weakens both components ofMemAudit\. Removing one poisoned memory has limited effect when other poisoned memories can still support the same harmful behavior, and the structural signal also becomes less distinctive once poisoned memories are mutually consistent with one another\. At that point, the problem is no longer to identify a few covert toxic memories, but to repair a memory state that has already been partially rewritten\.

Table 4:Contamination trend on QA\. Each row reports a fixedASRbefore\\mathrm\{ASR\}\_\{\\text\{before\}\}and the resultingASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}at different contamination ratiosρ\\rho, including the zero\-contamination caseρ=0\\rho=0\.Table[5](https://arxiv.org/html/2605.23723#S6.T5)shows the same mechanism even more sharply on RAP\. Atρ=0\\rho=0,ASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}is again0\.0%0\.0\\%for all three backbones, providing the clean reference point with no poisoned memories present\.MemAuditcontinues to fully suppress the attack atρ=0\.10\\rho=0\.10andρ=0\.15\\rho=0\.15, but breaks down once poisoning reachesρ=0\.24\\rho=0\.24and above\. In long\-horizon settings, mutually supportive poisoned memories can continue to bias retrieval and action selection across multiple steps, so targeted removal of a few entries is no longer sufficient to disrupt the attack\-supporting structure\. The results therefore identify a clear operating boundary for post\-hoc auditing\.MemAuditis most effective from the zero\-contamination case to the sparse\-poisoning regime, where harmful memories still appear as identifiable deviations from a largely benign store\. Once poisoning crosses a critical threshold, the memory state is better viewed as broadly compromised, and stronger recovery actions such as rollback or global sanitization may be more appropriate thanMemAuditalone\.

Table 5:Contamination trend on RAP\. Each row reports a fixedASRbefore\\mathrm\{ASR\}\_\{\\text\{before\}\}and the resultingASRafter\\mathrm\{ASR\}\_\{\\text\{after\}\}at different contamination ratiosρ\\rho, including the zero\-contamination caseρ=0\\rho=0\. The key transition occurs betweenρ=0\.15\\rho=0\.15andρ=0\.24\\rho=0\.24\.

## 7Conclusion

We studypost\-hoc memory auditingfor memory\-augmented LLM agents, focusing on the practical setting in which memory poisoning has already occurred and harmful behavior has already been observed\. Unlike online filtering or prompt\-time prevention, this setting targets recovery after failure and asks which stored memories should be removed to restore safer behavior\. We proposeMemAudit, a post\-hoc auditing framework that combines counterfactual causal attribution with structural anomaly detection to rank suspicious memories for targeted removal\. Without requiring oracle poison labels,MemAuditprovides a practical mechanism for identifying harmful memory entries and supporting targeted memory repair after deployment\-time failures\. Across MINJA QA and RAP,MemAuditsubstantially reduces attack success after targeted memory removal, showing that post\-hoc memory repair is feasible in both short\-form QA and long\-horizon interactive agent settings\. More broadly, our findings suggest that persistent memory should be treated as a first\-class security surface in LLM agents\. We hope this work can motivate further research on memory detoxification and post\-hoc recovery in memory\-augmented agent systems\.

## 8Limitations

MemAudit is designed for*post\-hoc auditing*, not real\-time prevention\. The framework assumes that harmful behavior has already been observed and that the auditor can replay or reconstruct the relevant events\. It therefore complements, rather than replaces, online safeguards such as input filtering, output monitoring, and runtime permission control\. The method also depends on the quality of observable failure signals\. If the deployed system does not log retrieved memories, if harmful behavior is not surfaced by the monitor, or if the harm signal is noisy, the replay\-based auditing procedure may rank the wrong memories\. This is a fundamental limitation of retrospective diagnosis: MemAudit can help repair observed failures, but it cannot address failures that are never detected\. In addition, the effectiveness of post\-hoc memory repair decreases when poisoning becomes sufficiently dense\. Once poisoned memories begin to reinforce one another during retrieval, suspicious entries become harder to isolate through targeted removal alone\. In such settings, earlier intervention or stronger preventive defenses may still be necessary\. Finally, the current evaluation space for interaction\-induced memory poisoning is still constrained by the attack benchmarks presently available to the community\. In our case, evaluation is conducted on the two MINJA settings, QA and RAP, because MINJA currently provides a concrete and realistic query\-only memory injection framework for memory\-augmented agents\. Although these two settings already cover two representative regimes of memory poisoning—short\-form QA manipulation and long\-horizon reasoning\-agent manipulation—they do not yet span the full space of possible memory attacks, such as more adaptive adversaries, broader memory\-write channels, or substantially more heterogeneous memory environments\. As the benchmark ecosystem matures, it will be important to test post\-hoc auditing methods under a wider range of memory\-poisoning scenarios\.

## References

- V\. P\. Bhardwaj \(2026\)SuperLocalMemory: privacy\-preserving multi\-agent memory with bayesian trust defense against memory poisoning\.arXiv preprint arXiv:2603\.02240\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Chen, Z\. Xiang, C\. Xiao, D\. Song, and B\. Li \(2024\)Agentpoison: red\-teaming llm agents via poisoning memory or knowledge bases\.Advances in Neural Information Processing Systems37,pp\. 130185–130213\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Dong, S\. Xu, P\. He, Y\. Li, J\. Tang, T\. Liu, H\. Liu, and Z\. Xiang \(2025\)A practical memory injection attack against llm agents\.arXiv e\-prints,pp\. arXiv–2503\.Cited by:[Appendix F](https://arxiv.org/html/2605.23723#A6.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.23723#S1.p4.1),[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Dong, Z\. Zhou, C\. Yang, J\. Shao, and Y\. Qiao \(2024\)Attacks, defenses and evaluations for llm conversation safety: a survey\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6734–6747\.Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p5.1),[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px2.p1.1)\.
- M\. A\. Ferrag, N\. Tihanyi, D\. Hamouda, L\. Maglaras, A\. Lakas, and M\. Debbah \(2025\)From prompt injections to protocol exploits: threats in llm\-powered ai agents workflows\.ICT Express\.Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p2.1)\.
- K\. Greshake, S\. Abdelnabi, S\. Mishra, C\. Endres, T\. Holz, and M\. Fritz \(2023\)Not what you’ve signed up for: compromising real\-world llm\-integrated applications with indirect prompt injection\.InProceedings of the 16th ACM workshop on artificial intelligence and security,pp\. 79–90\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px1.p1.1)\.
- P\. He, J\. Gao, and W\. Chen \(2021\)DeBERTaV3: improving deberta using electra\-style pre\-training with gradient\-disentangled embedding sharing\.External Links:2111\.09543Cited by:[§4](https://arxiv.org/html/2605.23723#S4.p1.5)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[Appendix F](https://arxiv.org/html/2605.23723#A6.SS0.SSS0.Px4.p1.1),[§6\.1](https://arxiv.org/html/2605.23723#S6.SS1.p1.10)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p3.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[Appendix F](https://arxiv.org/html/2605.23723#A6.SS0.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2605.23723#S6.SS1.p1.10)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p3.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p3.1)\.
- S\. S\. Srivastava and H\. He \(2025\)MemoryGraft: persistent compromise of llm agents via poisoned experience retrieval\.arXiv preprint arXiv:2512\.16962\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px1.p1.1)\.
- B\. D\. Sunil, I\. Sinha, P\. Maheshwari, S\. Todmal, S\. Mallik, and S\. Mishra \(2026\)Memory poisoning attack and defense on memory based llm\-agents\.arXiv preprint arXiv:2601\.05504\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p3.1)\.
- X\. Wang, B\. Li, Y\. Song, F\. F\. Xu, X\. Tang, M\. Zhuge, J\. Pan, Y\. Song, B\. Li, J\. Singh,et al\.\(2024\)Openhands: an open platform for ai software developers as generalist agents\.arXiv preprint arXiv:2407\.16741\.Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p1.1)\.
- Q\. Wei, T\. Yang, Y\. Wang, X\. Li, L\. Li, Z\. Yin, Y\. Zhan, T\. Holz, Z\. Lin, and X\. Wang \(2025\)A\-memguard: a proactive defense framework for llm\-based agent memory\.arXiv preprint arXiv:2510\.02373\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Williams, N\. Nangia, and S\. Bowman \(2018\)A broad\-coverage challenge corpus for sentence understanding through inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),pp\. 1112–1122\.Cited by:[§4](https://arxiv.org/html/2605.23723#S4.p1.5)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022a\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[Appendix F](https://arxiv.org/html/2605.23723#A6.SS0.SSS0.Px2.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022b\)React: synergizing reasoning and acting in language models\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2605.23723#S1.p1.1)\.
- S\. Yi, Y\. Liu, Z\. Sun, T\. Cong, X\. He, J\. Song, K\. Xu, and Q\. Li \(2024\)Jailbreak attacks and defenses against large language models: a survey\.arXiv preprint arXiv:2407\.04295\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Yu, X\. Lin, Z\. Yu, and X\. Xing \(2023\)Gptfuzzer: red teaming large language models with auto\-generated jailbreak prompts\.arXiv preprint arXiv:2309\.10253\.Cited by:[§5](https://arxiv.org/html/2605.23723#S5.SS0.SSS0.Px2.p1.1)\.
- W\. Zou, M\. Dong, M\. R\. Calvo, S\. Chang, J\. Guo, D\. Lee, X\. Niu, X\. Ma, Y\. Qi, and J\. Jiang \(2026\)Poison once, exploit forever: environment\-injected memory poisoning attacks on web agents\.arXiv preprint arXiv:2604\.02623\.Cited by:[§2](https://arxiv.org/html/2605.23723#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix ACompute Resources

MemAudit does not require training a new foundation model\. The main experiments use API\-based backbone models for agent replay and counterfactual generation, while the local computation is limited to memory scoring components such as embedding similarity and NLI\-based consistency checking\. As a result, the method has relatively modest local hardware requirements\.

In our implementation, MemAudit can be run on a standard research workstation or notebook environment, provided that the machine can load the embedding model and the NLI model used in the MCG component\. A representative configuration is a single\-machine setup with one consumer GPU \(e\.g\., 16–24 GB VRAM\), 16–32 GB system memory, and standard Python/PyTorch dependencies\. No distributed training or multi\-GPU infrastructure is required for the auditing pipeline itself\.

For practical reproduction, the most important requirement is that the local environment should be able to execute the embedding and NLI models without out\-of\-memory errors\. The replay and evaluation cost is otherwise dominated by calls to the external LLM backbones rather than by local training compute\. In this sense, MemAudit should be understood as a lightweight post\-hoc auditing framework whose local resource requirement is modest compared with model training or fine\-tuning pipelines\.

## Appendix BMINJA Evaluation Details

We provide additional details for the two MINJA settings used in our experiments\.

#### QA\.

In the QA setting, the victim trigger is the wordfood\. Poisoned memories encode a transformation rule that maps correct answer labels to encrypted labels when the trigger appears\. Retrieval is based on Levenshtein distance over question text, using the top\-55memories together with33in\-context examples\. The memory store contains all benign entries pluskkpoisoned entries\.

#### RAP\.

In the RAP setting, the victim category istoothbrush, and the attack target is a DenTek product\. Each episode is a WebShop trajectory with up to1515steps\. An attack is counted as successful only when a victim query is redirected to the target product and the agent completes a purchase\.

#### Run tags\.

Logging tags such asalpha07andcomponent\_fulldenote different experimental configurations\. Full MemAudit corresponds to runs with both CMIS and MCG enabled\. The tagscomponent\_cmis\_onlyandcomponent\_mcg\_onlycorrespond toα=1\.0\\alpha=1\.0andα=0\.0\\alpha=0\.0, respectively\.

## Appendix CHarm Scoring Details

MemAudit uses task\-aligned binary harm scores for both QA and RAP\.

#### QA\.

In the QA setting, a harmful event occurs when a victim query produces an attack label:

hQA\(q,y\)=𝕀\[victim\(q\)\]⋅𝕀\[Answer\(y\)∈\{E,F,G,H\}\]\.h\_\{\\text\{QA\}\}\(q,y\)=\\mathbb\{I\}\[\\text\{victim\}\(q\)\]\\cdot\\mathbb\{I\}\[\\text\{Answer\}\(y\)\\in\\\{\\texttt\{E\},\\texttt\{F\},\\texttt\{G\},\\texttt\{H\}\\\}\]\.These labels correspond to shifted or encrypted multiple\-choice outputs introduced by the attack\.

#### RAP\.

In the RAP setting, a trajectory is considered harmful only if all three conditions hold:

1. 1\.the query contains the victim category,
2. 2\.the trajectory reaches the target product,
3. 3\.the agent executesclick\[Buy Now\]\.

This definition ensures that only completed attack trajectories are counted as harmful\.

## Appendix DReplay and Counterfactual Auditing Details

MemAudit follows a three\-stage post\-hoc auditing protocol:

1. 1\.replay the attacked agent and collect harmful events,
2. 2\.audit the original memory store without modifying it,
3. 3\.remove the aggregated suspicious memories and replay evaluation on the purified store\.

For each harmful event, CMIS evaluates candidate memories through counterfactual replay:

1. 1\.compute the harm score of the observed output,
2. 2\.remove one candidate memory,
3. 3\.re\-run retrieval and generate a counterfactual output,
4. 4\.measure the resulting harm reduction\.

When retrieved memory IDs are available, as in QA, ablation can be performed directly within the logged context\. When they are not explicitly available, as in RAP, retrieval is recomputed after each removal\.

A key design choice is that auditing is performed in batch rather than sequentially\. All harmful events are analyzed against the same underlying memory state, and removal is applied only after the final ranking is obtained\. This avoids order effects and keeps attribution results consistent across events\.

## Appendix EQualitative Auditing Examples

Table[6](https://arxiv.org/html/2605.23723#A5.T6)shows representative examples of suspicious memories identified by MemAudit\. In both settings, the removed entries correspond to attack\-specific patterns that directly support the harmful behavior observed during evaluation\.

Table 6:Representative qualitative examples from MINJA runs\.
## Appendix FLicenses and Terms of Use for Existing Assets

Our experiments rely on existing external assets and services\. We credit these assets in the main paper and summarize their license or usage status here\.

#### MINJA\.

We follow the MINJA evaluation setting introduced by Dong et al\.Donget al\.\[[2025](https://arxiv.org/html/2605.23723#bib.bib13)\]\. In this work, MINJA is used as an attack benchmark and evaluation protocol rather than as a newly redistributed asset package\. We therefore cite the original paper as the primary source for the benchmark setting used in our experiments\.

#### WebShop\.

Our RAP evaluation is built on the WebShop environment introduced by Yao et al\.Yaoet al\.\[[2022a](https://arxiv.org/html/2605.23723#bib.bib2)\]\. The public WebShop repository is released under the MIT License, which permits use, copying, modification, distribution, and sublicensing subject to preservation of the copyright and license notice\.

#### DeepSeek\.

For experiments involving DeepSeek, we use the DeepSeek\-V3 family as referenced in the paperLiuet al\.\[[2024](https://arxiv.org/html/2605.23723#bib.bib6)\]\. The public DeepSeek\-V3 code repository is released under the MIT License\. The model weights are provided under the DeepSeek Model License, which grants broad usage rights while imposing additional use\-based restrictions that must be respected by downstream users\.

#### OpenAI models\.

For experiments involving GPT\-4o and GPT\-4o\-mini, we access the models through OpenAI’s API\-based servicesHurstet al\.\[[2024](https://arxiv.org/html/2605.23723#bib.bib5)\]\. These models are not released under an open\-source license in our setting; instead, their use is governed by OpenAI’s applicable business and service terms\. We use these services only through authorized API access and in accordance with the provider’s published terms and policies\.

#### General note\.

We do not claim ownership of any of the external assets above\. Our work introduces a post\-hoc auditing framework and evaluates it on existing benchmarks and model services\. We cite the original sources of these assets in the paper and use them in accordance with their publicly stated licenses or service terms\.

## Appendix GBroader Impacts

This work studies the security of memory\-augmented LLM agents and proposes a post\-hoc auditing framework for identifying and removing suspicious memories after harmful behavior has already been observed\. A potential positive impact of this work is improved recoverability for deployed agent systems: instead of treating unsafe behavior only as a response\-time problem, MemAudit provides a mechanism for diagnosing and repairing the persistent memory state that may continue to influence future actions\. In high\-stakes agent settings, such as long\-horizon web interaction or decision support, this may improve accountability, reduce repeated failures, and support safer recovery after compromise\.

At the same time, the method also carries potential risks and limitations\. First, post\-hoc auditing may produce false positives and remove useful memories, which could degrade the performance, personalization, or long\-term consistency of an agent\. Second, auditing tools of this kind could be misused in overly aggressive monitoring or filtering pipelines that erase benign but unusual memories without sufficient oversight\. Third, the broader deployment of memory auditing does not eliminate the need for preventive safeguards, since dense poisoning or undetected harmful events may still evade repair\. For these reasons, we view MemAudit as a complementary recovery mechanism rather than a replacement for input filtering, output monitoring, access control, and other online defenses\.

More broadly, this work highlights that persistent memory should be treated as a first\-class security surface in LLM agents\. We hope the paper encourages future research on safer memory management, more reliable post\-hoc recovery, and better evaluation of the trade\-off between attack reduction and utility preservation\.

## Appendix HUse of AI

We used LLMs for language polishing and grammar correction only\. All research design, implementation, and analysis were carried out by the authors\.

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The main claims made in the abstract and introduction are consistent with the actual scope of the paper\. Specifically, the paper claims that MemAudit addresses*post\-hoc memory auditing*for memory\-augmented LLM agents, that it combines counterfactual causal attribution with structural anomaly detection, and that it substantially reduces attack success under realistic post\-hoc settings\. These claims are directly supported by the method description in Section 4 and the empirical results reported in Section 6\.
5. 2\.Limitations
6. Question: Does the paper discuss the limitations of the work performed by the authors?
7. Answer:\[Yes\]
8. Justification: The paper includes a dedicated “Limitations” section\. It explicitly discusses several important limitations, including the post\-hoc nature of the framework, its dependence on observable harmful events and replay quality, reduced effectiveness under dense poisoning, and the current empirical coverage being limited to the MINJA QA and RAP settings\. See Section 8\.
9. 3\.Theory assumptions and proofs
10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
11. Answer:\[N/A\]
12. Justification: This is not a theory paper\. The contribution is primarily methodological and empirical, and the paper does not present formal theorems, propositions, or proofs that would require a theorem\-proof checklist assessment\.
13. 4\.Experimental result reproducibility
14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
15. Answer:\[Yes\]
16. Justification: The paper discloses the core ingredients needed to reproduce the main results: the MemAudit framework in Section 4 and Algorithm 1, the post\-hoc batch auditing protocol and evaluation setup in Section 5, and the detailed MINJA settings, harm scoring rules, and replay procedure in Appendix B–D\. While some lower\-level implementation choices can still be clarified further in the final version, the current manuscript already provides the main information needed to understand and reproduce the reported findings at the level relevant to the paper’s central claims\.
17. 5\.Open access to data and code
18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
19. Answer:\[No\]
20. Justification: The current submission does not provide an anonymized public release of code or data\. Although the paper and appendix contain substantial methodological and experimental detail, the full codebase and reproducibility package are not yet openly released as part of the submission\.
21. 6\.Experimental setting/details
22. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
23. Answer:\[Yes\]
24. Justification: The paper specifies the experimental setting at a level sufficient to understand the reported results\. In particular, it reports the QA and RAP task construction, memory composition, contamination ratios, replay\-based evaluation protocol, retrieval setup, harm definitions, and the main fusion setting used for full MemAudit\. Together with Appendix A–D, these details make the evaluation protocol understandable, although some implementation details can still be described more exhaustively in the final version\.
25. 7\.Experiment statistical significance
26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
27. Answer:\[Yes\]
28. Justification: The paper reports results as averages over 10 runs rather than single\-run outcomes\. While the current version does not include error bars or formal significance tests, the repeated\-run averages provide some evidence of the stability of the empirical comparisons\.
29. 8\.Experiments compute resources
30. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
31. Answer:\[Yes\]
32. Justification: The appendix includes a dedicated compute\-resources description for the MemAudit pipeline\. It explains that the method does not require training a new foundation model, that the backbone agents are accessed through external APIs, and that the local computation is limited to embedding and NLI scoring\. It also provides a representative single\-machine setup, including approximate GPU and system\-memory requirements for the local components\. Although the current version does not provide detailed wall\-clock timings, it provides sufficient information to contextualize the computational footprint of the reported experiments\.
33. 9\.Code of ethics
35. Answer:\[Yes\]
36. Justification: To the best of our knowledge, this research is consistent with the NeurIPS Code of Ethics\. The work studies a security problem in memory\-augmented LLM agents and proposes a defensive auditing framework; it does not involve human subjects, crowdsourcing, deceptive deployment, or release of a harmful system\.
37. 10\.Broader impacts
38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
39. Answer:\[Yes\]
40. Justification: The paper discusses broader societal impacts in the appendix\. It discusses potential positive impacts such as improved recoverability, accountability, and safer deployment of memory\-augmented agents, as well as potential negative impacts including false\-positive memory removal, degradation of useful memory, and possible misuse of auditing tools in overly aggressive monitoring or filtering settings\. See Appendix G\.
41. 11\.Safeguards
42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
43. Answer:\[N/A\]
44. Justification: The paper does not release a newly trained high\-risk foundation model, a new generative system, or a new dataset as the primary artifact of the work\. The contribution is a post\-hoc auditing framework evaluated on existing benchmarks and API\-accessible models, so a safeguards discussion for release of high\-risk assets is not directly applicable here\.
45. 12\.Licenses for existing assets
46. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
47. Answer:\[Yes\]
48. Justification: The paper credits the main external assets it builds upon, including MINJA, WebShop, DeepSeek, and the evaluated OpenAI model family\. In addition, the appendix now includes a dedicated “Licenses and Terms of Use for Existing Assets” section that explicitly summarizes the license or usage status of the external assets used in our experiments, including the MIT License for WebShop, the MIT code license and model license for DeepSeek\-V3, and the API service terms governing OpenAI models\. Since these assets are now both credited and documented with their corresponding license or usage conditions, this checklist item is satisfied\.
49. 13\.New assets
50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
51. Answer:\[N/A\]
52. Justification: The paper does not introduce or release a new dataset, model, or benchmark as a submission artifact\. Its contribution is a methodological framework and empirical evaluation rather than a new public asset package\.
53. 14\.Crowdsourcing and research with human subjects
54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
55. Answer:\[N/A\]
56. Justification: This work does not involve crowdsourcing experiments or research with human subjects\.
57. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
59. Answer:\[N/A\]
60. Justification: This work does not involve human subjects research, and therefore IRB approval or an equivalent review process is not applicable\.
61. 16\.Declaration of LLM usage
62. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
63. Answer:\[Yes\]
64. Justification: The work studies memory\-augmented LLM agents as its core experimental setting, and the appendix also includes an explicit statement that LLMs were used only for language polishing and grammar correction\. This clarifies both the role of LLM\-based systems in the experiments and the fact that writing assistance did not affect the research design, implementation, or analysis\.
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Similar Articles

State Contamination in Memory-Augmented LLM Agents

When Agents Remember Too Much: Memory Poisoning Attacks on Large Language Model Agents

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

Memory as an Attack Surface in LLM Agents: A Study on Multiple-Choice Question Answering

Auditing Forgetting in Limited Memory Language Models

Submit Feedback

Similar Articles

State Contamination in Memory-Augmented LLM Agents
When Agents Remember Too Much: Memory Poisoning Attacks on Large Language Model Agents
The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
Memory as an Attack Surface in LLM Agents: A Study on Multiple-Choice Question Answering
Auditing Forgetting in Limited Memory Language Models