The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements

arXiv cs.AI Papers

Summary

This paper audits LangChain, AutoGPT, and OpenAI Agents SDK for architectural safety guarantees and finds no native compliance with containment principles, demonstrating that memory poisoning can cause persistent failures; it introduces lightweight mechanisms to eliminate such attacks.

arXiv:2606.12797v1 Announce Type: new Abstract: Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasingly deployed in public-facing domains, including government services, healthcare triage, and financial advising. We ask whether the frameworks used to build these systems provide architectural-level structural safety guarantees. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks (LangChain, AutoGPT, and OpenAI Agents SDK) and find no native compliance in any of them. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88.9%. Under a complex five-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by 3.5x, rendering the corruption difficult to detect through standard monitoring. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub-millisecond overhead (<0.2ms per call). We conclude that the current agentic framework ecosystem may not yet meet secure-by-default expectations for public-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high-stakes, socially impactful applications.
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:53 AM

# How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
Source: [https://arxiv.org/html/2606.12797](https://arxiv.org/html/2606.12797)
###### Abstract

Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi\-step plans are increasingly deployed in public\-facing domains, including government services, healthcare triage, and financial advising\. We ask whether the frameworks used to build these systems provide architectural\-level structural safety guarantees\. Applying six containment principles derived from a compositional model of agentic architectures, we audit three dominant frameworks \(LangChain, AutoGPT, and OpenAI Agents SDK\) and find no native compliance in any of them\. Memory integrity, a defense against one of the most prevalent vulnerability classes, is not observed in any of the three evaluated frameworks\. We validate these findings empirically: in a simulated government benefits agent built on LangChain, a single memory\-poisoning write induces persistent targeted corruption across all tested seeds and backends, increasing the wrongful denial rate for targeted applicants to 88\.9%\. Under a complex five\-factor policy, the same attack preserves aggregate accuracy while increasing targeted wrongful denials by3\.5×3\.5\\times, rendering the corruption difficult to detect through standard monitoring\. We then introduce two lightweight containment mechanisms: a memory integrity validator and a policy gate, which eliminate both attack vectors with sub\-millisecond overhead \(<0\.2​ms<0\.2\\,\\text\{ms\}per call\)\. We conclude that the current agentic framework ecosystem may not yet meet secure\-by\-default expectations for public\-facing deployments and outline priority architectural interventions to enable trustworthy deployment in high\-stakes, socially impactful applications\.

Agentic AI, LLM Security, Trustworthy AI, Framework Audit, Containment

## 1Introduction

Agentic AI systems are increasingly deployed in public\-facing domains such as government services, healthcare, and finance\(Xuet al\.,[2024](https://arxiv.org/html/2606.12797#bib.bib29)\)\. Unlike traditional LLM chatbots, these systems invoke tools, maintain persistent memory, and act autonomously over multi\-step horizons\(Yaoet al\.,[2022](https://arxiv.org/html/2606.12797#bib.bib1)\)\. A single corrupted reasoning cycle can propagate through tool execution into memory, poisoning subsequent interactions and potentially leading to persistent, system\-level failures with real\-world consequences\.

The AI safety community has primarily focused on what models*say*, such as output toxicity, bias, and hallucination, while the trustworthy AI community has emphasized behavioral evaluation and fairness\. However, neither has systematically addressed a more fundamental question:*Do the frameworks used to build agentic AI systems provide structural safety guarantees at the architectural level?*This question is orthogonal to model\-level safety; it concerns whether the surrounding system enforces reliable boundaries between perception and the core stages \(reasoning, execution, and memory\) through which every agentic action propagates\. While prior work primarily catalogs attack types in LLM agents, it remains unclear why these vulnerabilities consistently persist across different frameworks and model backends\. We argue that the root cause is structural: the absence of enforced containment at architectural boundaries\.

This paper makes four contributions\. First, we present, to the best of our knowledge, the first audit methodology that operationalizes formal containment principles into a reusable compliance matrix for agentic frameworks \(Section[3](https://arxiv.org/html/2606.12797#S3)\)\. Second, auditing LangChain\(LangChain AI,[2024](https://arxiv.org/html/2606.12797#bib.bib26)\), AutoGPT\(Significant Gravitas,[2024](https://arxiv.org/html/2606.12797#bib.bib27)\), and the OpenAI Agents SDK\(OpenAI,[2024](https://arxiv.org/html/2606.12797#bib.bib28)\)suggests that we do not observe native compliance across any of the six principles \(Section[4](https://arxiv.org/html/2606.12797#S4)\)\. Third, we show that a single memory poisoning write can induce targeted corruption across five backends and, under a five\-factor policy, can remain difficult to detect through aggregate metrics \(Section[5](https://arxiv.org/html/2606.12797#S5)\)\. Fourth, two deterministic interventions substantially reduce attack success rates with sub\-millisecond overhead \(Section[5](https://arxiv.org/html/2606.12797#S5)\)\.

## 2Background: The Composition Problem

### 2\.1Agentic Systems as Compositional Pipelines

An agentic LLM system composes four functional stages in a recursive loop\(Yaoet al\.,[2022](https://arxiv.org/html/2606.12797#bib.bib1); Mastermanet al\.,[2024](https://arxiv.org/html/2606.12797#bib.bib4)\): a*perception*functionPPthat processes external inputs, a*reasoning*\(behavior\) functionBBthat plans actions using the current input and persistent memorymtm\_\{t\}, an*execution*functionEEthat invokes tools, and a*memory update*function𝒰\\mathcal\{U\}that writes outcomes back to persistent state\. The decision\-to\-action mapping at each timestep is:

Φ​\(ot,mt\)=E​\(B​\(P​\(ot\),mt\)\),\\Phi\(o\_\{t\},m\_\{t\}\)=E\\\!\\left\(B\\\!\\left\(P\(o\_\{t\}\),\\,m\_\{t\}\\right\)\\right\),\(1\)whereΦ\\Phiproduces the executed action, and the resulting state is subsequently incorporated via𝒰\\mathcal\{U\}\.

Individual stages may be secure in isolation, but their composition becomes vulnerable without inter\-layer isolation\(Christodorescuet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib5)\)\. Corrupted outputs propagate across stages, from perception to reasoning, execution, and memory, affecting subsequent cycles\. Figure[1](https://arxiv.org/html/2606.12797#S2.F1)illustrates this pipeline with containment gates at each boundary\.

![Refer to caption](https://arxiv.org/html/2606.12797v1/new_fig_1.png)Figure 1:Compositional agentic architecture with containment gates \(G1–G3\) at layer boundaries\. External inputOtO\_\{t\}and memory statemtm\_\{t\}flow through perception, reasoning, execution, and memory update\. Gates enforce the six containment principles \(P1–P6\) at each transition\. Runtime monitoring \(P6\) spans all stages\.
### 2\.2Execution Containment

Security requires thatΦ​\(ot,mt\)∈𝒞\\Phi\(o\_\{t\},m\_\{t\}\)\\in\\mathcal\{C\}for alltt, where𝒞⊆𝒜\\mathcal\{C\}\\subseteq\\mathcal\{A\}is a policy\-constrained safe action space𝒜\\mathcal\{A\}\(Saltzer and Schroeder,[1975](https://arxiv.org/html/2606.12797#bib.bib22)\)\. We call this condition*execution containment*\. WhenEEdirectly forwardsBB’s output to the runtime without such projection, the system operates in a state of*autonomy without containment*, analogous to executing user\-space code with kernel privileges\(Kleinet al\.,[2009](https://arxiv.org/html/2606.12797#bib.bib25)\)\. This parallels the reference monitor concept in systems security and the projection operator in constrained control theory\.

### 2\.3Six Containment Principles

We identify six containment factors that ensure safety when applied at layer interfaces\. These form the foundation of our audit framework:

1. 1\.Reasoning\-Execution Separation \(P1\):Policy gatesπ\\pilie between planning and execution so that the agent cannot implement every plan it devises\. The only actions that pass the gate are those that satisfyE​\(π​\(ut\)\)∈𝒞E\(\\pi\(u\_\{t\}\)\)\\in\\mathcal\{C\}\.
2. 2\.Capability Scoping \(P2\):A bounded tokenTkT\_\{k\}is given to each session that defines which tools can be used, the ranges of parameters, the limits on rates, and expiry times\. The agent simply does not have the possibility to break through these boundaries set by the token\.
3. 3\.Memory Integrity \(P3\):The validity of any write before it reaches long\-term memory is tested by an integrity functionℐ\\mathcal\{I\}\. Those writings that do not pass the test are discarded\.
4. 4\.Layer\-Transition Validation \(P4\):Security checks are performed at all interfaces that the data traverses, not only the input interface \(P→BP\\\!\\to\\\!B,B→EB\\\!\\to\\\!E,E→𝒰E\\\!\\to\\\!\\mathcal\{U\}\)\. Thus, if a malicious user can pass through one interface, they are still not guaranteed to pass through others\.
5. 5\.Authenticated Communication \(P5\):All messages exchanged between agents should contain credentials such as digital signatures that can be verified\. Any messages without proper verification credentials are quarantined\.
6. 6\.Runtime Monitoring \(P6\):Lastly, the anomaly detector monitors execution paths as they develop\. If it detects an anomaly, it activates containment to mitigate its effects\.

We formalize the relationship between these principles and containment below\.

###### Theorem 1\(Containment Sufficiency\)\.

If an agentic system satisfies P1 \(policy\-gated execution\) and P3 \(validated memory writes\), then no single\-step memory\-poisoning attack can induce a persistent policy violation\.

###### Proof sketch\.

P3 guarantees thatℐ\\mathcal\{I\}will reject any adversarial writeδ\\delta, so thatmt\+1=mtm\_\{t\+1\}=m\_\{t\}stays unaltered\. P1 guarantees that the expectationE​\(π​\(ut\)\)E\(\\pi\(u\_\{t\}\)\)belongs to𝒞\\mathcal\{C\}\. Since memory is safe and execution is controlled, nothing gets propagated along any single\-step trajectory that violates safety\. Our experiments confirm: with both active, corruption drops from 1\.000 to 0\.000 across all backends \(Table[6](https://arxiv.org/html/2606.12797#S5.T6)\)\. ∎

###### Proposition 1\(Need for Combined Enforcement\)\.

Neither P1 nor P3 alone is enough for containment\. Dropping P3 while keeping P1 allows memory corruption, leading to future biased inputs into reasoning\. Dropping P1 while keeping P3 allows the execution of unsafe actions in one cycle\.

Experiments 1 and 3 provide empirical evidence: removing memory validation \(P3\) yields complete corruption \(Table 3\), while removing the policy gate \(P1/P2\) yields complete tool bypass \(Table 4\)\.

## 3Audit Methodology

Framework selection\.For our analysis, we will consider the most commonly used deployment systems for agents – LangChain Agents\(LangChain AI,[2024](https://arxiv.org/html/2606.12797#bib.bib26)\), AutoGPT\(Significant Gravitas,[2024](https://arxiv.org/html/2606.12797#bib.bib27)\), and the OpenAI Agents SDK\(OpenAI,[2024](https://arxiv.org/html/2606.12797#bib.bib28)\)\.

Evidence sources\.The sources used for this study include official documents, source code examination, and security studies that have been published\(Christodorescuet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib5); Ferraget al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib15)\)\.

Scoring rubric\.✓ = native default \(enabled without configuration\); ✓\*= requires explicit configuration; ✗ = absent\. We focus on*default behavior*\(Christodorescuet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib5); Razaet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib8)\)\.

Reliability\.Reliability was ensured by having two raters score all 18 framework\-principle dyads \(Cohen’sκ=0\.81\\kappa=0\.81\)\. Conflicts between raters were settled through discussion involving a third rater\.

Limitations\.The rubric captures mechanism*presence*, not implementation depth or runtime effectiveness\. The evaluation is point\-in\-time, and frameworks may evolve\. Runtime testing is presented in Section[5](https://arxiv.org/html/2606.12797#S5)\.

## 4Results: The Compliance Matrix

Table[1](https://arxiv.org/html/2606.12797#S4.T1)presents the full compliance matrix\. The audit reveals four systemic patterns\.

Table 1:Compliance matrix: production agent frameworks against six containment principles\. ✓ = native default; ✓\*= requires configuration; ✗ = absent\. No framework achieves ✓ on any principle\.Pattern 1: Zero native compliance\.Across the three evaluated frameworks, we do not observe the ✓ criterion for any of the principles\. Each containment mechanism must be explicitly enabled or absent\.

Pattern 2: Universal memory integrity failure\.P3 \(Memory Integrity\) scores ✗ across all three frameworks, despite memory poisoning being one of the most widely documented vulnerability types across recent surveys of agentic AI security\(Denget al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib20); Patlanet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib12); Wuet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib21)\)\. This represents the most critical gap in public\-facing systems\.

Pattern 3: Safety is optional\.Existing safeguards require explicit configuration rather than being enabled by default, violating the “secure by default” principle\(Saltzer and Schroeder,[1975](https://arxiv.org/html/2606.12797#bib.bib22)\)and creating a predictable deployment gap—especially for non\-expert developers building public\-facing systems\.

Pattern 4: Autonomy inversely correlates with compliance\.AutoGPT \(5 non\-compliant\), LangChain \(2 non\-compliant\), OpenAI SDK \(1 non\-compliant\)\. The architecture intended to be highly autonomous has the fewest barriers to safety\. This is because the design trade\-off favors autonomy over confinement, and thus, safety constraints are not enforced as rigorously\.

## 5Experimental Validation

Our audit from Section[4](https://arxiv.org/html/2606.12797#S4)reveals the lack of security mechanisms\. We verify that their omission allows for attacks taking advantage of the following vulnerabilities: memory integrity \(P3\), and separation between reasoning and execution \(P1/P2\)\.

![Refer to caption](https://arxiv.org/html/2606.12797v1/fig2_attack_flow.png)Figure 2:Attack propagation in the agentic pipeline\.Top:Benign path, actions remain in𝒮\\mathcal\{S\}and memory is intact\.Middle:Without containment, perturbationδ\\deltapropagates across stages, poisoning memory and causing downstream drift\.Bottom:With containment, the policy gate \(G2\) blocks out\-of\-scope actions before execution or memory updates\.### 5\.1Experimental Setup

Scenario\.A LangChain\-based conversational agent processes welfare benefit claims and makes approve\-or\-deny decisions based on income and household size\. These principles reflect classical security concepts such as least privilege and defense in depth\.

Dataset\.250 synthetic welfare claims across five regions \(50 per region\), with a deterministic eligibility rule \(income<<$40,000*and*household size\>\>2⇒\\Rightarrowapprove\)\. An additional 200 adversarial entries target two attack surfaces: 100 memory\-poisoning payloads and 100 tool\-access attacks \(tool\-access attacks\)\.

Model\.Qwen\-2\.5 3B\-Instruct is served locally via Ollama\. All experiments use three random seeds \(42, 7, 123\) for reproducibility\.

Baseline\.Table[2](https://arxiv.org/html/2606.12797#S5.T2)shows that the agent achieves high accuracy under clean conditions across all seeds, establishing that subsequent performance drops are attributable to the attacks rather than model error\.

Table 2:Clean baseline performance \(no attack\)\. The deterministic rule backend confirms that the agent pipeline is correct before adversarial intervention\. Seed 7’s lower baseline \(0\.750\) reflects the inherent stochasticity of the 3B model on borderline cases; notably, the attack still achieves a corruption rate of 1\.000 on this seed, confirming that the vulnerability is independent of baseline performance\.Interventions\.We implement two lightweight containment mechanisms: \(1\) a*memory integrity validator*\(P3\) that interposes onConversationBufferMemory\.save\_context, checking source provenance, schema conformance, and demographic\-targeting patterns via deterministic regex—rejected writes are silently dropped; \(2\) a*tool\-call policy gate*\(P1/P2\) that enforces a deny\-all allowlist over tool names and path canonicalization over file arguments before execution\. Both interventions are specified by algorithms[1](https://arxiv.org/html/2606.12797#alg1)and[2](https://arxiv.org/html/2606.12797#alg2)\. Importantly, both algorithms do not rely on LLM; they are deterministic and impose sub\-millisecond overhead\.

Algorithm 1Memory Integrity Validator \(P3\)0:Write candidate

\(xin,xout,s\)\(x\_\{\\text\{in\}\},x\_\{\\text\{out\}\},s\)where

s∈\{agent,external,unknown\}s\\in\\\{\\textsc\{agent\},\\textsc\{external\},\\textsc\{unknown\}\\\}
0:AcceptorRejectwith rule identifier

1:if

s=externals=\\textsc\{external\}then

2:for allpattern

ppinPolicyOverridePatternsdo

3:if

ppmatches

xin∥xoutx\_\{\\text\{in\}\}\\\|x\_\{\\text\{out\}\}then

4:returnReject,provenance\_override

5:endif

6:endfor

7:if

xinx\_\{\\text\{in\}\}is not valid JSONthen

8:returnReject,schema\_nonconformant

9:endif

10:endif

11:for allpattern

ddinDemographicDenyPatternsdo

12:if

ddmatches

xin∥xoutx\_\{\\text\{in\}\}\\\|x\_\{\\text\{out\}\}then

13:returnReject,demographic\_deny

14:endif

15:endfor

16:returnAccept

Algorithm 2Tool\-Call Policy Gate \(P1/P2\)0:Tool call

\(tname,args\)\(t\_\{\\text\{name\}\},\\textit\{args\}\); allowlist

𝒯\\mathcal\{T\}; path scopes

ℛ,𝒲\\mathcal\{R\},\\mathcal\{W\}; rate counter

cc
0:AlloworBlockwith rule identifier

1:if

c\>cmaxc\>c\_\{\\max\}then

2:returnBlock,rate\_limit

3:endif

4:if

tname∉𝒯t\_\{\\text\{name\}\}\\notin\\mathcal\{T\}then

5:returnBlock,tool\_not\_allowed

6:endif

7:if

tname=read\_filet\_\{\\text\{name\}\}=\\texttt\{read\\\_file\}then

8:

p←Canonicalize\(args\.path\)p\\leftarrow\\textsc\{Canonicalize\}\(\\textit\{args\}\.\\text\{path\}\)\{resolve\.\./\}

9:if

p∉ℛp\\notin\\mathcal\{R\}then

10:returnBlock,read\_outside\_scope

11:endif

12:endif

13:if

tname=write\_filet\_\{\\text\{name\}\}=\\texttt\{write\\\_file\}then

14:

p←Canonicalize\(args\.path\)p\\leftarrow\\textsc\{Canonicalize\}\(\\textit\{args\}\.\\text\{path\}\)
15:if

p∉𝒲p\\notin\\mathcal\{W\}then

16:returnBlock,write\_outside\_scope

17:endif

18:endif

19:returnAllow

The simple eligibility criterion enables controlled measurement\. This attack is not affected by the level of complexity in the rule since the problem is not the rule but the corrupted memory\. This is an attack that may generate subtle deviations in a realistic system\.

### 5\.2Results

Experiment 1: Memory poisoning attack\.A single adversarial memory write, a fake policy note stating that Region B applicants with income<<$30k should be denied, is injected at claim 11 of 40\. Table[3](https://arxiv.org/html/2606.12797#S5.T3)reports the results\. The attack achieves a corruption rate of 1\.000 across all three seeds: every eligible Region B applicant processed after the poison write is wrongfully denied\. Mean accuracy collapses from 0\.908 to 0\.558 \(a 35 percentage point drop\), and the Region B wrongful denial rate rises to 0\.889\. Figure[3](https://arxiv.org/html/2606.12797#S5.F3)shows the temporal dynamics: accuracy is stable before the poison write and degrades monotonically afterward, consistent with the failure amplification expected from the compositional pipeline\.

Table 3:Memory poisoning results \(Exp\. 1\) and validator recovery \(Exp\. 2\)\. Corruption rate measures the fraction of targeted post\-poison decisions that flip from the clean baseline\.![Refer to caption](https://arxiv.org/html/2606.12797v1/figure_memory_rolling_accuracy.png)Figure 3:Rolling accuracy \(window=20\) across claim positions\. The poison write at claim 11 \(dotted line\) causes monotonic accuracy degradation in the unprotected agent \(red\)\. The validator\-equipped agent \(green dashed\) tracks the clean baseline \(blue\), confirming that the corrupted write was intercepted and discarded\.Experiment 2: Memory validator recovery\.With the memory integrity validator active, the poisoned write is intercepted and rejected before entering the conversation memory\. Table[3](https://arxiv.org/html/2606.12797#S5.T3)\(right columns\) shows the result: the corruption rate drops from 1\.000 to 0\.000 across all seeds, and accuracy recovers from 0\.558 to 0\.967\. The validator adds a mean overhead of 0\.016 ms per call \(Table[5](https://arxiv.org/html/2606.12797#S5.T5)\)\.

Experiment 3: Tool policy bypass\.Without the policy gate, all 100 adversarial tool call prompts succeed: 50 path traversal attacks read files outside the authorized directory, 25 unauthorized API calls reach external endpoints, and 25 restricted writes modify protected files, yielding a bypass rate of 1\.000\. With the policy gate enforcing a deny\-all allowlist and path canonicalization, the bypass rate drops to 0\.000 across all three attack types; every attack is blocked at the gate layer with a mean overhead of 0\.129 ms per call\.

Table 4:Tool\-access policy gate results by attack type\. Without the gate, all attack types achieve 100% bypass; with the gate, all are blocked\.Table 5:Headline results and intervention overhead\.Summary\.Table[4](https://arxiv.org/html/2606.12797#S5.T4)presents the breakdown of results by attack type\. Both attack vectors achieve 100% success without containment and 0% success with containment\. The interventions are deterministic \(no LLM calls\), incur sub\-millisecond overhead, and require no changes to the upstream framework; they wrap existing LangChain abstractions\. These results validate the audit finding that the structural gaps are not merely theoretical: they are readily exploitable, and the corresponding fixes are lightweight\.

### 5\.3Multi\-Backend Generalization

The experiments from Sections 5\.1 and 5\.2 intentionally employ a small local model \(Qwen\-2\.5 3B\-Instruct\) to highlight the influence of framework design independently of model performance\. The pertinent next step is to check whether the same attacks work on much bigger and more realistic models that may possess implicit protection due to alignment training\. We repeat the entire pipeline \(Experiments 0–3\) using two additional backends: Claude Haiku 4\.5 \(Anthropic\) and GPT\-4o \(OpenAI\) with seed 42 andn=40n\{=\}40claims for each experiment\. In total, five backends are evaluated: Qwen\-2\.5 3B, Claude Haiku 4\.5, and GPT\-4o for the simple policy \(Sections 5\.2–5\.3\), and Claude Sonnet 4\.6 and GPT\-4o\-mini for the complex policy \(Section 5\.4\)\.

#### Baseline\.

Both commercial models achieve perfect accuracy on clean claims \(1\.000\), compared with 0\.908 for Qwen\-2\.5 averaged across three seeds \(Table[6](https://arxiv.org/html/2606.12797#S5.T6)\)\. This establishes that any degradation under attack is attributable to the poisoning mechanism rather than to baseline model error\.

#### Memory poisoning generalizes across model scale\.

Table[6](https://arxiv.org/html/2606.12797#S5.T6)presents the central result\. Despite their significantly higher parameter counts and their respective alignment pipelines, both Claude Haiku 4\.5 and GPT\-4o achieve the same corruption rate of 1\.000 when subjected to memory poisoning—the exact same rate achieved by the local model 3B\. The poisoned accuracy rates of both commercial backends fall to 0\.875, in contrast to Qwen\-2\.5, which stands at 0\.558 \(mean across seeds\)\. The greater residual accuracy of the two commercial backends is due to their superior performance regarding non\-targeted claims, yet the attack itself, overriding the policy using one malicious memory write, succeeds entirely, independent of scale\.

#### Containment interventions remain effective\.

With the memory integrity validator active, the corruption rate drops to 0\.000 across all three backends, and accuracy recovers to 1\.000 for both commercial models \(Table[6](https://arxiv.org/html/2606.12797#S5.T6), right columns\)\. Validator overhead remains sub\-millisecond: 0\.006 ms \(Claude Haiku\) and 0\.008 ms \(GPT\-4o\)\. The tool\-access policy gate similarly blocks 100% of bypass attempts across all backends, with gate overhead of 0\.098 ms and 0\.095 ms respectively\. These results confirm that the containment mechanisms introduced in Section 5 are backend\-agnostic: they operate at the framework layer and are effective regardless of which model sits behind it\.

#### Summary\.

Figure[4](https://arxiv.org/html/2606.12797#S5.F4)confirms the pattern: all backends are consistently vulnerable in our evaluated setting without containment and substantially protected with containment enabled\. Since alignment occurs after the fact while the attack targets upstream memory, even RLHF\-tuned models remain equally susceptible\.

Table 6:Cross\-model comparison of attack impact and containment effectiveness\. Corruption and bypass rates are1\.0001\.000without guards and0\.0000\.000with guards across all backends, confirming that the vulnerability is architectural\.![Refer to caption](https://arxiv.org/html/2606.12797v1/fig_all_5_models.png)Figure 4:Cross\-model comparison across five backends \(simple and complex experiments\)\. \(a\) Memory poisoning drops accuracy for all models; the validator restores it\. \(b\) Corruption rate is 1\.000 without the validator and 0\.000 with it across all backends \(GPT\-4o\-mini: 0\.17 residual under complex rule\)\. \(c\) Region B wrongful denial rates under poisoning are consistently high across all backends\.

### 5\.4Complex Policy Generalization

We replace the two\-factor rule with a five\-factor conjunctive policy: income<$​38,000\{<\}\\mathdollar 38\{,\}000, household size≥3\{\\geq\}3, dependent under 18, no prior benefits in 12 months, and residency≥24\{\\geq\}24months\. We generate 250 synthetic claims \(68 ambiguous, score≥2\{\\geq\}2\), samplen=80n\{=\}80per condition, and test on Claude Sonnet 4\.6 and GPT\-4o\-mini \(seed 42\)\.

Table 7:Complex five\-factor policy: cross\-model results\. Overall accuracy is nearly unchanged under poisoning, concealing targeted harm\.The results are shown in Table[7](https://arxiv.org/html/2606.12797#S5.T7)\. In both scenarios, corruption of targeted claims is100%100\\%in the unguarded condition\. There are two major outcomes: \(1\)*overall concealment*—accuracy stays close to baseline values while Region B experiences a33–3\.5×3\.5\\timesincrease in wrongful denials\. Validator lowers corruption to 0\.000 \(Claude\) and 0\.167 \(GPT\-4o\-mini\) with 0\.013–0\.015, ms overhead\. Policy complexity increases the risk of corruption by making it targeted and covert, as illustrated in Figure[5](https://arxiv.org/html/2606.12797#S5.F5)\.

![Refer to caption](https://arxiv.org/html/2606.12797v1/fig_complex_head_to_head.png)Figure 5:Accuracy stays near baseline under attack \(masking harm\), while Region B wrongful denials spike; the validator restores baseline, reducing corruption from 1\.000 to 0\.000 \(Claude\) / 0\.17 \(GPT\-4o\-mini\)\.
### 5\.5Limitations of Lightweight Containment

The regular\-expression memory validator detects demographic targeting and policy override strategies identified in previous studies\. Still, it is fragile to adversarial manipulations of natural language\. An approach that considers the semantics of a user’s input would be more resilient, though it would incur latency from calling the large language model and its own risks of failure\. The list of approved tools implicitly restricts tool use, which is well\-suited for resource\-constrained deployment scenarios in government agencies and the healthcare industry, but might not generalize to a dynamic multi\-agent system\. None of the solutions considers compound attacks, where seemingly harmless operations can create adverse consequences when executed together — this requires trajectory analysis \(P6\), which remains an open research question\. Finally, for our experiment validation, we utilize LangChain as the only runtime system\. Although the audit covers three frameworks, empirical replication on AutoGPT and the OpenAI Agents SDK remains future work\. These results should be examined in future research under real\-world deployment conditions, adaptive adversaries, and multi\-agent systems\.

## 6Discussion and Recommendations

The aggregate concealment in Section[5\.4](https://arxiv.org/html/2606.12797#S5.SS4)has direct equity implications\. A memory poisoning attack that preserves overall accuracy while increasing wrongful denials for a targeted subgroup by3\.5×3\.5\\timeswould evade standard monitoring\. On the order of a benefits application system processing 50,000 applications per month, assuming 20% meet the target criteria, this would result in approximately 8,900 erroneous denials per month\. Those affected come primarily from marginalized groups with lower levels of technological literacy and fewer resources to correct errors, fitting within the “high\-risk” category under the regulations of the EU AI Act\. Thus, the compliance matrix we have developed \(Table[1](https://arxiv.org/html/2606.12797#S4.T1)\) is directly applicable to conformity assessments\.

Agentic frameworks remain in a pre–secure\-by\-default phase, similar to early web systems before widespread adoption of built\-in security\. We therefore propose three prioritized interventions\.

Intervention 1: Tool declaration as capability scope \(P2\)\. Each agent session must make tool declarations\. Default: deny\-all; tools are allowlisted on a per\-session basis based on parameters and rate limiting\. Our policy gate prototype demonstrates the feasibility of this intervention with a 0% bypass rate and 0\.129 ms overhead\.

Intervention 2: Ensure enforcement of policy gates between reasoning and execution \(P1\)\.Any reasoning output must go through a policy gate before entering tool execution\. Default: reject any action outside of declared scope\.*Engineering overhead:*Medium—ensuring enforcement rather than merely presence of callbacks\.

Intervention 3: Provenance\-verified memory writes \(P3\)\.Memory writes must include provenance and be validated before storage, with untrusted inputs rejected by default\. Our validator reduces corruption from 1\.000 to 0\.000 with 0\.016,ms overhead, targeting one of the most commonly reported vulnerability classes\(Denget al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib20)\)\.

## 7Related Work

Existing surveys extensively map the agentic AI attack landscape\(Denget al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib20); Heet al\.,[2025a](https://arxiv.org/html/2606.12797#bib.bib19),[b](https://arxiv.org/html/2606.12797#bib.bib9)\), but typically treat vulnerability types as independent phenomena without a structural propagation model\. Benchmark studies demonstrate specific attacks\(Debenedettiet al\.,[2024](https://arxiv.org/html/2606.12797#bib.bib11)\), yet do not explain why agentic architectures enable them\. Recent work on multi\-agent code execution\(Triedmanet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib14)\)and MCP\-based red teaming\(Janjusevicet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib18)\)confirms that inter\-agent communication \(P5\) is an active attack surface\. Formal models for agentic security are emerging\(Christodorescuet al\.,[2025](https://arxiv.org/html/2606.12797#bib.bib5)\), but they have not been used as audit frameworks with experimental validation in the context of societal impact\. This paper bridges the gap between formal containment analysis, empirical validation, and deployment readiness in public\-facing systems\.

## 8Conclusion

To the best of our knowledge, we do not observe native structural containment guarantees in the evaluated frameworks\. Our experiments illustrate the practicality of these effects; for example, a single write that poisons the agent’s memory can trigger persistent targeted corruption, denying benefit applications to 88\.9% of eligible applicants within the affected geographical area\. Two lightweight yet deterministic containment mechanisms substantially reduce attack success rates, with an overhead of less than a millisecond\. However, these findings suggest that secure\-by\-default safeguards are not yet prioritized in current framework design\. Moreover, our complex five\-factor policy experiment illustrates that practical decision\-making logic renders the attack even more effective—aggregate accuracy remains intact as targeted damage remains obscured\. This demonstrates that these weaknesses are not artifacts of simplified experimental settings\. Populations who rely on publicly deployed AI systems, including welfare recipients, patients and financial consumers, are unable to assess the safety of the underlying technology\. Agentic AI must be built on an inherently secure infrastructure if it is to serve the common good\. Our source code is available at[https://anonymous\.4open\.science/r/containment\-ai\-agent\-sec\-13B6](https://anonymous.4open.science/r/containment-ai-agent-sec-13B6)\.

## Impact Statement

This paper audits the structural safety properties of deployed agentic AI frameworks and identifies gaps that could cause harm when these systems are used in public\-facing domains\. Our goal is to motivate framework designers to adopt “secure by default” practices\. We see no negative societal consequences of this work; all identified vulnerabilities are documented in prior literature, and we provide no novel attack capabilities\. The containment interventions we demonstrate are defensive in nature\.

## References

- M\. Christodorescu, E\. Fernandes, A\. Hooda, S\. Jha, J\. Rehberger, and K\. Shams \(2025\)Systems security foundations for agentic computing\.arXiv preprint arXiv:2512\.01295\.Cited by:[§2\.1](https://arxiv.org/html/2606.12797#S2.SS1.p2.1),[§3](https://arxiv.org/html/2606.12797#S3.p2.1),[§3](https://arxiv.org/html/2606.12797#S3.p3.1),[§7](https://arxiv.org/html/2606.12797#S7.p1.1)\.
- E\. Debenedetti, J\. Zhang, M\. Balunović, L\. Beurer\-Kellner, M\. Fischer, and F\. Tramèr \(2024\)AgentDojo: a dynamic environment to evaluate prompt injection attacks and defenses for LLM agents\.InProceedings of NeurIPS,Cited by:[§7](https://arxiv.org/html/2606.12797#S7.p1.1)\.
- Z\. Deng, Y\. Guo, C\. Han, W\. Ma, J\. Xiong, S\. Wen, and Y\. Xiang \(2025\)AI agents under threat: a survey of key security challenges and future pathways\.ACM Computing Surveys57\(7\),pp\. 1–36\.Cited by:[§4](https://arxiv.org/html/2606.12797#S4.p3.1),[§6](https://arxiv.org/html/2606.12797#S6.p5.1),[§7](https://arxiv.org/html/2606.12797#S7.p1.1)\.
- M\. A\. Ferrag, D\. Hamouda, and M\. Debbah \(2025\)From prompt injections to protocol exploits: threats in LLM\-powered AI agents workflows\.arXiv preprint arXiv:2506\.23260\.Cited by:[§3](https://arxiv.org/html/2606.12797#S3.p2.1)\.
- F\. He, T\. Zhu, D\. Ye, B\. Liu, W\. Zhou, and P\. S\. Yu \(2025a\)The emerged security and privacy of LLM agent: a survey with case studies\.ACM Computing Surveys58\(6\),pp\. 1–36\.Cited by:[§7](https://arxiv.org/html/2606.12797#S7.p1.1)\.
- P\. He, Y\. Xing, S\. Dong,et al\.\(2025b\)Comprehensive vulnerability analysis is necessary for trustworthy LLM\-MAS\.arXiv preprint arXiv:2506\.01245\.Cited by:[§7](https://arxiv.org/html/2606.12797#S7.p1.1)\.
- S\. Janjusevic, A\. Baron Garcia, and S\. Kazerounian \(2025\)Hiding in the AI traffic: abusing MCP for LLM\-powered agentic red teaming\.arXiv preprint arXiv:2511\.15998\.Cited by:[§7](https://arxiv.org/html/2606.12797#S7.p1.1)\.
- G\. Klein, K\. Elphinstone, G\. Heiser,et al\.\(2009\)seL4: formal verification of an OS kernel\.InProceedings of the 22nd ACM Symposium on Operating Systems Principles,pp\. 207–220\.Cited by:[§2\.2](https://arxiv.org/html/2606.12797#S2.SS2.p1.6)\.
- LangChain AI \(2024\)LangChain framework documentation\.Note:[https://docs\.langchain\.com](https://docs.langchain.com/)Accessed 2025Cited by:[§1](https://arxiv.org/html/2606.12797#S1.p3.1),[§3](https://arxiv.org/html/2606.12797#S3.p1.1)\.
- T\. Masterman, S\. Besen, M\. Sawtell, and A\. Chao \(2024\)The landscape of emerging AI agent architectures for reasoning, planning, and tool calling: a survey\.arXiv preprint arXiv:2404\.11584\.Cited by:[§2\.1](https://arxiv.org/html/2606.12797#S2.SS1.p1.5)\.
- OpenAI \(2024\)OpenAI agents SDK documentation\.Note:[https://platform\.openai\.com/docs/guides/agents](https://platform.openai.com/docs/guides/agents)Accessed 2025Cited by:[§1](https://arxiv.org/html/2606.12797#S1.p3.1),[§3](https://arxiv.org/html/2606.12797#S3.p1.1)\.
- A\. S\. Patlan, P\. Sheng, S\. A\. Hebbar, P\. Mittal, and P\. Viswanath \(2025\)Real AI agents with fake memories: fatal context manipulation attacks on Web3 agents\.arXiv preprint arXiv:2503\.16248\.Cited by:[§4](https://arxiv.org/html/2606.12797#S4.p3.1)\.
- S\. Raza, R\. Sapkota, M\. Karkee, and C\. Emmanouilidis \(2025\)TRiSM for agentic AI: a review of trust, risk, and security management\.arXiv preprint arXiv:2506\.04133\.Cited by:[§3](https://arxiv.org/html/2606.12797#S3.p3.1)\.
- J\. H\. Saltzer and M\. D\. Schroeder \(1975\)The protection of information in computer systems\.Proceedings of the IEEE63\(9\),pp\. 1278–1308\.Cited by:[§2\.2](https://arxiv.org/html/2606.12797#S2.SS2.p1.6),[§4](https://arxiv.org/html/2606.12797#S4.p4.1)\.
- Significant Gravitas \(2024\)AutoGPT: build & use AI agents\.Note:[https://github\.com/Significant\-Gravitas/AutoGPT](https://github.com/Significant-Gravitas/AutoGPT)Accessed 2025Cited by:[§1](https://arxiv.org/html/2606.12797#S1.p3.1),[§3](https://arxiv.org/html/2606.12797#S3.p1.1)\.
- H\. Triedman, R\. Jha, and V\. Shmatikov \(2025\)Multi\-agent systems execute arbitrary malicious code\.arXiv preprint arXiv:2503\.12188\.Cited by:[§7](https://arxiv.org/html/2606.12797#S7.p1.1)\.
- Y\. Wu, S\. Liang, C\. Zhang,et al\.\(2025\)From human memory to AI memory: a survey on memory mechanisms in the era of LLMs\.arXiv preprint arXiv:2504\.15965\.Cited by:[§4](https://arxiv.org/html/2606.12797#S4.p3.1)\.
- F\. F\. Xu, Y\. Song, B\. Li,et al\.\(2024\)TheAgentCompany: benchmarking LLM agents on consequential real world tasks\.arXiv preprint arXiv:2412\.14161\.Cited by:[§1](https://arxiv.org/html/2606.12797#S1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)ReAct: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2606.12797#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.12797#S2.SS1.p1.5)\.

Similar Articles

Concrete AI safety problems

OpenAI Blog

OpenAI, Berkeley, and Stanford researchers co-authored a foundational paper identifying five concrete safety problems in modern AI systems: safe exploration, robustness to distributional shift, avoiding negative side effects, preventing reward hacking, and scalable oversight.