SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Hugging Face Daily Papers Papers

Summary

SafeHarbor is a novel framework for LLM agent safety that uses hierarchical memory and self-evolution to balance safety and utility, achieving state-of-the-art performance on benign and malicious tasks.

With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerful tool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent's utility on benign tasks. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.
Original Article
View Cached Full Text

Cached at: 05/14/26, 08:17 AM

Paper page - SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Source: https://huggingface.co/papers/2605.05704 Published on May 7

·

Submitted byhttps://huggingface.co/ljjDL

ZheLiuon May 14

Abstract

SafeHarbor is a novel framework for LLM agents that establishes precise decision boundaries through context-aware defense rules, featuring a hierarchical memory system and self-evolution mechanism to balance safety and utility.

With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerfultool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent’s utility on benign tasks. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precisedecision boundariesfor LLM agents. Unlike static guidelines, SafeHarbor extracts context-awaredefense rulesthrough enhancedadversarial generation. We design alocal hierarchical memory systemfordynamic rule injection, offering a training-free, efficient, andplug-and-play solution. Furthermore, we introduce aninformation entropy-based self-evolution mechanismthat continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.

View arXiv pageView PDFGitHub5Add to collection

Get this paper in your agent:

hf papers read 2605\.05704

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.05704 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.05704 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.05704 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

State Contamination in Memory-Augmented LLM Agents

arXiv cs.AI

This paper identifies and studies 'memory laundering' in LLM agents, where toxic or adversarial context compressed into memory summaries evades standard toxicity detectors while still influencing future generations. It introduces the sub-threshold propagation gap (SPG) to measure hidden downstream influence and shows that sanitizing toxic state before summarization is more effective than post-hoc cleaning.

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

arXiv cs.LG

PropGuard is a propagation-aware framework for safeguarding LLM-based multi-agent systems (LLM-MAS) from malicious instructions that propagate across agents and rounds. It constructs a dual-view spatio-temporal graph and uses a GE-GRPO trained inspector to detect and remediate suspicious propagation subgraphs.

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

arXiv cs.CL

HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.