SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
Summary
SafeHarbor is a novel framework for LLM agent safety that uses hierarchical memory and self-evolution to balance safety and utility, achieving state-of-the-art performance on benign and malicious tasks.
View Cached Full Text
Cached at: 05/14/26, 08:17 AM
Paper page - SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
Source: https://huggingface.co/papers/2605.05704 Published on May 7
·
Submitted byhttps://huggingface.co/ljjDL
ZheLiuon May 14
Abstract
SafeHarbor is a novel framework for LLM agents that establishes precise decision boundaries through context-aware defense rules, featuring a hierarchical memory system and self-evolution mechanism to balance safety and utility.
With the rapid evolution of foundation models, Large Language Model (LLM) agents have demonstrated increasingly powerfultool-use capabilities. However, this proficiency introduces significant security risks, as malicious actors can manipulate agents into executing tools to generate harmful content. While existing defensive mechanisms are effective, they frequently suffer from the over-refusal problem, where increased safety strictness compromises the agent’s utility on benign tasks. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precisedecision boundariesfor LLM agents. Unlike static guidelines, SafeHarbor extracts context-awaredefense rulesthrough enhancedadversarial generation. We design alocal hierarchical memory systemfordynamic rule injection, offering a training-free, efficient, andplug-and-play solution. Furthermore, we introduce aninformation entropy-based self-evolution mechanismthat continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.
View arXiv pageView PDFGitHub5Add to collection
Get this paper in your agent:
hf papers read 2605\.05704
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.05704 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.05704 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.05704 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
State Contamination in Memory-Augmented LLM Agents
This paper identifies and studies 'memory laundering' in LLM agents, where toxic or adversarial context compressed into memory summaries evades standard toxicity detectors while still influencing future generations. It introduces the sub-threshold propagation gap (SPG) to measure hidden downstream influence and shows that sanitizing toxic state before summarization is more effective than post-hoc cleaning.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure
H-Mem is a novel memory mechanism for LLM-based agents that uses a hybrid structure combining a temporal and semantic tree with a knowledge graph to model memory evolution and improve retrieval, achieving state-of-the-art performance on QA benchmarks.
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation
PropGuard is a propagation-aware framework for safeguarding LLM-based multi-agent systems (LLM-MAS) from malicious instructions that propagate across agents and rounds. It constructs a dual-view spatio-temporal graph and uses a GE-GRPO trained inspector to detect and remediate suspicious propagation subgraphs.
Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching
This paper proposes a memory-augmented multi-agent architecture using nested learning, continuum memory systems, and semantic caching to mitigate hallucination in LLM pipelines, achieving significant reductions in factual errors while improving operational efficiency.
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
HeLa-Mem is a bio-inspired memory architecture for LLM agents that models memory as a dynamic graph using Hebbian learning dynamics, featuring episodic and semantic memory stores to improve long-term coherence. Experiments on LoCoMo show superior performance across question categories while using fewer context tokens.