LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
Summary
LiSA (Lifelong Safety Adaptation) is a framework that enhances AI agent safety guardrails by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to perform well under sparse and noisy feedback, addressing the critical need for adaptive safety in real-world deployments.
View Cached Full Text
Cached at: 05/15/26, 12:25 PM
Paper page - LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
Source: https://huggingface.co/papers/2605.14454
Abstract
LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions.
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows,guardrailsbecome a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap:guardrailsmust adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservativepolicy inductionframework that improves a fixed base guardrail throughstructured memory. LiSA converts occasional failures into reusablepolicy abstractionsso that sparse reports can generalize beyond individual cases, addsconflict-aware local rulesto prevent overgeneralization in mixed-label contexts, and appliesevidence-aware confidence gatingvia aposterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strongmemory-based baselinesundersparse feedback, remains robust undernoisy user feedbackeven at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.14454
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.14454 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.14454 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.14454 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints
Proposes LILAC+, a framework for safe continual reinforcement learning under nonstationarity that uses three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Evaluations in simulated driving environments show reduced safety violations under distribution shift while maintaining competitive performance.
Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation
This paper introduces OPSA, an on-policy self-distillation method for LLM safety alignment that reduces the safety tax by training on the model's own rollouts and using a teacher flip rate to activate latent safety reasoning, achieving stronger safety-reasoning tradeoffs across multiple model scales.
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
SafeHarbor is a novel framework for LLM agent safety that uses hierarchical memory and self-evolution to balance safety and utility, achieving state-of-the-art performance on benign and malicious tasks.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
On Safety Risks in Experience-Driven Self-Evolving Agents
Researchers from Harbin Institute of Technology and Singapore Management University investigate safety risks in experience-driven self-evolving LLM agents, finding that even benign task experience can compromise safety in high-risk scenarios due to agents' execution-oriented tendencies, and revealing a fundamental safety–utility trade-off.