LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Hugging Face Daily Papers Papers

Summary

LiSA (Lifelong Safety Adaptation) is a framework that enhances AI agent safety guardrails by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to perform well under sparse and noisy feedback, addressing the critical need for adaptive safety in real-world deployments.

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.
Original Article
View Cached Full Text

Cached at: 05/15/26, 12:25 PM

Paper page - LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Source: https://huggingface.co/papers/2605.14454

Abstract

LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions.

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows,guardrailsbecome a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap:guardrailsmust adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservativepolicy inductionframework that improves a fixed base guardrail throughstructured memory. LiSA converts occasional failures into reusablepolicy abstractionsso that sparse reports can generalize beyond individual cases, addsconflict-aware local rulesto prevent overgeneralization in mixed-label contexts, and appliesevidence-aware confidence gatingvia aposterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strongmemory-based baselinesundersparse feedback, remains robust undernoisy user feedbackeven at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.14454

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.14454 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.14454 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.14454 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Safe Continual Reinforcement Learning under Nonstationarity via Adaptive Safety Constraints

arXiv cs.LG

Proposes LILAC+, a framework for safe continual reinforcement learning under nonstationarity that uses three adaptive safety mechanisms: context-based safety constraints, adaptation-speed constraints, and budget-to-state safety enforcement. Evaluations in simulated driving environments show reduced safety violations under distribution shift while maintaining competitive performance.

Learning Agentic Policy from Action Guidance

arXiv cs.CL

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.

On Safety Risks in Experience-Driven Self-Evolving Agents

arXiv cs.CL

Researchers from Harbin Institute of Technology and Singapore Management University investigate safety risks in experience-driven self-evolving LLM agents, finding that even benign task experience can compromise safety in high-risk scenarios due to agents' execution-oriented tendencies, and revealing a fundamental safety–utility trade-off.