Tag
This paper proposes CR4T, a model-agnostic safeguarding framework that rewrites unsafe or refusal-style LLM outputs into developmentally appropriate, guidance-oriented responses for adolescents, offering a more human-centered alternative to traditional refusal-centric guardrails.
This paper introduces open-book benign rewriting (OBBR) as a proactive defense against backdoor attacks on LLMs, showing it neutralizes harmful content by projecting to benign prompts, and improves safety by 51% over state-of-the-art defenses.