Tag
This paper analyzes why LLM safety alignment is fragile, attributing it to 'autoregressive consistency'—the tendency of next-token prediction to extend the current response trajectory—which concentrates alignment updates on early tokens. The authors introduce a 'random insertion attack' exploiting this property and propose an adversarial safety alignment framework to address it.
Proposes COMPASS, a cognitive MCTS-guided process alignment framework to enhance safety in LLM-powered search agents by synthesizing attack trajectories and isolating risky actions, achieving a favorable safety-utility trade-off with less training data.
This paper introduces the Configurable Safety Reward Model (CSRM), a reward model that can be configured to accommodate heterogeneous and evolving safety requirements for LLM alignment. CSRM achieves state-of-the-art results on configurable safety benchmarks and improves the helpfulness-safety tradeoff.
This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.
This paper proposes Staged-Competence, a curriculum learning framework for DPO-based safety alignment that organizes preference data by difficulty, improving robustness and data efficiency while preserving general capabilities.
Palette proposes a modular framework for selectively relaxing safety refusal behaviors in LLMs for authorized professional domains, using multi-objective search and lightweight adaptation to avoid costly retraining.
This paper introduces OPSA, an on-policy self-distillation method for LLM safety alignment that reduces the safety tax by training on the model's own rollouts and using a teacher flip rate to activate latent safety reasoning, achieving stronger safety-reasoning tradeoffs across multiple model scales.
GradShield introduces a principled filtering method to preserve LLM safety alignment during fine-tuning by computing a Finetuning Implicit Harmfulness Score and using adaptive thresholding to remove harmful data, achieving low attack success rates while maintaining utility.
This paper introduces MOCI, a novel framework for inferring shared constraints and individual preferences from heterogeneous expert demonstrations in reinforcement learning, outperforming existing baselines in predictive performance and computational efficiency.