safety-alignment

Tag

Cards List
#safety-alignment

When Autoregressive Consistency Hurts Safety Alignment

arXiv cs.LG · yesterday Cached

This paper analyzes why LLM safety alignment is fragile, attributing it to 'autoregressive consistency'—the tendency of next-token prediction to extend the current response trajectory—which concentrates alignment updates on early tokens. The authors introduce a 'random insertion attack' exploiting this property and propose an adversarial safety alignment framework to address it.

0 favorites 0 likes
#safety-alignment

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

arXiv cs.AI · 4d ago Cached

Proposes COMPASS, a cognitive MCTS-guided process alignment framework to enhance safety in LLM-powered search agents by synthesizing attack trajectories and isolating risky actions, achieving a favorable safety-utility trade-off with less training data.

0 favorites 0 likes
#safety-alignment

Configurable Reward Model for Balanced Safety Alignment

arXiv cs.CL · 4d ago Cached

This paper introduces the Configurable Safety Reward Model (CSRM), a reward model that can be configured to accommodate heterogeneous and evolving safety requirements for LLM alignment. CSRM achieves state-of-the-art results on configurable safety benchmarks and improves the helpfulness-safety tradeoff.

0 favorites 0 likes
#safety-alignment

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

arXiv cs.AI · 2026-05-29 Cached

This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.

0 favorites 0 likes
#safety-alignment

Curriculum Learning for Safety Alignment

arXiv cs.LG · 2026-05-27 Cached

This paper proposes Staged-Competence, a curriculum learning framework for DPO-based safety alignment that organizes preference data by difficulty, improving robustness and data efficiency while preserving general capabilities.

0 favorites 0 likes
#safety-alignment

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

arXiv cs.AI · 2026-05-26 Cached

Palette proposes a modular framework for selectively relaxing safety refusal behaviors in LLMs for authorized professional domains, using multi-objective search and lightweight adaptation to avoid costly retraining.

0 favorites 0 likes
#safety-alignment

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

arXiv cs.LG · 2026-05-18 Cached

This paper introduces OPSA, an on-policy self-distillation method for LLM safety alignment that reduces the safety tax by training on the model's own rollouts and using a teacher flip rate to activate latent safety reasoning, achieving stronger safety-reasoning tradeoffs across multiple model scales.

0 favorites 0 likes
#safety-alignment

GradShield: Alignment Preserving Finetuning

arXiv cs.CL · 2026-05-15 Cached

GradShield introduces a principled filtering method to preserve LLM safety alignment during fine-tuning by computing a Finetuning Implicit Harmfulness Score and using adaptive thresholding to remove harmful data, achieving low attack success rates while maintaining utility.

0 favorites 0 likes
#safety-alignment

Multi-Objective Constraint Inference using Inverse reinforcement learning

arXiv cs.AI · 2026-05-11 Cached

This paper introduces MOCI, a novel framework for inferring shared constraints and individual preferences from heterogeneous expert demonstrations in reinforcement learning, outperforming existing baselines in predictive performance and computational efficiency.

0 favorites 0 likes
← Back to home

Submit Feedback