adversarial-perturbations

#adversarial-perturbations

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.

0 favorites 0 likes

#adversarial-perturbations

Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper systematically investigates unlearnable examples under diverse training paradigms, revealing that pretrained weights weaken existing methods, and proposes Shallow Semantic Camouflage (SSC) to maintain unlearnability by generating perturbations in a semantically valid subspace.

0 favorites 0 likes

adversarial-perturbations

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

Submit Feedback