adversarial-perturbations

Tag

Cards List
#adversarial-perturbations

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

arXiv cs.AI · 2026-05-29 Cached

This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.

0 favorites 0 likes
#adversarial-perturbations

Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms

arXiv cs.LG · 2026-05-08 Cached

This paper systematically investigates unlearnable examples under diverse training paradigms, revealing that pretrained weights weaken existing methods, and proposes Shallow Semantic Camouflage (SSC) to maintain unlearnability by generating perturbations in a semantically valid subspace.

0 favorites 0 likes
← Back to home

Submit Feedback