Tag
GRZO is a novel zeroth-order optimization method for fine-tuning large language models that reduces variance by using group-relative normalization, achieving better accuracy and memory efficiency compared to MeZO.
This paper proposes a hybrid framework combining first-order safety alignment with zeroth-order refinement to enhance the robustness of LLM safety alignment against post-alignment perturbations. Theoretical and empirical results show that only a few refinement steps can improve robustness while preserving safety.