Tag
This paper proposes methods for protecting large language models against unauthorized knowledge distillation by rewriting reasoning traces to degrade training usefulness while preserving correctness, and embedding verifiable watermarks in distilled student models. The approach uses instruction-based and gradient-based rewriting techniques to achieve anti-distillation effects without compromising teacher model performance.
This paper demonstrates that deep neural networks are catastrophically vulnerable to minimal sign-bit flips in parameters, introducing DNL and 1P-DNL methods to identify critical vulnerable parameters without data or optimization. The vulnerability spans multiple domains including image classification, object detection, instance segmentation, and language models, with practical implications for model security.