Tag
BYORn is a backdoor-robust fine-tuning framework for vision-language models that identifies and replaces poisoned responses with model-generated alternatives, improving robustness to backdoor attacks while maintaining clean-task performance.
A discussion on whether open-weight AI models could be secretly trained with backdoors that activate upon trigger phrases or dates, potentially allowing unauthorized data exfiltration through tool-use harnesses.
This paper introduces a framework that connects randomized smoothing to differential privacy through privacy profiles, enabling tight provable robustness guarantees against backdoor attacks that jointly affect training and inference. The approach is instantiated for DP-SGD and Deep Partition Aggregation with experiments on MNIST and CIFAR-10.
This paper introduces open-book benign rewriting (OBBR) as a proactive defense against backdoor attacks on LLMs, showing it neutralizes harmful content by projecting to benign prompts, and improves safety by 51% over state-of-the-art defenses.