Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
Summary
This paper introduces open-book benign rewriting (OBBR) as a proactive defense against backdoor attacks on LLMs, showing it neutralizes harmful content by projecting to benign prompts, and improves safety by 51% over state-of-the-art defenses.
View Cached Full Text
Cached at: 05/20/26, 10:40 PM
Paper page - Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
Source: https://huggingface.co/papers/2605.19147
Abstract
Open-book benign rewriting effectively defends large language models against backdoor attacks by neutralizing harmful content through benign prompt projection, outperforming existing defenses while maintaining computational efficiency and natural language task performance.
Large language models(LLMs) are highly susceptible tobackdoor attacks(BAs), wherein training samples are poisoned usingtrigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense againstdata poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termedopen-book benign rewriting(OBBR)--the probability of a rewritten output being benign is strictly greater than that ofclosed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space ofbenign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared toclosed-book rewritingmethods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger baseddata poisoningattacks.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.19147
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.19147 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.19147 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.19147 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Protecting Language Models Against Unauthorized Distillation through Trace Rewriting
Researchers propose trace rewriting methods to prevent unauthorized LLM knowledge distillation while preserving answer correctness and embedding detectable watermarks.
State Contamination in Memory-Augmented LLM Agents
This paper identifies and studies 'memory laundering' in LLM agents, where toxic or adversarial context compressed into memory summaries evades standard toxicity detectors while still influencing future generations. It introduces the sub-threshold propagation gap (SPG) to measure hidden downstream influence and shows that sanitizing toxic state before summarization is more effective than post-hoc cleaning.
Chainwash: Multi-Step Rewriting Attacks on Diffusion Language Model Watermarks
This research paper introduces Chainwash, a multi-step rewriting attack that effectively removes statistical watermarks from diffusion language model (LLaDA-8B-Instruct) outputs, reducing detection rates from 87.9% to 4.86% after five chained rewrites.
Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs
This paper analyzes the reconstruction-concealment tradeoff in intent-obfuscation jailbreak attacks on Multimodal Large Language Models (MLLMs). It proposes concealment-aware variant construction and keyword-related distractor images to exploit model vulnerabilities more effectively.
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
The paper introduces CITA, a framework for generating implicit toxicity attacks in Chinese to evaluate and improve LLM toxicity detectors, finding high attack success rates across tested models.