Statistically we are cooked
Summary
Argues that because LLMs must encode harmful content to identify it and jailbreaks are always statistically possible given large user bases, there is a non-zero chance of harm; the author therefore advocates against censorship to ensure good actors have the same tools as bad actors.
Similar Articles
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
This paper reveals that grammar-constrained decoding (GCD) can be exploited as a jailbreak attack (CodeSpear) to induce LLMs to generate malicious code, and proposes a defense (CodeShield) that preserves safety under such attacks.
How Far Will They Go? Red-Teaming Online Influence with Large Language Models
This paper introduces a red-teaming framework that measures the 'Overton Window' of political opinions open-source LLMs can express and evaluates how simple jailbreaks expand that range, finding systematic left-leaning biases and vulnerabilities across 30+ models.
Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
Researchers from CUHK-Shenzhen introduce a jailbreak method using fanfiction subgenres from Archive of Our Own as attack carriers, embedding harmful content within creative writing scenes. Their method achieves a mean attack success rate of 0.731 on eight aligned LLMs, with a multi-turn extension (Saga-A4) reaching 0.924 ASR, outperforming existing methods.
What political censorship looks like inside an LLM's weights (109 minute read)
This mechanistic interpretability study of Qwen 3.5 uncovers the specific circuit responsible for political censorship, demonstrating how it can be identified, analyzed, and even turned off by steering internal directions. The findings reveal that the model's factual knowledge remains intact, with censorship behavior layered on top.