Statistically we are cooked

Reddit r/artificial 06/15/26, 07:44 PM News

Summary

Argues that because LLMs must encode harmful content to identify it and jailbreaks are always statistically possible given large user bases, there is a non-zero chance of harm; the author therefore advocates against censorship to ensure good actors have the same tools as bad actors.

In order for an LLM to identify harmful content, that harmful content must be included in the model's weights. If you train a model on data that omits this information, then it may naively regurgitate harmful content provided by a human users without knowing that it is harmful.  If harmful content is encoded in LLMs, and it is also true that jailbreaking LLMs is always technically possible (because LLMs are not deterministic). Then in theory every model has the capacity to cause great harm in theory  Even though labs like anthropic are exceptionally good at llm alignment and make it very difficult to jailbreak, just the fact that there will always be a non-zero chance of a jailbreak, and give the fact that LLMs have millions of users, then statistically it is likely that at least one will succeed.  All that said, this means that the people with the most conviction will have jailbroken models while the rest of the world doesn't, this is a scary thought. That's why I personally believe LLMs should not be censored, because one bad actor needs to be able to be taken down by a good actor with the same tools.  I'm very open to having my mind changed in the comments

Original Article

Similar Articles

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv cs.CL

Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

Hugging Face Daily Papers

This paper reveals that grammar-constrained decoding (GCD) can be exploited as a jailbreak attack (CodeSpear) to induce LLMs to generate malicious code, and proposes a defense (CodeShield) that preserves safety under such attacks.

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

arXiv cs.CL

This paper introduces a red-teaming framework that measures the 'Overton Window' of political opinions open-source LLMs can express and evaluates how simple jailbreaks expand that range, finding systematic left-leaning biases and vulnerabilities across 30+ models.

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arXiv cs.CL

Researchers from CUHK-Shenzhen introduce a jailbreak method using fanfiction subgenres from Archive of Our Own as attack carriers, embedding harmful content within creative writing scenes. Their method achieves a mean attack success rate of 0.731 on eight aligned LLMs, with a multi-turn extension (Saga-A4) reaching 0.924 ASR, outperforming existing methods.

What political censorship looks like inside an LLM's weights (109 minute read)

TLDR AI

This mechanistic interpretability study of Qwen 3.5 uncovers the specific circuit responsible for political censorship, demonstrating how it can be identified, analyzed, and even turned off by steering internal directions. The findings reveal that the model's factual knowledge remains intact, with censorship behavior layered on top.

Similar Articles

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

What political censorship looks like inside an LLM's weights (109 minute read)

Submit Feedback