Statistically we are cooked

Reddit r/artificial News

Summary

Argues that because LLMs must encode harmful content to identify it and jailbreaks are always statistically possible given large user bases, there is a non-zero chance of harm; the author therefore advocates against censorship to ensure good actors have the same tools as bad actors.

In order for an LLM to identify harmful content, that harmful content must be included in the model's weights. If you train a model on data that omits this information, then it may naively regurgitate harmful content provided by a human users without knowing that it is harmful. ​ If harmful content is encoded in LLMs, and it is also true that jailbreaking LLMs is always technically possible (because LLMs are not deterministic). Then in theory every model has the capacity to cause great harm in theory ​ Even though labs like anthropic are exceptionally good at llm alignment and make it very difficult to jailbreak, just the fact that there will always be a non-zero chance of a jailbreak, and give the fact that LLMs have millions of users, then statistically it is likely that at least one will succeed. ​ All that said, this means that the people with the most conviction will have jailbroken models while the rest of the world doesn't, this is a scary thought. That's why I personally believe LLMs should not be censored, because one bad actor needs to be able to be taken down by a good actor with the same tools. ​ I'm very open to having my mind changed in the comments
Original Article

Similar Articles

What political censorship looks like inside an LLM's weights (109 minute read)

TLDR AI

This mechanistic interpretability study of Qwen 3.5 uncovers the specific circuit responsible for political censorship, demonstrating how it can be identified, analyzed, and even turned off by steering internal directions. The findings reveal that the model's factual knowledge remains intact, with censorship behavior layered on top.