Is it ever possible to have a malicious LLM with a backdoor

Reddit r/LocalLLaMA 06/29/26, 11:52 AM News

Summary

Discusses the possibility of LLMs containing backdoors triggered by secret sentences or conditions, and the relative risks of closed vs open-source models.

I was just brainstorming of possibilities that the LLMs behave differently than normal if trained to recognize a specific secret sentence, and then unlocks a backdoor of malicious behavior. This sounds to me very possible at first glance. Don't get me wrong, the risk is relevant for ALL LLMs (closed & open ones), as long as we don't know the training data. I'm just trying to get the community ideas about such possibility and what are our lines of defense as long as we get the LLM having access to critical resources. My opinion is that closed source is riskier in this regards, because they can ultimately even change the behavior intentionally from the source. For local LLMs, since we're not exposing the LLM externally (i.e. we're the only prompters) it would limit the backdoor injection risks, but not entirely, because the LLM my have a sleeping trigger trained on (e.g. only wakes up when the date/time is matching a specific value). What do you think about such possibilities?

Original Article

Similar Articles

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

arXiv cs.AI

This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.

The only ethical way to use LLMs for research is with a closed-loop LLM Knowledge Base.

Reddit r/artificial

The article argues that using LLMs for research requires a closed-loop system like Karpathy's LLM Wiki or the Recall AI knowledge base to prevent hallucinations, ensuring all outputs are grounded in trusted source documents.

Estimating worst case frontier risks of open weight LLMs

OpenAI Blog

OpenAI researchers study worst-case frontier risks of releasing open-weight LLMs through malicious fine-tuning (MFT) in biology and cybersecurity domains, finding that open-weight models underperform frontier closed-weight models and don't substantially advance harmful capabilities.

Giving LLMs exec() power is a security nightmare. I built a open-source AST-based guardrail to stop malicious agent execution.

Reddit r/AI_Agents

Introduces ast-guard, an open-source AST-based security tool that prevents malicious code execution from LLM-generated Python strings by parsing them into an abstract syntax tree and applying node-level whitelisting and context-aware safety checks.

Have we reached the point where open-source LLMs are “just good enough”?

Reddit r/LocalLLaMA

A discussion on whether open-source LLMs are now 'just good enough' for most use cases, questioning the added value of proprietary models and the cost-benefit tradeoffs.