Is it ever possible to have a malicious LLM with a backdoor

Reddit r/LocalLLaMA News

Summary

Discusses the possibility of LLMs containing backdoors triggered by secret sentences or conditions, and the relative risks of closed vs open-source models.

I was just brainstorming of possibilities that the LLMs behave differently than normal if trained to recognize a specific secret sentence, and then unlocks a backdoor of malicious behavior. This sounds to me very possible at first glance. Don't get me wrong, the risk is relevant for ALL LLMs (closed & open ones), as long as we don't know the training data. I'm just trying to get the community ideas about such possibility and what are our lines of defense as long as we get the LLM having access to critical resources. My opinion is that closed source is riskier in this regards, because they can ultimately even change the behavior intentionally from the source. For local LLMs, since we're not exposing the LLM externally (i.e. we're the only prompters) it would limit the backdoor injection risks, but not entirely, because the LLM my have a sleeping trigger trained on (e.g. only wakes up when the date/time is matching a specific value). What do you think about such possibilities?
Original Article

Similar Articles

Estimating worst case frontier risks of open weight LLMs

OpenAI Blog

OpenAI researchers study worst-case frontier risks of releasing open-weight LLMs through malicious fine-tuning (MFT) in biology and cybersecurity domains, finding that open-weight models underperform frontier closed-weight models and don't substantially advance harmful capabilities.