Is it ever possible to have a malicious LLM with a backdoor
Summary
Discusses the possibility of LLMs containing backdoors triggered by secret sentences or conditions, and the relative risks of closed vs open-source models.
Similar Articles
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.
The only ethical way to use LLMs for research is with a closed-loop LLM Knowledge Base.
The article argues that using LLMs for research requires a closed-loop system like Karpathy's LLM Wiki or the Recall AI knowledge base to prevent hallucinations, ensuring all outputs are grounded in trusted source documents.
Estimating worst case frontier risks of open weight LLMs
OpenAI researchers study worst-case frontier risks of releasing open-weight LLMs through malicious fine-tuning (MFT) in biology and cybersecurity domains, finding that open-weight models underperform frontier closed-weight models and don't substantially advance harmful capabilities.
Giving LLMs exec() power is a security nightmare. I built a open-source AST-based guardrail to stop malicious agent execution.
Introduces ast-guard, an open-source AST-based security tool that prevents malicious code execution from LLM-generated Python strings by parsing them into an abstract syntax tree and applying node-level whitelisting and context-aware safety checks.
Have we reached the point where open-source LLMs are “just good enough”?
A discussion on whether open-source LLMs are now 'just good enough' for most use cases, questioning the added value of proprietary models and the cost-benefit tradeoffs.