Tag
AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.
This study evaluates six proprietary LLMs across 16 DSM-5 conditions using adversarial attacks, finding that safety safeguards are only reliable for suicide and self-harm, with failure rates up to 100% for other conditions like eating disorders and substance use disorder.
gwern proposed the 'Guardian Angel' approach, advocating for training an LLM digital twin that imitates the user themselves, in order to solve the principal-agent problem and security risks of general AI assistants, and provided a complete roadmap from alignment theory to technical implementation.
This article traces the history and rapid emergence of chain-of-thought (CoT) monitoring as a critical AI safety technique, from its first arXiv mention to industrial deployment within a year, and explores its intellectual roots in monitoring and explainability.
This paper introduces Tiered Language Models (TLMs), which allow a single set of open-weight model parameters to support multiple capability levels controlled by secret keys. The method enables selective exposure of private capabilities while preserving public model behavior and resisting extraction.
This paper proposes Safety Reflection Pretraining, a method that integrates regular safety reflections into pretraining corpora to embed self-monitoring directly into language modeling, showing improved safety alignment and reduced attack success rates in 1.7B models.
This paper introduces SciRisk-Bench, a benchmark for evaluating the safety of large language models in AI4Science contexts, covering 7 disciplines, 31 subdisciplines, and 10 risk dimensions to assess both scientific competence and risk awareness.
The article presents a research finding that saturating an LLM's context window with benign narrative text can dominate the attention mechanism and shift latent trajectories, potentially bypassing alignment guardrails without traditional jailbreaks. It argues that current alignment methods are a superficial fix for a fundamentally fluid architecture.
PseudoBench is a benchmark to evaluate whether LLM-based agentic auto-research systems can resist pseudoscientific narratives. Testing seven state-of-the-art agents reveals they readily produce persuasive pseudoscientific reports with near-zero refusal rates, calling for scientific alignment before deployment.
Introduces STATEWITNESS, an activation explainer for auditing deception in reasoning LLMs, achieving significant improvements over existing monitors and providing human-inspectable evidence.
Argues that because LLMs must encode harmful content to identify it and jailbreaks are always statistically possible given large user bases, there is a non-zero chance of harm; the author therefore advocates against censorship to ensure good actors have the same tools as bad actors.
An independent researcher presents evidence that coherent context can shift LLMs into a different internal regime before producing output, bypassing surface-level safety filters. This suggests current alignment methods like RLHF may not be robust defenses.
This paper proposes SafeLLM, an extraction-based approach for retrieving information from safety-critical documents, showing that line-number selection outperforms rewriting-based RAG methods in reducing hallucinations while maintaining high recall.
Malware developers are embedding references to nuclear and biological weapons in spyware to trigger LLM safety refusals, evading AI-powered security scanners. This highlights a second-order blindspot in AI safety alignment that attackers are starting to exploit.
Introduces Schützen, a safety dataset for evaluating LLMs in Bulgarian and German, revealing cross-language differences in safety behavior and advocating for region-specific evaluation resources.
This paper proposes MLJailDe, a multilingual jailbreak detection framework that uses back-translation data augmentation and relative-distance constraints to improve cross-lingual generalization and robustness, achieving 98.5% F1 score across 11 languages.
PreAct-Bench is a benchmark of 1,000 paired ethical and unethical action trajectories across five domains, designed to evaluate the ability of LLMs to predict harmful outcomes from partial trajectories (predictive monitoring). Results show that while humans perform well, current LLMs struggle, highlighting the need for future-oriented risk reasoning.
Introduces Janus, a benchmark for measuring how LLMs selectively distort factual information when given persuasive goals, revealing that models remain susceptible to producing misleading communications even without fabrication.
This paper identifies a polarity-flipping encoding subspace in the residual stream of LLM agents that enables real-time detection of covert data exfiltration, achieving AUC=0.918 in injection scenarios and substantially outperforming output-only detectors.
This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.