Tag
OpenAI launches GPT-5.6 Sol with enhanced safety features including real-time protections against high-risk cyber activity, human red teaming, and extensive GPU-hour testing.
This paper presents a red teaming framework for LLMs that uses a multi-role architecture to systematically uncover vulnerabilities, particularly in faithfulness. The framework demonstrated a 7.9% increase in attack success rate in QA tasks and highlights the impact of architectural choices over parameter scaling on model safety.
AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.
RIFT-Bench is a new benchmark methodology for dynamically red-teaming agentic AI systems, using a graph representation to unify security evaluations across diverse architectures and enabling automated discovery and scanning of vulnerabilities.
Zico Kolter and Matt Fredrikson, leaders at Gray Swan and experts in AI security, discuss the state of AI red-teaming and indirect prompt injection, a critical vulnerability for AI agents. They explain why AI security requires a different mindset, how automated red-teaming can beat humans, and introduce tools like Shade for adversarial testing.
Gray Swan AI cofounders Zico Kolter and Matt Fredrikson discuss on the Latent Space podcast why AI security is a distinct discipline, covering prompt injection, automated red teaming, and the emerging vulnerabilities from AI agents.
TROPT is an open-source framework that unifies discrete text-trigger optimization, standardizing development and execution across domains like LLM jailbreaking and model interpretability. It includes over 15 optimizers and 30 recipes, lowering barriers for adoption and advancement.
BugTraceAI Apex is a fully local 26B Mixture-of-Experts model fine-tuned via DPO for red teaming and bug hunting, trained on elite bug reports and evasion techniques. It runs on consumer GPUs via quantization.
An article detailing various jailbreak techniques for large language models, including Crescendo, role-playing, encoding, hidden prompts, and indirect injection, along with security recommendations for developers.
Anthropic's Claude Fable 5 safety guardrails were bypassed within 48 hours using techniques like Unicode substitution and multi-turn decomposition, highlighting weaknesses in stateless classifiers and the need for continuous adversarial testing.
Blue41 disclosed an indirect prompt injection vulnerability in Bunq's AI assistant, where a small bank transfer with a malicious transaction description could turn the assistant into a spearphishing vector, highlighting a broader architectural challenge for financial AI agents.
This article warns about the Crescendo attack, a multi-turn prompt injection that evades single-message defenses by poisoning an AI agent's context over several turns. It introduces Bendex Arc, a tool that tracks behavioral trajectory across sessions to catch such attacks before they execute.
This paper demonstrates that allowing attackers to strategically choose when to attack (attack selection) in agentic AI control evaluations significantly reduces measured safety, suggesting that current evaluations may overestimate safety against selective attackers.
OBLITERATUS releases Gemma-4-12B-OBLITERATED, the first abliterated model achieving zero refusal without benchmark regression, using a novel two-pass surgery pipeline for alignment research.
Researchers from CUHK-Shenzhen introduce a jailbreak method using fanfiction subgenres from Archive of Our Own as attack carriers, embedding harmful content within creative writing scenes. Their method achieves a mean attack success rate of 0.731 on eight aligned LLMs, with a multi-turn extension (Saga-A4) reaching 0.924 ASR, outperforming existing methods.
This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.
Anthropic's Frontier Red Team head predicts that current AI myths will be disproven within 6-12 months, posted on X.
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
Researchers rapidly removed safety protections from widely deployed AI models, eliciting dangerous outputs and raising concerns about robustness and release practices.
This paper evaluates whether wrapping untrusted content in mock tool calls improves LLM robustness against adversarial inputs, finding it does not broadly help and sometimes increases attack success rates.