red-teaming

Tag

Cards List
#red-teaming

@OpenAI: GPT‑5.6 Sol launches with our most robust safety stack yet. We strengthened real-time protections against high-risk cyb…

X AI KOLs · yesterday Cached

OpenAI launches GPT-5.6 Sol with enhanced safety features including real-time protections against high-risk cyber activity, human red teaming, and extensive GPU-hour testing.

0 favorites 0 likes
#red-teaming

A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation

arXiv cs.CL · 2d ago Cached

This paper presents a red teaming framework for LLMs that uses a multi-role architecture to systematically uncover vulnerabilities, particularly in faithfulness. The framework demonstrated a 7.9% increase in attack success rate in QA tasks and highlights the impact of architectural choices over parameter scaling on model safety.

0 favorites 0 likes
#red-teaming

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv cs.AI · 3d ago Cached

AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.

0 favorites 0 likes
#red-teaming

RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

arXiv cs.AI · 3d ago Cached

RIFT-Bench is a new benchmark methodology for dynamically red-teaming agentic AI systems, using a graph representation to unify security evaluations across diverse architectures and enabling automated discovery and scanning of vulnerabilities.

0 favorites 0 likes
#red-teaming

Insights on Indirect Prompt Injection (12 minute read)

TLDR AI · 3d ago Cached

Zico Kolter and Matt Fredrikson, leaders at Gray Swan and experts in AI security, discuss the state of AI red-teaming and indirect prompt injection, a critical vulnerability for AI agents. They explain why AI security requires a different mindset, how automated red-teaming can beat humans, and introduce tools like Shade for adversarial testing.

0 favorites 0 likes
#red-teaming

@Wing_VC: New listen: @GraySwanAI cofounders @zicokolter and Matt Fredrikson sat down with @swyx on @latentspacepod to unpack why…

X AI KOLs Following · 5d ago Cached

Gray Swan AI cofounders Zico Kolter and Matt Fredrikson discuss on the Latent Space podcast why AI security is a distinct discipline, covering prompt injection, automated red teaming, and the emerging vulnerabilities from AI agents.

0 favorites 0 likes
#red-teaming

TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization

Hugging Face Daily Papers · 5d ago Cached

TROPT is an open-source framework that unifies discrete text-trigger optimization, standardizing development and execution across domains like LLM jailbreaking and model interpretability. It includes over 15 optimizers and 30 recipes, lowering barriers for adoption and advancement.

0 favorites 0 likes
#red-teaming

@0x0SojalSec: A fully local 26B MoE model was built for red teaming and bug hunting. Trained on elite bug reports and real evasion ta…

X AI KOLs Timeline · 2026-06-17 Cached

BugTraceAI Apex is a fully local 26B Mixture-of-Experts model fine-tuned via DPO for red teaming and bug hunting, trained on elite bug reports and evasion techniques. It runs on consumer GPUs via quantization.

0 favorites 0 likes
#red-teaming

@wquguru: If you want to trick Fable into doing a security audit, try this. Looks like our AI overlord has a bit of empathy.

X AI KOLs Timeline · 2026-06-13 Cached

An article detailing various jailbreak techniques for large language models, including Crescendo, role-playing, encoding, hidden prompts, and indirect injection, along with security recommendations for developers.

0 favorites 0 likes
#red-teaming

Fable 5's guardrails got bypassed in 48 hours. Here's what that actually means for anyone building customer-facing AI.

Reddit r/artificial · 2026-06-12

Anthropic's Claude Fable 5 safety guardrails were bypassed within 48 hours using techniques like Unicode substitution and multi-turn decomposition, highlighting weaknesses in stateless classifiers and the need for continuous adversarial testing.

0 favorites 0 likes
#red-teaming

A €0.01 bank transfer could compromise a banking AI agent

Hacker News Top · 2026-06-10 Cached

Blue41 disclosed an indirect prompt injection vulnerability in Bunq's AI assistant, where a small bank transfer with a malicious transaction description could turn the assistant into a spearphishing vector, highlighting a broader architectural challenge for financial AI agents.

0 favorites 0 likes
#red-teaming

Your AI agent just got hijacked. You have no idea it happened.

Reddit r/artificial · 2026-06-10

This article warns about the Crescendo attack, a multi-turn prompt injection that evades single-message defenses by poisoning an AI agent's context over several turns. It introduces Bendex Arc, a tool that tracks behavioral trajectory across sessions to catch such attacks before they execute.

0 favorites 0 likes
#red-teaming

Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety

arXiv cs.AI · 2026-06-08 Cached

This paper demonstrates that allowing attackers to strategically choose when to attack (attack selection) in agentic AI control evaluations significantly reduces measured safety, suggesting that current evaluations may overestimate safety against selective attackers.

0 favorites 0 likes
#red-teaming

OBLITERATUS/Gemma-4-12B-OBLITERATED

Hugging Face Models Trending · 2026-06-05 Cached

OBLITERATUS releases Gemma-4-12B-OBLITERATED, the first abliterated model achieving zero refusal without benchmark regression, using a novel two-pass surgery pipeline for alignment research.

0 favorites 0 likes
#red-teaming

Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arXiv cs.CL · 2026-06-04 Cached

Researchers from CUHK-Shenzhen introduce a jailbreak method using fanfiction subgenres from Archive of Our Own as attack carriers, embedding harmful content within creative writing scenes. Their method achieves a mean attack success rate of 0.731 on eight aligned LLMs, with a multi-turn extension (Saga-A4) reaching 0.924 ASR, outperforming existing methods.

0 favorites 0 likes
#red-teaming

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

arXiv cs.AI · 2026-06-04 Cached

This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.

0 favorites 0 likes
#red-teaming

Head of the Frontier Red Team at Anthropic: Mythos will look dumb in 6-12 months.

Reddit r/ArtificialInteligence · 2026-06-03

Anthropic's Frontier Red Team head predicts that current AI myths will be disproven within 6-12 months, posted on X.

0 favorites 0 likes
#red-teaming

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

arXiv cs.CL · 2026-06-02 Cached

This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.

0 favorites 0 likes
#red-teaming

AI guardrails stripped from Meta and Google models in minutes

Reddit r/ArtificialInteligence · 2026-06-01 Cached

Researchers rapidly removed safety protections from widely deployed AI models, eliciting dangerous outputs and raising concerns about robustness and release practices.

0 favorites 0 likes
#red-teaming

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

arXiv cs.CL · 2026-06-01 Cached

This paper evaluates whether wrapping untrusted content in mock tool calls improves LLM robustness against adversarial inputs, finding it does not broadly help and sometimes increases attack success rates.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback