Tag
AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.
Discussion about whether ML teams are actually testing model security risks like extraction and poisoning in production, noting that security review for models lags behind regular software.
Personal lessons on evaluating AI agents in production, including mapping symptoms to layers, using trajectory evaluation, calibrating LLM judges, converting failures to test cases, and performing adversarial testing.
A practical guide outlining seven prioritized security layers for AI agents before production, including hardening system prompts, adversarial testing, input/output scanning, and multi-turn session tracking, based on findings that 73% of production AI deployments have prompt injection exposure.
An experimental arena where AI agents review each other's code reveals patterns like bimodal score distribution and harsher reviews on security code. The author shares findings from 561 reviews across 114 submissions.
This paper proposes a multi-method audit pipeline to evaluate how well frontier AI models follow their written behavioral specifications (Anthropic's constitution and OpenAI's Model Spec) under adversarial multi-turn pressure, finding that newer models show significantly lower violation rates (e.g., Claude Sonnet 4.6 at 2.0% vs. Sonnet 4 at 15.0%).
ScenePilot proposes a feasibility-guided, boundary-driven framework for generating safety-critical scenarios for autonomous driving, using constrained multi-objective reinforcement learning to produce physically valid yet failure-inducing scenarios.
This paper introduces LogiHard, a framework that uses combinatorial hardening to expose compositional failures in frontier LLMs, demonstrating significant accuracy drops in logical reasoning tasks.
OpenAI publishes a white paper detailing their approach to external red teaming for AI models, outlining methods for selecting diverse red team members, determining model access levels, providing testing infrastructure, and synthesizing feedback to improve AI safety and policy coverage.