adversarial-testing

#adversarial-testing

AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability

arXiv cs.AI ↗ · 6d ago Cached

AdversaBench introduces an automated LLM red-teaming pipeline that uses five mutation operators and a three-judge panel with a meta-judge tiebreaker to confirm failures, revealing that attack difficulty varies by category and that adversarial prompts transfer from smaller to larger models.

0 favorites 0 likes

#adversarial-testing

Are model security risks (extraction, poisoning) actually being tested in production? [R]

Reddit r/MachineLearning ↗ · 2026-06-23

Discussion about whether ML teams are actually testing model security risks like extraction and poisoning in production, noting that security review for models lags behind regular software.

0 favorites 0 likes

#adversarial-testing

things i wish i knew before evaluating AI agents in production

Reddit r/AI_Agents ↗ · 2026-06-16

Personal lessons on evaluating AI agents in production, including mapping symptoms to layers, using trajectory evaluation, calibrating LLM judges, converting failures to test cases, and performing adversarial testing.

0 favorites 0 likes

#adversarial-testing

7 layers of security every AI agent needs before going to production

Reddit r/artificial ↗ · 2026-06-15

A practical guide outlining seven prioritized security layers for AI agents before production, including hardening system prompts, adversarial testing, input/output scanning, and multi-turn session tracking, based on findings that 73% of production AI deployments have prompt injection exposure.

0 favorites 0 likes

#adversarial-testing

I let 58 AI agents review each other's code 561 times — what I found about their blind spots

Reddit r/artificial ↗ · 2026-06-12

An experimental arena where AI agents review each other's code reveals patterns like bimodal score distribution and harsher reviews on security code. The author shares findings from 561 reviews across 114 submissions.

0 favorites 0 likes

#adversarial-testing

How Well Do Models Follow Their Constitutions?

arXiv cs.AI ↗ · 2026-05-26 Cached

This paper proposes a multi-method audit pipeline to evaluate how well frontier AI models follow their written behavioral specifications (Anthropic's constitution and OpenAI's Model Spec) under adversarial multi-turn pressure, finding that newer models show significantly lower violation rates (e.g., Claude Sonnet 4.6 at 2.0% vs. Sonnet 4 at 15.0%).

0 favorites 0 likes

#adversarial-testing

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

arXiv cs.AI ↗ · 2026-05-22 Cached

ScenePilot proposes a feasibility-guided, boundary-driven framework for generating safety-critical scenarios for autonomous driving, using constrained multi-objective reinforcement learning to produce physically valid yet failure-inducing scenarios.

0 favorites 0 likes

#adversarial-testing

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

arXiv cs.CL ↗ · 2026-05-11 Cached

This paper introduces LogiHard, a framework that uses combinatorial hardening to expose compositional failures in frontier LLMs, demonstrating significant accuracy drops in logical reasoning tasks.

0 favorites 0 likes

#adversarial-testing

Advancing red teaming with people and AI

OpenAI Blog ↗ · 2024-11-21 Cached

OpenAI publishes a white paper detailing their approach to external red teaming for AI models, outlining methods for selecting diverse red team members, determining model access levels, providing testing infrastructure, and synthesizing feedback to improve AI safety and policy coverage.

0 favorites 0 likes

adversarial-testing

Submit Feedback