GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

arXiv cs.CL Papers

Summary

This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.

arXiv:2605.09027v1 Announce Type: new Abstract: In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:08 AM

# GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
Source: [https://arxiv.org/abs/2605.09027](https://arxiv.org/abs/2605.09027)
[View PDF](https://arxiv.org/pdf/2605.09027)

> Abstract:In multi\-agent systems \(MAS\), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses\. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them\. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero\-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples\. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co\-evolved imposter strategies\. Our contributions are threefold: \(1\) Using chess as a substrate deep reasoning problem and Gemini 3\.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; \(2\) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable \(50\.5% F1\-score with a Gemini\-based detector\); \(3\) We show that zero\-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near\-identical zero\-shot scores differ by 8x on few\-shot adaptation, while the meta\-learned variant converges 20x faster, a gap only visible in the recalibration mode\. Altogether, GAMBIT provides the first multi\-agent benchmark where adversarial attacks and defenses co\-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system\. Code and data:[this https URL](https://anonymous.4open.science/r/gambit)\.

## Submission history

From: Alexandre Le Mercier \[[view email](https://arxiv.org/show-email/b985ade1/2605.09027)\] **\[v1\]**Sat, 9 May 2026 16:07:23 UTC \(3,912 KB\)

Similar Articles

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

arXiv cs.LG

This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

arXiv cs.CL

This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.