GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
Summary
This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.
View Cached Full Text
Cached at: 05/12/26, 07:08 AM
# GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives Source: [https://arxiv.org/abs/2605.09027](https://arxiv.org/abs/2605.09027) [View PDF](https://arxiv.org/pdf/2605.09027) > Abstract:In multi\-agent systems \(MAS\), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses\. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them\. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero\-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples\. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co\-evolved imposter strategies\. Our contributions are threefold: \(1\) Using chess as a substrate deep reasoning problem and Gemini 3\.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; \(2\) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable \(50\.5% F1\-score with a Gemini\-based detector\); \(3\) We show that zero\-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near\-identical zero\-shot scores differ by 8x on few\-shot adaptation, while the meta\-learned variant converges 20x faster, a gap only visible in the recalibration mode\. Altogether, GAMBIT provides the first multi\-agent benchmark where adversarial attacks and defenses co\-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system\. Code and data:[this https URL](https://anonymous.4open.science/r/gambit)\. ## Submission history From: Alexandre Le Mercier \[[view email](https://arxiv.org/show-email/b985ade1/2605.09027)\] **\[v1\]**Sat, 9 May 2026 16:07:23 UTC \(3,912 KB\)
Similar Articles
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems
This paper introduces MAC-Bench, a dynamic adversarial benchmark for evaluating procedural compliance in multi-agent systems. It proposes the SERV pipeline to generate contamination-free scenarios and new metrics like Compliance-Weighted Success Rate (CSR) and Machiavellian Gap (MG).
Gate AI: LLM Security Benchmark Evaluation Methodology and Results
This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
This paper addresses the challenge of robust checkpoint selection for multimodal LLMs under evaluation uncertainty, proposing a multi-stage framework that integrates curated real-world data, LLM-based judgment, and ranking protocols with confidence estimation.
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement
CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.