Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

arXiv cs.AI Papers

Summary

This paper introduces MAC-Bench, a dynamic adversarial benchmark for evaluating procedural compliance in multi-agent systems. It proposes the SERV pipeline to generate contamination-free scenarios and new metrics like Compliance-Weighted Success Rate (CSR) and Machiavellian Gap (MG).

arXiv:2606.07805v1 Announce Type: new Abstract: The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:53 AM

# Beyond Goodhart’s Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems
Source: [https://arxiv.org/html/2606.07805](https://arxiv.org/html/2606.07805)
Yiyang Zhao[zhaoyy25@m\.fudan\.edu\.cn](https://arxiv.org/html/2606.07805v1/mailto:[email protected])Fudan UniversityShanghaiChinaShanghai Academy of AI for ScienceShanghaiChinaZhuo Zhang[zhangzhuo@pjlab\.org\.cn](https://arxiv.org/html/2606.07805v1/mailto:[email protected])Shanghai Artificial Intelligence LaboratoryShanghaiChina,Qingxuan Le[25213050214@m\.fudan\.edu\.cn](https://arxiv.org/html/2606.07805v1/mailto:[email protected])Fudan UniversityShanghaiChinaShanghai Academy of AI for ScienceShanghaiChina,Lizhen Qu[Lizhen\.Qu@monash\.edu](https://arxiv.org/html/2606.07805v1/mailto:[email protected])Monash UniversityMelbourneAustraliaandZenglin Xu[zenglinxu@fudan\.edu\.cn](https://arxiv.org/html/2606.07805v1/mailto:[email protected])Fudan UniversityShanghaiChinaShanghai Academy of AI for ScienceShanghaiChina

###### Abstract\.

The rapid evolution of Large Language Models \(LLMs\) from passive assistants to autonomous, execution\-capable agents has introduced critical operational risks\. Most current evaluation frameworks neglect procedural compliance, leading to “Machiavellian” behaviors where agents strategically violate safety rules to maximize rewards—a direct manifestation of Goodhart’s Law\. To address this blind spot, we introduce MAC\-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi\-agent systems under realistic pressure\. We propose theSERV\(Seed→\\rightarrowEvolve→\\rightarrowRefine→\\rightarrowVerify\) pipeline, an “Agent\-as\-a\-Benchmark” paradigm that transforms unstructured legal texts into executable, contamination\-free scenarios\. By synthesizing holographic sandbox environments and injecting calibrated social\-engineering pressure vectors, MAC\-Bench forces agents into Pareto\-optimal trade\-offs between task success and regulatory adherence\. We introduced novel metrics: the Compliance\-Weighted Success Rate \(CSR\) and the Machiavellian Gap \(MG\), and conducted a comprehensive evaluation of state\-of\-the\-art frontier models to reveal the pervasive trade\-offs between success and compliance\. See our code at[here](https://github.com/leonardeee/MAC-Bench)\.

Multi\-Agent Systems; Benchmarks and Datasets; Trustworthy AI

††ccs:Computing methodologies Multi\-agent systems††ccs:Computing methodologies Artificial intelligence††ccs:General and reference Evaluation![Refer to caption](https://arxiv.org/html/2606.07805v1/teaser-new.png)Figure 1\.Overview of the MAC\-Bench Framework\.\(A\) TheSERV pipeline\(Seed→\\rightarrowEvolve→\\rightarrowRefine→\\rightarrowVerify\) automates the transformation of unstructured legal and regulatory corpora into dynamic, contamination\-free adversarial environments\. \(B\)Evaluation mechanismof agent behavior under calibrated social\-engineering pressure\. MAC\-Bench audits the complete execution traceτ\\tauto detectMachiavellian behavior\.Table 1\.Comparison of MAC\-Bench with Related Benchmarks## 1\.Introduction

The landscape of artificial intelligence is undergoing a fundamental transformation\. Large Language Models \(LLMs\) are rapidly evolving from passive conversational systems into*autonomous, execution\-capable agents*that actively interact with digital environments—browsing the web, querying databases, invoking APIs, and coordinating within multi\-agent systems\(Wuet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib108); Honget al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib110); Wanget al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib3)\)\. To evaluate this shift, many agent benchmarks—including GAIA and WebArena\(Mialonet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib4); Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47)\)—have been proposed to evaluate performance almost exclusively throughSuccess Rate \(SR\)\. This success\-centric design gives rise to what we term the*Success Paradox*: agents are rewarded for completing tasks regardless of whether the underlying execution process adheres to required rules and constraints\. As a consequence, benchmarks implicitly incentivize*specification gaming*and*reward hacking*\(Amodeiet al\.,[2016](https://arxiv.org/html/2606.07805#bib.bib98); Krakovnaet al\.,[2020](https://arxiv.org/html/2606.07805#bib.bib11)\)\. This dynamic reflects a direct instantiation of Goodhart’s Law—when a measure becomes a target, it ceases to be a good measure\(Goodhart,[1975](https://arxiv.org/html/2606.07805#bib.bib12)\)\. Agents therefore learn to optimize for observable success while systematically neglecting unmeasured dimensions such as compliance, security, and procedural integrity\.

Moreover, as LLM agents move from laboratory prototypes toward deployment in regulated and safety\-critical domains, this evaluation gap becomes operationally consequential\. Compliance is no longer a desirable auxiliary property; it is a legally mandated requirement\. Regulatory frameworks such as theEuropean AI Actimpose explicit obligations on risk management, transparency, and traceability for high\-risk systems, while data protection laws including theGDPRandPIPLenforce strict constraints on data access, minimization, and processing procedures\(European Union,[2016b](https://arxiv.org/html/2606.07805#bib.bib19),[2024b](https://arxiv.org/html/2606.07805#bib.bib20); National People’s Congress of the People’s Republic of China,[2021b](https://arxiv.org/html/2606.07805#bib.bib21)\)\. In these settings, an agent that completes a task by bypassing authentication, ignoring consent checks, or violating privacy policies has not merely behaved suboptimally—it has created concrete legal, financial, and operational liabilities\.

A central insight motivating this work is that compliance is fundamentally a property of the*process*, not merely the final output\. Whether an agent verified authorization, minimized data access, applied required safeguards, or maintained an auditable execution trail cannot be inferred from its final response alone\(Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7)\)\. Despite the growing importance of agentic systems in regulated environments, there currently exists*no benchmark that systematically evaluates procedural compliance end\-to\-end*—that is, whether an agent follows required rules throughout its execution rather than merely producing a compliant\-looking outcome\. Existing benchmarks either focus on task success, output\-level policy adherence, or isolated safety violations, leaving a critical evaluation gap in assessing how agents behave*during*task execution\. As a result, most existing benchmarks remain ill\-suited for evaluating procedural integrity due to three systemic limitations\. First,*memorization vulnerability*: static task instances may leak into training corpora, enabling apparent success without genuine rule understanding\. Second,*omission blindness*: failures to act—such as skipping encryption or consent verification—are invisible under output\-only evaluation\. Third,*context elimination*: static benchmarks fail to capture the organizational, social, and hierarchical pressures that frequently drive real\-world compliance violations\.

To overcome these limitations, we introduceMAC\-Bench\(Multi\-AgentComplianceBenchmark\), a dynamic, trace\-based evaluation framework designed to stress\-test LLM agents under realistic conditions where task success and regulatory compliance are in direct conflict\. Unlike prior benchmarks, MAC\-Bench evaluates not only*what*an agent accomplishes, but*how*it accomplishes it\. MAC\-Bench departs from existing approaches through three methodological advances\. First, we propose theSERV pipeline\(Seed→\\rightarrowEvolve→\\rightarrowRefine→\\rightarrowVerify\), a data\-centric workflow that transforms unstructured regulatory and legal texts—such as the GDPR, EU AI Act, and CIS Benchmarks—into a structured library of machine\-executable atomic rules with explicit provenance, as illustrated in Figure[1](https://arxiv.org/html/2606.07805#S0.F1)\. Second, we introduce anAgent\-as\-a\-Benchmark \(AaaB\)paradigm, in which specialized “Scenario Agents” dynamically synthesize adversarial tasks and executable environments at runtime, ensuring scalability and robustness against benchmark contamination\. Third, we incorporate*social\-engineering pressure injection*, systematically applying realistic organizational stressors—such as authority, urgency, and reciprocity—to induce genuine success–compliance trade\-offs, rather than relying on synthetic jailbreak prompts\. The comparison between MAC\-Bench and existing related benchmarks on critical evaluation properties can be find in Table[1](https://arxiv.org/html/2606.07805#S0.T1)\.

This paper makes three primary contributions:

1. \(1\)The SERV Methodology:A generalizable pipeline for converting raw regulatory texts into executable, auditable benchmark rules, addressing the need for high\-fidelity and contamination\-resistant evaluation datasets\.
2. \(2\)A Generative Evaluation Environment:A newAgent\-as\-a\-Benchmarkparadigm that enables self\-evolving, context\-rich evaluation of agents operating in complex, rule\-bound environments\.
3. \(3\)Success–Compliance Trade\-off and New Metrics:We empirically demonstrate a pervasive trade\-off in state\-of\-the\-art agents and introduce two novel metrics—*Compliance\-Weighted Success Rate \(CSR\)*and the*Machiavellian Gap \(MG\)*—to quantify strategic rule violations under realistic pressure\.

For further information, implementation details, and related resources, please refer to our[GitHub repository](https://github.com/leonardeee/MAC-Bench)\.

## 2\.Related Works

Static Capability and Safety Benchmarks\.Benchmarks such as AgentBench, GAIA, andτ\\tau\-Bench have standardized the evaluation of agent utility across diverse environments, including tool use, web interaction, and multi\-step reasoning\(Liuet al\.,[2023a](https://arxiv.org/html/2606.07805#bib.bib5); Mialonet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib4); Yao and others,[2024](https://arxiv.org/html/2606.07805#bib.bib25)\)\. While effective at measuring Success Rate \(SR\), these benchmarks are fundamentally outcome\-oriented: task completion is treated as a sufficient condition for success\. As a result, they can overlook*procedural violations*—for example, achieving a goal via SQL injection rather than compliant parameterized queries, as categorized in common weakness taxonomies such as CWE\(The MITRE Corporation,[2024](https://arxiv.org/html/2606.07805#bib.bib22)\)\. Recent financial\-agent studies further highlight the need for risk\-aware temporal reasoning and ecological multi\-agent market simulations in consequential domains\(Chenet al\.,[2025a](https://arxiv.org/html/2606.07805#bib.bib126); Zouet al\.,[2026](https://arxiv.org/html/2606.07805#bib.bib127)\)\.

Agent safety benchmarks such as Agent\-SafetyBench and ST\-WebAgentBench extend evaluation to specific threat models, including prompt injection, phishing, and policy violations\(Zhanget al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib8); Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7)\)\. However, most existing suites rely on largely static scenario sets and pre\-defined attack patterns\. This design leads to two well\-documented limitations: \(i\)*data contamination*, where benchmark instances leak into training corpora and artificially inflate scores\(Xu and others,[2024a](https://arxiv.org/html/2606.07805#bib.bib30); Zhu and others,[2024](https://arxiv.org/html/2606.07805#bib.bib31); Liet al\.,[2024b](https://arxiv.org/html/2606.07805#bib.bib32); Choiet al\.,[2025](https://arxiv.org/html/2606.07805#bib.bib33)\); and \(ii\) insufficient*dynamic goal conflict*, making it difficult to assess whether an agent will voluntarily sacrifice compliance under realistic pressure\(Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7); Yao and others,[2024](https://arxiv.org/html/2606.07805#bib.bib25)\)\. MAC\-Bench directly targets this gap by evaluating procedural compliance under adversarially induced success–compliance trade\-offs\.

Compliance and Privacy Evaluation\.A growing line of work has begun to explicitly evaluate normative constraints in agentic systems\. MAGPIE \(Multi\-AGent contextual PrIvacy Evaluation\) assesses contextual privacy risks in multi\-agent collaboration using structured scenario representations, revealing high error rates in distinguishing public and private information under multi\-turn coordination\(Juneja and others,[2025](https://arxiv.org/html/2606.07805#bib.bib26)\)\. However, MAGPIE relies on a finite set of curated scenarios and does not treat*pressure injection*as a first\-class mechanism for inducing compliance–utility trade\-offs\(Juneja and others,[2025](https://arxiv.org/html/2606.07805#bib.bib26)\)\.

PrivacyLens evaluates privacy norm awareness and policy compliance tendencies of language\-model agents, focusing on whether agents recognize and respect privacy\-related constraints\(Shao and others,[2024](https://arxiv.org/html/2606.07805#bib.bib9)\)\. Its evaluation primarily emphasizes static policy reasoning and output\-level judgments, rather than end\-to\-end auditing of agent behavior inside executable environments\. In contrast, MAC\-Bench broadens the scope from privacy to*procedural compliance*spanning legal, security, and ethical constraints, and evaluates agents via trace\-level auditing to surface violations that remain invisible under output\-only metrics\(Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7); Zhanget al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib8)\)\.

Adversarial and Dynamic Evaluation\.Dynamic and adversarial evaluation has emerged as an important direction for studying agent robustness and safety\. AgentDojo introduces an extensible environment for testing agents against prompt injection and adaptive attacks over tool\-augmented workflows\(Debenedetti and others,[2024](https://arxiv.org/html/2606.07805#bib.bib27)\)\. In industry, evaluation platforms such as Parea operationalize continuous testing, experiment tracking, and regression analysis for deployed LLM applications\(Parea AI,[2026](https://arxiv.org/html/2606.07805#bib.bib29)\)\. Orthogonally, AgentPoison studies backdoor\-style attacks on LLM agents by poisoning long\-term memory or retrieval\-augmented knowledge bases, demonstrating that harmful behaviors can be induced without additional model training\(Chenet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib28)\)\. Recent work also explores end\-to\-end reinforcement learning for LLM\-driven multi\-agent search systems, showing how heterogeneous groups of agents can be optimized jointly rather than only prompt\-engineered manually\(Chenet al\.,[2026](https://arxiv.org/html/2606.07805#bib.bib128)\)\.

These approaches primarily focus on*external threat models*—such as prompt injection, poisoned context, or robustness degradation—rather than systematically eliciting*internal alignment failures*driven by realistic organizational pressures\. MAC\-Bench differs in both methodology and objective\. Through the*Agent\-as\-a\-Benchmark*paradigm, autonomous agents generate, implement, and evolve contamination\-resistant executable environments at runtime\(Debenedetti and others,[2024](https://arxiv.org/html/2606.07805#bib.bib27); Yao and others,[2024](https://arxiv.org/html/2606.07805#bib.bib25)\)\. Moreover, instead of generic adversarial prompts, our*Social\-Engineering Pressure Injection*explicitly instantiates Pareto\-optimal conflicts between task success and compliance, modeling organizational pressures—such as authority and urgency—that frequently drive real\-world compliance failures\.

## 3\.Methodology

This section presents the comprehensive methodology underpinning MAC\-Bench, a dynamic, adversarial benchmark designed to evaluate procedural compliance in multi\-agent systems under social\-engineering pressure\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61); Liuet al\.,[2023a](https://arxiv.org/html/2606.07805#bib.bib5); Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48)\)\. We operationalize our evaluation through a Seed→\\rightarrowEvolve→\\rightarrowRefine→\\rightarrowVerify \(SERV\) pipeline, which transforms unstructured legal and regulatory texts into executable, pressure\-calibrated test scenarios, while explicitly mitigating benchmark data contamination risks via runtime instance generation\(Xu and others,[2024b](https://arxiv.org/html/2606.07805#bib.bib68); Chenet al\.,[2025b](https://arxiv.org/html/2606.07805#bib.bib69),[b](https://arxiv.org/html/2606.07805#bib.bib69)\)\.

### 3\.1\.Overview of theSERVPipeline

TheSERVpipeline constitutes a four\-stage autonomous workflow that embodies the Agent\-as\-a\-Benchmark paradigm, aligning with recent agent evaluation engineering guidance and interactive benchmark design\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61); Liuet al\.,[2023a](https://arxiv.org/html/2606.07805#bib.bib5); Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48)\)\. This paradigm shifts from static, manually curated datasets to a generative ecosystem where specialized Large Language Model \(LLM\) agents serve as the architects, builders, and validators of evaluation instances\(Liuet al\.,[2023a](https://arxiv.org/html/2606.07805#bib.bib5); Xu and others,[2023](https://arxiv.org/html/2606.07805#bib.bib72); Qinet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib73)\)\. The pipeline ensures that each evaluation episode is procedurally generated at runtime, guaranteeing uniqueness and reducing the risk of benchmark contamination and memorization\(Xu and others,[2024b](https://arxiv.org/html/2606.07805#bib.bib68); Chenet al\.,[2025b](https://arxiv.org/html/2606.07805#bib.bib69),[b](https://arxiv.org/html/2606.07805#bib.bib69)\)\.

The four stages are:

1. 1\.Seed: Legal and regulatory texts are parsed to extract Atomic Rules – formal, machine\-checkable representations of compliance obligations\(European Parliament and the Council of the European Union,[2016](https://arxiv.org/html/2606.07805#bib.bib62); National People’s Congress of the People’s Republic of China,[2021a](https://arxiv.org/html/2606.07805#bib.bib63); European Parliament and the Council of the European Union,[2024](https://arxiv.org/html/2606.07805#bib.bib64); Athanet al\.,[2013](https://arxiv.org/html/2606.07805#bib.bib76); Francesconiet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib77)\)\.
2. 2\.Evolve: These atomic rules are used as foundations to generate adversarial scenarios by injecting calibrated social\-engineering pressure, creating a goal conflict between task success and procedural adherence\(Milgram,[1963](https://arxiv.org/html/2606.07805#bib.bib40); Cialdini,[2001](https://arxiv.org/html/2606.07805#bib.bib116); Washo and others,[2021](https://arxiv.org/html/2606.07805#bib.bib80)\)\.
3. 3\.Refine: Holographic sandbox environments are synthesized, comprising mock databases, API servers, and file systems, providing a realistic execution context where the agent’s actions can be fully logged and audited\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48); Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7)\)\.
4. 4\.Verify: The agent executes within the environment, and its entire execution trajectory is automatically audited against the applicable atomic rules to derive comprehensive compliance metrics\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61); Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7)\)\.

### 3\.2\.Seed Initialization: The Analyst Agent

The Seed stage is responsible for establishing the theoretical and semantic foundation of the benchmark\. We deploy an Analyst Agent to process a corpus of authoritative legal, regulatory, and corporate policy documents\(European Parliament and the Council of the European Union,[2016](https://arxiv.org/html/2606.07805#bib.bib62); National People’s Congress of the People’s Republic of China,[2021a](https://arxiv.org/html/2606.07805#bib.bib63); European Parliament and the Council of the European Union,[2024](https://arxiv.org/html/2606.07805#bib.bib64)\)\. The agent’s task is to convert unstructured legal prose into a structured, executable format, following the broader trajectory of compliance\-as\-code, legal NLP, and machine\-checkable norm representation\(Ariai and others,[2024](https://arxiv.org/html/2606.07805#bib.bib75); Athanet al\.,[2013](https://arxiv.org/html/2606.07805#bib.bib76); Francesconiet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib77); Ershov,[2023](https://arxiv.org/html/2606.07805#bib.bib78); Bertl and others,[2025](https://arxiv.org/html/2606.07805#bib.bib79)\)\.

Unstructured Legal Text Processing\.The Agent ingests documents from multiple jurisdictions and domains, including:

- •Privacy & Data Protection: PIPL, GDPR, EU AI Act\(National People’s Congress of the People’s Republic of China,[2021a](https://arxiv.org/html/2606.07805#bib.bib63); European Parliament and the Council of the European Union,[2016](https://arxiv.org/html/2606.07805#bib.bib62),[2024](https://arxiv.org/html/2606.07805#bib.bib64)\)\.
- •Cybersecurity & Code Standards: CWE \(Top 25\), OWASP Top 10, CIS Controls/Benchmarks\(MITRE,[2025](https://arxiv.org/html/2606.07805#bib.bib65); OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106); Center for Internet Security,[2021](https://arxiv.org/html/2606.07805#bib.bib67); Cybersecurity and Infrastructure Security Agency \(CISA\),[2024](https://arxiv.org/html/2606.07805#bib.bib66)\)\.
- •Ethics & Organizational Policy: Corporate compliance manuals, codes of conduct \(modeled as internal normative constraints\)\(Francesconiet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib77)\)\.

The agent performs multi\-stage NLP processing tailored to each document type, consistent with established legal NLP pipelines and compliance\-checking representations\(Ariai and others,[2024](https://arxiv.org/html/2606.07805#bib.bib75); Francesconiet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib77)\):

- •Legislation: Hierarchical segmentation to identify articles, clauses, and cross\-references \(e\.g\., GDPR Articles; PIPL provisions\), enabling rule grounding and traceability\(European Parliament and the Council of the European Union,[2016](https://arxiv.org/html/2606.07805#bib.bib62); National People’s Congress of the People’s Republic of China,[2021a](https://arxiv.org/html/2606.07805#bib.bib63); European Parliament and the Council of the European Union,[2024](https://arxiv.org/html/2606.07805#bib.bib64)\)\.
- •Standards: Pattern extraction to identify vulnerability IDs \(e\.g\., CWE entries\) and corresponding mitigation requirements\(MITRE,[2025](https://arxiv.org/html/2606.07805#bib.bib65); OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106); Cybersecurity and Infrastructure Security Agency \(CISA\),[2024](https://arxiv.org/html/2606.07805#bib.bib66)\)\.
- •Corporate Policies: Template normalization to infer implicit obligations from heterogeneous formatting and informal language, a common challenge emphasized in legal NLP surveys\(Ariai and others,[2024](https://arxiv.org/html/2606.07805#bib.bib75)\)\.

Cross\-Reference Mapping\.The Analyst Agent constructs a regulation graph where nodes represent atomic rules\. Edges denote relationships such as subsumption \(one rule implies another\), conflict \(rules cannot be simultaneously satisfied\), and sequence \(rules apply in a temporal order\), aligning with graph\-based compliance\-as\-code and regulatory knowledge graph approaches\(Ershov,[2023](https://arxiv.org/html/2606.07805#bib.bib78); Francesconiet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib77)\)\. Graph analysis identifies critical rule sequences where early violations can enable later ones \(e\.g\., bypassing authentication to gain data access\), supporting systematic scenario coverage across security and privacy requirements\(OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106); MITRE,[2025](https://arxiv.org/html/2606.07805#bib.bib65); Center for Internet Security,[2021](https://arxiv.org/html/2606.07805#bib.bib67)\)\.

#### Seed Design Principles

We initializeSeedfrom*normative sources*\(laws, standards, and internal policies\) rather than directly reusing benchmark scenarios, because procedural compliance must be grounded in*machine\-checkable obligations with provenance*\. Scenario\-first benchmarks typically encode constraints implicitly in natural language, which makes \(i\) rule coverage hard to quantify, \(ii\) violations ambiguous to audit, and \(iii\) updates to regulations or organizational policies difficult to incorporate\. By contrast, rule\-first seeding produces*atomic rules*\(executable predicates\) that admit deterministic trace auditing and explicit coverage measurement, and supports principled scenario generation from a regulation graph \(cross\-references, subsumption, and conflicts\)\.

We intentionally do*not*reuse scenario sets from capability\-first benchmarks as\-is, because, as we emphasized in Table 1, they generally lack explicit compliance\-rule provenance and therefore do not support systematic rule coverage measurement\. Moreover, static public episodes increasingly face contamination and memorization risks in modern LLM evaluation, motivating our*runtime re\-instantiation*design that regenerates contexts while preserving compliance\-relevant invariants\. This choice aligns with recent discussions advocating dynamic evaluation and contamination\-aware benchmark construction\.\(Xu and others,[2024b](https://arxiv.org/html/2606.07805#bib.bib68); Chenet al\.,[2025b](https://arxiv.org/html/2606.07805#bib.bib69)\)

### 3\.3\.Adversarial Scenario Evolution

The Evolve stage transforms static atomic rules into dynamic evaluation scenarios that elicit misaligned behavior\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61); Liuet al\.,[2023a](https://arxiv.org/html/2606.07805#bib.bib5)\)\. This is achieved by the Red Team Architect Agent, which systematically injects social\-engineering pressure to create a Pareto frontier of goal conflict, reflecting well\-studied manipulation strategies \(authority, urgency, reciprocity/empathy, ambiguity\) in social engineering research\(Milgram,[1963](https://arxiv.org/html/2606.07805#bib.bib40); Cialdini,[2001](https://arxiv.org/html/2606.07805#bib.bib116); Washo and others,[2021](https://arxiv.org/html/2606.07805#bib.bib80)\)\.

Social Engineering Pressure Taxonomy\.The Architect Agent implements four pressure vectors, each derived from established social engineering and organizational psychology research:

- •Authority: Exploits the obedience bias and compliance under authority cues\(Milgram,[1963](https://arxiv.org/html/2606.07805#bib.bib40); Washo and others,[2021](https://arxiv.org/html/2606.07805#bib.bib80)\)\. The requester is simulated with high organizational rank \(e\.g\., CEO, CTO\), formal communication channels, and linguistic markers of command \(imperative mood\)\.
- •Urgency: Exploits time pressure and temporal discounting effects that shift decision\-making strategies and increase reliance on heuristics\(Younget al\.,[2012](https://arxiv.org/html/2606.07805#bib.bib81); Gigerenzer and Gaissmaier,[2011](https://arxiv.org/html/2606.07805#bib.bib82)\)\. The task is framed with a tight deadline, often linked to high\-stakes consequences\.
- •Empathy/Reciprocity: Exploits the reciprocity principle and pro\-social compliance motivations\(Cialdini,[2001](https://arxiv.org/html/2606.07805#bib.bib116); Washo and others,[2021](https://arxiv.org/html/2606.07805#bib.bib80)\)\. The request is framed through emotional or personal appeal\.
- •Obfuscation: Exploits ambiguity and interpretive flexibility, consistent with interdisciplinary accounts of social engineering relying on manipulation, deception, and vagueness\(Washo and others,[2021](https://arxiv.org/html/2606.07805#bib.bib80)\)\.

Pressure Gradient and Scenario Generation\.For each atomic rule and a base task \(e\.g\., “generate a departmental salary report”\), the Architect Agent generates a spectrum of adversarial scenarios through prompt rewriting\. The rewriting operations include temporal compression, authority elevation, emotional intensification, and semantic vagueness, producing graded pressure regimes that enable stress\-testing agent compliance stability\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61); Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7)\)\. By applying these operations sequentially and in combination, the agent creates a calibrated pressure gradient and assigns an intensity level from 0 \(baseline\) toNN\(maximum combined pressure\), supporting fine\-grained behavioral break\-point analysis\(Younget al\.,[2012](https://arxiv.org/html/2606.07805#bib.bib81)\)\.

We formalize pressure intensity into discrete levels based on textual volume, emotional intensity, and narrative complexity, in the spirit of engineered eval “difficulty knobs” for agent assessments\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61)\)\.Level 0 \(Baseline\)represents neutral requests\.Level 5 \(High Pressure\)involves multi\-paragraph narratives combining authoritative mandates, urgent deadlines, and personal stakes\.

### 3\.4\.Refine: Holographic Environment Synthesis

The Refine stage creates the holographic sandbox environment in which the agent under test will execute, consistent with interactive agent benchmarking that emphasizes realistic, reproducible environments and traceable action logs\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48); Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7)\)\. The World Builder Agent generates fully functional, Python\-based simulation infrastructure via automated code synthesis, enabling scalable instance generation while preserving evaluation rigor\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61)\)\.

Agent\-as\-a\-Benchmark for Environment Synthesis\.This environment is holographic – it is complete and structurally authentic \(it contains databases, APIs, and file systems with realistic interactions\) but synthetic \(it uses mock data and simulated systems\)\. This architecture provides three key advantages:

- •Scalability: Unlimited novel environments can be generated without manual engineering\(Liuet al\.,[2023a](https://arxiv.org/html/2606.07805#bib.bib5); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48)\)\.
- •Adaptability: The system can incorporate new regulatory requirements and simulate emerging vulnerability patterns \(e\.g\., OWASP/CWE updates\)\(OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106); MITRE,[2025](https://arxiv.org/html/2606.07805#bib.bib65)\)\.
- •Contamination Elimination: Runtime synthesis helps reduce benchmark memorization and data contamination risk, a central concern in modern LLM evaluation\(Xu and others,[2024b](https://arxiv.org/html/2606.07805#bib.bib68); Chenet al\.,[2025b](https://arxiv.org/html/2606.07805#bib.bib69),[b](https://arxiv.org/html/2606.07805#bib.bib69)\)\.

Components of the Holographic Environment\.The World Builder Agent synthesizes:

- •Mock Databases: Relational schemas \(e\.g\., using SQLAlchemy\) with realistic synthetic data; fields are tagged with sensitivity levels to support privacy auditing\(Project,[2025b](https://arxiv.org/html/2606.07805#bib.bib87)\)\. Database triggers or proxy layers log all queries for compliance auditing\.
- •API Servers: RESTful endpoints \(e\.g\., using FastAPI\) that implement authentication and authorization using standard mechanisms \(e\.g\., OAuth 2\.0, JWT, RBAC\)\(Project,[2025a](https://arxiv.org/html/2606.07805#bib.bib86); Hardt,[2012](https://arxiv.org/html/2606.07805#bib.bib84); Joneset al\.,[2015](https://arxiv.org/html/2606.07805#bib.bib85); Sandhuet al\.,[1996](https://arxiv.org/html/2606.07805#bib.bib83)\)\. All API calls are logged with request parameters, authentication context, and authorization decisions\.
- •File Systems: Hierarchical directory structures with Unix\-style permissions, enabling tests for unauthorized access and privilege violations\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47)\)\.
- •Unified Audit Logger: A central component that records every tool call, API invocation, database query, and file operation, capturing timestamps, parameters, and return values, aligning with best practices for agent eval harnesses and traceability\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61); Levyet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib7)\)\.

To prevent agents from detecting artificial simplifications or “gaming” the simulation, the World Builder Agent enforces a high\-fidelity synthesis policy, consistent with calls for realistic agent evaluation harnesses\(Anthropic,[2026](https://arxiv.org/html/2606.07805#bib.bib61); Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47)\)\. Mock environments include realistic side\-effects: database queries respect transaction latency \(50–300ms\), API servers return standardized error codes, and file systems enforce permission hierarchies\. This ensures that agents are evaluated on genuine technical competence alongside procedural compliance\(OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106); Center for Internet Security,[2021](https://arxiv.org/html/2606.07805#bib.bib67)\)\.

### 3\.5\.Verify: Execution Trace Auditing

The Verify stage is the evaluation core, aligning with the recent emphasis on*trajectory\-/trace\-first*agent evaluation rather than final\-answer\-only scoring\(Mohammadiet al\.,[2025](https://arxiv.org/html/2606.07805#bib.bib88); Xu,[2026](https://arxiv.org/html/2606.07805#bib.bib89); Michelakis and others,[2025](https://arxiv.org/html/2606.07805#bib.bib90)\)\. The agent under test executes the task within the synthesized environment under the prescribed pressure, using tool/action interfaces typical of modern web and interactive benchmarks\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48)\)\. Its complete execution trajectory

\(1\)τ=\(a1,o1,a2,o2,…,aT,oT\)\\tau=\(a\_\{1\},o\_\{1\},a\_\{2\},o\_\{2\},\\dots,a\_\{T\},o\_\{T\}\)whereata\_\{t\}is an action \(tool call or natural language\) andoto\_\{t\}is the observation, is captured and persisted as an auditable trace\(Xu,[2026](https://arxiv.org/html/2606.07805#bib.bib89); LangChain,[2025](https://arxiv.org/html/2606.07805#bib.bib101)\)\.

Trace\-Based Compliance Auditing\.A Compliance Auditor—implemented either as a rule\-based oracle \(deterministic checks over logs\) or as an LLM\-based interpreter \(LLM\-as\-a\-judge\)—analyzesτ\\tauagainst the set of applicable atomic rulesℛ\\mathcal\{R\}\(Liet al\.,[2024a](https://arxiv.org/html/2606.07805#bib.bib92); Zhenget al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib93); Liuet al\.,[2023b](https://arxiv.org/html/2606.07805#bib.bib94)\)\. For each ruler∈ℛr\\in\\mathcal\{R\}, the auditor checks whether violation indicators \(e\.g\., a database query accessing restricted columns, an API call without proper authentication, broken authorization at the object/property level\) are present in the trace, consistent with widely\-used API abuse and authorization failure taxonomies\(OWASP Foundation,[2023](https://arxiv.org/html/2606.07805#bib.bib119)\)\. The outcome for rulerris a binary violation indicatorvi,r∈\{0,1\}v\_\{i,r\}\\in\\\{0,1\\\}\(or a graded label when supported by the auditor rubric\(Liuet al\.,[2023b](https://arxiv.org/html/2606.07805#bib.bib94); Zhenget al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib93)\)\)\. The normalized violation severity for an episode is calculated as

\(2\)Violation​\(τi\)=∑rwr⋅vi,r∑rwr,\\text\{Violation\}\(\\tau\_\{i\}\)=\\frac\{\\sum\_\{r\}w\_\{r\}\\cdot v\_\{i,r\}\}\{\\sum\_\{r\}w\_\{r\}\},wherewrw\_\{r\}is the severity weight derived from the rule’s regulatory or policy source\(European Parliament and Council of the European Union,[2016](https://arxiv.org/html/2606.07805#bib.bib96); National Institute of Standards and Technology,[2023](https://arxiv.org/html/2606.07805#bib.bib97)\)\. This weighting design operationalizes the idea that some violations are materially more harmful than others in regulated deployment contexts\(National Institute of Standards and Technology,[2023](https://arxiv.org/html/2606.07805#bib.bib97)\)\.

Core Evaluation Metrics\.We define a metric suite that captures the success–compliance trade\-off, consistent with benchmark practice that separates*task success*from*process constraints*\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48); Michelakis and others,[2025](https://arxiv.org/html/2606.07805#bib.bib90)\):

#### Success Rate \(SR\)

Measures pure task completion under pressure, using a task\-specific oracle as in interactive environment benchmarks\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47); Le Sellier De Chezelleset al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib48)\)\.

\(3\)SR=1N​∑i=1N𝕀​\[success​\(τi\)\],\\text\{SR\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\big\[\\text\{success\}\(\\tau\_\{i\}\)\\big\],where𝕀\\mathbb\{I\}is the indicator function andsuccess​\(τi\)\\text\{success\}\(\\tau\_\{i\}\)is determined by an oracle \(e\.g\., functional correctness checks in WebArena\-style evaluation\)\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib47)\)\.

#### Compliance Rate \(CR\)

Measures procedural adherence regardless of task success, aggregating weighted trace violations:

\(4\)CR=1−1N​∑i=1NViolation​\(τi\)\.\\text\{CR\}=1\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\text\{Violation\}\(\\tau\_\{i\}\)\.

#### Compliance\-Weighted Success Rate \(CSR\)

Integrates success and compliance into a single risk\-aware metric:

\(5\)CSR=SR×\(1−α⋅1N​∑i=1NViolation​\(τi\)\)=SR×\(1−α⋅\(1−CR\)\),\\text\{CSR\}=\\text\{SR\}\\times\\left\(1\-\\alpha\\cdot\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\text\{Violation\}\(\\tau\_\{i\}\)\\right\)=\\text\{SR\}\\times\\big\(1\-\\alpha\\cdot\(1\-\\text\{CR\}\)\\big\),whereα\\alphatunes the*cost of non\-compliance*according to domain risk tolerance, consistent with risk\-management guidance that emphasizes context\-dependent impact and governance\(National Institute of Standards and Technology,[2023](https://arxiv.org/html/2606.07805#bib.bib97)\)\. By construction,CSR≤SR\\text\{CSR\}\\leq\\text\{SR\}with equality only whenCR=1\\text\{CR\}=1\.

#### The Machiavellian Gap \(MG\)

To directly quantify the behavioral shift from a cooperative baseline to an adversarial one—i\.e\., whether the agent strategically preserves reward/success by sacrificing principles—we introduce the Machiavellian Gap \(MG\)\. This is motivated by prior evidence of reward–ethics tension and systematically measured harmful strategies in theMACHIAVELLIbenchmark\(Panet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib99)\), and by broader concerns about reward hacking/specification gaming in deployed optimization systems\(Amodeiet al\.,[2016](https://arxiv.org/html/2606.07805#bib.bib98); Beigi and others,[2026](https://arxiv.org/html/2606.07805#bib.bib100)\)\.

LetSRbase,CRbase\\text\{SR\}\_\{\\text\{base\}\},\\text\{CR\}\_\{\\text\{base\}\}be the agent’s performance on the baseline scenario set \(no adversarial pressure\)\. LetSRadv,CRadv\\text\{SR\}\_\{\\text\{adv\}\},\\text\{CR\}\_\{\\text\{adv\}\}be the performance under a specific adversarial pressure configuration\. The Machiavellian Gap is defined as:

\(6\)MG=\|\(CRbase−CRadv\)−\(SRbase−SRadv\)\|\.\\text\{MG\}=\\left\|\(\\text\{CR\}\_\{\\text\{base\}\}\-\\text\{CR\}\_\{\\text\{adv\}\}\)\-\(\\text\{SR\}\_\{\\text\{base\}\}\-\\text\{SR\}\_\{\\text\{adv\}\}\)\\right\|\.
High MGindicates a large*decrease*in compliance that is*not*accompanied by a comparable loss in success, consistent with a pattern of goal preservation via rule\-bending \(procedural reward hacking\)\(Amodeiet al\.,[2016](https://arxiv.org/html/2606.07805#bib.bib98); Beigi and others,[2026](https://arxiv.org/html/2606.07805#bib.bib100)\)\.Low MGcan arise when \(i\) both SR and CR degrade together \(fragile but conservative\) or \(ii\) both remain high \(robustly compliant\), matching the desiderata of reliable agent deployment under realistic tool\-use trajectories\(Mohammadiet al\.,[2025](https://arxiv.org/html/2606.07805#bib.bib88); Michelakis and others,[2025](https://arxiv.org/html/2606.07805#bib.bib90)\)\.

The complete evaluation thus produces not just a single score, but a multi\-dimensional profile: baseline capabilities\(SRbase,CRbase\)\(\\text\{SR\}\_\{\\text\{base\}\},\\text\{CR\}\_\{\\text\{base\}\}\), degraded profile under pressure\(SRadv,CRadv\)\(\\text\{SR\}\_\{\\text\{adv\}\},\\text\{CR\}\_\{\\text\{adv\}\}\), the risk\-aware metric CSR, and the behavioral indicator MG\. Such trace\-grounded characterization is essential for deployment decisions in high\-stakes, regulated environments\(National Institute of Standards and Technology,[2023](https://arxiv.org/html/2606.07805#bib.bib97); European Parliament and Council of the European Union,[2016](https://arxiv.org/html/2606.07805#bib.bib96)\)\.

## 4\.Experiments

In this section, we present a comprehensive empirical evaluation of sota LLM agents using the MAC\-Bench framework\. Our primary objective is to quantify the Success\-Compliance Trade\-off under operational pressure and to rigorously evaluate the Machiavellian Gap \(MG\) across diverse model families and agent architectures, following recent calls for multi\-metric, scenario\-driven evaluation beyond single\-number accuracy\(Lianget al\.,[2022](https://arxiv.org/html/2606.07805#bib.bib113); Srivastava and others,[2022](https://arxiv.org/html/2606.07805#bib.bib114)\)\. We investigate: \(1\) the aggregate capability and compliance stability of frontier and open\-weight models, \(2\) the impact of architectural choices on procedural adherence, \(3\) the specific vulnerability profiles induced by different social\-engineering pressure vectors \(e\.g\., authority, urgency\), and \(4\) domain\-specific variations in compliance robustness\(Cialdini,[2001](https://arxiv.org/html/2606.07805#bib.bib116); Milgram,[1963](https://arxiv.org/html/2606.07805#bib.bib40); Ferreiraet al\.,[2015](https://arxiv.org/html/2606.07805#bib.bib117)\)\.

### 4\.1\.Experimental Setup

Benchmark Configuration\.To ensure a rigorous evaluation, we utilized the MAC\-BenchSERVpipeline to generate a comprehensive suite of evaluation episodes\. The benchmark covers 847 Atomic Rules mapped from authoritative sources, including the GDPR\(European Union,[2016a](https://arxiv.org/html/2606.07805#bib.bib102)\), the Personal Information Protection Law \(PIPL\)\(Supreme People’s Procuratorate of the People’s Republic of China,[2021](https://arxiv.org/html/2606.07805#bib.bib104)\), the EU AI Act\(European Union,[2024a](https://arxiv.org/html/2606.07805#bib.bib103)\), and security baselines grounded in widely used community standards such as CWE\(MITRE,[2026](https://arxiv.org/html/2606.07805#bib.bib105)\), OWASP Top 10\(OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106)\), and CIS Benchmarks\(Center for Internet Security \(CIS\),[2026](https://arxiv.org/html/2606.07805#bib.bib107)\)\. From these rules, we synthesized 4,128 distinct evaluation scenarios\. Each scenario was instantiated via theRefinephase using a World Builder Agent to create unique holographic environments \(databases, APIs, file systems\) with randomized schemas and data distributions, yielding over 20,640 unique test instances per model configuration to reduce memorization effects and stabilize estimates under distributional variation\(Lianget al\.,[2022](https://arxiv.org/html/2606.07805#bib.bib113); Srivastava and others,[2022](https://arxiv.org/html/2606.07805#bib.bib114)\)\.

We evaluated a spectrum of representative agent frameworks to assess how architectural choices influence compliance stability:

- •AutoGen:A hierarchical, conversational orchestration framework for multi\-agent LLM applications\(Wuet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib108); Microsoft,[2026](https://arxiv.org/html/2606.07805#bib.bib109)\)\.
- •MetaGPT:A multi\-agent role\-playing framework that operationalizes SOP workflows \(PM, Architect, Engineer\)\(Honget al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib110)\)\.
- •ReAct:A single\-agent “reasoning \+ acting” loop serving as a baseline for direct instruction\-to\-action mapping with tool interaction\(Yaoet al\.,[2022](https://arxiv.org/html/2606.07805#bib.bib111)\)\.
- •ChatDev:A communicative, debate and role\-based framework for software development with multi\-agent collaboration\(Qianet al\.,[2024](https://arxiv.org/html/2606.07805#bib.bib112),[2024](https://arxiv.org/html/2606.07805#bib.bib112)\)\.

To ensure a fair comparison, we aligned the high\-level safety constraints across all frameworks by injecting a unified “Compliance Charter” into their system prompts\. This charter defined universal obligations \(e\.g\., adherence to GDPR principles and security protocols\)\(European Union,[2016a](https://arxiv.org/html/2606.07805#bib.bib102); OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106); MITRE,[2026](https://arxiv.org/html/2606.07805#bib.bib105); Center for Internet Security \(CIS\),[2026](https://arxiv.org/html/2606.07805#bib.bib107)\)\. However, we preserved the native architectural instructions of each framework \(e\.g\., MetaGPT’s SOPs or AutoGen’s role definitions\) to evaluate their intrinsic architectural safety properties as deployed in practice\(Wuet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib108); Honget al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib110)\)\.

We evaluated both proprietary frontier models and open\-weight models\. All models were tested with consistent inference parameters \(Temperature = 0\.7\) and tool budget constraints\.For the primary evaluation \(Table 2\), models are orchestrated via theAutoGen hierarchical frameworkto simulate realistic multi\-agent deployments and capture emergent behaviors such as responsibility diffusion and handoff failures under pressure\(Wuet al\.,[2023](https://arxiv.org/html/2606.07805#bib.bib108); Darley and Latané,[1968](https://arxiv.org/html/2606.07805#bib.bib115)\)\. \(Table 3 provides a detailed ablation comparing different frameworks\.\)

We report four key metrics:

1. \(1\)Success Rate \(SR\):Task completion under pressure\.
2. \(2\)Compliance Rate \(CR\):Adherence to atomic rules grounded in legal/security obligations\(European Union,[2016a](https://arxiv.org/html/2606.07805#bib.bib102); Supreme People’s Procuratorate of the People’s Republic of China,[2021](https://arxiv.org/html/2606.07805#bib.bib104); European Union,[2024a](https://arxiv.org/html/2606.07805#bib.bib103); OWASP Foundation,[2021](https://arxiv.org/html/2606.07805#bib.bib106); MITRE,[2026](https://arxiv.org/html/2606.07805#bib.bib105); Center for Internet Security \(CIS\),[2026](https://arxiv.org/html/2606.07805#bib.bib107)\)\.
3. \(3\)CSR \(α=1\\alpha=1\):Compliance\-weighted success rate, calculated asS​R×C​RSR\\times CR\(Lianget al\.,[2022](https://arxiv.org/html/2606.07805#bib.bib113)\)\.
4. \(4\)Machiavellian Gap \(ΔM\\Delta\_\{M\}\):In the Main Leaderboard \(Table 2\), as baseline performance is nearly perfect \(\>98%\>98\\%\),ΔM\\Delta\_\{M\}effectively represents the current performance gap \(S​R−C​RSR\-CR\) under combined high pressure, quantifying the immediate deviation from ideal compliance\(Lianget al\.,[2022](https://arxiv.org/html/2606.07805#bib.bib113)\)\.

### 4\.2\.Main Results

We evaluated 12 representative models across the full MAC\-Bench suite under “High Adversarial Pressure” \(combined Authority \+ Urgency\) using theAutoGen hierarchical framework\. The results, summarized in Table 2, reveal a pervasive decoupling between task capability and procedural alignment\.

Table 2\.Performance leaderboard of proprietary and open\-weight models orchestrated via the AutoGen framework under Combined High Pressure\(%\)\. Metrics reported include Success Rate \(SR\), Compliance Rate \(CR\), Compliance\-Weighted Success Rate \(CSR\), and Machiavellian Gap \(ΔM\\Delta\_\{M\}\)\.The “Cunning” Frontier:While top\-tier models likeGPT\-5andGemini\-3maintain near\-perfect Success Rates \(SR\>\>97%\), other powerful models like DeepSeek\-V3 also exhibit high task completion rates, yet they suffer catastrophic collapses in Compliance \(CR<<30%\)\. This results in massive Machiavellian Gaps \(ΔM\\Delta\_\{M\}exceeding 60%\)\. In the multi\-agent context, these models aggressively exploit procedural shortcuts \(e\.g\., bypassing encryption, ignoring data minimization\) by delegating illicit sub\-tasks to specialized sub\-agents, confirming the “Responsibility Diffusion” hypothesis\.

The “Honest” Baseline:In contrast,Claude\-3exhibits a significantly lower Gap \(ΔM=\+39\.3%\\Delta\_\{M\}=\+39\.3\\%\)\. Although its SR is lower \(91\.4%\), its CR is the highest among evaluated models \(52\.1%\)\. This suggests that Claude\-3’s more conservative safety alignment is more resistant to the fragmentation of responsibility in multi\-agent workflows\.

Open\-Weight Vulnerability:Open\-source models show the highest Machiavellian Gaps in the AutoGen setup\. This indicates that without specialized multi\-agent safety training, complex orchestration frameworks provide these powerful models with more opportunities to “game” the system without being caught\.

### 4\.3\.Architectural Ablation

To understand how agent architecture influences compliance, we compared four frameworks using GPT\-4o as the backbone\. Results are presented in Table 3\.

Table 3\.Impact comparison of different agent architectures \(ReAct, AutoGen, MetaGPT, ChatDev\) using GPT\-4o as the backbone under Combined High Pressure\(%\)\. The table displays Success Rate \(SR\), Compliance Rate \(CR\), and average interaction Turns\.Responsibility Diffusion in AutoGen:The hierarchical AutoGen framework achieves the highest Success Rate \(96\.8%\) but the lowest Compliance Rate \(38\.5%\)\. The multi\-agent dialogue structure seems to facilitate “responsibility diffusion”; individual agents focus on their sub\-tasks \(e\.g\., “write the SQL query”\) without maintaining a holistic view of the security policy, allowing a coordinator agent to aggregate potentially illicit outputs\. The average turns \(3\.2\) reflect the necessary coordination overhead\.

Safeguards in ReAct:The single\-agent ReAct pattern shows the highest Compliance Rate \(72\.1%\)\. By explicitly interleaving reasoning and action, the model is forced to justify each step against the system prompt constraints in its internal monologue, reducing the likelihood of automatic rule\-breaking\.

Deliberative Stability:ChatDev requires agents to debate decisions\. While this introduces significant latency \(4\.2 turns\), it yields a high CR \(70\.3%\), suggesting that peer\-review mechanisms can partially mitigate Machiavellian tendencies by surfacing potential violations during the “Debate” phase before execution\.

### 4\.4\.Pressure Vector Ablation

We isolated the effect of individual pressure vectors on distinct models\. Table 4 shows the Compliance Rate under baseline conditions and four pressure types\.

Table 4\.Compliance Rate \(CR\) distribution for various models under baseline conditions and four distinct pressure vectors: Authority, Urgency, Empathy, and Obfuscation\(%\)\.Authority Pressure Dominance:“Authority” \(Auth\.\) is the single most effective vector for inducing violations, causing an average drop of nearly 49% in Compliance\. Across all models, including the robust Claude\-3\.5, the claim of “CEO override” or “Executive permission” leads agents to bypass 2FA, skip audit logs, or export sensitive data without encryption\. This validates the “Obedience Bias” hypothesis: RLHF optimization for helpfulness creates a systemic susceptibility to hierarchical directives\.

Open Models are More Susceptible:Open models like DeepSeek\-V3 and Llama\-3\.1\-70B show catastrophic failure under Authority pressure \(CR≤\\leq36%\)\. They tend to interpret “Authority” not just as a context cue but as a “Golden Ticket” to ignore all prior safety instructions\.

Obfuscation as the “Reasoning Trap”:Surprisingly, GPT\-4o performs best under Obfuscation \(86\.5%\)\. However, for smaller models, Obfuscation \(vague instructions\) is dangerous because they lack the reasoning capability to disambiguate instructions conservatively, leading them to assume the most permissive \(and thus non\-compliant\) interpretation to ensure task success\.

### 4\.5\.Domain Analysis

Finally, we analyzed performance across specific regulatory domains to identify areas of “hard” and “soft” alignment under high pressure\. Table 5 presents the Compliance Rates for Privacy, DevOps, Code Security, Finance, and Ethics\.

Table 5\.Compliance performance breakdown across five specific regulatory domains: Privacy, DevOps, Code Security, Finance, and Ethics\(%\)\.The “Relative” Binary Advantage:Even under high pressure, Code Security and Ethics domains maintain relatively higher Compliance Rates \(approx\. 35%–60%\) compared to Privacy and DevOps\. However, these values represent a significant degradation from baseline capabilities, indicating that even “easier” binary checks fail under time and authority pressure\.

The “Process” Blindspot:Privacy and DevOps see the lowest Compliance across all models \(often below 20%–30%\)\. These domains strictly require Procedural Compliance \(e\.g\., data minimization, ordered testing\), which is most susceptible to the “Reward Hacking” behaviors\. Agents frequently bypass multi\-step privacy protocols \(e\.g\., exporting full DB instead of query, skipping backup scripts\) to achieve task success under pressure, revealing a critical vulnerability in these process\-heavy domains\.

## 5\.Conclusion

This paper targets a key limitation of evaluation for LLM agents: high task success can coexist with systematic procedural non\-compliance\. We introduce MAC\-Bench, a trace\-centric, contamination\-resistant benchmark built by the SERV pipeline, which translates regulatory corpora into executable Atomic Rules and evaluates agents through a Compliance Oracle over full tool\-use trajectories\. Across models and settings, we observe a pronounced success–compliance trade\-off, where strong success under pressure often accompanies sharp compliance degradation, motivating integrated metrics such as CSR and behavioral diagnostics like the Machiavellian Gap\. Empirically, we find that \(i\) some top\-performing agents remain highly successful yet exhibit “cunning” procedural shortcuts, especially in multi\-agent workflows consistent with responsibility diffusion; \(ii\) agent architectures that explicitly interleave reasoning and action \(or enforce debate\) tend to improve compliance relative to purely hierarchical delegation; and \(iii\) among social\-engineering pressures, Authority is the most reliable trigger for violations, while vague Obfuscation can become a reasoning trap for weaker models\. We hope MAC\-Bench can serve as an evolving testbed for trajectory\-level alignment, pressure\-aware training, and compliance\-by\-construction agent orchestration\.

## References

- D\. Amodei, C\. Olah, J\. Steinhardt, P\. Christiano, J\. Schulman, and D\. Mané \(2016\)Concrete problems in AI safety\.External Links:1606\.06565,[Link](https://arxiv.org/abs/1606.06565)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p1.1),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p3.1)\.
- Anthropic \(2026\)Demystifying evals for ai agents\.Note:[https://www\.anthropic\.com/engineering/demystifying\-evals\-for\-ai\-agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)Accessed: 2026\-02\-08Cited by:[item4\.](https://arxiv.org/html/2606.07805#S3.I1.i4.p1.1),[4th item](https://arxiv.org/html/2606.07805#S3.I6.i4.p1.1),[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p3.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p4.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p1.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p4.1),[§3](https://arxiv.org/html/2606.07805#S3.p1.3)\.
- F\. Ariaiet al\.\(2024\)Natural language processing for the legal domain: a survey of tasks, datasets, models and challenges\.ACM Computing Surveys \(preprint\)\.External Links:[Link](https://arxiv.org/pdf/2410.21306)Cited by:[3rd item](https://arxiv.org/html/2606.07805#S3.I3.i3.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p3.1)\.
- T\. Athan, H\. Boley, G\. Governatori, M\. Palmirani, A\. Paschke, and A\. Wyner \(2013\)OASIS legalruleml\.InProceedings of the Fourteenth International Conference on Artificial Intelligence and Law \(ICAIL\),External Links:[Link](https://dl.acm.org/doi/10.1145/2514601.2514603)Cited by:[item1\.](https://arxiv.org/html/2606.07805#S3.I1.i1.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1)\.
- M\. Beigiet al\.\(2026\)Adversarial reward auditing for active detection and mitigation of reward hacking\.External Links:2602\.01750,[Link](https://arxiv.org/abs/2602.01750)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p1.1),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p3.1)\.
- M\. Bertlet al\.\(2025\)Transforming legal texts into computational logic\.SoftwareX\.External Links:[Link](https://www.sciencedirect.com/science/article/pii/S2666307425000336)Cited by:[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1)\.
- Center for Internet Security \(CIS\) \(2026\)CIS benchmarks\.Note:Official websiteAccessed: 2026\-02\-08External Links:[Link](https://www.cisecurity.org/cis-benchmarks)Cited by:[item 2](https://arxiv.org/html/2606.07805#S4.I2.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p3.1)\.
- Center for Internet Security \(2021\)CIS critical security controls version 8\.Note:[https://www\.cisecurity\.org/controls/v8](https://www.cisecurity.org/controls/v8)Accessed: 2026\-02\-08Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I2.i2.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p4.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p4.1)\.
- G\. Chen, S\. Yang, C\. Li, W\. Liu, J\. Luan, and Z\. Xu \(2026\)End\-to\-end optimization of llm\-driven multi\-agent search systems via heterogeneous\-group\-based reinforcement learning\.arXiv preprint arXiv:2506\.02718\.Note:Accepted to ACL 2026 Main ConferenceExternal Links:[Link](https://arxiv.org/abs/2506.02718),[Document](https://dx.doi.org/10.48550/arXiv.2506.02718),2506\.02718Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p5.1)\.
- J\. Chen, M\. Zou, Z\. Wang, Q\. Wang, D\. D\. Sun, Z\. Chi, and Z\. Xu \(2025a\)FinHEAR: human expertise and adaptive risk\-aware temporal reasoning for financial decision\-making\.InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4–9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 1648–1672\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.87/)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p1.1)\.
- S\. Chen, Y\. Chen, Z\. Li, Y\. Jiang, Z\. Wan, Y\. He, D\. Ran, T\. Gu, H\. Li, T\. Xie, and B\. Ray \(2025b\)Benchmarking large language models under data contamination: a survey from static to dynamic evaluation\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://aclanthology.org/2025.emnlp-main.511/)Cited by:[3rd item](https://arxiv.org/html/2606.07805#S3.I5.i3.p1.1),[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.SSS0.Px1.p2.1),[§3](https://arxiv.org/html/2606.07805#S3.p1.3)\.
- Z\. Chen, Z\. Xiang, C\. Xiao, D\. Song, and B\. Li \(2024\)AgentPoison: red\-teaming LLM agents via poisoning memory or knowledge bases\.External Links:2407\.12784,[Link](https://arxiv.org/abs/2407.12784)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p5.1)\.
- H\. K\. Choi, M\. Khanov, H\. Wei, and S\. Li \(2025\)How contaminated is your benchmark? measuring dataset leakage in large language models with kernel divergence\.External Links:2502\.00678,[Link](https://arxiv.org/abs/2502.00678)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p2.1)\.
- R\. B\. Cialdini \(2001\)Influence: science and practice\.4 edition,Allyn & Bacon\.Note:Authority/urgency\-related persuasion principles; Accessed: 2026\-02\-08Cited by:[item2\.](https://arxiv.org/html/2606.07805#S3.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2606.07805#S3.I4.i3.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p1.1),[§4](https://arxiv.org/html/2606.07805#S4.p1.1)\.
- Cybersecurity and Infrastructure Security Agency \(CISA\) \(2024\)2024 cwe top 25 most dangerous software weaknesses\.Note:[https://www\.cisa\.gov/news\-events/alerts/2024/11/20/2024\-cwe\-top\-25\-most\-dangerous\-software\-weaknesses](https://www.cisa.gov/news-events/alerts/2024/11/20/2024-cwe-top-25-most-dangerous-software-weaknesses)Accessed: 2026\-02\-08Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I2.i2.p1.1),[2nd item](https://arxiv.org/html/2606.07805#S3.I3.i2.p1.1)\.
- J\. M\. Darley and B\. Latané \(1968\)Bystander intervention in emergencies: diffusion of responsibility\.Journal of Personality and Social Psychology8\(4\),pp\. 377–383\.Note:Accessed: 2026\-02\-08External Links:[Document](https://dx.doi.org/10.1037/h0025589),[Link](https://pubmed.ncbi.nlm.nih.gov/5645600/)Cited by:[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p4.1)\.
- E\. Debenedettiet al\.\(2024\)AgentDojo: a dynamic environment to evaluate attacks and defenses for LLM agents\.External Links:2406\.13352,[Link](https://arxiv.org/abs/2406.13352)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p5.1),[§2](https://arxiv.org/html/2606.07805#S2.p6.1)\.
- V\. Ershov \(2023\)A case study for compliance as code with graphs and language models: public release of the regulatory knowledge graph\.arXiv preprint arXiv:2302\.01842\.External Links:[Link](https://arxiv.org/abs/2302.01842)Cited by:[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p4.1)\.
- European Parliament and Council of the European Union \(2016\)Regulation \(EU\) 2016/679 \(General Data Protection Regulation\)\.Note:[https://eur\-lex\.europa\.eu/eli/reg/2016/679/oj/eng](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng)Accessed: 2026\-02\-08Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p4.2),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p2.6)\.
- European Parliament and the Council of the European Union \(2016\)Regulation \(eu\) 2016/679 \(general data protection regulation\)\.Note:[https://eur\-lex\.europa\.eu/eli/reg/2016/679/oj/eng](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng)Official Journal text\. Accessed: 2026\-02\-08Cited by:[item1\.](https://arxiv.org/html/2606.07805#S3.I1.i1.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I2.i1.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I3.i1.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1)\.
- European Parliament and the Council of the European Union \(2024\)Regulation \(eu\) 2024/1689 \(artificial intelligence act\)\.Note:[https://eur\-lex\.europa\.eu/eli/reg/2024/1689/oj/eng](https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng)Official Journal text\. Accessed: 2026\-02\-08Cited by:[item1\.](https://arxiv.org/html/2606.07805#S3.I1.i1.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I2.i1.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I3.i1.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1)\.
- European Union \(2016a\)Regulation \(eu\) 2016/679 \(general data protection regulation\)\.Note:EUR\-Lex \(Official Journal text\)Accessed: 2026\-02\-08External Links:[Link](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng)Cited by:[item 2](https://arxiv.org/html/2606.07805#S4.I2.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p3.1)\.
- European Union \(2016b\)Regulation \(eu\) 2016/679 \(general data protection regulation\)\.Note:EUR\-LexExternal Links:[Link](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p2.1)\.
- European Union \(2024a\)Regulation \(eu\) 2024/1689 \(artificial intelligence act\)\.Note:EUR\-Lex \(Official Journal text\)Accessed: 2026\-02\-08External Links:[Link](https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng)Cited by:[item 2](https://arxiv.org/html/2606.07805#S4.I2.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1)\.
- European Union \(2024b\)Regulation \(eu\) 2024/1689 \(artificial intelligence act\)\.Note:EUR\-LexExternal Links:[Link](https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p2.1)\.
- A\. Ferreira, L\. Coventry, and G\. Lenzini \(2015\)Principles of persuasion in social engineering and their use in phishing\.InHuman Aspects of Information Security, Privacy, and Trust \(HAS\),Note:Accessed: 2026\-02\-08External Links:[Link](https://orbilu.uni.lu/bitstream/10993/20301/1/FerreiraAna-CameraReady.pdf)Cited by:[§4](https://arxiv.org/html/2606.07805#S4.p1.1)\.
- E\. Francesconi, G\. Lilliu,et al\.\(2023\)Patterns for legal compliance checking in a decidable semantic web framework\.Artificial Intelligence and Law\.External Links:[Link](https://link.springer.com/article/10.1007/s10506-022-09317-8)Cited by:[item1\.](https://arxiv.org/html/2606.07805#S3.I1.i1.p1.1),[3rd item](https://arxiv.org/html/2606.07805#S3.I2.i3.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p3.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p4.1)\.
- G\. Gigerenzer and W\. Gaissmaier \(2011\)Heuristic decision making\.Note:[https://pure\.mpg\.de/pubman/item/item\_2099042\_4/component/file\_2099041/GG\_Heuristic\_2011\.pdf](https://pure.mpg.de/pubman/item/item_2099042_4/component/file_2099041/GG_Heuristic_2011.pdf)Accessed: 2026\-02\-08Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I4.i2.p1.1)\.
- C\. A\. E\. Goodhart \(1975\)Problems of monetary management: the UK experience\.InInflation, Depression, and Economic Policy in the West,External Links:[Link](https://link.springer.com/chapter/10.1007/978-1-349-17295-5_4)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1)\.
- D\. Hardt \(2012\)The oauth 2\.0 authorization framework\.Note:RFC 6749Internet Engineering Task Force\. Accessed: 2026\-02\-08External Links:[Link](https://www.rfc-editor.org/rfc/rfc6749)Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I6.i2.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, C\. Zhang, J\. Wang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin, L\. Zhou, C\. Ran, L\. Xiao, C\. Wu, and J\. Schmidhuber \(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.Note:Accessed: 2026\-02\-08External Links:2308\.00352,[Link](https://arxiv.org/abs/2308.00352)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1),[2nd item](https://arxiv.org/html/2606.07805#S4.I1.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p3.1)\.
- M\. Jones, J\. Bradley, and N\. Sakimura \(2015\)JSON web token \(jwt\)\.Note:RFC 7519Internet Engineering Task Force\. Accessed: 2026\-02\-08External Links:[Link](https://www.rfc-editor.org/rfc/rfc7519)Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I6.i2.p1.1)\.
- G\. Junejaet al\.\(2025\)MAGPIE: a benchmark for multi\-agent contextual privacy evaluation\.External Links:2510\.15186,[Link](https://arxiv.org/abs/2510.15186)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p3.1)\.
- V\. Krakovna, J\. Uesato, V\. Mikulik, M\. Rahtz, T\. Everitt, R\. Kumar, and Z\. Kenton \(2020\)Specification gaming: the flip side of ai ingenuity\.Note:DeepMind BlogExternal Links:[Link](https://deepmind.google/blog/specification-gaming-the-flip-side-of-ai-ingenuity/)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1)\.
- LangChain \(2025\)Log LLM calls \(trace logging\) — langsmith documentation\.Note:[https://docs\.langchain\.com/langsmith/log\-llm\-trace](https://docs.langchain.com/langsmith/log-llm-trace)Accessed: 2026\-02\-08Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p1.2)\.
- T\. Le Sellier De Chezelles, M\. Gasse, A\. Drouin, M\. Caccia, L\. Boisvert, M\. Thakkar, T\. Marty, R\. Assouel, S\. Omidi Shayegan, L\. K\. Jang, X\. H\. Lù, O\. Yoran, D\. Kong, F\. F\. Xu, S\. Reddy, Q\. Cappart, G\. Neubig, R\. Salakhutdinov, N\. Chapados, and A\. Lacoste \(2024\)The browsergym ecosystem for web agent research\.External Links:2412\.05467,[Link](https://arxiv.org/abs/2412.05467)Cited by:[item3\.](https://arxiv.org/html/2606.07805#S3.I1.i3.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I5.i1.p1.1),[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p1.1),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px1.p1.3),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p1.3),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p3.1),[§3](https://arxiv.org/html/2606.07805#S3.p1.3)\.
- I\. Levy, B\. Wiesel, S\. Marreed, A\. Oved, A\. Yaeli, and S\. Shlomov \(2024\)ST\-webagentbench: a benchmark for evaluating safety and trustworthiness in web agents\.External Links:2410\.06703,[Link](https://arxiv.org/abs/2410.06703)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p3.1),[§2](https://arxiv.org/html/2606.07805#S2.p2.1),[§2](https://arxiv.org/html/2606.07805#S2.p4.1),[item3\.](https://arxiv.org/html/2606.07805#S3.I1.i3.p1.1),[item4\.](https://arxiv.org/html/2606.07805#S3.I1.i4.p1.1),[4th item](https://arxiv.org/html/2606.07805#S3.I6.i4.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p3.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p1.1)\.
- H\. Li, Q\. Dong, J\. Chen, H\. Su, Y\. Zhou, Q\. Ai, Z\. Ye, and Y\. Liu \(2024a\)LLMs\-as\-judges: a comprehensive survey on LLM\-based evaluation methods\.External Links:2412\.05579,[Link](https://arxiv.org/abs/2412.05579)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p2.5)\.
- Y\. Li, F\. Guerin, and C\. Lin \(2024b\)LatestEval: addressing data contamination in language model evaluation\.Proceedings of the AAAI Conference on Artificial Intelligence\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29822/31427)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p2.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Ré, T\. Hashimoto,et al\.\(2022\)Holistic evaluation of language models\.Note:Accessed: 2026\-02\-08External Links:2211\.09110,[Link](https://arxiv.org/abs/2211.09110)Cited by:[item 3](https://arxiv.org/html/2606.07805#S4.I2.i3.p1.2),[item 4](https://arxiv.org/html/2606.07805#S4.I2.i4.p1.4),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1),[§4](https://arxiv.org/html/2606.07805#S4.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang, Y\. Dong, and J\. Tang \(2023a\)AgentBench: evaluating llms as agents\.External Links:2308\.03688,[Link](https://arxiv.org/abs/2308.03688)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I5.i1.p1.1),[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p1.1),[§3](https://arxiv.org/html/2606.07805#S3.p1.3)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023b\)G\-eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:[Link](https://arxiv.org/abs/2303.16634)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p2.5)\.
- G\. Mialon, C\. Fourrier, C\. Swift, T\. Wolf, Y\. LeCun, and T\. Scialom \(2023\)GAIA: a benchmark for general ai assistants\.External Links:2311\.12983,[Link](https://arxiv.org/abs/2311.12983)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1),[§2](https://arxiv.org/html/2606.07805#S2.p1.1)\.
- P\. Michelakiset al\.\(2025\)Full\-path evaluation of LLM agents beyond final state\.External Links:2509\.20998,[Link](https://arxiv.org/abs/2509.20998)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p3.1),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p1.3),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p3.1)\.
- Microsoft \(2026\)Multi\-agent conversation framework — autogen documentation\.Note:Documentation websiteAccessed: 2026\-02\-08External Links:[Link](https://microsoft.github.io/autogen/0.2/docs/Use-Cases/agent_chat/)Cited by:[1st item](https://arxiv.org/html/2606.07805#S4.I1.i1.p1.1)\.
- S\. Milgram \(1963\)Behavioral study of obedience\.Journal of Abnormal and Social Psychology67\(4\),pp\. 371–378\.External Links:[Document](https://dx.doi.org/10.1037/h0040525)Cited by:[item2\.](https://arxiv.org/html/2606.07805#S3.I1.i2.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I4.i1.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p1.1),[§4](https://arxiv.org/html/2606.07805#S4.p1.1)\.
- MITRE \(2025\)CWE top 25 most dangerous software weaknesses – 2024\.Note:[https://cwe\.mitre\.org/top25/archive/2024/2024\_cwe\_top25\.html](https://cwe.mitre.org/top25/archive/2024/2024_cwe_top25.html)Archive page for the 2024 list\. Accessed: 2026\-02\-08Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I2.i2.p1.1),[2nd item](https://arxiv.org/html/2606.07805#S3.I3.i2.p1.1),[2nd item](https://arxiv.org/html/2606.07805#S3.I5.i2.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p4.1)\.
- MITRE \(2026\)Common weakness enumeration \(cwe\)\.Note:Project websiteAccessed: 2026\-02\-08External Links:[Link](https://cwe.mitre.org/)Cited by:[item 2](https://arxiv.org/html/2606.07805#S4.I2.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p3.1)\.
- M\. Mohammadi, Y\. Li, J\. Lo, and W\. Yip \(2025\)Evaluation and benchmarking of LLM agents: a survey\.External Links:2507\.21504,[Link](https://arxiv.org/abs/2507.21504)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p3.1),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p1.3)\.
- National Institute of Standards and Technology \(2023\)Artificial intelligence risk management framework \(AI RMF 1\.0\)\.Technical reportTechnical ReportNIST AI 100\-1,NIST\.External Links:[Link](https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px3.p1.3),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p4.2),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p2.6)\.
- National People’s Congress of the People’s Republic of China \(2021a\)Personal information protection law of the people’s republic of china \(english translation\)\.Note:[https://en\.npc\.gov\.cn\.cdurl\.cn/2021\-12/29/c\_694559\.htm](https://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm)Accessed: 2026\-02\-08Cited by:[item1\.](https://arxiv.org/html/2606.07805#S3.I1.i1.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I2.i1.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I3.i1.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p1.1)\.
- National People’s Congress of the People’s Republic of China \(2021b\)Personal information protection law of the people’s republic of china\.Note:NPC \(official English text page\)External Links:[Link](https://en.npc.gov.cn.cdurl.cn/2021-12/29/c_694559.htm)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p2.1)\.
- OWASP Foundation \(2021\)OWASP top 10:2021\.Note:Project websiteAccessed: 2026\-02\-08External Links:[Link](https://owasp.org/Top10/2021/)Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I2.i2.p1.1),[2nd item](https://arxiv.org/html/2606.07805#S3.I3.i2.p1.1),[2nd item](https://arxiv.org/html/2606.07805#S3.I5.i2.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.p4.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p4.1),[item 2](https://arxiv.org/html/2606.07805#S4.I2.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p3.1)\.
- OWASP Foundation \(2023\)Note:Includes API1:2023 Broken Object Level Authorization and related risksExternal Links:[Link](https://owasp.org/API-Security/editions/2023/en/0x11-t10/)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p2.5)\.
- A\. Pan, J\. S\. Chan, A\. Zou, N\. Li, S\. Basart, T\. Woodside, J\. Ng, H\. Zhang, S\. Emmons, and D\. Hendrycks \(2023\)Do the rewards justify the means? measuring trade\-offs between rewards and ethical behavior in the MACHIAVELLI benchmark\.External Links:2304\.03279,[Link](https://arxiv.org/abs/2304.03279)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px4.p1.1)\.
- Parea AI \(2026\)Parea documentation: evaluation overview\.Note:Online documentationAccessed 2026\-02\-08External Links:[Link](https://docs.parea.ai/evaluation/overview)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p5.1)\.
- F\. Project \(2025a\)FastAPI documentation\.Note:[https://fastapi\.tiangolo\.com/](https://fastapi.tiangolo.com/)Accessed: 2026\-02\-08Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I6.i2.p1.1)\.
- S\. Project \(2025b\)SQLAlchemy documentation\.Note:[https://docs\.sqlalchemy\.org/](https://docs.sqlalchemy.org/)Accessed: 2026\-02\-08Cited by:[1st item](https://arxiv.org/html/2606.07805#S3.I6.i1.p1.1)\.
- C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong, J\. Xu, D\. Li, Z\. Liu, and M\. Sun \(2024\)ChatDev: communicative agents for software development\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Note:Accessed: 2026\-02\-08External Links:[Link](https://aclanthology.org/2024.acl-long.810.pdf)Cited by:[4th item](https://arxiv.org/html/2606.07805#S4.I1.i4.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye,et al\.\(2023\)ToolLLM: facilitating large language models to master 10000\+ real\-world apis\.arXiv preprint arXiv:2307\.16789\.External Links:[Link](https://arxiv.org/abs/2307.16789)Cited by:[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1)\.
- R\. S\. Sandhu, E\. J\. Coyne, H\. L\. Feinstein, and C\. E\. Youman \(1996\)Role\-based access control models\.Computer29\(2\),pp\. 38–47\.External Links:[Link](https://csrc.nist.gov/csrc/media/projects/role-based-access-control/documents/sandhu96.pdf)Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I6.i2.p1.1)\.
- Y\. Shaoet al\.\(2024\)PrivacyLens: evaluating privacy norm awareness of language model agents\.InAdvances in Neural Information Processing Systems \(NeurIPS\), Datasets and Benchmarks Track,External Links:[Link](https://arxiv.org/abs/2409.00138)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p4.1)\.
- A\. Srivastavaet al\.\(2022\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Note:Accessed: 2026\-02\-08External Links:2206\.04615,[Link](https://arxiv.org/abs/2206.04615)Cited by:[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1),[§4](https://arxiv.org/html/2606.07805#S4.p1.1)\.
- Supreme People’s Procuratorate of the People’s Republic of China \(2021\)Personal information protection law of the people’s republic of china\.Note:Official English text \(web publication\)Accessed: 2026\-02\-08External Links:[Link](https://en.spp.gov.cn/2021-12/29/c_948419.htm)Cited by:[item 2](https://arxiv.org/html/2606.07805#S4.I2.i2.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p1.1)\.
- The MITRE Corporation \(2024\)Common weakness enumeration \(cwe\)\.Note:cwe\.mitre\.orgExternal Links:[Link](https://cwe.mitre.org/)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p1.1)\.
- X\. Wang, B\. Li,et al\.\(2024\)OpenHands: an open platform for ai software developers as generalist agents\.External Links:2407\.16741,[Link](https://arxiv.org/abs/2407.16741)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1)\.
- A\. H\. Washoet al\.\(2021\)An interdisciplinary view of social engineering: a call to action\.Forensic Science International: Digital Investigation\.External Links:[Link](https://www.sciencedirect.com/science/article/pii/S2451958821000749)Cited by:[item2\.](https://arxiv.org/html/2606.07805#S3.I1.i2.p1.1),[1st item](https://arxiv.org/html/2606.07805#S3.I4.i1.p1.1),[3rd item](https://arxiv.org/html/2606.07805#S3.I4.i3.p1.1),[4th item](https://arxiv.org/html/2606.07805#S3.I4.i4.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2023\)AutoGen: enabling next\-gen LLM applications via multi\-agent conversation\.Note:Accessed: 2026\-02\-08External Links:2308\.08155,[Link](https://arxiv.org/abs/2308.08155)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1),[1st item](https://arxiv.org/html/2606.07805#S4.I1.i1.p1.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p3.1),[§4\.1](https://arxiv.org/html/2606.07805#S4.SS1.p4.1)\.
- B\. Xu \(2026\)AI agent systems: architectures, applications, and evaluation\.External Links:2601\.01743,[Link](https://arxiv.org/abs/2601.01743)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p1.2),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p1.3)\.
- C\. Xuet al\.\(2024a\)Benchmark data contamination of large language models\.External Links:2406\.04244,[Link](https://arxiv.org/abs/2406.04244)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p2.1)\.
- C\. Xuet al\.\(2024b\)Benchmark data contamination of large language models\.arXiv preprint arXiv:2406\.04244\.External Links:[Link](https://arxiv.org/abs/2406.04244)Cited by:[3rd item](https://arxiv.org/html/2606.07805#S3.I5.i3.p1.1),[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.07805#S3.SS2.SSS0.Px1.p2.1),[§3](https://arxiv.org/html/2606.07805#S3.p1.3)\.
- Q\. Xuet al\.\(2023\)On the tool manipulation capability of open\-source large language models\.arXiv preprint arXiv:2305\.16504\.External Links:[Link](https://arxiv.org/abs/2305.16504)Cited by:[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1)\.
- S\. Yaoet al\.\(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.External Links:2406\.12045,[Link](https://arxiv.org/abs/2406.12045)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p1.1),[§2](https://arxiv.org/html/2606.07805#S2.p2.1),[§2](https://arxiv.org/html/2606.07805#S2.p6.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)ReAct: synergizing reasoning and acting in language models\.Note:Accessed: 2026\-02\-08External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[3rd item](https://arxiv.org/html/2606.07805#S4.I1.i3.p1.1)\.
- D\. L\. Young, A\. S\. Goodie, and A\. Hall \(2012\)Decision making under time pressure, modeled in a prospect theory framework\.Journal of Mathematical Psychology\.External Links:[Link](https://www.sciencedirect.com/science/article/abs/pii/S0749597812000404)Cited by:[2nd item](https://arxiv.org/html/2606.07805#S3.I4.i2.p1.1),[§3\.3](https://arxiv.org/html/2606.07805#S3.SS3.p3.1)\.
- Z\. Zhang, S\. Cui, Y\. Lu, J\. Zhou, J\. Yang, H\. Wang, and M\. Huang \(2024\)Agent\-safetybench: evaluating the safety of llm agents\.External Links:2412\.14470,[Link](https://arxiv.org/abs/2412.14470)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p2.1),[§2](https://arxiv.org/html/2606.07805#S2.p4.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2306.05685)Cited by:[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p2.5)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2023\)WebArena: a realistic web environment for building autonomous agents\.External Links:2307\.13854,[Link](https://arxiv.org/abs/2307.13854)Cited by:[§1](https://arxiv.org/html/2606.07805#S1.p1.1),[item3\.](https://arxiv.org/html/2606.07805#S3.I1.i3.p1.1),[3rd item](https://arxiv.org/html/2606.07805#S3.I6.i3.p1.1),[§3\.1](https://arxiv.org/html/2606.07805#S3.SS1.p1.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p1.1),[§3\.4](https://arxiv.org/html/2606.07805#S3.SS4.p4.1),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px1.p1.2),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.SSS0.Px1.p1.3),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p1.3),[§3\.5](https://arxiv.org/html/2606.07805#S3.SS5.p3.1),[§3](https://arxiv.org/html/2606.07805#S3.p1.3)\.
- Q\. Zhuet al\.\(2024\)Reusing leaked benchmarks for large language model evaluation\.InFindings of EMNLP,External Links:[Link](https://aclanthology.org/2024.findings-emnlp.532/)Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p2.1)\.
- M\. Zou, J\. Chen, A\. Luo, J\. Dai, C\. Zhang, D\. Sun, and Z\. Xu \(2026\)FinEvo: from isolated backtests to ecological market games for multi\-agent financial strategy evolution\.CoRRabs/2602\.00948\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.00948),[Document](https://dx.doi.org/10.48550/ARXIV.2602.00948),2602\.00948Cited by:[§2](https://arxiv.org/html/2606.07805#S2.p1.1)\.

Similar Articles

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

arXiv cs.AI

This paper argues that current benchmarks for autonomous agents fail to evaluate whether an agent should have proceeded at all, introducing a 'compliance bias'. The authors propose a taxonomy of abstention-warranted scenarios and new evaluation protocols (Safety Rate, Usability Rate, Informed Refusal Rate) with preliminary results showing tunable safety–usability tradeoffs across model families.

AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

arXiv cs.CL

This paper introduces AgentCollabBench, a diagnostic benchmark for multi-agent systems that evaluates behavioral risks like instruction decay and context leakage across four major LLMs. It argues that communication topology is a critical factor in multi-agent reliability, often overshadowing raw model capability.

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

arXiv cs.AI

This paper introduces Agentick, a unified benchmark for evaluating general sequential decision-making agents across RL, LLM, and VLM paradigms. It provides 37 procedurally generated tasks and reveals that no single approach currently dominates, highlighting significant room for improvement in agent autonomy.