Randomness is sometimes necessary for coordination

arXiv cs.AI Papers

Summary

The paper introduces Diamond Attention, a method for multi-agent reinforcement learning that uses structured randomness to break symmetry and enable role differentiation among homogeneous agents, achieving perfect coordination in symmetric tasks like the XOR game.

arXiv:2605.06825v1 Announce Type: new Abstract: Full parameter sharing is standard in cooperative multi-agent reinforcement learning (MARL) for homogeneous agents. Under permutation-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness. We propose Diamond Attention, a cross-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower-ranked peers from agent-to-agent attention while leaving task attention fully unmasked. This realizes a random-bit coordination protocol in a single broadcast round, and the set-based attention enables zero-shot deployment to teams of different sizes. We evaluate across three regimes that isolate when structured randomness matters. On the perfectly symmetric XOR game, our method achieves $1.0$ success while all deterministic baselines plateau near $0.5$. On control coordination tasks, a policy trained on $N=4$ generalizes zero-shot to $N \in [2,8]$. On SMACLite cross-scenario transfer, we achieve zero-shot transfer where standard baselines cannot transfer due to structural limitations. Furthermore, replacing the structured mask with standard dropout-based randomness results in a 0\% win rate, confirming that protocol-space structure, not stochastic noise, is the operative ingredient. https://anonymous.4open.science/r/randomness-137A/
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:07 AM

# Randomness is sometimes necessary for coordination
Source: [https://arxiv.org/html/2605.06825](https://arxiv.org/html/2605.06825)
Rohan Patil∗Jai Malegaonkar Henrik I\. Christensen Department of Computer Science and Engineering University of California San Diego San Diego, CA 92093 \{rpatil, jmalegaonkar, hichristensen\}@ucsd\.edu

###### Abstract

Full parameter sharing is standard in cooperative multi\-agent reinforcement learning \(MARL\) for homogeneous agents\. Under permutation\-symmetric observations, however, a shared deterministic policy outputs identical action distributions for every agent, making role differentiation impossible\. This failure can theoretically be resolved using symmetry breaking among anonymous identical processors, which requires randomness\. We propose Diamond Attention, a cross\-attention architecture in which each agent samples a scalar random number per timestep, inducing a transient rank ordering that masks lower\-ranked peers from agent\-to\-agent attention while leaving task attention fully unmasked\. This realizes a random\-bit coordination protocol in a single broadcast round, and the set\-based attention enables zero\-shot deployment to teams of different sizes\. We evaluate across three regimes that isolate when structured randomness matters\. On the perfectly symmetric XOR game, our method achieves1\.01\.0success while all deterministic baselines plateau near0\.50\.5\. On control coordination tasks, a policy trained onN=4N=4generalizes zero\-shot toN∈\[2,8\]N\\in\[2,8\]\. On SMACLite cross\-scenario transfer, we achieve zero\-shot transfer where standard baselines cannot transfer due to structural limitations\. Furthermore, replacing the structured mask with standard dropout\-based randomness results in a 0% win rate, confirming that protocol\-space structure, not stochastic noise, is the operative ingredient\.[https://anonymous\.4open\.science/r/randomness\-137A/](https://anonymous.4open.science/r/randomness-137A/)

## 1Introduction

A widely adopted design choice in cooperative multi\-agent reinforcement learning \(MARL\) is full parameter sharing, where all homogeneous agents execute copies of a single learned policyGupta et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib9)\); Terry et al\. \([2020](https://arxiv.org/html/2605.06825#bib.bib26)\)\. Combined with the centralized training and decentralized execution \(CTDE\) paradigmLowe et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib15)\)and value decomposition methods such as VDNSunehag et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib25)\)and QMIXRashid et al\. \([2020](https://arxiv.org/html/2605.06825#bib.bib21)\), this approach reduces sample complexity, simplifies training infrastructure, and scales naturally with team size\. It has become the default substrate for cooperative MARLSunehag et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib25)\); Rashid et al\. \([2020](https://arxiv.org/html/2605.06825#bib.bib21)\); Gronauer & Diepold \([2022](https://arxiv.org/html/2605.06825#bib.bib8)\); Oroojlooy & Hajinezhad \([2023](https://arxiv.org/html/2605.06825#bib.bib18)\)\. This default fails in tasks with multi\-modal reward structures, where multiple equally optimal joint strategies exist\. Under such conditions, all agents produce identical action distributions when observations are structurally identical, making role differentiation impossible\.Fu et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib7)\)use the XOR game—where two agents must choose opposite actions to receive any reward—to show that a shared deterministic policy outputs the same action for both agents, guaranteeing zero reward regardless of training duration, illustrating that this is not an empirical pathology that better optimization can resolve\.

Angluin \([1980](https://arxiv.org/html/2605.06825#bib.bib1)\)proved that symmetry breaking among anonymous processors is impossible deterministically;Case et al\. \([2005](https://arxiv.org/html/2605.06825#bib.bib4)\)extended this to cooperative games, establishing that access to shared random bits is*necessary*, not merely beneficial, for coordination when agents lack unique identifiers\. Section[2](https://arxiv.org/html/2605.06825#S2)develops these theoretical foundations and surveys how prior work falls short on each axis\.

Existing approaches address this problem on two axes, but no prior method satisfies both simultaneously\. Methods that modify parameter sharing or optimization dynamics improve coordination in symmetric settings but remain deterministic at execution time, failing the theoretical randomness requirement\. Methods that achieve symmetry breaking through sequential execution resolve the symmetry problem but require𝒪​\(N\)\\mathcal\{O\}\(N\)inference rounds and an externally imposed execution order—which, in a truly decentralized homogeneous setting, is itself the symmetry\-breaking problem\.

We operationalize the theoretical prescription ofCase et al\. \([2005](https://arxiv.org/html/2605.06825#bib.bib4)\)in a practical MARL architecture\. Our proposed*Diamond Attention*is a cross\-attention mechanism between agent and task embeddings that incorporates structured random masking\. At each timestep, every agent samples a scalar random number and shares it with the team via a single broadcast round\. These scalars induce a strict rank ordering over agents, generating asymmetric attention masks over the agent dimension: each agent masks all agents ranked below it and attends only to agents at equal or higher rank\. The mask applies only to agent\-to\-agent attention; task attention remains fully unmasked\. The resulting asymmetry creates a dynamic, per\-timestep hierarchy in which high\-ranked agents attend to few peers and act largely independently, while low\-ranked agents condition their behavior on what higher\-ranked agents are doing\. Because Diamond Attention operates over sets of agent and task embeddings, the architecture accepts any number of agents and any number of tasks without modification: policies trained on one team size generalize zero\-shot to others as an inherent property of computing attention over variable\-length sequences\.

Our contributions are as follows:

- •Theory\.We formalize the equivalence between XOR coordination and random\-bit sharing , providing the theoretical bridge that motivates each architectural component of Diamond Attention and grounds the structured mask in the necessity result of Case et al\. \(2005\)\.
- •Architecture\.We propose Diamond Attention, which realizes this protocol via structured random masking in a single broadcast round while retaining the set\-based scalability of standard cross\-attention\. The architecture requires no sequential execution and no external agent identifiers\.
- •Empirical validation\.We validate across three regimes: Diamond Attention is the only approach to achieve1\.01\.0success on the XOR game where all deterministic baselines plateau at the random\-action floor; a policy trained onN=4N=4generalizes zero\-shot toN∈\[2,8\]N\\in\[2,8\]on VMAS continuous coordination tasks; and the architecture achieves zero\-shot transfer from easier to harder SMACLite scenarios where all baselines and ablations fail entirely\.

## 2Related Work

Parameter sharing \(PS\) has become the dominant paradigm in cooperative MARLGupta et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib9)\); Terry et al\. \([2020](https://arxiv.org/html/2605.06825#bib.bib26)\), especially when combined with the CTDE frameworkLowe et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib15)\)and value decomposition methods such as VDNSunehag et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib25)\)and QMIXRashid et al\. \([2020](https://arxiv.org/html/2605.06825#bib.bib21)\)\. Under symmetric observations a shared deterministic policy produces identical outputs for all agents, and gradient updates that improve one agent’s strategy identically affect all othersFu et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib7)\), trapping the system in symmetric equilibria\. Recent work mitigates this while preserving PS efficiency: KaleidoscopeLi et al\. \([2024](https://arxiv.org/html/2605.06825#bib.bib12)\)introduces learnable sparse masks to induce per\-agent heterogeneity at training time, GradPSQin et al\. \([2025](https://arxiv.org/html/2605.06825#bib.bib19)\)resolves gradient conflicts that arise from opposing update signals, and pH\-MARLSebastián et al\. \([2025](https://arxiv.org/html/2605.06825#bib.bib23)\)leverages port\-Hamiltonian geometric priors to enforce valid distributed coordination structures\. However, Kaleidoscope’s learned masks are deterministic once trained; GradPS resolves gradient conflicts during optimization but produces no execution\-time differentiation between agents; and pH\-MARL’s geometric priors enforce coordination structure without introducing the stochasticity that theory demands\. These methods address symptoms of the symmetry problem rather than its theoretical root\.

A related challenge is zero\-shot scalability: deploying a trained policy to teams of varying size without retrainingLiu et al\. \([2024](https://arxiv.org/html/2605.06825#bib.bib13)\)\. UPDeTHu et al\. \([2021](https://arxiv.org/html/2605.06825#bib.bib10)\)achieves scalability by treating each agent’s observation as part of a variable\-length sequence, but agents act simultaneously from independent observations, leaving symmetry breaking unaddressed in multi\-modal reward settings\. Autoregressive models MATWen et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib28)\)and SableMahjoub et al\. \([2024](https://arxiv.org/html/2605.06825#bib.bib16)\)achieve symmetry breaking by construction through sequential execution, but at the cost of𝒪​\(N\)\\mathcal\{O\}\(N\)inference latency and an externally imposed execution order—establishing that order requires a coordination mechanism, and in a truly decentralized homogeneous setting, agreeing on who executes first is equivalent to solving the symmetry\-breaking problem, making the approach circularly dependent on an assumption it cannot supply\. We make no claim of superior coordination quality where a fixed ordering is externally available; our contribution is orthogonal, addressing the gap that UPDeT and MAT each leave open on a different axis\.

The theoretical underpinnings of our approach trace to distributed computing\.Angluin \([1980](https://arxiv.org/html/2605.06825#bib.bib1)\)proved that symmetry breaking in anonymous networks is impossible deterministically\.Fischer et al\. \([1985](https://arxiv.org/html/2605.06825#bib.bib6)\)established that consensus is unattainable in asynchronous systems even with a single fault, reinforcing that deterministic coordination protocols are fragile even under mild adversarial conditions\.Case et al\. \([2005](https://arxiv.org/html/2605.06825#bib.bib4)\)showed that shared random bits resolve this impossibility for cooperative games, making coordination achievable with bounded failure probability\. The MP\-MAB literatureLiu & Zhao \([2010](https://arxiv.org/html/2605.06825#bib.bib14)\); Shi & Shen \([2021](https://arxiv.org/html/2605.06825#bib.bib24)\)reaches the same conclusion through collision models that are structurally identical to coordination failures in multi\-modal MARL: agents assigned to the same arm and agents selecting the same action face the same orthogonalization problem, and in both cases deterministic policies cannot escape\.

Attention\-based methods such as UPDeT achieve zero\-shot scalability over variable team sizes but leave symmetry breaking unaddressed in multi\-modal reward settings\. Autoregressive methods such as MAT and Sable achieve symmetry breaking by construction but require𝒪​\(N\)\\mathcal\{O\}\(N\)sequential rounds and an externally imposed execution order\. Diamond Attention addresses the intersection: structured randomness, grounded in the theoretical necessity established byCase et al\. \([2005](https://arxiv.org/html/2605.06825#bib.bib4)\), enables coordination in a single broadcast round while retaining the set\-based scalability of standard cross\-attention\. No prior MARL architecture has realized this protocol as an architectural primitive\.

## 3Motivation & Model Architecture

We begin by formalizing the core limitation that motivates our approach\.

###### Definition 3\.1\(Symmetry Breaking\)\.

In a cooperative multi\-agent system withnnhomogeneous agents sharing a single policyπθ\\pi\_\{\\theta\},*symmetry breaking*is the ability of agents to produce differentiated action distributions despite receiving structurally identical observations\. Formally, if the observationsoio\_\{i\}andojo\_\{j\}of agentsiiandjjare equal under permutation of agent indices, thenπθ​\(oi\)=πθ​\(oj\)\\pi\_\{\\theta\}\(o\_\{i\}\)=\\pi\_\{\\theta\}\(o\_\{j\}\)by parameter sharing, so differentiated behavior must arise from differentiated internal state rather than from the policy itself\.

We next detail the XOR game, which provides the theoretical grounding for our architecture, and derive how each component of the resulting protocol is realized in the Diamond Attention mechanism\.

### 3\.1XOR Game

The XOR gameFu et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib7)\)is a single\-step cooperative game where two players each select from two actions; both receive reward11if their actions differ, and0otherwise \([Table˜1](https://arxiv.org/html/2605.06825#S3.T1)\)\. Its generalization tonnplayers andkkactions \(n≤kn\\leq k\) yields reward only when all players select distinct actions, mirroring collision models in the multi\-agent multi\-armed bandit literatureLiu & Zhao \([2010](https://arxiv.org/html/2605.06825#bib.bib14)\)\.

0110Table 1:Payoff matrix for the 2\-player XOR game\.While an autoregressive approach can solve XOR by having agents act sequentially and condition on predecessors’ actionsFu et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib7)\), the required communication rounds grow linearly with team size\. More fundamentally, in a truly decentralized homogeneous setting, establishing the execution order itself requires random bits, which is equivalent to solving the generalized XOR game\.

Case et al\. \([2005](https://arxiv.org/html/2605.06825#bib.bib4)\)demonstrate that coordination games like XOR can be solved in homogeneous settings by sharing random bits in a single communication round: each player generates a string of bits, and with bounded probability all strings are unique, enabling coordination through any fixed total ordering on bit strings\. Without randomness, agents provably fail to obtain the optimal payoff in a truly decentralized setting\. To formalize this, we model players as machines aligned with the Inexhaustible Interactive Turing Machine frameworkKüsters et al\. \([2013](https://arxiv.org/html/2605.06825#bib.bib11)\)\. In our formulation, machines do not engage in point\-to\-point communication but rather broadcast messages\.111Küsters et al\. \([2013](https://arxiv.org/html/2605.06825#bib.bib11)\)have point\-to\-point communicationWhile the source of a broadcast is indistinguishable, agents can determine how many machines are broadcasting the same input — realizable through a frequency\-modulated receiver where signal strength indicates the number of broadcasting machines\.

###### Definition 3\.2\(Player\)\.

A player is a machine initialized with a finite input that may execute four routines, transitioning between them based on the input:Compute\(run any terminating Turing machine on the current tape, whose final state becomes the new input\);Broadcast\(transmit a portion of the input to other agents\);Receive\(append incoming broadcasts to the input\); andSample\(append freshly sampled uniform random bits to the input\)\. Each non\-compute routine returns to Compute\. The final input state on halt is the player’s output\.

###### Definition 3\.3\(Homogeneity\)\.

Two playersAAandBBare*homogeneous*if, for any inputIIand any string of random bitsrrsupplied to both players’ Sample routines,AAandBBproduce the same final output, both consume exactly the bits inrr, and neither generates random bits beyondrr\.

It is trivial to see that homogeneity is transitive\.

###### Theorem 3\.4\.

The probability thatnnhomogeneous players solve the generalized XOR game \(nnplayers,nnactions\) equals the probability of agreeing on an execution order when each player generates and shareskkbits\.

###### Proof\.

### 3\.2From Theory to Model Architecture

[Theorem˜3\.4](https://arxiv.org/html/2605.06825#S3.Thmtheorem4)establishes that solving the generalized XOR game reduces to each player generating and sharing random bits to agree on an execution order\.[Table˜2](https://arxiv.org/html/2605.06825#S3.T2)maps each component of this theoretical protocol to its realization in Diamond Attention\. The key insight is that attention masking serves as the mechanism by which random scalars induce differentiated behavior\. Even though all agents share identical network parameters, different random numbers produce different effective attention patterns, yielding distinct latent representations and consequently distinct action distributions\. This resolves the symmetry identified in[Definition˜3\.1](https://arxiv.org/html/2605.06825#S3.Thmtheorem1)at the attention level rather than through parameter differences\.

Table 2:Mapping from the theoretical coordination protocol of[Theorem˜3\.4](https://arxiv.org/html/2605.06825#S3.Thmtheorem4)to the Diamond Attention architecture\.![Refer to caption](https://arxiv.org/html/2605.06825v1/x1.png)Figure 1:Calculation ofVAV\_\{A\}andVTV\_\{T\}from agent embeddingsAAand task embeddingsTT\. The weight matrix follows the standard attention mechanism; the matrix multiplication uses the weight matrix resulting from masking the calculated scores\.
![Refer to caption](https://arxiv.org/html/2605.06825v1/x2.png)Figure 2:Mask generation: each agent generates a random number\. For each column, the system computes whether an agent’s number exceeds others’; if so, the entry is masked\. Hollow squares represent masked entries in the task×\\timesagent mask matrices\.

We develop a cross\-attention architecture that assigns distinct sub\-goals to different agents\. Our aim is an architecture that maps any number of agents to any number of tasks without modification\.[Equation˜1](https://arxiv.org/html/2605.06825#S3.E1)shows the standard attention mechanism, whereQ,K,VQ,K,Vare the query, key, and value matrices anddkd\_\{k\}is the key dimensionVaswani et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib27)\)\. We employ cross\-attention between agent embeddings and task embeddings, termed*Diamond Attention*\([Figure˜2](https://arxiv.org/html/2605.06825#S3.F2)\)\.

Attention​\(Q,K,V\)=softmax​\(Q​KTdk\)​V\\text\{Attention\}\(Q,K,V\)=\\text\{softmax\}\\\!\\left\(\\frac\{QK^\{T\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V\(1\)
![Refer to caption](https://arxiv.org/html/2605.06825v1/x3.png)Figure 3:The complete architecture consists of three Diamond Attention blocks\. The first computes attention between tasks and agents; agent\-specific data constructs a fixed\-length vector for action generation\. Each*linear projection*consists of multiple linear layers with intermediate activation\. The depicted mask is simplified to illustrate that an agent with a lower random number is masked\. Note the mask usage during Diamond Attention between agent\-weighted tasks and the agent\-specific token\.Structured mask construction\.Each agentiiappends a scalarri∼Uniform​\[0,1\]r\_\{i\}\\sim\\mathrm\{Uniform\}\[0,1\]to its observation at every timestep, sampled independently across agents and steps\. These scalars induce a transient rank ordering: agentiiholds higher rank than agentjjiffri\>rjr\_\{i\}\>r\_\{j\}\. Agentiimasks out its*lower\-ranked*peers from its cross\-attention over agent embeddings, while leaving attention over task embeddings fully unmasked\. Formally, for thekk\-th key position in agentii’s attention:

𝐌i,k=\{0if​rk≥ri−∞otherwise\.\\mathbf\{M\}\_\{i,k\}=\\begin\{cases\}0&\\text\{if \}r\_\{k\}\\geq r\_\{i\}\\\\ \-\\infty&\\text\{otherwise\.\}\\end\{cases\}\(2\)
Under standard parameter sharing without masking, all agents process identical network weights and, in symmetric configurations, produce identical action distributions\. The structured mask breaks this symmetry at the attention level: agentiiwithri=0\.7r\_\{i\}=0\.7sees a different effective attention pattern than agentjjwithrj=0\.3r\_\{j\}=0\.3, becauseii’s mask suppresses attention to lower\-ranked agents whilejjattends to the full peer set\. This produces differentiated latent representations, and consequently differentiated actions, even under fully shared parameters\.

[Figure˜2](https://arxiv.org/html/2605.06825#S3.F2)illustrates mask generation for three agents and three tasks\.222We refer to subtasks as tasks, considering only cases where subtasks can be created\.The mask is computed independently and in parallel for each agent, requiring no sequential communication\. The agent with the lowestrir\_\{i\}attends to all peers and all tasks; agents with progressively higher ranks attend to a shrinking peer neighborhood, with the highest\-ranked agent attending to no peers at all, conditioning its action solely on task structure\. The mask is asymmetric across agents — no two agents suppress the same subset of peers — which is the minimal structural condition required to break the homogeneity that causes coordination failures in XOR\. Task attention is intentionally left unmasked so that every agent, regardless of rank, maintains full access to the goal structure\.

#### Task embedding construction\.

Task embeddings are constructed by projecting raw task features through a learned two\-layer MLP with SiLU activation\. In Simple Spread, each landmark’s 2D relative position constitutes the task feature vector\. In Food Collection, the relative positions of food items serve as task features\. In SMACLite, enemy unit features \(relative position, health, unit type\) are projected into the task embedding space\. Agent embeddings are constructed analogously from each agent’s local observation \(position, velocity\) and other agents’ relative positions and velocities\. The architecture is scenario\-agnostic: swapping the feature extractor suffices to apply Diamond Attention to a new domain\.

#### Complete architecture\.

Given agent embeddingsAAand subtask embeddingsTT, we calculate two sets of weighted embeddingsVAV\_\{A\}andVTV\_\{T\}as defined in[Equation˜3](https://arxiv.org/html/2605.06825#S3.E3), whereWAW\_\{A\}andWTW\_\{T\}are learned weight matrices,MMis the mask \(masked entries−∞\-\\infty, otherwise0\), and⊕\\oplusdenotes concatenation\.

VA=softmax​\(T​ATdk\+M\)​\(WA​A\)⊕A;VT=softmax​\(T​ATdk\+M\)⊤​\(WT​T\)⊕TV\_\{A\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{TA^\{T\}\}\{\\sqrt\{d\_\{k\}\}\}\+M\\right\)\(W\_\{A\}A\)\\oplus A\\,\\,;\\,\\,V\_\{T\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{TA^\{T\}\}\{\\sqrt\{d\_\{k\}\}\}\+M\\right\)^\{\\\!\\top\}\(W\_\{T\}T\)\\oplus T\(3\)
The complete architecture \([Figure˜3](https://arxiv.org/html/2605.06825#S3.F3)\) comprises three Diamond Attention blocks\. The first computes cross\-attention between tasks and agents\. Agent\-specific data then constructs a fixed\-length vector from which the final action is generated\. This design accepts fixed\-length embeddings per agent and task, rendering it independent of the total agent or task count\.

## 4Experiments

We evaluate Diamond Attention across three settings that isolate distinct coordination pathologies:The XOR Game:Tests symmetry breaking in perfectly symmetric reward landscapes where deterministic parameter sharing theoretically fails\.Continuous Coordination \(VMAS\):Tests zero\-shot generalization to variable agent counts without retraining in both static and non\-stationary tasks\.StarCraft Multi\-Agent Challenge:Tests coordination against an active opponent and cross\-scenario transfer with variable enemy counts\.

#### Implementation details\.

We modify the PPOSchulman et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib22)\)implementation from StableBaselines3Raffin et al\. \([2021](https://arxiv.org/html/2605.06825#bib.bib20)\)to process observations from all agents and produce actions using a shared policy, as described in[Table˜2](https://arxiv.org/html/2605.06825#S3.T2)\. We employ a common critic and train using the global team rewardYu et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib29)\)\. Baselines are described per experiment below\.

### 4\.1XOR Game: Isolating Symmetry Breaking

In annn\-player,kk\-action XOR game, homogeneous agents sharing a deterministic policy inevitably output identical action distributions, producing collisions that yield zero rewardFu et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib7)\)\.[Table˜3](https://arxiv.org/html/2605.06825#S4.T3)reports success rates\. As predicted by theory, deterministic baselines and the no\-mask ablation collapse to the random\-action floor:0\.50\.5in then=k=2n=k=2setting and2\!/32≈0\.222\!/3^\{2\}\\approx 0\.22in then=2,k=3n=2,k=3\(train\)n=3,k=3n=3,k=3\(eval\) setting\. In contrast, our approach achieves a1\.01\.0success rate acrossn=k=2n=k=2andn=k=3n=k=3when trained directly on those configurations, under both greedy \(π∗\\pi^\{\*\}\) and stochastic \(π\\pi\) action selection\. The structured random mask serves as an implicit rank\-assignment mechanism, allowing agents to assume distinct roles without explicit communication\. Then=2,k=3→n=3,k=3n=2,k=3\\to n=3,k=3generalization column tells a different story: our method fails to generalize when trained on a strict subset of the action space and then deployed at full team size\. We discuss this generalization brittleness, and its contrast with our generalization behavior in dynamic environments, in[Section˜5](https://arxiv.org/html/2605.06825#S5)\. In case of MAT, however, it can be observed that it also fails to grasp the entire structure as greedy policy fails\. However, it is able to keep some understanding of the solution as we can see that sampling from the policy gives a good success rate\.

Table 3:Average success rates on the XOR game\.Trainindicates the training configuration,Evalthe evaluation configuration\.π∗\\pi^\{\*\}uses greedy action selection,π\\pisamples from the policy\. Our method and MAT are the only approaches to achieve1\.01\.0success on directly\-trained configurations, but via fundamentally different mechanisms: MAT breaks symmetry through autoregressive decoding conditioned on a fixed agent ordering, while ours uses structured random masking that is agnostic to team size\. Then=2,k=3→n=3,k=3n=2,k=3\\to n=3,k=3column tests generalization to a larger team and is discussed in[Section˜5](https://arxiv.org/html/2605.06825#S5)\.
### 4\.2Continuous Coordination: Isolating Scalability

We use the Vectorized Multi\-Agent Simulator \(VMAS\)Bettini et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib2)\)to test zero\-shot scalability of a policy trained onN=4N=4agents deployed onN∈\[2,8\]N\\in\[2,8\]without retraining\. Two scenarios isolate different facets:

Simple Spread \(Static Topology\):A cooperative navigation task whereNNagents must occupyNNlandmarks in a 2D space\[−1,1\]2\[\-1,1\]^\{2\}\. Tests spatial distribution strategies in a static environment\.

Food Collection \(Dynamic Topology\):A custom foraging environment whereNaN\_\{a\}agents collectNfN\_\{f\}food items that respawn at random locations on collection\. Tests adaptation to non\-stationary targets\. Agents receive a reward of\+20\+20per collection, with penalties for collisions\.

#### Baselines and setup\.

We compare against GSA and pH\-MARLSebastián et al\. \([2025](https://arxiv.org/html/2605.06825#bib.bib23)\), MATWen et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib28)\), and MAPPOYu et al\. \([2022](https://arxiv.org/html/2605.06825#bib.bib29)\), QMIXRashid et al\. \([2020](https://arxiv.org/html/2605.06825#bib.bib21)\), IPPODe Witt et al\. \([2020](https://arxiv.org/html/2605.06825#bib.bib5)\), MASACLowe et al\. \([2017](https://arxiv.org/html/2605.06825#bib.bib15)\)via BenchMARLBettini et al\. \([2024](https://arxiv.org/html/2605.06825#bib.bib3)\), all trained onN=4N=4\. GSA and pH\-MARL support variable\-size evaluation; the remaining baselines do not\. Evaluation runs over 400 timesteps across 64 seeds using the per\-agent normalized cumulative rewardR=1N​∑t=1Tr​\(s,a\)R=\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{T\}r\(s,a\)Sebastián et al\. \([2025](https://arxiv.org/html/2605.06825#bib.bib23)\); values are negative,*less negative is better*\.

Scalability across team sizes\.[Figure˜4\(a\)](https://arxiv.org/html/2605.06825#S4.F4.sf1)reports Simple Spread results\. The monotonic reward decrease withNNis expected, as per\-agent penalties accumulate with team size\. Our method maintains near\-optimal performance across the full rangeN∈\[2,8\]N\\in\[2,8\]from a single training run, with notably low variance throughout\. GSA and pH\-MARL degrade substantially at unseen team sizes, performing poorly even at smallNNand worsening monotonically asNNgrows — confirming that graph\-based and physics\-informed inductive biases do not substitute for the decoupling from agent count that set\-based cross\-attention provides\. MAT appears only atN=4N\{=\}4for the architectural reasons discussed in[Section˜2](https://arxiv.org/html/2605.06825#S2)\.

[Figure˜4\(b\)](https://arxiv.org/html/2605.06825#S4.F4.sf2)shows Food Collection where non\-stationarity makes the task more challenging\. Our method outperforms GSA and pH\-MARL across all team sizes; both degrade sharply fromN=2N=2onward, while our cross\-attention mechanism remains stable across the full rangeN∈\[2,8\]N\\in\[2,8\]\. Fixed\-team baselines MAPPO and IPPO are competitive at the training configurationN=4N=4but cannot be deployed elsewhere\. TheOurs w/o maskablation performs comparably to the full model here — a predicted consequence of our analysis, not a failure mode\. When the environment supplies sufficient asymmetry through observation variance \(stochastic food respawning differentiates agents at each step\), the protocol\-space mask is redundant within distribution\. The mask’s contribution becomes essential precisely when that environmental signal shifts at deployment, as the SMAC zero\-shot results in[Section˜4\.3](https://arxiv.org/html/2605.06825#S4.SS3)confirm\.[Appendix˜C](https://arxiv.org/html/2605.06825#A3)shows performance acrossNaN\_\{a\}andNfN\_\{f\}\.

![Refer to caption](https://arxiv.org/html/2605.06825v1/x4.png)\(a\)Simple Spread
![Refer to caption](https://arxiv.org/html/2605.06825v1/x5.png)\(b\)Food Collection

Figure 4:Mean and standard deviation of normalized cumulative reward for Simple Spread \(a\) and Food Collection \(b\) \(less negative is better; “better” arrow on left axis\)\. Methods with fixed\-size architectures \(MAPPO, QMIX, IPPO, MASAC, MAT\) appear only at the training configurationN=4N\{=\}4, since they cannot be deployed at other team sizes without retraining\. Methods with variable\-size architectures \(GSA, pH\-MARL, Ours, Ours w/o Mask, Ours Dropout\) are evaluated acrossN∈\[2,8\]N\\in\[2,8\]\. Our method maintains robust performance across the full range despite training only onN=4N\{=\}4\.

### 4\.3SMAC: Isolating Adversarial Robustness

We employ SMACLiteMichalski et al\. \([2023](https://arxiv.org/html/2605.06825#bib.bib17)\)to test coordination in a high\-dimensional adversarial environment with rigid action spaces that typically prevent transfer\. Unlike VMAS, SMAC introduces an active opponent \(the built\-in AI\), and the discrete action space depends on the number of enemy units, making cross\-scenario transfer non\-trivial\.

Architectural modification for transfer\.We train on the2s3zscenario and evaluate zero\-shot on3s5z\. To enable transfer across action spaces, we modify the architecture’s final output layer: the final Diamond Attention block is replaced with two linear layers sandwiching an activation, projecting agent embeddings into opponent\-dependent logits\. This allows the model to handle the changing action space \(where actions depend on the number of enemies\)\. We do not change the size of the model\. Given that the task is adversarial, our hypothesis for the degradation for the train task is because the model with structured randomness is trying to learn a more general solution\.

MAPPOIPPOQMIXMATOursOursOursw/o mask\(dropout\)2s3z \(Train\)1\.0001\.0001\.0001\.0001\.0001\.0001\.0001\.0000\.8320\.8321\.0001\.0001\.0001\.0003s5z \(Zero\-shot\)\-\-\-\-0\.4970\.0000\.0000\.0000\.000Table 4:Win rates on SMAC scenarios\. Standard methods achieve100%100\\%on the training task but cannot transfer to 3s5z due to fixed\-head action\-space constraints\. Our method is capable of zero\-shot transfer \(49\.7%49\.7\\%\); removing the structured mask collapses transfer to0\.0%0\.0\\%, and replacing the mask with dropout produces the same collapse\.Results\.[Table˜4](https://arxiv.org/html/2605.06825#S4.T4)reports win rates\. The deterministic baselines \(MAPPO, IPPO, QMIX\) achieve perfect performance on 2s3z but cannot transfer to 3s5z\. Our method achieves49\.7%49\.7\\%zero\-shot transfer\. Critically, the two ablations isolate*what*provides this transferability: removing the mask entirely \(Ours w/o mask\) drops transfer to0%0\\%; replacing the mask with dropout does likewise, showing that unstructured stochasticity is also insufficient\. Only the*structured*mask survives the scenario shift, establishing that what enables transferable coordination is not noise but the specific protocol\-space structure of the random rank ordering\.

## 5Discussion, Limitations, and Impact

The findings in this paper rest on a single distinction: coordination among homogeneous agents requires asymmetry, and that asymmetry can be supplied either by the environment or by the protocol itself\. The two are interchangeable within distribution but not across it\. Environmental asymmetry—spatial variance, opponent\-induced state shifts, observation noise—is sufficient when training and deployment statistics align, and our no\-mask ablation on Food Collection shows this explicitly\. Protocol\-space asymmetry, in the form of structured random rank ordering, is what survives when those statistics shift, as the SMAC zero\-shot transfer results demonstrate\. Diamond Attention is the architectural realization of this protocol\-space asymmetry, requiring only a single broadcast round and remaining decoupled from the specific number of agents in deployment\. The contrast between dropout and structured masking on SMAC transfer \(0%0\\%vs49\.7%49\.7\\%\) sharpens the claim: it is not noise that enables coordination across distribution shift, but the specific structure of the random ordering\.

What the ablations isolate\.Ours w/o maskmatches the full method on Food Collection \([Figure˜4\(b\)](https://arxiv.org/html/2605.06825#S4.F4.sf2)\) but fails on XOR \([Table˜3](https://arxiv.org/html/2605.06825#S4.T3)\) and collapses to0%0\\%transfer on SMAC \([Table˜4](https://arxiv.org/html/2605.06825#S4.T4)\)\. The dropout ablation’s0%0\\%confirms the operative ingredient is the protocol\-space ordering, not stochasticity itself: both ablations confirm that neither a deterministic policy nor unstructured noise suffices when no environmental symmetry\-breaking signal is available or when that signal shifts at deployment\.

Coordination capacity vs\. learning capacity\.Our XOR results empirically validate the theoretical bounds ofCase et al\. \([2005](https://arxiv.org/html/2605.06825#bib.bib4)\): deterministic architectures cannot exceed the random\-action floor in symmetric coordination tasks, and structured randomness lifts it\. A separate question is whether the architecture can*learn*to coordinate at scale\. In annn\-player,nn\-action XOR setting the probability of non\-zero reward under random play isn\!/nnn\!/n^\{n\}, which vanishes rapidly; atn=5n=5training does not converge even with reward shaping\. This is a learning bottleneck as well as architectural one, given the performance of MAT\. The same architecture that solvesn=k=2n=k=2andn=k=3n=k=3generalizes effectively to up to eight agents in VMAS and SMAC, where scenario complexity—spatial proximities, continuous feedback, observable opponent positions—provides the richer signals PPO can exploit\. The coordination protocol works in both regimes; the difference is whether PPO can find it\.

Deployment\.Zero\-shot generalization to varying agent counts implies that a single trained policy can be deployed across fleets of fluctuating size without retraining\. Because coordination does not rely on synchronized global communication at execution time, the system tolerates node failures and communication latency in ways that fixed\-permutation autoregressive methods cannot\.

Limitations\.Three limitations are worth naming\. First, the SMAC transfer result requires replacing the final Diamond Attention block with a linear projection to handle variable action spaces, trading representational capacity on the training task \(0\.8320\.832on 2s3z against1\.0001\.000for fixed\-head baselines\) for the ability to generalize at all\. Furthermore, there is a need to study the effect of structured randomness and the tradeoff of generalization and model capacity\. Adaptive projection layers that preserve coordination capacity under variable action spaces are a natural next step\. Second, the architecture’s ability to learn coordination at scale is bottlenecked by reward sparsity under PPO: convergence fails forn\>5n\>5even with reward shaping, and off\-policy training or curriculum approaches may circumvent this\. Third, the protocol assumes a broadcast model in which agents share scalar values within a single timestep; strict point\-to\-point deployments would require an explicit consensus layer, which would in turn require its own coordination mechanism\.

Impact\.This work contributes a coordination primitive whose operative state is generated internally and stochastically per step rather than read from environmental signals\. Decentralized coordination without reliance on global synchronization or fixed agent identities offers value in time\-critical deployments where infrastructure is unreliable—disaster response, search\-and\-rescue—and the internally\-sampled coordination state is harder for adversaries to predict or jam than schemes that rely on observable environmental cues\. The same properties carry real dual\-use risk: the resilience that benefits civilian applications also makes the approach attractive for autonomous adversarial swarms or military systems operating under contested communication\. We do not see a clean mitigation here, as the underlying mechanism is general\.

## References

- Angluin \(1980\)Angluin, D\.Local and global properties in networks of processors\.In*Proceedings of the twelfth annual ACM symposium on Theory of computing*, pp\. 82–93, 1980\.
- Bettini et al\. \(2022\)Bettini, M\., Kortvelesy, R\., Blumenkamp, J\., and Prorok, A\.Vmas: A vectorized multi\-agent simulator for collective robot learning\.In*International Symposium on Distributed Autonomous Robotic Systems*, pp\. 42–56\. Springer, 2022\.
- Bettini et al\. \(2024\)Bettini, M\., Prorok, A\., and Moens, V\.Benchmarl: Benchmarking multi\-agent reinforcement learning\.*Journal of Machine Learning Research*, 25\(217\):1–10, 2024\.
- Case et al\. \(2005\)Case, J\., Jain, S\., Montagna, F\., Simi, G\., and Sorbi, A\.On learning to coordinate: Random bits help, insightful normal forms, and competency isomorphisms\.*Journal of Computer and System Sciences*, 71\(3\):308–332, 2005\.
- De Witt et al\. \(2020\)De Witt, C\. S\., Gupta, T\., Makoviichuk, D\., Makoviychuk, V\., Torr, P\. H\., Sun, M\., and Whiteson, S\.Is independent learning all you need in the starcraft multi\-agent challenge?*arXiv preprint arXiv:2011\.09533*, 2020\.
- Fischer et al\. \(1985\)Fischer, M\. J\., Lynch, N\. A\., and Paterson, M\. S\.Impossibility of distributed consensus with one faulty process\.*Journal of the ACM \(JACM\)*, 32\(2\):374–382, 1985\.
- Fu et al\. \(2022\)Fu, W\., Yu, C\., Xu, Z\., Yang, J\., and Wu, Y\.Revisiting some common practices in cooperative multi\-agent reinforcement learning\.In*International Conference on Machine Learning*, pp\. 6863–6877\. PMLR, 2022\.
- Gronauer & Diepold \(2022\)Gronauer, S\. and Diepold, K\.Multi\-agent deep reinforcement learning: a survey\.*Artificial Intelligence Review*, 55\(2\):895–943, 2022\.
- Gupta et al\. \(2017\)Gupta, J\. K\., Egorov, M\., and Kochenderfer, M\.Cooperative multi\-agent control using deep reinforcement learning\.*Autonomous Agents and Multi\-Agent Systems*, pp\. 66–83, 2017\.
- Hu et al\. \(2021\)Hu, S\., Zhu, F\., Chang, X\., and Liang, X\.Updet: Universal multi\-agent reinforcement learning via policy decoupling with transformers\.In*International Conference on Learning Representations*, 2021\.
- Küsters et al\. \(2013\)Küsters, R\., Tuengerthal, M\., and Rausch, D\.The iitm model: a simple and expressive model for universal composability\.*Cryptology ePrint Archive*, 2013\.
- Li et al\. \(2024\)Li, X\., Pan, L\., and Zhang, J\.Kaleidoscope: Learnable masks for heterogeneous multi\-agent reinforcement learning\.In*Advances in Neural Information Processing Systems*, 2024\.
- Liu et al\. \(2024\)Liu, D\., Ren, F\., Yan, J\., Su, G\., Gu, W\., and Kato, S\.Scaling up multi\-agent reinforcement learning: An extensive survey on scalability issues\.*IEEE Access*, 12:94610–94631, 2024\.
- Liu & Zhao \(2010\)Liu, K\. and Zhao, Q\.Distributed learning in multi\-armed bandit with multiple players\.*IEEE transactions on signal processing*, 58\(11\):5667–5681, 2010\.
- Lowe et al\. \(2017\)Lowe, R\., Wu, Y\. I\., Tamar, A\., Harb, J\., Pieter Abbeel, O\., and Mordatch, I\.Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.*Advances in neural information processing systems*, 30, 2017\.
- Mahjoub et al\. \(2024\)Mahjoub, O\., Abramowitz, S\., de Kock, R\., Khlifi, W\., Toit, S\. d\., Daniel, J\., Nessir, L\. B\., Beyers, L\., Formanek, C\., Clark, L\., et al\.Sable: a performant, efficient and scalable sequence model for marl\.*arXiv preprint arXiv:2410\.01706*, 2024\.
- Michalski et al\. \(2023\)Michalski, A\., Christianos, F\., and Albrecht, S\. V\.Smaclite: A lightweight environment for multi\-agent reinforcement learning\.*arXiv preprint arXiv:2305\.05566*, 2023\.
- Oroojlooy & Hajinezhad \(2023\)Oroojlooy, A\. and Hajinezhad, D\.A review of cooperative multi\-agent deep reinforcement learning\.*Applied Intelligence*, 53\(11\):13677–13722, 2023\.
- Qin et al\. \(2025\)Qin, H\., Liu, Z\., Lin, C\., Ma, C\., Mei, S\., Shen, S\., and Wang, C\.Gradps: Resolving futile neurons in parameter sharing network for multi\-agent reinforcement learning\.In*Forty\-second International Conference on Machine Learning*, 2025\.
- Raffin et al\. \(2021\)Raffin, A\., Hill, A\., Gleave, A\., Kanervisto, A\., Ernestus, M\., and Dormann, N\.Stable\-baselines3: Reliable reinforcement learning implementations\.*Journal of Machine Learning Research*, 22\(268\):1–8, 2021\.URL[http://jmlr\.org/papers/v22/20\-1364\.html](http://jmlr.org/papers/v22/20-1364.html)\.
- Rashid et al\. \(2020\)Rashid, T\., Samvelyan, M\., De Witt, C\. S\., Farquhar, G\., Foerster, J\., and Whiteson, S\.Monotonic value function factorisation for deep multi\-agent reinforcement learning\.*Journal of Machine Learning Research*, 21\(178\):1–51, 2020\.
- Schulman et al\. \(2017\)Schulman, J\., Wolski, F\., Dhariwal, P\., Radford, A\., and Klimov, O\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Sebastián et al\. \(2025\)Sebastián, E\., Duong, T\., Atanasov, N\., Montijano, E\., and Sagüés, C\.Physics\-informed multi\-agent reinforcement learning for distributed multi\-robot problems\.*IEEE Transactions on Robotics*, 2025\.
- Shi & Shen \(2021\)Shi, C\. and Shen, C\.Multi\-player multi\-armed bandits with collision\-dependent reward distributions\.*IEEE Transactions on Signal Processing*, 69:4385–4402, 2021\.
- Sunehag et al\. \(2017\)Sunehag, P\., Lever, G\., Gruslys, A\., Czarnecki, W\. M\., Zambaldi, V\., Jaderberg, M\., Lanctot, M\., Sonnerat, N\., Leibo, J\. Z\., Tuyls, K\., et al\.Value\-decomposition networks for cooperative multi\-agent learning\.*arXiv preprint arXiv:1706\.05296*, 2017\.
- Terry et al\. \(2020\)Terry, J\. K\., Black, B\., Jayakumar, M\., Hari, A\., Santos, L\., Dieffendahl, C\., Williams, N\., Lokesh, Y\., Horsch, C\., and Ravi, P\.Revisiting parameter sharing in multi\-agent deep reinforcement learning\.*arXiv preprint arXiv:2005\.13625*, 2020\.
- Vaswani et al\. \(2017\)Vaswani, A\., Shazeer, N\., Parmar, N\., Uszkoreit, J\., Jones, L\., Gomez, A\. N\., Kaiser, Ł\., and Polosukhin, I\.Attention is all you need\.*Advances in neural information processing systems*, 30, 2017\.
- Wen et al\. \(2022\)Wen, M\., Kuba, J\., Lin, R\., Zhang, W\., Wen, Y\., Wang, J\., and Yang, Y\.Multi\-agent transformer\.In*Advances in Neural Information Processing Systems*, volume 35, pp\. 8107–8120, 2022\.
- Yu et al\. \(2022\)Yu, C\., Velu, A\., Vinitsky, E\., Gao, J\., Wang, Y\., Bayen, A\., and Wu, Y\.The surprising effectiveness of ppo in cooperative multi\-agent games\.*Advances in neural information processing systems*, 35:24611–24624, 2022\.

## Appendix AProof of Theorem

###### Proof\.

In annn\-homogeneous player system, each player can receive at most\(n−1\)​k\(n\-1\)kbits, assuming each transmitskkbits\.

LetMMbe a player that plays the XOR game optimally\. Assume at least one player transmitskkbits \(otherwise consider a system transmittingk−1k\-1bits, which is valid since players may use Routine 4 to transition to states transmitting fewer bits\)\. ConstructM′M^\{\\prime\}that runsMMwhile tracking transmitted and received bits\. IfMMstops having transmitted fewer thankkbits,M′M^\{\\prime\}broadcasts0s to reachkk, ensuringkkbits are transmitted and an opportunity exists to receivekkbits from others\.M′M^\{\\prime\}is homogeneous toMMby definition\.

Play the game withnncopies ofM′M^\{\\prime\}, and letk1,…,knk\_\{1\},\\dots,k\_\{n\}be thekk\-length bit strings transmitted\. Letπ​\(ki,k1,…,kn\)\\pi\(k\_\{i\},k\_\{1\},\\dots,k\_\{n\}\)denote playerii’s action distribution\. If any two strings are identical \(ki=kjk\_\{i\}=k\_\{j\}\), the corresponding distributions are identical and a collision occurs\. If all strings are distinct,π\\pican assign probability11to a unique action per player\. Since homogeneous agents differ only through random bits, the collision probability is minimized under uniform sampling, giving:

P​\(No Collision\)≤\(2kn\)⋅n\!2n​kP\(\\text\{No Collision\}\)\\;\\leq\\;\\frac\{\\dbinom\{2^\{k\}\}\{n\}\\cdot n\!\}\{2^\{nk\}\}
Now considernnhomogeneous players each generatingkkuniform random bits and sharing them, then using a fixed strict ordering overkk\-bit strings to determine execution order\. Homogeneity ensures all players use the same ordering\. A conflict arises only if two strings collide, and the probability allnnstrings are distinct is exactly:

P​\(Different Strings\)=\(2kn\)⋅n\!2n​kP\(\\text\{Different Strings\}\)\\;=\\;\\frac\{\\dbinom\{2^\{k\}\}\{n\}\\cdot n\!\}\{2^\{nk\}\}The two expressions are equal, establishing the equivalence\. ∎

## Appendix BHyperparameters and Implementation details

All training was on NVIDIA Titan X GPU\.

Table 5:Comparison of general hyperparameters across all scripts
## Appendix CGeneralization in Food Collection

[Figure˜5](https://arxiv.org/html/2605.06825#A3.F5)reports our method’s performance across joint variations in agent countNaN\_\{a\}and food countNfN\_\{f\}\. Two trends emerge\. For fixedNaN\_\{a\}, normalized reward decreases asNfN\_\{f\}grows, an artifact of the reward function where the cumulative distance\-to\-food penalty scales linearly withNfN\_\{f\}\. For fixedNfN\_\{f\},RRimproves withNaN\_\{a\}as a denser agent population covers the area more effectively, increasing the probability that a respawned food item appears near an agent\. Reward variance also decreases asNaN\_\{a\}grows, indicating that the team learns a stable decentralized coverage strategy rather than relying on stochastic individual successes\.

![Refer to caption](https://arxiv.org/html/2605.06825v1/x6.png)Figure 5:Mean and standard deviation ofRRfor Food Collection across 64 runs, varying both agent countNaN\_\{a\}\(x\-axis\) and food countNfN\_\{f\}\(y\-axis\)\. For fixedNaN\_\{a\},RRdegrades asNfN\_\{f\}grows due to per\-food distance penalties\. For fixedNfN\_\{f\},RRimproves withNaN\_\{a\}, and variance decreases, indicating learned decentralized coverage rather than stochastic individual success\.
## Appendix DAttention Mechanism Visualizations

These are visual snapshots of the per\-agent cross\-attention weights in evaluation mode to offer deeper insight into the dynamic masking process\.

Figures[6](https://arxiv.org/html/2605.06825#A4.F6)and[7](https://arxiv.org/html/2605.06825#A4.F7)illustrate the environment state alongside the cross\-attention heatmaps for a 4\-agent team in the Simple Spread and Food Collection scenarios, respectively, at different timesteps\. The dynamic nature of the randomized structured mask is clearly visible: at any given timestep, the attention matrices are structurally asymmetric across the team\. Agents assigned a higher transient rank by the masking protocol exhibit highly concentrated attention, effectively acting as temporary "leaders" for specific subtasks, while lower\-ranked agents exhibit broader attention across their peers\. As the random numbers are resampled, this fluid hierarchy shifts continuously throughout the episode, successfully breaking coordination symmetry without requiring explicit communication\.

![Refer to caption](https://arxiv.org/html/2605.06825v1/assets/spread_snap.png)Figure 6:Cross\-attention snapshots for 4 agents in the Simple Spread scenario\. The top row displays the spatial distribution of agents and landmarks at various timesteps\. The subsequent rows display the corresponding attention weights for Agents 0 through 3\. The heatmaps confirm the application of the structured mask, revealing the transient, asymmetric attention hierarchies that enable dynamic coordination\.![Refer to caption](https://arxiv.org/html/2605.06825v1/assets/food_snap.png)Figure 7:Cross\-attention snapshots for 4 agents in the Food Collection scenario\. Similar to the Simple Spread task, the top row displays the spatial distribution of agents \(blue\) and food items \(green\), with the rows below showing the shifting, asymmetric attention weights that drive decentralized coordination over the course of the non\-stationary episode\.

Similar Articles

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Hugging Face Daily Papers

This paper introduces AEM, a supervision-free method for agentic reinforcement learning that adapts entropy dynamics at the response level to improve exploration-exploitation trade-offs. It demonstrates performance gains on benchmarks like ALFWorld and SWE-bench by aligning uncertainty estimation with action granularity.

Learning to cooperate, compete, and communicate

OpenAI Blog

OpenAI presents research on multi-agent reinforcement learning environments where agents learn to cooperate, compete, and communicate. The paper introduces MADDPG (Multi-Agent DDPG), a centralized critic approach that enables agents to learn collaborative strategies and communication protocols more effectively than traditional decentralized methods.

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

arXiv cs.AI

This paper introduces Agentick, a unified benchmark for evaluating general sequential decision-making agents across RL, LLM, and VLM paradigms. It provides 37 procedurally generated tasks and reveals that no single approach currently dominates, highlighting significant room for improvement in agent autonomy.