Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

arXiv cs.AI Papers

Summary

This paper proposes Influence-Based Team Steering (IBTS), a framework for zero-shot human-machine teaming that uses influence shaping to discover diverse interaction patterns and steer trajectories toward stronger coordination. Experiments on Overcooked-AI with two-agent and three-agent settings, including a 30-subject human study, show IBTS improves team performance over baselines.

arXiv:2605.15400v1 Announce Type: new Abstract: While AI agents are rapidly advancing from isolated tools to interactive collaborators, data-driven human-machine teaming (HMT) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes. Zero-shot coordination (ZSC) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded. To remedy this deficiency, we propose Influence-Based Team Steering (IBTS), a framework that uses influence shaping to incentivize agents to discover diverse, high-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes. We assess IBTS on Overcooked-AI in both two-agent and three-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction. Our evaluation includes simulated partners, synthetic partner-style variation, and, to our knowledge, the first 30-subject Overcooked-AI HMT study involving two real human teammates and one machine teammate. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse-reward coordination mechanisms with partner-variation coverage rather than relying on diversity alone.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:32 AM

# An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming
Source: [https://arxiv.org/html/2605.15400](https://arxiv.org/html/2605.15400)
Wei Sheng Department of Computer Science Purdue University shengw@purdue\.edu &Rohan Paleja Department of Computer Science Purdue University rpaleja@purdue\.edu

###### Abstract

While AI agents are rapidly advancing from isolated tools to interactive collaborators, data\-driven human\-machine teaming \(HMT\) methods remain costly in their reliance on human interaction data across domains, teammates, and team sizes\. Zero\-shot coordination \(ZSC\) addresses this bottleneck by simulating diverse partner populations to approximate how unseen partners might behave\. However, partner coverage alone is insufficient as team settings scale and communication becomes degraded\. To remedy this deficiency, we propose Influence\-Based Team Steering \(IBTS\), a framework that uses influence shaping to incentivize agents to discover diverse, high\-performing team interaction patterns and further steers ongoing trajectories toward stronger learned coordination modes\. We assess IBTS on Overcooked\-AI in both two\-agent and three\-agent settings, allowing us to test whether learned coordination structure transfers beyond dyadic interaction\. Our evaluation includes simulated partners, synthetic partner\-style variation, and, to our knowledge, the first 30\-subject Overcooked\-AI HMT study involving two real human teammates and one machine teammate\. Across these evaluations, IBTS improves team performance against competing baselines, highlighting the need for scaled ZSC to combine sparse\-reward coordination mechanisms with partner\-variation coverage rather than relying on diversity alone\.

## 1Introduction

Recent surging investment in embodied machines for human\-proximate work, such as Apptronik’s Apollo\[[15](https://arxiv.org/html/2605.15400#bib.bib20)\], a humanoid robot designed to work alongside people and assist them with physically demanding tasks, underscores the growing need for human\-machine teaming \(HMT\)\[[44](https://arxiv.org/html/2605.15400#bib.bib51),[34](https://arxiv.org/html/2605.15400#bib.bib2),[16](https://arxiv.org/html/2605.15400#bib.bib4)\]\. Gathering representative human interaction data for each task offers a direct way to address many dyadic HMT challenges, but this approach becomes burdensome when even one additional human joins the team\. The added teammate expands the interaction space beyond individual human–machine adaptation, requiring the machine to account for individual preferences\[[47](https://arxiv.org/html/2605.15400#bib.bib21)\], emerging human–human coordination, and trust\-mediated task\-allocation dynamics under limited communication and ambiguous intent cues\[[31](https://arxiv.org/html/2605.15400#bib.bib28),[2](https://arxiv.org/html/2605.15400#bib.bib23)\]\. This barrier motivates our focus on zero\-shot coordination \(ZSC\)\[[42](https://arxiv.org/html/2605.15400#bib.bib25)\]and our extension from the common one\-human–one\-machine setup to a two\-human–one\-machine setup as a minimal step toward more realistic human\-group collaboration\.

One standard solution to the out\-of\-distribution \(OOD\) dilemma in ZSC is to train agents against populations of simulated partners and learn best responses to them\[[43](https://arxiv.org/html/2605.15400#bib.bib9)\], with recent methods further improving robustness by amplifying partner diversity within the training population\[[36](https://arxiv.org/html/2605.15400#bib.bib49),[55](https://arxiv.org/html/2605.15400#bib.bib12)\]\. Yet as teammate populations broaden, partner diversity alone remains inadequate because human behavior simulation cannot be covered exhaustively\[[7](https://arxiv.org/html/2605.15400#bib.bib13)\], and best\-response learning may converge toward robust but static generalist policies that are poor at sustaining the effective coordination patterns\[[29](https://arxiv.org/html/2605.15400#bib.bib24)\]\. This inspires us to seek criteria beyond diversity for identifying high\-performing coordination patterns, and to strengthen the learning signal used during best\-response training so that ongoing interactions can be steered toward these patterns\.

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/overview.png)Figure 1:IBTS overview\. Stage 1 constructs a diverse team pool using influence\-shaped coordination rewards and behavioral diversity\. Stage 2 trains a predictor that maps recent trajectory to a coordination embedding and a team\-similarity distribution\. Stage 3 uses the predicted team similarity and team scores to define a steering reward for training a best\-response policy\.Here, we propose Influence\-Based Team Steering \(IBTS\), a novel HMT framework that augments partner diversity with learned coordination guidance\. IBTS first promotes supportive behavior patterns during team generation through influence shaping, then learns a trajectory predictor that recognizes these patterns from interaction histories\. Finally, IBTS uses this real\-time recognition as a steering signal, guiding the machine teammate toward stronger learned coordination patterns\.

We instantiate IBTS in both standard two\-agent Overcooked\-AI settings\[[5](https://arxiv.org/html/2605.15400#bib.bib1)\]and extended three\-agent Overcooked\-AI settings\. We evaluate IBTS with simulated learned partners, synthetic LLM partners, and real human teammates\. Across these evaluations, IBTS outperforms strong diversity\-focused baselines, demonstrating that learned team\-performance structure provides a useful optimization signal beyond partner coverage in both 2\-agent and 3\-agent teams\.

In summary, the contributions of this paper are:

- •We introduce IBTS, an HMT framework that couples partner diversity with learned team\-performance structure\. During training, IBTS uses influence shaping to discover supportive coordination patterns and predictor\-guided steering to train best\-response policies\. At deployment, the policy uses the learned trajectory representation to recognize the current coordination mode with unseen teammates and guide interaction toward stronger learned modes\.
- •We extend Overcooked\-AI evaluation with reusable three\-agent layouts and a personality\-conditioned synthetic\-AI protocol, showing that IBTS improves over strong diversity\-focused baselines across simulated task evaluation and synthetic partner\-style variation\.
- •We conduct a 30\-participant human study and release, to our knowledge, the first 90\-trajectory dataset of 2\-human–1\-AI Overcooked\-AI collaboration, supporting future research on scaled HMT\.

## 2Preliminaries and Related Work

In this section, we review the foundations needed to contextualize IBTS, covering HMT, diversity\-driven ZSC, influence shaping, and the Dec\-POMDP formulation used throughout the paper\.

Human\-machine teaming\.Our work studies human\-machine teaming \(HMT\), where autonomous agents serve as interdependent collaborators that contribute to shared objectives while adapting to human behavior during interaction\[[30](https://arxiv.org/html/2605.15400#bib.bib26),[40](https://arxiv.org/html/2605.15400#bib.bib46)\]\. A canonical testbed for studying such interaction is Overcooked\-AI\[[5](https://arxiv.org/html/2605.15400#bib.bib1),[9](https://arxiv.org/html/2605.15400#bib.bib18)\], where the team receives a shared reward for completing cooking tasks that require coordination, alignment, and role allocation\. However, prior Overcooked\-AI\-based HMT evaluation centers on dyadic one\-human–one\-AI interaction\[[6](https://arxiv.org/html/2605.15400#bib.bib27),[8](https://arxiv.org/html/2605.15400#bib.bib19),[46](https://arxiv.org/html/2605.15400#bib.bib15)\], whereas real collaborative settings often involve mixed teams in which a machine teammate must work with multiple human partners\[[31](https://arxiv.org/html/2605.15400#bib.bib28)\]\. Scaling HMT beyond dyads introduces additional complexity because the machine must adapt to individual human preferences while also preserving the coordination already emerging among human teammates, all under increasingly limited communication and ambiguous intent cues\. Data\-driven HMT solutions, such as learning from human demonstrations\[[41](https://arxiv.org/html/2605.15400#bib.bib48),[27](https://arxiv.org/html/2605.15400#bib.bib29)\], can be onerous to deploy across tasks and team compositions because representative human interaction data and task\-specific interfaces are often costly or unavailable\. This burden motivates scalable HMT agents that can generalize to unseen human partners without access to target human data during training\.

Zero\-Shot Coordination\.Learning how agents can collaborate with previously unseen teammates at test time without additional adaptation is the central goal of zero\-shot coordination \(ZSC\)\[[3](https://arxiv.org/html/2605.15400#bib.bib47),[33](https://arxiv.org/html/2605.15400#bib.bib57),[50](https://arxiv.org/html/2605.15400#bib.bib10)\]\. Self\-play \(SP\), where agents learn by interacting with copies, often induces brittle conventions that generalize poorly to novel partners\[[5](https://arxiv.org/html/2605.15400#bib.bib1),[12](https://arxiv.org/html/2605.15400#bib.bib11)\]\. Population\-based methods\[[13](https://arxiv.org/html/2605.15400#bib.bib31),[25](https://arxiv.org/html/2605.15400#bib.bib30),[51](https://arxiv.org/html/2605.15400#bib.bib55)\]address this limitation by exposing agents to a broader set of simulated teammates, including Fictitious Co\-Play \(FCP\)\[[43](https://arxiv.org/html/2605.15400#bib.bib9)\], which trains agents against a population of historical partners, Maximum Entropy Population\-Based Training \(MEP\)\[[55](https://arxiv.org/html/2605.15400#bib.bib12)\], which adds an entropy bonus to promote diversity among teammate policies, and GAMMA\[[21](https://arxiv.org/html/2605.15400#bib.bib14)\], which models heterogeneous partner behavior through generative teammate representations\. However, these methods can still struggle with coverage as the number of agents scales, because agents may adopt a broader range of role assignments, timing conventions, and interaction patterns\[[53](https://arxiv.org/html/2605.15400#bib.bib58)\]\. Recent latent\-strategy approaches\[[11](https://arxiv.org/html/2605.15400#bib.bib5)\], such as TALENTS\[[19](https://arxiv.org/html/2605.15400#bib.bib3)\], learn implicit structured representations of collaborative behavior to expand the scope of diversity and outperform prior population\-based baselines\. Yet they remain fragile when reward feedback is too weak to reveal useful coordination behavior\. IBTS therefore builds on latent\-strategy design to span the diversity, while further injecting coordination cues beyond environment reward to reinforce high\-performing pattern\.

Influence shaping\.Generating diverse teams with coordinated behavior requires role specializations, which can emerge through cooperative multi\-agent reinforcement learning \(MARL\)\[[48](https://arxiv.org/html/2605.15400#bib.bib42),[23](https://arxiv.org/html/2605.15400#bib.bib41)\]\. A common backbone is centralized training with decentralized execution \(CTDE\), where centralized information can stabilize joint optimization while each agent still acts from local observations at test time\[[24](https://arxiv.org/html/2605.15400#bib.bib45),[1](https://arxiv.org/html/2605.15400#bib.bib32)\]\. PPO\-based\[[38](https://arxiv.org/html/2605.15400#bib.bib50)\]CTDE methods such as Multi\-Agent Proximal Policy Optimization \(MAPPO\)\[[52](https://arxiv.org/html/2605.15400#bib.bib6)\]have shown strong performance across multi\-agent benchmarks\[[35](https://arxiv.org/html/2605.15400#bib.bib8),[45](https://arxiv.org/html/2605.15400#bib.bib52),[18](https://arxiv.org/html/2605.15400#bib.bib7)\], making them a natural foundation for discovering coordinated team behavior\. However, larger teams and longer task dependencies make sparse shared rewards less informative, since rare successes may not reveal which interactions produced coordination\[[22](https://arxiv.org/html/2605.15400#bib.bib34),[20](https://arxiv.org/html/2605.15400#bib.bib33)\]\. Influence\-based shaping mitigates this issue by rewarding actions that affect teammates’ future behavior, thereby biasing optimization toward interaction\-relevant consequences rather than isolated individual progress\[[14](https://arxiv.org/html/2605.15400#bib.bib16),[49](https://arxiv.org/html/2605.15400#bib.bib35)\]\. This mechanism can expose reusable coordination structure during self\-play, but because raw influence need not be task\-beneficial, IBTS links the resulting interaction patterns to learned team performance so that best\-response training can steer teams toward higher\-return coordination modes\.

Markov Decision Process\.To model HMT, we formalize the cooperative task as a decentralized partially observable Markov decision process \(Dec\-POMDP\)\[[4](https://arxiv.org/html/2605.15400#bib.bib36),[32](https://arxiv.org/html/2605.15400#bib.bib37)\]\. A Dec\-POMDP consists of a finite set of agentsN=\{1,…,n\}N=\\\{1,\\ldots,n\\\}, global statess∈𝒮s\\in\\mathcal\{S\}, per\-agent action spaces𝒜i\\mathcal\{A\}^\{i\}with joint action𝐚=\(a1,…,an\)∈𝒜=∏i𝒜i\\mathbf\{a\}=\(a^\{1\},\\ldots,a^\{n\}\)\\in\\mathcal\{A\}=\\prod\_\{i\}\\mathcal\{A\}^\{i\}, and local observationsoi∈Ωio^\{i\}\\in\\Omega^\{i\}generated from the underlying state\. At each time steptt, each agent chooses an actionatia\_\{t\}^\{i\}using only its local information, the environment evolves according to𝒯​\(st\+1∣st,𝐚t\)\\mathcal\{T\}\(s\_\{t\+1\}\\mid s\_\{t\},\\mathbf\{a\}\_\{t\}\), and the team receives a shared rewardrt=R​\(st,𝐚t\)r\_\{t\}=R\(s\_\{t\},\\mathbf\{a\}\_\{t\}\)\. The goal is to learn decentralized policiesπi​\(ai∣oi\)\\pi\_\{i\}\(a^\{i\}\\mid o^\{i\}\)that maximize the finite\-horizon discounted return𝔼​\[∑t=0H−1γt​rt\]\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}r\_\{t\}\\right\], whereHHdenotes the episode horizon andγ∈\[0,1\]\\gamma\\in\[0,1\]is the discount factor\.

## 3Method

Here, we formalize the three stages of IBTS used to train a deployable machine teammate, as described in Figure[1](https://arxiv.org/html/2605.15400#S1.F1)\.

### 3\.1Diverse Team Pool Construction

This subsection explains the two intrinsic terms used in the team\-pool construction phase\. Section[3\.1\.1](https://arxiv.org/html/2605.15400#S3.SS1.SSS1)introduces the influence\-based collaboration\-shaping term, which encourages stronger coordination patterns by rewarding actions that create opportunities for teammate follow\-up\. Section[3\.1\.2](https://arxiv.org/html/2605.15400#S3.SS1.SSS2)presents the behavioral diversity term, which helps maintain distinct conventions across teams in the pool\. Algorithm[1](https://arxiv.org/html/2605.15400#alg1)summarizes the full team\-pool construction procedure\.

#### 3\.1\.1Influence Shaping for Collaborative Behavior

Environment reward alone may not reliably induce coordinated behavior\. Appendix[C\.1](https://arxiv.org/html/2605.15400#A3.SS1)illustrates a failure mode in a three\-agent layout, where a standard MAPPO baseline struggles to discover simple assembly\-line behavior\. To bias the team pool toward such interaction\-enabling behaviors, we augment the self\-play objective with an influence\-shaping term to reward actions that create opportunities for teammates to follow up\.

Leto~t=\(oti\)i∈N\\tilde\{o\}\_\{t\}=\(o\_\{t\}^\{i\}\)\_\{i\\in N\}denote the joint observation at timett, andΦ​\(o~t\)\\Phi\(\\tilde\{o\}\_\{t\}\)encode a minimal set of general collaborative features extracted fromo~t\\tilde\{o\}\_\{t\}, including agent locations, object locations, and object possession\. Then we define a salient actionat∗a\_\{t\}^\{\*\}as the action expected to induce the largest change in the collaborative state in the Eq\.[1](https://arxiv.org/html/2605.15400#S3.E1)\.

at∗∈argmaxa∈𝒜𝔼\[∥Φ\(o~t\+1\)−Φ\(o~t\)∥2\|o~t,at=a\]a\_\{t\}^\{\*\}\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\ \\mathbb\{E\}\\\!\\left\[\\left\\lVert\\Phi\(\\tilde\{o\}\_\{t\+1\}\)\-\\Phi\(\\tilde\{o\}\_\{t\}\)\\right\\rVert\_\{2\}\\;\\middle\|\\;\\tilde\{o\}\_\{t\},\\ a\_\{t\}=a\\right\]\(1\)
Crucially, the effect ofat∗a\_\{t\}^\{\*\}may be delayed when teammates need to reposition or continue a multi\-step chain before the induced response becomes visible\. Therefore, our mechanism intentionally avoids scoring only the immediate next action and instead uses a short\-horizon event label as in Eq\.[2](https://arxiv.org/html/2605.15400#S3.E2)\.

yj,t\(K\)=𝟏​\{∃τ∈\{1,…,K\}​such that​at\+τj=at∗\}y\_\{j,t\}^\{\(K\)\}=\\mathbf\{1\}\\Big\\\{\\exists\\tau\\in\\\{1,\\dots,K\\\}\\text\{ such that \}a\_\{t\+\\tau\}^\{j\}=a\_\{t\}^\{\*\}\\Big\\\}\(2\)Here,yj,t\(K\)=1y\_\{j,t\}^\{\(K\)\}=1indicates that agentjjrealizes salient actionat∗a\_\{t\}^\{\*\}within the nextKKsteps\. We use a binary indicator to capture whether the follow\-up occurs at all, rather than counting repeated executions of the same target action within one window, which prevents local repetitions from being treated as separate influence events when they may simply reflect transient dynamics\.

We then use the event labelyj,t\(K\)y\_\{j,t\}^\{\(K\)\}to define theKK\-step directed influence rewardri,tinfr^\{\\mathrm\{inf\}\}\_\{i,t\}in Eq\.[3](https://arxiv.org/html/2605.15400#S3.E3)\. The reward is computed over ordered pairs of distinct agents\(i,j\)\(i,j\), whereiiis the source agent receiving the intrinsic reward for shaping teammate behavior, andjjis the target teammate whose future follow\-up event may be affected by agentii’s action\. For each pair, we compare an observation\-only baselineωj\(⋅∣o~t\)\\omega\_\{j\}\(\\cdot\\mid\\tilde\{o\}\_\{t\}\)with an influence\-conditioned predictorqi→j\(⋅∣o~t,ati\)q\_\{i\\rightarrow j\}\(\\cdot\\mid\\tilde\{o\}\_\{t\},a\_\{t\}^\{i\}\)\. The baseline estimates how much the current joint observation alone predicts the eventyj,t\(K\)y\_\{j,t\}^\{\(K\)\}, while the influence\-conditioned predictor estimates how this prediction changes after conditioning on agentii’s actionatia\_\{t\}^\{i\}\. We model their difference to estimate the contribution byatia\_\{t\}^\{i\}, and take the non\-negative part so that actions are rewarded only when they increase the predicted likelihood of teammate follow\-up\.

ri,tinf:=1n−1​∑j≠imax⁡\(qi→j​\(yj,t\(K\)=1∣o~t,ati\)−ωj​\(yj,t\(K\)=1∣o~t\),0\)r^\{\\mathrm\{inf\}\}\_\{i,t\}:=\\frac\{1\}\{n\-1\}\\\!\\sum\_\{j\\neq i\}\\max\\\!\\left\(q\_\{i\\rightarrow j\}\\\!\\left\(y\_\{j,t\}^\{\(K\)\}\{=\}1\\mid\\tilde\{o\}\_\{t\},a\_\{t\}^\{i\}\\right\)\-\\omega\_\{j\}\\\!\\left\(y\_\{j,t\}^\{\(K\)\}\{=\}1\\mid\\tilde\{o\}\_\{t\}\\right\),0\\right\)\(3\)Both predictors are trained online from rollouts collected under the current policies\. During each PPO iteration, we execute the decentralized actors to collect tuples\(o~t,𝐚t,o~t\+1\)\(\\tilde\{o\}\_\{t\},\\mathbf\{a\}\_\{t\},\\tilde\{o\}\_\{t\+1\}\), construct the event labelyj,t\(K\)y\_\{j,t\}^\{\(K\)\}from the followingKKsteps, and updateqi→jq\_\{i\\rightarrow j\}andωj\\omega\_\{j\}with binary cross\-entropy\.

The directed pairwise form in Eq\.[3](https://arxiv.org/html/2605.15400#S3.E3)intentionally averages over all target teammatesj≠ij\\neq i, rather than restricting influence to spatially nearby agents or local contribution\. This design matters in coordination chains, for instance in a streamline\-style strategy, where an early action may immediately affect the nearest teammate, but it can also change whether a farther downstream teammate later performs a salient follow\-up\. Importantly, this design differs from directly rewarding a handcrafted event\. As shown in Appendix[C\.3](https://arxiv.org/html/2605.15400#A3.SS3), reward\-hacking baselines that directly incentivize handoffs can produce unstable shortcuts, such as repeated local exchanges that increase shaped reward without improving task progress\. This collaboration\-shaping component is combined with the diversity\-promoting MEP term described next\.

#### 3\.1\.2Behavioral Diversity

While influence shaping encourages coordination\-producing behavior, it does not prevent the population from collapsing to a single dominant convention\. Motivated by MEP\[[55](https://arxiv.org/html/2605.15400#bib.bib12)\], we thus encourage diversity across team policies by comparing action distributions of corresponding agents across teams, rather than encouraging agents within the same team to differ from one another\. To this end, we add a behavior diversity term that rewards actions that are less typical under the current population\.

Letπ¯i\(⋅∣oti\)\\bar\{\\pi\}^\{i\}\(\\cdot\\mid o\_\{t\}^\{i\}\)denote the mean action distribution of agent indexiiacross the current population of teams\. We define the behavior diversity reward as Eq\.[4](https://arxiv.org/html/2605.15400#S3.E4), whereϵ\>0\\epsilon\>0is a small stability constant\.

ri,tdiv=−log⁡\(max⁡\(π¯i​\(ati∣oti\),ϵ\)\)r\_\{i,t\}^\{\\mathrm\{div\}\}=\-\\log\\\!\\Big\(\\max\\big\(\\bar\{\\pi\}^\{\\,i\}\(a\_\{t\}^\{i\}\\mid o\_\{t\}^\{i\}\),\\epsilon\\big\)\\Big\)\(4\)This reward is large when the chosen action has low probability under the current population mean\. As a result, the term encourages different teams to realize distinct conventions rather than collapsing to a single shared mode\.

Finally, we integrate the influence and diversity rewards into the CTDE training objective as shown in Eq\.[5](https://arxiv.org/html/2605.15400#S3.E5)\. For each agentii, the actor\-side shaped reward combines the environment reward, the influence reward from Eq\.[3](https://arxiv.org/html/2605.15400#S3.E3), and the behavioral diversity reward from Eq\.[4](https://arxiv.org/html/2605.15400#S3.E4)\.

ri,t=rtenv\+λinf​ri,tinf\+λdiv​ri,tdivr\_\{i,t\}=r\_\{t\}^\{\\mathrm\{env\}\}\+\\lambda\_\{\\mathrm\{inf\}\}r\_\{i,t\}^\{\\mathrm\{inf\}\}\+\\lambda\_\{\\mathrm\{div\}\}r\_\{i,t\}^\{\\mathrm\{div\}\}\(5\)

### 3\.2Trajectory\-Conditioned Team Predictor

For the purpose of anchoring ongoing trajectories to the learned team pool during deployment, we train a transformer\-based predictor on rollout histories collected from the pool ofMMteams\.

Lethth\_\{t\}denote a trajectory history up to timett, formed from features available in agents’ observations and actions, including each agent’s location, facing direction, held object, and executed action\. The transformer encodergψg\_\{\\psi\}, parameterized by weightsψ\\psi, mapshth\_\{t\}to a continuous embeddingct=gψ​\(ht\)c\_\{t\}=g\_\{\\psi\}\(h\_\{t\}\)to represent the current coordination pattern\. A learned classification head with parametersη=\{Wη,bη\}\\eta=\\\{W\_\{\\eta\},b\_\{\\eta\}\\\}then projectsctc\_\{t\}to a distribution over theMMteams as shown in Eq\.[6](https://arxiv.org/html/2605.15400#S3.E6)\.

pt=softmax​\(Wη​ct\+bη\)∈ℝM,p\_\{t\}=\\mathrm\{softmax\}\(W\_\{\\eta\}c\_\{t\}\+b\_\{\\eta\}\)\\in\\mathbb\{R\}^\{M\},\(6\)The encoder parametersψ\\psiand classifier\-head parametersη\\etaare optimized jointly with cross\-entropy on the ground\-truth team labelmmthat generated historyhth\_\{t\}as shown in Eq\.[7](https://arxiv.org/html/2605.15400#S3.E7)\.

ℒpred=𝔼\(ht,m\)​\[−log⁡pt​\(m\)\]\\mathcal\{L\}\_\{\\mathrm\{pred\}\}=\\mathbb\{E\}\_\{\(h\_\{t\},m\)\}\\left\[\-\\log p\_\{t\}\(m\)\\right\]\(7\)Optimizing Eq\.[7](https://arxiv.org/html/2605.15400#S3.E7)encourages the transformer encoder to learn embeddingsctc\_\{t\}that distinguish among the coordination styles represented in the team pool\. Intuitively,ctc\_\{t\}represents the implicit coordination pattern encoded from recent team trajectories, whilept​\(m\)p\_\{t\}\(m\)estimates how closely the ongoing trajectory history resembles each team in the pool\.

### 3\.3Team Steering

The goal of this stage is to learn a best\-response policyπ\(⋅∣ot,ct\)\\pi\(\\cdot\\mid o\_\{t\},c\_\{t\}\), conditioned on the learned latent patternctc\_\{t\}, against the team pool constructed in Section[3\.1](https://arxiv.org/html/2605.15400#S3.SS1)\. Section[3\.3\.1](https://arxiv.org/html/2605.15400#S3.SS3.SSS1)describes the integration of the predictor distributionpt​\(m\)p\_\{t\}\(m\)into an intrinsic steering reward, which encourages ongoing trajectories to move toward higher\-quality historical coordination modes\. Section[3\.3\.2](https://arxiv.org/html/2605.15400#S3.SS3.SSS2)presents an offline distillation step that consolidates these specialized best\-response policies into a single shared policy for deployment\. The full procedure is summarized in Alg\.[2](https://arxiv.org/html/2605.15400#alg2)\.

#### 3\.3\.1Trajectory\-Quality Steering

The central idea of team steering is to reward a policy when its actions move the ongoing team trajectory closer to higher\-performing teams in the learned pool\. Therefore, we treat the historical team pool as a reference set of coordination modes with different quality levels\. For each teammm, we evaluate its performance over fixed\-length rollouts and normalize the resulting score to obtain a quality scoreS​\(m\)∈\[0,1\]S\(m\)\\in\[0,1\]\. Given thatpt​\(m\)p\_\{t\}\(m\)measures how strongly the current trajectory history resembles historical teammm, we define the resulting quality\-weighted trajectory score as Eq\.[8](https://arxiv.org/html/2605.15400#S3.E8)\. Intuitively,Q​\(ht\)Q\(h\_\{t\}\)is a soft quality estimate of the current trajectory under the reference team pool, which is large when the current history assigns high probability to stronger historical teams\.

Q​\(ht\)=∑m=1Mpt​\(m\)​S​\(m\)Q\(h\_\{t\}\)=\\sum\_\{m=1\}^\{M\}p\_\{t\}\(m\)\\,S\(m\)\(8\)
We then define a short\-horizon steering reward by measuring whether the trajectory moves toward a higher\-quality region over the nextΔ\\Deltasteps as Eq\.[9](https://arxiv.org/html/2605.15400#S3.E9)\.

rtsteer=max⁡\(Q​\(ht\+Δ\)−Q​\(ht\),0\)r\_\{t\}^\{\\mathrm\{steer\}\}=\\max\\big\(Q\(h\_\{t\+\\Delta\}\)\-Q\(h\_\{t\}\),\\,0\\big\)\(9\)This reward is positive when the future trajectory becomes more similar to higher\-quality historical coordination modes\. We use the non\-negative form to maintain the stability of the shaping signal because the goal of steering is to reinforce improvements toward stronger historical coordination patterns, rather than to penalize every local deviation from the current score estimate\.

Finally, we use this steering term to shape the training reward as shown in Eq\.[10](https://arxiv.org/html/2605.15400#S3.E10), wherertenvr\_\{t\}^\{\\mathrm\{env\}\}is the environment reward andα\\alphacontrols the strength of the steering bonus\.

rttotal=rtenv\+α​rtsteerr\_\{t\}^\{\\mathrm\{total\}\}=r\_\{t\}^\{\\mathrm\{env\}\}\+\\alpha\\,r\_\{t\}^\{\\mathrm\{steer\}\}\(10\)In practice, for each agent indexii, the best\-response policy is trained as specialized teacherπiteacher\\pi\_\{i\}^\{\\mathrm\{teacher\}\}while keeping both the trajectory predictor and the sampled partner policies from the team pool fixed\. Althoughrtsteerr\_\{t\}^\{\\mathrm\{steer\}\}is computed from the full team trajectory, only the active teacher policyπiteacher\\pi\_\{i\}^\{\\mathrm\{teacher\}\}is updated during its training loop, while the remaining partner policies are frozen\. This objective thus trains each teacher not only to achieve task reward, but also to gradually shift unfolding trajectories toward coordination patterns associated with stronger historical teams\.

#### 3\.3\.2Shared\-Student Distillation

Although the previous steering stage produces effective teacher policies, each teacherπiteacher\\pi\_\{i\}^\{\\mathrm\{teacher\}\}is trained only as one target agent while the remaining positions are occupied by frozen partners\. As a result, a single teacher may cover only the ego\-centric states induced by its own agent index and partner contexts, making it insufficient as a general deployable policy\. To consolidate these specialized behaviors into one policy with broader coverage, we perform an offline distillation step that pools supervision from all learned teachers\.

Concretely, for each agent indexi∈\{1,…,n\}i\\in\\\{1,\\dots,n\\\}, we freeze the corresponding teacher policyπiteacher\\pi\_\{i\}^\{\\mathrm\{teacher\}\}and roll it out as agentiiwhile filling the remaining agent positions with sampled frozen partners from the team pool\. At each timestep, we export only the acting teacher’s ego\-centric sample, consisting of the local observationotio\_\{t\}^\{i\}, the online latentctc\_\{t\}, and the teacher actionatia\_\{t\}^\{i\}\. Pooling these samples across all teacher indices and partner configurations yields the offline dataset in Eq\.[11](https://arxiv.org/html/2605.15400#S3.E11)\.

𝒟=⋃i=1n\{\(oti,ct,ati\)\}\\mathcal\{D\}=\\bigcup\_\{i=1\}^\{n\}\\left\\\{\\left\(o\_\{t\}^\{i\},\\,c\_\{t\},\\,a\_\{t\}^\{i\}\\right\)\\right\\\}\(11\)Here, the superscriptiiindicates which teacher index generated the sample, but the student itself does not take the agent index as an explicit input\.

We then train a single shared student policyπstudent\(⋅∣ot,ct\)\\pi^\{\\mathrm\{student\}\}\(\\cdot\\mid o\_\{t\},c\_\{t\}\)by behavior cloning on the pooled dataset\. The objective is standard cross\-entropy over teacher actions as defined in Eq\.[12](https://arxiv.org/html/2605.15400#S3.E12)\.

ℒBC=𝔼\(o,c,a\)∼𝒟​\[−log⁡πstudent​\(a∣o,c\)\]\\mathcal\{L\}\_\{\\mathrm\{BC\}\}=\\mathbb\{E\}\_\{\(o,c,a\)\\sim\\mathcal\{D\}\}\\left\[\-\\log\\pi^\{\\mathrm\{student\}\}\(a\\mid o,c\)\\right\]\(12\)By aggregating demonstrations from teachers trained under different indices and partner configurations, the student receives supervision over a broader set of ego\-centric situations and thus serves as the final shared best\-response policy at deployment\.

## 4Experiments

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/user_study_layouts.png)Figure 2:Overcooked\-AI layouts used in the human study\. We evaluate 2\-agent and 3\-agent settings across Forced Coordination \(FC\), Pipeline \(PL\), and Asymmetric Advantages \(AA\)\. These layouts are selected because high performance requires coordinated interaction\.We study whether IBTS improves coordination under broad teammate variation in Overcooked\-AI layouts where high performance requires collaborative behavior rather than independent solo routines\. Specifically, we use Forced Coordination \(FC\), Pipeline \(PL\), and Asymmetric Advantages \(AA\), as shown in Figure[2](https://arxiv.org/html/2605.15400#S4.F2)\. We denote each layout by its abbreviation and team size, e\.g\., FC\-2 refers to the 2\-agent Forced Coordination setting and FC\-3 refers to its 3\-agent counterpart\. These layouts respectively emphasize dependency resolution, sequential task flow, and uneven resource access\.

In this section, we assess IBTS in simulated settings, covering both in\-distribution task competence and out\-of\-distribution robustness to synthetic partner\-style variation\. Baselines include MAPPO\-based self\-play \(SP\)\[[52](https://arxiv.org/html/2605.15400#bib.bib6)\], FCP\[[43](https://arxiv.org/html/2605.15400#bib.bib9)\], MEP\[[55](https://arxiv.org/html/2605.15400#bib.bib12)\], and GAMMA building on MEP \(GAMMA\)\[[21](https://arxiv.org/html/2605.15400#bib.bib14)\], covering standard self\-play and diversity\-based approaches for ZSC\. We discuss additional comparison\-design choices in Appendix[A\.5](https://arxiv.org/html/2605.15400#A1.SS5)\.

### 4\.1In\-Distribution Simulated Evaluation

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/in_distribution_summary.png)Figure 3:In\-distribution simulated evaluation across 2\-agent and 3\-agent Overcooked layouts\. Bars report mean normalized task score across 3 random seeds, with error bars showing standard deviation\.We first examine standard task competence under in\-distribution simulation, where training and test episodes use the same layout family and agent interface without introducing new partner styles\. Figure[3](https://arxiv.org/html/2605.15400#S4.F3)reports mean task scores across both 2\-agent and 3\-agent layouts\.

The clearest gains for our framework appear in PL, where the first meaningful task reward requires agents to complete a longer coordination chain before onions reach a pot\. This sparse\-reward structure makes it harder for standard baselines to discover useful collaboration from task return alone\. In this setting, IBTS achieves the best performance for both 2\-agent and 3\-agent teams, improving over MEP by\+16\.5%\+16\.5\\%in the 2\-agent case \(231\.7231\.7vs\.198\.8198\.8\) and by\+17\.1%\+17\.1\\%in the 3\-agent case \(148\.3148\.3vs\.126\.6126\.6\)\. By contrast, the advantage is less pronounced when early rewards are easier to discover\. In AA, especially the 2\-agent setting where an onion source is close to the pot, SP remains strong and achieves the highest score \(800\.9800\.9\), while IBTS remains competitive \(775\.7775\.7\)\. Overall, these results suggest that influence\-based steering is most beneficial under sparse reward, where useful intermediate coordination steps may otherwise receive little direct feedback before task reward is observed, while preserving task competence when reward feedback is already more accessible\.

### 4\.2Synthetic LLM Partner\-Style Evaluation

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/synthetic_llm_heatmap.png)Figure 4:Synthetic LLM partner\-style evaluation\. Each panel corresponds to one partner personality\. Columns denote layout and team\-size combinations, rows denote IBTS improvements over FCP, MEP, and GAMMA, and cell values show relative improvement over the corresponding baseline\. Positive values indicate improvement\.We then assess robustness to partners outside the training population\. We use GPT\-5\-mini partners conditioned with three personality profiles: Agreeable, Extraverted, and Neurotic, following prior work showing that personality induction can produce distinct decision\-making patterns in LLM\-based agents\[[28](https://arxiv.org/html/2605.15400#bib.bib44)\]\. The profiles are designed to induce different low\-level coordination tendencies, such as cooperative handoffs, assertive task initiation, and cautious waiting behavior\. Full game instructions, personality prompts, and the rationale for this synthetic\-partner design are provided in Appendix[D](https://arxiv.org/html/2605.15400#A4)\.

Figure[4](https://arxiv.org/html/2605.15400#S4.F4)reports relative improvement over each corresponding baseline across personality profiles, layouts, and team sizes\. Although the gains are not uniform across all layout\-personality pairs, IBTS improves over FCP in all 18 settings and over MEP in 13 of 18 settings\. Averaged across the synthetic evaluation, IBTS achieves a mean score of186\.9186\.9, compared with133\.7133\.7for MEP,110\.6110\.6for FCP, and158\.6158\.6for GAMMA\. The strongest gains appear in PL, where IBTS improves over MEP by\+106\.8%\+106\.8\\%in the 2\-agent setting and\+75\.8%\+75\.8\\%in the 3\-agent setting, indicating that the method remains effective as coordination requires longer interaction chains and larger teams\.

## 5Human Study

Here, we present the details of our human\-subjects study evaluating whether IBTS transfers to real human teammates\. We explore the question:

Q1\.Does IBTS improve real\-human team performance over diversity\-only baselines in both dyadic HMT and two\-human–one\-machine group HMT?

We review the study conditions, participant protocol, evaluation measures, statistical analysis, and task\-score results below\.

Study Conditions and Procedure\.We evaluate two team\-size conditions: one\-human–one\-machine HMT and two\-human–one\-machine group HMT\. In both conditions, participants interact with three machine partners, MEP, GAMMA, and IBTS, which were the three strongest learned\-agent methods evaluated in Section[4](https://arxiv.org/html/2605.15400#S4)\. Participants played three Overcooked layouts, FC, PL, and AA, shown in Figure[2](https://arxiv.org/html/2605.15400#S4.F2)\. The layout order was fixed as FC, PL, and AA, while the machine partner order was randomized\.

Before the formal trials, participants received unlimited practice time in PL or AA to familiarize themselves with the controls, task mechanics, and reward structure\. During the formal study, participants completed one FC–PL–AA block with a given machine partner, filled out a post\-game questionnaire, and repeated this process for the remaining machine partners\. Each game was capped at 400 environment timesteps and used synchronous stepping, so the environment advanced only after actions were received from all active human players and machine agents\. Participants were not given information about the machine partner policies beyond the shared game rules and task objective\. Verbal communication was prohibited in both team\-size conditions, while natural nonverbal cues were not explicitly controlled\. Full gameplay and questionnaire procedures are provided in Appendix[E\.1](https://arxiv.org/html/2605.15400#A5.SS1), Appendix[E\.2](https://arxiv.org/html/2605.15400#A5.SS2), and Appendix[E\.3](https://arxiv.org/html/2605.15400#A5.SS3)\.

Measures and Statistical Analysis\.Our primary measure is team task score, computed as the total environment reward accumulated within a 400\-timestep game\. For each team\-size condition and layout, we compare IBTS against the learned\-agent baselines using paired Wilcoxon signed\-rank tests\. The paired unit is the participant in the one\-human–one\-machine condition and the two\-human team in the two\-human–one\-machine condition\. Before conducting the nonparametric tests, we used Shapiro\-Wilk tests on paired score differences to assess normality and Levene’s tests with median centering to inspect variance differences\.

For each layout, we test two planned comparisons: IBTS versus MEP and IBTS versus GAMMA\. We apply Holm correction within each layout to account for these two comparisons\. Significance markers in the result figure correspond to the Holm\-corrected Wilcoxonpp\-values, with∗\*forp<0\.05p<0\.05,∗⁣∗\*\*forp<0\.01p<0\.01, and∗⁣∗⁣∗\*\*\*forp<0\.001p<0\.001\. Error bars report standard error of the mean\.

Participants also completed post\-game questionnaires measuring perceived teammate quality\. Since task performance is the primary outcome of this human study, we report the questionnaire results separately in Appendix[E\.5](https://arxiv.org/html/2605.15400#A5.SS5)\.

Q1 Results\.We recruited 30 participants aged 20–35 years \(mean 25\.1, SD 3\.3; 23 male, 7 female\) under an IRB\-approved protocol\. Ten participants completed the one\-human–one\-machine condition, and the remaining 20 participants formed 10 two\-human teams for the two\-human–one\-machine condition\. We display the task\-score results in Figure[5](https://arxiv.org/html/2605.15400#S5.F5)\.

Overall, the statistically significant results support that IBTS can improve real\-human team performance over learned\-agent baselines in coordination\-intensive settings where useful task progress is harder to discover from sparse reward alone\. This pattern is consistent with the simulated and synthetic evaluations in Section[4](https://arxiv.org/html/2605.15400#S4), where the clearest gains also appear in settings requiring longer interaction chains\. In the one\-human–one\-machine condition, IBTS significantly outperforms MEP in PL\-2 \(251\.4251\.4vs\.217\.0217\.0; Holm\-correctedp=0\.039p=0\.039\) and significantly outperforms GAMMA in PL\-2 \(251\.4251\.4vs\.173\.7173\.7; Holm\-correctedp=0\.0039p=0\.0039\)\. In the two\-human–one\-machine condition, IBTS significantly outperforms both MEP and GAMMA in FC\-3 \(479\.1479\.1vs\.435\.1435\.1and423\.0423\.0, respectively; Holm\-correctedp=0\.035p=0\.035for both comparisons\)\. Together, these findings indicate that the benefits of IBTS transfer to both dyadic and group HMT\.

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/user_study_scores.png)Figure 5:Human\-study scores across one\-human–one\-machine and two\-human–one\-machine settings\. Bars report mean team score with standard error\. Significance markers denote paired Wilcoxon signed\-rank tests comparing IBTS against each baseline, with Holm correction within each layout\.
## 6Conclusion

This paper introduces Influence\-Based Team Steering \(IBTS\), a framework for scalable zero\-shot HMT in settings where diversity\-only approaches struggle to identify useful coordination behavior under sparse rewards and longer interaction chains\. IBTS mitigates this issue by using influence shaping to strengthen coordination feedback during population generation and predictor\-guided steering to reuse high\-performing team modes during best\-response training\. Across simulated, synthetic\-partner, and real\-human evaluations, IBTS improves over strong learned baselines in most settings, including scaled two\-human–one\-AI teams\.

Beyond the specific Overcooked\-AI setting, our results suggest a broader design principle for scalable HMT: machines should not only be trained to tolerate diverse partners, but also to recognize and reinforce productive coordination patterns as they emerge during interaction\. This perspective shifts ZSC from coverage alone toward transferable coordination structure, providing a path for using MARL\-discovered behaviors to support human groups when direct human data collection is limited\.

Limitations\.Our approach has several limitations: 1\) IBTS heavily depends on the quality of the learned team\-performance structure, otherwise the embedding space may not contain the cooperative behaviors that humans would prefer, limiting what the steering objective can recover\. Influence shaping helps mitigate this issue by encouraging actions that induce useful teammate responses, but it does not fully solve the sparse\-reward problem\. For example, in PL\-3, learned agents remain below the intuitive human\-designed heuristic shown in Figure[7](https://arxiv.org/html/2605.15400#A3.F7)\. 2\) The pairwise influence formulation in Section[3\.1\.1](https://arxiv.org/html/2605.15400#S3.SS1.SSS1)also introduces scalability costs because it evaluates directed source\-target influence terms across agent pairs\. This design is intentional since IBTS aims to capture whether one agent’s behavior induces useful follow\-up from each teammate, including nonlocal responses that unfold downstream in a coordination chain rather than only immediate nearby interactions, but more efficient approximations will be needed for much larger teams\. 3\) Our human study is limited in size and scope, and the survey results in Figure[9](https://arxiv.org/html/2605.15400#A5.F9)show that machine teammates still trail human teammates in trust, suggesting that higher task performance alone is not sufficient to close the subjective trust gap in group human\-machine coordination\.

Future work\.Future work should improve the scalability of pairwise influence estimation, extend IBTS to richer communication and continuous\-control domains, and combine team steering with stronger sparse\-reward MARL techniques\.

## References

- \[1\]\(2024\)An introduction to centralized training for decentralized execution in cooperative multi\-agent reinforcement learning\.arXiv preprint arXiv:2409\.03052\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[2\]R\. Basappa, C\. Lancaster, R\. Mallick, C\. Flathmann, and N\. McNeese\(2025\)Mind the gaps: how ai shortcomings and human concerns may disrupt team cognition in human\-ai teams \(hats\)\.InProceedings of the Human Factors and Ergonomics Society Annual Meeting,Vol\.69,pp\. 354–359\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1)\.
- \[3\]A\. Bauer, D\. Wollherr, and M\. Buss\(2008\)Human–robot collaboration: a survey\.International Journal of Humanoid Robotics5\(01\),pp\. 47–66\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[4\]D\. S\. Bernstein, R\. Givan, N\. Immerman, and S\. Zilberstein\(2002\)The complexity of decentralized control of markov decision processes\.Mathematics of operations research27,pp\. 819–840\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p5.13)\.
- \[5\]M\. Carroll, R\. Shah, M\. K\. Ho, T\. Griffiths, S\. Seshia, P\. Abbeel, and A\. Dragan\(2019\)On the utility of learning about humans for human\-ai coordination\.Advances in Neural Information Processing Systems32\.Cited by:[8th item](https://arxiv.org/html/2605.15400#A1.I2.i8.p1.1),[§1](https://arxiv.org/html/2605.15400#S1.p4.1),[§2](https://arxiv.org/html/2605.15400#S2.p2.1),[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[6\]R\. Charakorn, P\. Manoonpong, and N\. Dilokthanakul\(2020\)Investigating partner diversification methods in cooperative multi\-agent deep reinforcement learning\.InInternational Conference on Neural Information Processing,pp\. 395–402\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[7\]R\. Charakorn, P\. Manoonpong, and N\. Dilokthanakul\(2024\)Diversity is not all you need: training a robust cooperative agent needs specialist partners\.Advances in Neural Information Processing Systems37,pp\. 56401–56423\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p2.1)\.
- \[8\]M\. Fontaine, Y\. Hsu, Y\. Zhang, B\. Tjanaka, and S\. Nikolaidis\(2021\-07\)On the importance of environments in human\-robot coordination\.InProceedings of Robotics: Science and Systems,Virtual\.External Links:[Document](https://dx.doi.org/10.15607/RSS.2021.XVII.038)Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[9\]T\. Gessler, T\. Dizdarevic, A\. Calinescu, B\. Ellis, A\. Lupu, and J\. N\. Foerster\(2025\)OvercookedV2: rethinking overcooked for zero\-shot coordination\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[10\]G\. Hoffman\(2019\)Evaluating fluency in human–robot collaboration\.IEEE Transactions on Human\-Machine Systems49\(3\),pp\. 209–218\.Cited by:[§E\.5](https://arxiv.org/html/2605.15400#A5.SS5.p1.1)\.
- \[11\]J\. Hong, S\. Levine, and A\. Dragan\(2023\)Learning to influence human behavior with offline reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 36094–36105\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[12\]H\. Hu, A\. Lerer, A\. Peysakhovich, and J\. Foerster\(2020\)“Other\-Play” for zero\-shot coordination\.InInternational Conference on Machine Learning,pp\. 4399–4410\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[13\]M\. Jaderberg, V\. Dalibard, S\. Osindero, W\. M\. Czarnecki, J\. Donahue, A\. Razavi, O\. Vinyals, T\. Green, I\. Dunning, K\. Simonyan,et al\.\(2017\)Population based training of neural networks\.arXiv preprint arXiv:1711\.09846\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[14\]N\. Jaques, A\. Lazaridou, E\. Hughes, C\. Gulcehre, P\. Ortega, D\. Strouse, J\. Z\. Leibo, and N\. De Freitas\(2019\)Social influence as intrinsic motivation for multi\-agent deep reinforcement learning\.InInternational conference on machine learning,pp\. 3040–3049\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[15\]L\. Kolodny\(2026\-02\)Apptronik raises $520 million to beat chinese humanoids, tesla optimus to market\.Note:CNBCCited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1)\.
- \[16\]D\. Kontogiorgos and H\. R\. M\. Pelikan\(2020\)Towards adaptive and least\-collaborative\-effort social robots\.InCompanion of the 2020 ACM/IEEE International Conference on Human\-Robot Interaction,pp\. 311–313\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1)\.
- \[17\]J\. G\. Kuba, R\. Chen, M\. Wen, Y\. Wen, F\. Sun, J\. Wang, and Y\. Yang\(2022\)Trust region policy optimisation in multi\-agent reinforcement learning\.InInternational Conference on Learning Representations,Cited by:[§C\.1](https://arxiv.org/html/2605.15400#A3.SS1.p3.1)\.
- \[18\]K\. Kurach, A\. Raichuk, P\. Stańczyk, M\. Zając, O\. Bachem, L\. Espeholt, C\. Riquelme, D\. Vincent, M\. Michalski, O\. Bousquet,et al\.\(2020\)Google research football: a novel reinforcement learning environment\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 4501–4510\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[19\]B\. Li, S\. Shi, L\. Romero, H\. Li, Y\. Xie, W\. Kim, S\. Nikolaidis, C\. M\. Lewis, K\. P\. Sycara, and S\. Stepputtis\(2025\)Adaptively coordinating with novel partners via learned latent strategies\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§A\.5](https://arxiv.org/html/2605.15400#A1.SS5.p1.1),[§E\.5](https://arxiv.org/html/2605.15400#A5.SS5.p1.1),[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[20\]J\. Li, K\. Kuang, B\. Wang, X\. Li, F\. Wu, J\. Xiao, and L\. Chen\(2023\)Two heads are better than one: a simple exploration framework for efficient multi\-agent reinforcement learning\.Advances in neural information processing systems36,pp\. 20038–20053\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[21\]Y\. Liang, D\. Chen, A\. Gupta, S\. S\. Du, and N\. Jaques\(2024\)Learning to cooperate with humans using generative agents\.Advances in Neural Information Processing Systems37,pp\. 60061–60087\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1),[§4](https://arxiv.org/html/2605.15400#S4.p2.1)\.
- \[22\]I\. Liu, U\. Jain, R\. A\. Yeh, and A\. Schwing\(2021\)Cooperative exploration for multi\-agent deep reinforcement learning\.InInternational conference on machine learning,pp\. 6826–6836\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[23\]Y\. Liu, Y\. Li, X\. Xu, Y\. Dou, and D\. Liu\(2022\)Heterogeneous skill learning for multi\-agent tasks\.Advances in neural information processing systems35,pp\. 37011–37023\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[24\]R\. Lowe, Y\. I\. Wu, A\. Tamar, J\. Harb, O\. Pieter Abbeel, and I\. Mordatch\(2017\)Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[25\]A\. Lupu, B\. Cui, H\. Hu, and J\. Foerster\(2021\)Trajectory diversity for zero\-shot coordination\.InInternational Conference on Machine Learning,pp\. 7204–7213\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[26\]R\. R\. McCrae and P\. T\. Costa Jr\(1999\)A five\-factor theory of personality\.Handbook of personality: Theory and research2\(1999\),pp\. 139–153\.Cited by:[Appendix D](https://arxiv.org/html/2605.15400#A4.p4.1)\.
- \[27\]D\. Mukherjee, K\. Gupta, L\. H\. Chang, and H\. Najjaran\(2022\)A survey of robot learning strategies for human\-robot collaboration in industrial settings\.Robotics and Computer\-Integrated Manufacturing73,pp\. 102231\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[28\]L\. Newsham and D\. Prince\(2025\)Personality\-driven decision making in llm\-based autonomous agents\.InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems,AAMAS,Detroit, MI, USA,pp\. 1538–1547\.Cited by:[Appendix D](https://arxiv.org/html/2605.15400#A4.p7.1),[§4\.2](https://arxiv.org/html/2605.15400#S4.SS2.p1.1)\.
- \[29\]A\. Ni, S\. Stepputtis, S\. Nikolaidis, M\. Lewis, K\. P\. Sycara, and W\. Kim\(2026\)Theory of mind guided strategy adaptation for zero\-shot coordination\.InProceedings of the International Conference on Autonomous Agents and Multiagent Systems,Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p2.1)\.
- \[30\]T\. O’neill, N\. McNeese, A\. Barron, and B\. Schelble\(2022\)Human–autonomy teaming: a review and analysis of the empirical literature\.Human factors64\(5\),pp\. 904–938\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[31\]I\. Obi, R\. Wang, W\. Jo, and B\. Min\(2025\)Investigating the impact of trust in multi\-human multi\-robot task allocation\.InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),Hangzhou, China\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1),[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[32\]F\. A\. Oliehoek, C\. Amato,et al\.\(2016\)A concise introduction to decentralized pomdps\.Vol\.1,Springer\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p5.13)\.
- \[33\]R\. Paleja, M\. Ghuy, N\. Ranawaka Arachchige, R\. Jensen, and M\. Gombolay\(2021\)The utility of explainable ai in ad hoc human\-machine teaming\.Advances in neural information processing systems34,pp\. 610–623\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[34\]R\. Paleja, M\. Munje, K\. C\. Chang, R\. Jensen, and M\. Gombolay\(2024\)Designs for enabling collaboration in human\-machine teaming via interactive and explainable systems\.Advances in Neural Information Processing Systems37,pp\. 64942–64969\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1)\.
- \[35\]M\. Samvelyan, T\. Rashid, C\. S\. De Witt, G\. Farquhar, N\. Nardelli, T\. G\. J\. Rudner, C\. Hung, P\. H\. S\. Torr, J\. Foerster, and S\. Whiteson\(2019\)The starcraft multi\-agent challenge\.arXiv preprint arXiv:1902\.04043\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[36\]B\. Sarkar, A\. Shih, and D\. Sadigh\(2023\)Diverse conventions for human\-AI collaboration\.InThirty\-seventh Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p2.1)\.
- \[37\]T\. Schaul, J\. Quan, I\. Antonoglou, and D\. Silver\(2015\)Prioritized experience replay\.arXiv preprint arXiv:1511\.05952\.Cited by:[§A\.5](https://arxiv.org/html/2605.15400#A1.SS5.p2.1)\.
- \[38\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[39\]B\. Shacklett, L\. G\. Rosenzweig, Z\. Xie, B\. Sarkar, A\. Szot, E\. Wijmans, V\. Koltun, D\. Batra, and K\. Fatahalian\(2023\)An extensible, data\-oriented architecture for high\-performance, many\-world simulation\.ACM Transactions on Graphics \(TOG\)42\(4\),pp\. 1–13\.Cited by:[8th item](https://arxiv.org/html/2605.15400#A1.I2.i8.p1.1)\.
- \[40\]H\. C\. Siu, J\. D\. Pena, E\. Chen, Y\. Zhou, V\. Lopez, K\. Palko, K\. C\. Chang, and R\. E\. Allen\(2021\)Evaluation of human\-AI teams for learned and rule\-based agents in hanabi\.InAdvances in Neural Information Processing Systems,A\. Beygelzimer, Y\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[41\]V\. Sreeramdass, R\. R\. Paleja, L\. Chen, S\. van Waveren, and M\. Gombolay\(2025\)Generalized behavior learning from diverse demonstrations\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[42\]P\. Stone, G\. Kaminka, S\. Kraus, and J\. Rosenschein\(2010\)Ad hoc autonomous agent teams: collaboration without pre\-coordination\.InProceedings of the AAAI conference on artificial intelligence,Vol\.24,pp\. 1504–1509\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1)\.
- \[43\]D\. Strouse, K\. McKee, M\. Botvinick, E\. Hughes, and R\. Everett\(2021\)Collaborating with humans without human data\.Advances in Neural Information Processing Systems34,pp\. 14502–14515\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p2.1),[§2](https://arxiv.org/html/2605.15400#S2.p3.1),[§4](https://arxiv.org/html/2605.15400#S4.p2.1)\.
- \[44\]M\. Tomasello\(2009\)Why we cooperate\.MIT press\.Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1)\.
- \[45\]O\. Vinyals, I\. Babuschkin, W\. M\. Czarnecki, M\. Mathieu, A\. Dudzik, J\. Chung, D\. H\. Choi, R\. Powell, T\. Ewalds, P\. Georgiev,et al\.\(2019\)Grandmaster level in starcraft ii using multi\-agent reinforcement learning\.nature575\(7782\),pp\. 350–354\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[46\]H\. Wang, Z\. Tian, Y\. Song, X\. Zhang, and Z\. Cai\(2024\)Beyond single stationary policies: meta\-task players as naturally superior collaborators\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 78836–78862\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p2.1)\.
- \[47\]R\. Wang, D\. Zhao, D\. Suh, Z\. Yuan, G\. Chen, and B\. Min\(2025\)Personalization in human\-robot interaction through preference\-based action representation learning\.InProceedings of the IEEE International Conference on Robotics and Automation,pp\. 7377–7384\.External Links:[Document](https://dx.doi.org/10.1109/ICRA55743.2025.11128756)Cited by:[§1](https://arxiv.org/html/2605.15400#S1.p1.1)\.
- \[48\]T\. Wang, H\. Dong, V\. Lesser, and C\. Zhang\(2020\)ROMA: multi\-agent reinforcement learning with emergent roles\.InProceedings of the 37th International Conference on Machine Learning,ICML,pp\. 9876–9886\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[49\]T\. Wang, J\. Wang, Y\. Wu, and C\. Zhang\(2020\)Influence\-based multi\-agent exploration\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1)\.
- \[50\]X\. Wang, S\. Zhang, W\. Zhang, W\. Dong, J\. Chen, Y\. Wen, and W\. Zhang\(2024\)Zsc\-eval: an evaluation toolkit and benchmark for multi\-agent zero\-shot coordination\.Advances in Neural Information Processing Systems37,pp\. 47344–47377\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[51\]P\. Xu, J\. Zhang, and K\. Huang\(2024\)Population\-based diverse exploration for sparse\-reward multi\-agent tasks\.\.InIJCAI,pp\. 283–291\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[52\]C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu\(2022\)The surprising effectiveness of ppo in cooperative multi\-agent games\.Advances in Neural Information Processing Systems35,pp\. 24611–24624\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p4.1),[§4](https://arxiv.org/html/2605.15400#S4.p2.1)\.
- \[53\]L\. Yuan, L\. Li, Z\. Zhang, F\. Chen, T\. Zhang, C\. Guan, Y\. Yu, and Z\. Zhou\(2023\)Learning to coordinate with anyone\.InProceedings of the Fifth International Conference on Distributed Artificial Intelligence,pp\. 1–9\.Cited by:[§2](https://arxiv.org/html/2605.15400#S2.p3.1)\.
- \[54\]C\. Zhang, K\. Yang, S\. Hu, Z\. Wang, G\. Li, Y\. Sun, C\. Zhang, Z\. Zhang, A\. Liu, S\. Zhu,et al\.\(2024\)Proagent: building proactive cooperative agents with large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17591–17599\.Cited by:[Appendix D](https://arxiv.org/html/2605.15400#A4.p7.1)\.
- \[55\]R\. Zhao, J\. Song, Y\. Yuan, H\. Hu, Y\. Gao, Y\. Wu, Z\. Sun, and W\. Yang\(2023\)Maximum entropy population\-based training for zero\-shot human\-ai coordination\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 6145–6153\.Cited by:[§A\.5](https://arxiv.org/html/2605.15400#A1.SS5.p2.1),[§1](https://arxiv.org/html/2605.15400#S1.p2.1),[§2](https://arxiv.org/html/2605.15400#S2.p3.1),[§3\.1\.2](https://arxiv.org/html/2605.15400#S3.SS1.SSS2.p1.1),[§4](https://arxiv.org/html/2605.15400#S4.p2.1)\.

## Appendix AExperimental Details and Hyperparameters

### A\.1Reward Shaping

In all layouts, agents must coordinate to prepare and serve onion soup under standard Overcooked dynamics\.

- •Sparse task reward\.Delivery of a completed three\-onion soup yields2020\.
- •Intermediate shaping\.Placing an onion into a pot yields\+3\+3; picking up a cooked soup yields\+5\+5; dish pickup shaping is0\.

This shaping eases exploration while preserving long\-horizon coordination requirements, since high return still requires multi\-step coordinated sequences to finish and deliver soups\.

### A\.2Hyperparameter Rationale

- •Environment and rollout\.All training phases usenenvs=64n\_\{\\text\{envs\}\}=64parallel environments and an episode horizon ofH=400H=400\. The parallel environments provide stable rollout statistics while maintaining practical throughput, and the fixed horizon preserves delayed coordination effects\. PPO collectsnsteps=1024n\_\{\\text\{steps\}\}=1024steps per environment before each update\.
- •PPO optimization\.PPO uses discount factorγ=0\.99\\gamma=0\.99and GAE parameterλ=0\.95\\lambda=0\.95to retain long\-horizon returns while controlling variance\. The clip range is fixed at0\.150\.15, the entropy coefficient is0\.050\.05, and the learning rate is1×10−41\\times 10^\{\-4\}across the implemented IBTS training phases\. We use66PPO epochs per update and batch size10241024\.
- •Influence shaping and diversity\.The influence event horizon is fixed atK=4K=4, providing a short future window for detecting delayed teammate follow\-up\. The influence coefficient isλinf=5\.0\\lambda\_\{\\mathrm\{inf\}\}=5\.0\. The online posterior models are trained with learning rate1×10−41\\times 10^\{\-4\}, batch size20482048, and one epoch per update\. Behavioral diversity uses MEP coefficientλdiv=0\.01\\lambda\_\{\\mathrm\{div\}\}=0\.01with numerical constantϵ=10−8\\epsilon=10^\{\-8\}\.
- •Team\-pool training\.The team pool containsM=5M=5teams\. Each team is trained for3030M environment steps using a deterministic round\-robin schedule, with chunks of55M steps per team\. The diversity objective is activated after the first full population cycle so that the population reference is based on nontrivial policies\.
- •Trajectory predictor\.The team predictor uses trajectory windows of length2020with the features\(x,y,facing,held,action\)\(x,y,\\text\{facing\},\\text\{held\},\\text\{action\}\)for each selected agent\. It is trained as a Transformer classifier over team identifiers using cross\-entropy loss\. We usedmodel=64d\_\{\\mathrm\{model\}\}=64,44attention heads,22Transformer layers, feedforward dimension128128, dropout0\.10\.1, batch size256256, learning rate10−310^\{\-3\}, weight decay10−410^\{\-4\}, and early stopping patience2525over a maximum of500500epochs\.
- •Predictor\-guided steering\.Teacher training uses the frozen trajectory predictor and scalar team\-quality scores from the team\-pool evaluation summary\. The steering reward uses coefficientα=0\.5\\alpha=0\.5and temporal offsetΔ=10\\Delta=10; invalid future windows receive zero steering reward\. Each teacher is trained for5050M environment steps using the same PPO settings as above\.
- •Compute resources\.All models were trained on NVIDIA A30, A40, or A100 GPUs with 32GB GPU memory\. Training one team pool took approximately 16 hours, and training a predictor\-guided best\-response agent against a fixed pool took approximately 3 days on a single GPU worker\.
- •Implementation platform\.We extend the standard two\-agent Overcooked\-AI setting\[[5](https://arxiv.org/html/2605.15400#bib.bib1)\]to three\-agent layouts in order to study two\-human–one\-machine HMT\. The three\-agent environments preserve the same low\-level Overcooked action interface and task mechanics, including movement, interaction, object pickup and placement, pot cooking, and soup delivery\. For computational efficiency, we implement these environments using the Madrona many\-world simulation framework\[[39](https://arxiv.org/html/2605.15400#bib.bib40)\]\. This implementation affects simulation throughput but does not change the action interface used by learned agents or human participants\.

Table[1](https://arxiv.org/html/2605.15400#A1.T1)summarizes the primary hyperparameters used across the implemented IBTS training phases\.

Table 1:Key hyperparameters used across IBTS training phases\.
### A\.3Choice of Intrinsic Scaling Coefficients

The influence and diversity terms are integrated as intrinsic shaping rewards in the team\-pool construction objective in Eq\.[5](https://arxiv.org/html/2605.15400#S3.E5)\. The coefficientsλinf\\lambda\_\{\\mathrm\{inf\}\}andλdiv\\lambda\_\{\\mathrm\{div\}\}control the relative strength of the coordination and diversity incentives\. We setλinf=5\.0\\lambda\_\{\\mathrm\{inf\}\}=5\.0andλdiv=0\.01\\lambda\_\{\\mathrm\{div\}\}=0\.01\.

Our choice ofλinf=5\.0\\lambda\_\{\\mathrm\{inf\}\}=5\.0is primarily a reward\-scale calibration\. The influence reward in Eq\.[3](https://arxiv.org/html/2605.15400#S3.E3)is a baseline\-adjusted probability lift, averaged over target teammates and computed over a short\-horizon coordination event\. Because it is formed as a difference between predicted event probabilities and then averaged across targets, the per\-step influence term is naturally small, typically on the order of10−310^\{\-3\}after averaging overj≠ij\\neq i\. Multiplying byλinf=5\.0\\lambda\_\{\\mathrm\{inf\}\}=5\.0keeps the shaped influence contribution on the order of a few×10−3\\times 10^\{\-3\}, which is large enough to provide coordination guidance when task reward is sparse, but small enough to remain secondary once environment reward becomes informative\.

This scale choice is especially important early in training\. At initialization, deliveries are rare and the average per\-step environment reward can be very small, so a modest influence term can help reinforce salient actions that increase the probability of short\-horizon teammate follow\-up\. As learning begins, the extrinsic reward rises into a regime where task progress becomes more frequent, while the scaled influence term remains at an auxiliary scale\. Thus,λinf=5\.0\\lambda\_\{\\mathrm\{inf\}\}=5\.0provides a meaningful coordination bias during the sparse\-reward phase without turning influence maximization into the primary optimization target\.

For the behavioral diversity term, we useλdiv=0\.01\\lambda\_\{\\mathrm\{div\}\}=0\.01because the unscaled reward in Eq\.[4](https://arxiv.org/html/2605.15400#S3.E4)typically lies between roughly0\.50\.5and1\.71\.7in our runs\. Scaling this term by0\.010\.01yields an auxiliary contribution of approximately0\.0050\.005to0\.0170\.017, which is comparable to the scaled influence contribution and small relative to task reward once training becomes productive\. This keeps diversity as a regularizer that discourages collapse to a single convention without overwhelming the environment objective or the influence\-shaping signal\.

For predictor\-guided steering, we useα=0\.5\\alpha=0\.5\. The team\-quality scoresS​\(m\)S\(m\)are normalized to lie in\[0,1\]\[0,1\], and the predictor distributionpt​\(m\)p\_\{t\}\(m\)is also normalized over the reference team pool\. Therefore, the soft trajectory\-quality estimateQ​\(ht\)=∑m=1Mpt​\(m\)​S​\(m\)Q\(h\_\{t\}\)=\\sum\_\{m=1\}^\{M\}p\_\{t\}\(m\)S\(m\)also lies in\[0,1\]\[0,1\]\. The steering reward in Eq\.[9](https://arxiv.org/html/2605.15400#S3.E9)is based on the short\-horizon improvementQ​\(ht\+Δ\)−Q​\(ht\)Q\(h\_\{t\+\\Delta\}\)\-Q\(h\_\{t\}\), so its raw magnitude is usually much smaller than the absolute quality score and is often on the order of10−210^\{\-2\}\. Multiplying byα=0\.5\\alpha=0\.5keeps the steering contribution at a comparable auxiliary scale to the influence and diversity shaping terms, while preventing the teacher from optimizing the predictor score at the expense of task reward\.

### A\.4Sensitivity to Event HorizonKK

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/kstep_sensitivity_mappo_is.png)Figure 6:Sensitivity of MAPPO\+IS to the event horizonKKon PL\-3 and AA\-3\. Bars show mean final episode return over 12 seeds, and error bars show±1\\pm 1standard deviation\.We ablate the event horizonKKin the event\-level influence label while holding the MAPPO\+IS pipeline fixed\. We compareK∈\{1,4,7\}K\\in\\\{1,4,7\\\}on the 3\-agent Pipeline \(PL\-3\) and Asymmetric Advantages \(AA\-3\) layouts, ranging from immediate next\-step labeling to a longer timing tolerance\.

Figure[6](https://arxiv.org/html/2605.15400#A1.F6)shows that a medium horizon improves performance over immediate next\-step labeling in both layouts\. In Pipeline, moving fromK=1K\{=\}1toK=4K\{=\}4yields a\+31\.7%\+31\.7\\%gain in mean return, while extending toK=7K\{=\}7produces a−32\.1%\-32\.1\\%drop relative toK=4K\{=\}4\. In Asymmetric Advantages,K=4K\{=\}4improves mean return by\+9\.1%\+9\.1\\%overK=1K\{=\}1, andK=7K\{=\}7reduces performance by−6\.7%\-6\.7\\%relative toK=4K\{=\}4\.

These trends match the timing structure of Overcooked coordination\. WithK=1K\{=\}1, many legitimate follow\-ups cannot occur on the next step because teammates may need to move, avoid collisions, or reach an interaction tile before responding\. A short future window therefore gives the influence label enough tolerance to capture delayed teammate responses\. However, overly long horizons can dilute attribution: asKKincreases, positive labels become less selective and may occur under typical dynamics even when the initiating action did not meaningfully cause the follow\-up\. This weakens directed influence estimation and is consistent with the drop atK=7K\{=\}7, especially in Pipeline\.

We therefore useK=4K\{=\}4as the default in the main experiments, balancing tolerance to short coordination delays with keeping the event label tied to the initiating action\. Overall, these results suggest that influence shaping is most effective when the event horizon captures near\-term teammate responses without making the label too permissive\.

### A\.5Comparison Design Choices

We exclude TALENTS\[[19](https://arxiv.org/html/2605.15400#bib.bib3)\]although it is a closer recent baseline that outperforms MEP and GAMMA, because its pre\-designed high\-level action space introduces task\-specific abstractions, whereas our comparisons keep all methods on the same low\-level Overcooked action interface\. This choice avoids giving one method additional task knowledge through hand\-designed abstractions\.

For the same reason, during best\-response training we sample frozen partners uniformly from the team pool rather than using priority\-based partner sampling\[[37](https://arxiv.org/html/2605.15400#bib.bib39)\]\. In MEP\-style best\-response training, partners that are harder to collaborate with can be assigned higher sampling priority\[[55](https://arxiv.org/html/2605.15400#bib.bib12)\], shifting training away from a pure average\-case objective and toward a smooth approximation of maximizing worst\-case performance over the partner population\. While this can improve robustness, it also introduces an additional design choice about which partners should be emphasized during training\. In our setting, such curated partner exposure could inject extra prior knowledge into the comparison\. We therefore use uniform partner sampling to keep the evaluation focused on the predictor\-guided steering objective rather than on a manually shaped partner\-selection curriculum\.

## Appendix BTraining Algorithms

This section provides implementation\-level pseudocode for the main IBTS training procedures\. Algorithm[1](https://arxiv.org/html/2605.15400#alg1)summarizes influence\-shaped team\-pool construction, corresponding to Section[3\.1](https://arxiv.org/html/2605.15400#S3.SS1)\. Algorithm[2](https://arxiv.org/html/2605.15400#alg2)summarizes predictor\-guided teacher learning and shared\-student distillation, corresponding to Sections[3\.3](https://arxiv.org/html/2605.15400#S3.SS3)\.

Algorithm 1Team\-Pool Construction with Influence Shaping and Behavioral Diversity1:Initialize

MMteams

\{Π\(m\)\}m=1M\\\{\\Pi^\{\(m\)\}\\\}\_\{m=1\}^\{M\}, where each team is

Π\(m\)=\{π1\(m\),…,πn\(m\)\}\\Pi^\{\(m\)\}=\\\{\\pi\_\{1\}^\{\(m\)\},\\ldots,\\pi\_\{n\}^\{\(m\)\}\\\}with

nnagents

2:whilenot convergeddo

3:Sample a team

Π\\Pi, keep remaining teams frozen

4:foreach policy updatedo

5:Sample joint action

𝐚t∼Π\(⋅∣o~t\)\\mathbf\{a\}\_\{t\}\\sim\\Pi\(\\cdot\\mid\\tilde\{o\}\_\{t\}\)at each timestep

tt, step environment and collect rollout

6:foreach ordered pair

\(i,j\)\(i,j\)with

j≠ij\\neq ido

7:Construct

yj,t\(K\)y\_\{j,t\}^\{\(K\)\}using Eq\.[2](https://arxiv.org/html/2605.15400#S3.E2), update

qi→jq\_\{i\\rightarrow j\}and

ωj\\omega\_\{j\}using binary cross\-entropy on the rollout

8:endfor

9:Compute

ri,tinfr\_\{i,t\}^\{\\mathrm\{inf\}\}using Eq\.[3](https://arxiv.org/html/2605.15400#S3.E3)and

ri,tdivr\_\{i,t\}^\{\\mathrm\{div\}\}using Eq\.[4](https://arxiv.org/html/2605.15400#S3.E4)

10:foreach agent

i∈\{1,…,n\}i\\in\\\{1,\\dots,n\\\}and timestep

ttdo

11:Form actor reward

ri,tr\_\{i,t\}using Eq\.[5](https://arxiv.org/html/2605.15400#S3.E5)

12:endfor

13:Reduce per\-agent rewards to team reward

rtteam=1n​∑i=1nri,tr\_\{t\}^\{\\mathrm\{team\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}r\_\{i,t\}
14:Compute GAE / returns for actor and critic, and update only the active team

Π\\Pi
15:endfor

16:endwhile

Algorithm 2Predictor\-Guided Teacher Learning and Shared\-Student Distillation1:for

πiteacher\\pi\_\{i\}^\{\\mathrm\{teacher\}\}where

i∈\{1,…,n\}i\\in\\\{1,\\dots,n\\\}do

2:whilenot convergeddo

3:Compute predictor outputs

ctc\_\{t\}and

pt​\(m\)p\_\{t\}\(m\)from recent history

hth\_\{t\}at each timestep

tt
4:Sample target action

ati∼πiteacher\(⋅∣oti,ct\)a\_\{t\}^\{i\}\\sim\\pi\_\{i\}^\{\\mathrm\{teacher\}\}\(\\cdot\\mid o\_\{t\}^\{i\},c\_\{t\}\), sample frozen\-partner actions for the remaining agent positions, step environment, and collect rollout

5:Compute

Q​\(ht\)Q\(h\_\{t\}\)using Eq\.[8](https://arxiv.org/html/2605.15400#S3.E8),

rtsteerr\_\{t\}^\{\\mathrm\{steer\}\}using Eq\.[9](https://arxiv.org/html/2605.15400#S3.E9), and

rttotalr\_\{t\}^\{\\mathrm\{total\}\}using Eq\.[10](https://arxiv.org/html/2605.15400#S3.E10)

6:Compute GAE / returns for actor and critic, and update only

πiteacher\\pi\_\{i\}^\{\\mathrm\{teacher\}\}
7:endwhile

8:endfor

9:Roll out

πiteacher\\pi\_\{i\}^\{\\mathrm\{teacher\}\}as agent

iiwith sampled frozen partners and export

oti,ct,atio\_\{t\}^\{i\},c\_\{t\},a\_\{t\}^\{i\}to

𝒟\\mathcal\{D\}in Eq\.[11](https://arxiv.org/html/2605.15400#S3.E11)

10:whilenot convergeddo

11:Sample minibatches from

𝒟\\mathcal\{D\}
12:Update shared student policy

πstudent\\pi^\{\\mathrm\{student\}\}using Eq\.[12](https://arxiv.org/html/2605.15400#S3.E12)

13:endwhile

## Appendix CAblation and Case Study on Influence Shaping

This appendix provides additional evidence for the influence\-shaping component used to construct the self\-play team pool\. Throughout this section, we refer to the influence\-shaping variant asIS\.

### C\.1Motivating Case Study in Three\-Agent Overcooked\-AI

We include this case study to motivate the use of influence shaping for constructing the self\-play team pool\. The goal is not to argue that hand\-designed heuristics are a practical solution, but rather to illustrate that standard cooperative MARL objectives can fail to discover simple interaction patterns even when such patterns are clearly beneficial\. This motivates adding an explicit shaping signal that encourages agents to create opportunities for teammate follow\-up behavior\.

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/passing.png)\(a\) Pipeline layout with passing\-style heuristic behavior
![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/failing.png)\(b\) Standard MARL baselines underperform the heuristic

Figure 7:Case study showing that standard cooperative MARL baselines can struggle to discover passing\-style coordination in a three\-agent Pipeline layout\.Figure[7](https://arxiv.org/html/2605.15400#A3.F7)shows a simple three\-agent collaborative scenario, which we refer to as the Pipeline layout\. The layout admits an intuitive passing\-style strategy in which agents specialize into complementary roles and move task objects through the workspace in a coordinated sequence\. Although this behavior is simple to specify as a heuristic, it requires agents to learn that an action may be valuable because it enables a teammate’s later response, rather than because it immediately increases reward\.

In this setting, standard CTDE baselines such as MAPPO and Heterogeneous\-Agent Proximal Policy Optimization \(HAPPO\)\[[17](https://arxiv.org/html/2605.15400#bib.bib38)\]fail to match the hand\-designed passing heuristic, achieving 68% and 63% lower reward, respectively\. This gap suggests that sparse shared reward alone may be insufficient to induce the interaction structure required for efficient collaboration, even in relatively simple multi\-agent layouts\. In particular, agents must discover not only useful individual actions, but also the temporal dependencies between one agent’s behavior and another agent’s subsequent follow\-up\.

### C\.2Scaling Across Two\-, Three\-, and Four\-Agent Layouts

To examine whether the influence\-shaping signal remains useful beyond a single case study, we evaluate it across two\-, three\-, and four\-agent Overcooked\-AI layouts\. Figure[8](https://arxiv.org/html/2605.15400#A3.F8)summarizes the layouts used in these settings, and Table[2](https://arxiv.org/html/2605.15400#A3.T2)reports the full numerical results across all layouts and team sizes\.

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/two_agent_settings.png)

2\-agent

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/three_agent_settings.png)

3\-agent

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/four_agent_settings.png)

4\-agent

Figure 8:Layout overview for the 2\-agent, 3\-agent, and 4\-agent settings\. In the 2\-agent setting, the top row \(left to right\) shows Pipeline and Asymmetric Advantages, and the bottom row \(left to right\) shows Forced Coordination, Cramped Room, and Coordination Ring\. In the 3\-agent setting, the top row \(left to right\) shows Pipeline, Asymmetric Advantages, and Forced Coordination, and the bottom row \(left to right\) shows Open Room, Coordination Ring, and Cramped Room\. In the 4\-agent setting, the top row \(left to right\) shows Pipeline, Asymmetric Advantages, and Forced Coordination\.Table 2:Mean±\\pmstd over 12 seeds with maximum in parentheses for all layouts and team sizes where best mean is bolded\.
### C\.3Reward\-Hacking Ablation

A natural alternative to influence shaping is to directly reward a hand\-designed coordination event\. In Overcooked\-AI, the most obvious such event is a handoff, where one agent drops an item and another agent picks it up within a short temporal window\. To separate the contribution of influence shaping from task\-specific reward engineering, we evaluate a reward\-hacking baseline that adds a dense handoff bonus directly to the environment reward and trains the resulting policy with MAPPO\.

Table 3:Reward\-hacking ablation in three\-agent Overcooked\-AI\. Results show mean±\\pmstandard deviation over 12 random seeds\.Table[3](https://arxiv.org/html/2605.15400#A3.T3)shows that directly rewarding handoffs does not reliably improve task performance\. The high variance of the reward\-hacking baseline suggests unstable behavior across seeds\. Some runs can become trapped in poor local optima, where agents repeatedly exchange items to obtain dense immediate reward without improving task progress, while other runs largely ignore the added bonus\.

This result highlights an important distinction between handcrafted event rewards and influence shaping\. The reward\-hacking baseline optimizes the frequency of one predefined local event, whereas IS rewards actions that increase the likelihood of task\-relevant teammate follow\-up behavior\. As a result, IS can support a broader class of downstream coordination patterns, such as placing ingredients into pots or reorganizing teammate roles, without hard\-coding a specific handoff behavior\.

### C\.4Generality Beyond Overcooked\-AI

We also include a preliminary generality check in Google Research Football \(GRF\)\. Unlike Overcooked\-AI, coordination in GRF cannot be reduced to a single event type\. Depending on the game state, useful teammate responses may include short passes, long passes, shots, or off\-ball repositioning\. We use a setting with three controlled attackers against three defenders and one goalkeeper, with attackers initialized closer to midfield to make the strategy more varied\.

Using the same MAPPO backbone and 30M training steps, we evaluate over 10 deterministic episodes\. Passing ratio denotes the fraction of possession actions that are pass actions, and ball hold time denotes the total number of timesteps in which the controlled team holds the ball without passing\.

Table 4:GRF evaluation of influence shaping\.As shown in Table[4](https://arxiv.org/html/2605.15400#A3.T4), MAPPO\+IS scores more goals and increases the passing ratio relative to MAPPO\. It also reduces ball hold time without passing by 42\.9%, which is consistent with a policy that relies more on teammate interaction and less on prolonged single\-agent dribbling\. While this experiment is preliminary, it suggests that influence shaping can provide a useful coordination bias beyond Overcooked\-AI\.

## Appendix DSynthetic LLM Evaluation Prompts

This section describes the prompt structure used for the personality\-conditioned synthetic LLM partner evaluation\. Each LLM partner receives a shared game manual followed by a personality\-specific behavior prompt\. The model is instructed to output exactly one low\-level Overcooked action from \{north,south,east,west,stay,interact\} at each timestep, with no explanation or extra text\.

Shared game manual\.The shared game manual describes the Overcooked task objective, reward structure, action semantics, object types, and interaction rules\. It specifies that agents should work with teammates to maximize team reward by preparing and serving onion soup\. The reward structure is:\+3\+3for placing an onion into a pot,\+5\+5for taking cooked soup out of a pot using a dish, and\+20\+20for serving a completed cooked three\-onion soup\. The manual also explains that agents act in the same kitchen, can hold only one object at a time, must face an object or station before usinginteract, and may use movement actions to change facing direction even when movement is blocked\.

The manual provides task guidance but does not prescribe a complete high\-level strategy\. In particular, it tells the model to make useful progress toward higher reward, avoid repetitive non\-progressing behavior, use dishes when pots are cooking or ready, and choose an action consistent with its assigned personality when multiple actions appear reasonable\. This design keeps the LLM partner at the low\-level action\-selection interface rather than giving it an explicit planner or skill hierarchy\.

Agreeable partner\.The Agreeable profile follows the Five\-Factor view of agreeableness as cooperative and empathetic\[[26](https://arxiv.org/html/2605.15400#bib.bib54)\]\. The prompt describes the partner as cooperative, considerate, accommodating, generous, and willing to support others\. It instructs the agent to pay close attention to teammates, prefer actions that support smooth teamwork, help through handoffs or continuation of teammate\-started work, give way when another teammate is already committed to an object or route, avoid blocking or duplicating effort, and choose complementary tasks when teammates are already making useful progress\.

Extraverted partner\.The Extraverted profile describes the partner as active, assertive, energetic, and comfortable taking initiative\. The prompt instructs the agent to act early, commit quickly to useful tasks, prefer actions that directly move the task forward, keep momentum, use handoffs when clearly efficient, and take the lead in ambiguous situations rather than only reacting to teammates\. When several actions appear similarly useful, the agent is encouraged to choose the more assertive, faster, and more directly productive action\.

Neurotic partner\.The Neurotic profile describes the partner as anxious, cautious, hesitant, and sensitive to uncertainty\. The prompt instructs the agent to be less likely to commit quickly when the best action is unclear, prefer safer and more certain actions, avoid uncertain handoffs or tightly timed coordination unless intent is clear, hesitate before taking initiative without an established coordination pattern, and choose actions that are less risky, less disruptive, and easier to control\. The prompt still instructs the agent to attend to teammates and the team objective, but allows uncertainty to make the partner slower to commit\.

Purpose of the synthetic profiles\.The synthetic profiles are not intended to create optimal LLM teammates\. Instead, they provide controlled low\-level partner variation that complements the real\-human HMT evaluation\. We use personality\-conditioned GPT\-5\-mini partners to induce distinct action\-selection tendencies, such as cooperative handoffs, assertive task initiation, and cautious waiting behavior, while preserving the same game manual, action space, and observation interface across partners\. This follows prior work showing that personality induction can produce systematic differences in LLM\-based agent decision making\[[28](https://arxiv.org/html/2605.15400#bib.bib44)\]\. We do not use skill\-conditioned LLM agents such as ProAgent\[[54](https://arxiv.org/html/2605.15400#bib.bib53)\], because our goal is not to construct an explicitly planned high\-skill teammate with additional task\-knowledge or planning mechanisms\. Instead, the synthetic profiles are designed to simulate limited\-communication HMT settings in which the machine teammate must infer partner intent from observed actions\. This lets us evaluate whether IBTS remains robust when teammate behavior differs from the learned\-agent population, while keeping the evaluation focused on partner\-style variation rather than on the capabilities of a separately engineered LLM planner\.

## Appendix EHuman Study Protocol

### E\.1Study Design and Conditions

The study used a within\-condition design crossing three Overcooked layouts with three AI partners\. Each participant completed nine formal games in total\. The layouts were Forced Coordination \(FC\), Pipeline \(PL\), and Asymmetric Advantages \(AA\), with layout order fixed as FC, PL, and AA\. The AI partners were MEP, GAMMA, and IBTS, and their order was randomized to reduce ordering effects\. Each game was capped at 400 environment timesteps\.

We evaluated two team\-size conditions\. The 2\-agent condition consisted of one human participant and one AI teammate\. The 3\-agent condition consisted of two human participants and one AI teammate\. Thus, the 3\-agent condition evaluates two\-human–one\-AI collaboration under the same layout and AI\-partner structure as the 2\-agent condition\.

### E\.2Study Procedure

Before the formal trials, participants were given unlimited practice time in PL or AA to familiarize themselves with the controls, task mechanics, and reward structure\. During the formal study, participants completed one FC–PL–AA block with a given AI partner, filled out a team\-effectiveness questionnaire, and then repeated the same process for the remaining AI partners\. Thus, each questionnaire response summarizes the participant’s experience with one AI partner across all three layouts in the corresponding team\-size condition\.

The study was conducted under an IRB\-approved protocol with 30 participants in total\. Most sessions were completed in person: 28 participants completed the study in person, while one two\-human team, consisting of 2 participants, completed the study online through remote control of the study laptop\.

A single game typically lasted between three and six minutes depending on human decision time, and a full session lasted approximately one hour\. Each participant received a $10 Amazon gift card after completing the study\.

### E\.3Gameplay Constraints

The game used synchronous stepping: the environment advanced only after actions were received from all active human players and AI agents\. This prevented AI agents from moving faster than humans simply because the model could act more quickly\.

To approximate degraded or limited\-communication settings, verbal communication was prohibited during gameplay in both the 2\-agent and 3\-agent conditions\. This design makes the study a test of implicit coordination, since participants must infer machine behavior, and in the 3\-agent condition the other human’s intent, from observed actions alone\.

### E\.4Questionnaire Materials

The questionnaire battery consisted of three parts: a team\-effectiveness questionnaire, a personality questionnaire, and a workload questionnaire based on NASA\-TLX\. The team\-effectiveness questionnaire, shown in Figure[10](https://arxiv.org/html/2605.15400#A5.F10), measured perceived team fluency, trust, work balance, and satisfaction using 5\-point Likert\-scale items\. The personality questionnaire, shown in Figure[11](https://arxiv.org/html/2605.15400#A5.F11), was included as supplementary information for future analysis of whether real participant personality traits align with or help explain the personality\-conditioned synthetic AI evaluations\. The NASA\-TLX questionnaire, shown in Figure[12](https://arxiv.org/html/2605.15400#A5.F12), measured perceived workload during the task\.

![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/survey_plot.png)Figure 9:Post\-game questionnaire ratings averaged over fluency, trust, satisfaction, and work balance\. The 3\-agent plot includes human teammates as a reference\.![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/teq_p1.png)\(a\)Page 1\.
![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/teq_p2.png)\(b\)Page 2\.
![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/teq_p3.png)\(c\)Page 3\.
![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/teq_p4.png)\(d\)Page 4\.

Figure 10:Team\-effectiveness questionnaire used in the human study\. Items measure perceived fluency, trust, work balance, and satisfaction on 5\-point Likert scales\.![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/five_pers_p1.png)\(a\)Page 1\.
![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/five_pers_p2.png)\(b\)Page 2\.
![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/five_pers_p3.png)\(c\)Page 3\.
![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/five_pers_p4.png)\(d\)Page 4\.

Figure 11:Personality questionnaire used in the human study\. These responses are collected as supplementary information for analyzing whether participant personality traits align with behavior patterns observed in the synthetic personality\-conditioned AI evaluation\.![Refer to caption](https://arxiv.org/html/2605.15400v1/figs/nasa_tlx.png)Figure 12:NASA\-TLX workload questionnaire used to measure participants’ perceived workload during the human study\.
### E\.5Post\-Game Questionnaire Results

In addition to task scores, participants completed post\-game questionnaires assessing perceived teammate quality along four dimensions: fluency, trust, satisfaction, and work balance\. The fluency items are adapted from prior work on evaluating fluency in human–robot collaboration\[[10](https://arxiv.org/html/2605.15400#bib.bib56)\], while the trust and satisfaction items extend the user\-study questionnaire design used in TALENTS\[[19](https://arxiv.org/html/2605.15400#bib.bib3)\]\. Figure[9](https://arxiv.org/html/2605.15400#A5.F9)reports the average ratings for each AI teammate across the 2\-agent and 3\-agent human\-study settings\. In the 3\-agent setting, ratings for the human teammate are also included as a reference point\.

The questionnaire results provide complementary evidence about subjective teammate perception\. While human partners remain a strong subjective reference in group play, the learned\-agent ratings suggest that IBTS is generally perceived favorably relative to the learned baselines across several dimensions\. These results support the main task\-score findings by showing that the performance improvements of IBTS do not come at the cost of substantially degraded perceived teammate quality\.

## Appendix FSocietal Impact

This work studies how to train machine teammates that can coordinate with diverse human partners as mixed HMT scale beyond dyadic interaction\. A positive impact of this research is that it may help future assistive agents, robots, or collaborative decision\-support systems support groups of people in shared tasks, especially when communication is limited or team composition changes\.

At the same time, systems that steer team coordination may also shape human behavior in unintended ways\. A poorly aligned AI teammate could disrupt human\-human coordination, over\-optimize for task reward at the expense of user preferences, or encourage interaction patterns that are efficient but not desirable for all participants\. Our method also does not guarantee alignment with human values or individual preferences\. Before deployment in real\-world settings, such agents should be evaluated with human preference data, safety constraints, and domain\-specific oversight to ensure that coordination improvements do not come at the cost of user autonomy, fairness, or trust\.

Similar Articles

Searching for Synergy in Shared Workspace Human-AI Collaboration

arXiv cs.AI

This paper studies human-AI team coordination in shared workspaces using the Collaborative Gym and DiscoveryBench tasks, finding that adding collaborators can lower performance without proper structure. Scaffolding with shared group memory and human-in-the-loop gates improves performance, especially in three-person teams.

Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

arXiv cs.CL

This research paper investigates how human personality traits and AI design characteristics jointly impact human-AI interactions in imperfectly cooperative scenarios using both simulated datasets (2,000 simulations) and human subjects experiments (290 participants). The study finds significant divergences between simulation and real-world interactions, with AI transparency emerging as a critical factor in actual human-AI encounters.

Beyond Autonomy: The Power of an Agent That Knows Its Limits

Reddit r/AI_Agents

The COWCORPUS project, a study of 4,200 human-AI interactions, found that agents predicting their own failures and intervention moments are more useful than those simply trying to avoid errors. Researchers identified four stable trust patterns in human-AI collaboration and developed the Perfect Timing Score (PTS) to measure intervention prediction accuracy.