Searching for Synergy in Shared Workspace Human-AI Collaboration

arXiv cs.AI Papers

Summary

This paper studies human-AI team coordination in shared workspaces using the Collaborative Gym and DiscoveryBench tasks, finding that adding collaborators can lower performance without proper structure. Scaffolding with shared group memory and human-in-the-loop gates improves performance, especially in three-person teams.

arXiv:2606.18413v1 Announce Type: new Abstract: Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:40 AM

# 1 Introduction
Source: [https://arxiv.org/html/2606.18413](https://arxiv.org/html/2606.18413)
marginparsep has been altered\. topmargin has been altered\. marginparpush has been altered\. The page layout violates the ICML style\.Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you\. We’re not able to reliably undo arbitrary changes to the style\. Please remove the offending package\(s\), or layout\-changing commands and try again\.

Searching for Synergy in Shared Workspace Human\-AI Collaboration

Nachiket Kotalwar1Rohini Das1Carolyn Rosé1

††footnotetext:1Carnegie Mellon University\. Correspondence to: Nachiket Kotalwar <nkotalwa@cs\.cmu\.edu\>\.
ICML’26 Workshop on Human\-AI Co\-Creativity, Seoul, South Korea\. Copyright 2026 by the author\(s\)\.###### Abstract

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise\. We study shared\-workspace human–AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer\. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead\. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions\. We then evaluate scaffolding that combines shared group memory with simulated human\-in\-the\-loop \(HITL\) gates, where selected actions require approval from a designated simulated participant\. This scaffolding yields higher mean performance, most clearly in three\-person teams, with clearer responsibility signals and stronger routing of expertise to team actions\. Overall, how human–AI teams coordinate and integrate expertise matters as much as the capability available to them\.

Most evaluations of AI agents ask whether an autonomous agent can complete a task\. Human–AI collaboration asks a different question: can a team turn complementary expertise into a better joint outcome? The distinction matters because collaboration introduces new ways for a team to fail\. For instance in data\-analysis tasks, a collaborator with domain expertise may recognize the relevant variable or notice weak evidence, but for that expertise to help, the team must surface it at the right time, route it to the right decision, and carry it into the final product\. If this coordination breaks down, adding collaborators with relevant expertise can increase interaction without improving outcome\.

Archaeology data\-analysis taskDatasetstime\_series\_data\.csvcapital\.csvpollen\_…\.csv…Query:When do house sizes and daggers significantly decrease together for the second time since the start of the data?Submit:a hypothesis with the right century, criterion, variables, and evidence\.Default team: unstructured collaborationAI agentData specialistResearcherLoad dataCriterion suggestion\+Apply criterionEdit hypothesisSubmitScaffolded team: structured collaborationAI agentData specialistResearcherLoad dataMap evidence\+ComputeCheck criterionEdit hypothesisSubmit

Figure 1:How collaboration structure changes the path to submission\. The task requires a submitted hypothesis with the relevant variables, temporal criterion, and supporting evidence\. The Data specialist and Researcher lanes denote simulated\-human collaborator profiles\. In the Default trace, collaborator input does not follow up with the agent’s edit, so a criterion suggestion is misapplied and carried into an unsupported hypothesis edit\. In the Scaffolded trace, shared group memory and simulated HITL gates route the relevant checks before computation and editing\. The panels summarize events from the full traces\.Classic group\-process research calls this process loss: teams fail to convert member resources into productivity when coordination is ineffective\(Steiner,[1972](https://arxiv.org/html/2606.18413#bib.bib5)\)\. Coordination theory frames collaboration as managing dependencies among activities\(Malone and Crowston,[1994](https://arxiv.org/html/2606.18413#bib.bib6)\), and coordination neglect describes how teams underweight the integration work that interdependent contributions require\(Heath and Staudenmayer,[2000](https://arxiv.org/html/2606.18413#bib.bib7)\)\. Human–AI teams show analogous patterns: adding AI or human expertise is not always beneficial on average\(Vaccaroet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib1)\), people may over\-rely on or misread AI recommendations despite explanations\(Bansal and others,[2021](https://arxiv.org/html/2606.18413#bib.bib2)\), and professional AI assistance can have limited average benefit when humans and AI do not combine their signals effectively\(Agarwalet al\.,[2023](https://arxiv.org/html/2606.18413#bib.bib3); Yuet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib4)\)\.

We therefore use group\-process research as a design lens for shared\-workspace human–AI teams\. We evaluate two collaboration structures: shared group memory, which externalizes who knows what, what evidence has been established, and what checks remain; and simulated HITL gates, which require selected actions to be approved by a designated participant\. These structures test whether making expertise, responsibility, and evidence checks explicit can reduce process loss in collaborative agent trajectories\. Because such failures may appear in the interaction before they appear in the final answer, we evaluate both submitted hypotheses and process traces\.

We focus on shared\-workspace collaboration, where participants share tools, messages, and artifacts and decide during the task who does what\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib10)\)\. Within this framework, we study collaborative data analysis using archaeology tasks from DiscoveryBench\(Majumder and others,[2025](https://arxiv.org/html/2606.18413#bib.bib17)\), a multi\-step discovery benchmark where tasks require domain interpretation, variable mapping, and evidence construction \(Figure[1](https://arxiv.org/html/2606.18413#S1.F1)\)\.

The key empirical pattern is counterintuitive: adding profiled simulated collaborators can lower performance when teams lack the structure to coordinate contributions\. Trace diagnostics point to unassigned responsibilities and weak evidence handoff rather than missing capability alone\. We then evaluate a scaffolded setting that combines shared group memory with simulated HITL gates; with this scaffolding in place, teams show more distributed initiative, clearer responsibility signals, and higher mean performance, most clearly in the three\-person setting\.

Our contributions are:

1. 1\.We extend Collaborative Gym for controlled human–AI team studies on DiscoveryBench, enabling systematic variation in team composition and collaboration structure\.
2. 2\.We show that adding profiled simulated collaborators can lower performance, and use trace\-level diagnostics to examine how coordination failures prevent useful intermediate work from reaching the submitted hypothesis\.
3. 3\.We evaluate scaffolded collaboration with shared group memory and simulated HITL gates, showing higher mean performance, most clearly in the three\-person setting, alongside trace\-level evidence of responsibility assignment, routed checks, and stronger evidence grounding\.

## 2Related Work

#### Collaborative\-agent benchmarks\.

Recent interactive\-agent environments study collaboration beyond isolated answer generation\. Some emphasize social simulation and believable role\-played interaction\(Parket al\.,[2023](https://arxiv.org/html/2606.18413#bib.bib22); Zhou and others,[2024](https://arxiv.org/html/2606.18413#bib.bib23)\); others test proactive assistance through active user simulation\(Nathani and others,[2026](https://arxiv.org/html/2606.18413#bib.bib12)\), or workflow orchestration under coupling, asynchrony, temporal constraints, and time\-efficiency objectives\(Masters and others,[2025](https://arxiv.org/html/2606.18413#bib.bib16); Sun and others,[2025](https://arxiv.org/html/2606.18413#bib.bib11); Gonzalez\-Pumariegaet al\.,[2025](https://arxiv.org/html/2606.18413#bib.bib13); Linet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib14); Zhang and others,[2025](https://arxiv.org/html/2606.18413#bib.bib15)\)\. These largely study turn\-based or task\-orchestration settings rather than open shared\-workspace coordination\. Collaborative Gym instead supports shared\-workspace human–AI collaboration with communication, tool use, artifact editing, and flexible non\-turn\-taking interaction\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib10)\), which connects to workspace\-awareness research on how collaborators stay informed of one another’s actions in joint work\(Gutwin and Greenberg,[2002](https://arxiv.org/html/2606.18413#bib.bib9)\)\. We build on Collaborative Gym and DiscoveryBench\(Majumder and others,[2025](https://arxiv.org/html/2606.18413#bib.bib17)\)by varying team size and giving simulated collaborators distinct private guidance, then asking whether teams turn distributed evidence into a supported scientific hypothesis\.

#### Group memory and shared understanding\.

Our group\-memory scaffold targets process loss by making distributed knowledge easier to locate and use\(Steiner,[1972](https://arxiv.org/html/2606.18413#bib.bib5)\)\. This motivation also matches classic findings on biased information sampling: groups tend to discuss shared information more readily than unshared information, limiting the value of distributed expertise\(Stasser and Titus,[1985](https://arxiv.org/html/2606.18413#bib.bib8)\)\. It draws on transactive memory systems, which describe how teams coordinate distributed knowledge through shared awareness of who knows what and how that knowledge should be retrieved during work\(Wegner,[1987](https://arxiv.org/html/2606.18413#bib.bib24); Moreland,[1999](https://arxiv.org/html/2606.18413#bib.bib28); Lewis,[2003](https://arxiv.org/html/2606.18413#bib.bib25); Lewis and Herndon,[2011](https://arxiv.org/html/2606.18413#bib.bib27); Argote and Ren,[2012](https://arxiv.org/html/2606.18413#bib.bib26)\)\. Shared mental models similarly connect team effectiveness to common expectations about the task, teammates, and workflow, including in human–AI teams\(Mathieuet al\.,[2000](https://arxiv.org/html/2606.18413#bib.bib29); Andrewset al\.,[2023](https://arxiv.org/html/2606.18413#bib.bib30)\)\.

#### Simulated HITL gates and collaborative control\.

Simulated HITL gates draw on collaboration scripts, which use external structure to guide participation in joint work through responsibilities, activity sequences, and interaction moves\(Kollaret al\.,[2006](https://arxiv.org/html/2606.18413#bib.bib32); Fischeret al\.,[2013](https://arxiv.org/html/2606.18413#bib.bib31)\)\. They also relate to mixed\-initiative and automation\-control work on when automated systems should act, defer, or solicit human input\(Horvitz,[1999](https://arxiv.org/html/2606.18413#bib.bib33); Parasuramanet al\.,[2000](https://arxiv.org/html/2606.18413#bib.bib34); Amershi and others,[2019](https://arxiv.org/html/2606.18413#bib.bib35)\)\. Recent agent evaluations use HITL\-style abstractions without live human participants in the evaluated loop, including targeted gates in autonomous research systems and anask\_human\(\)tool for selective escalation\(Liu and others,[2026](https://arxiv.org/html/2606.18413#bib.bib38); Elfeki and others,[2026](https://arxiv.org/html/2606.18413#bib.bib39)\)\. In the AutoResearchClaw HITL ablation, targeted intervention outperformed dense step\-by\-step oversight\(Liu and others,[2026](https://arxiv.org/html/2606.18413#bib.bib38)\), consistent with the idea that the routing of human input matters, not only its frequency\. We adapt this gating pattern so that the team itself decides which actions need sign\-off and from whom\.

## 3Problem Setup

We follow the tabular\-analysis setup from Collaborative Gym\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib10)\)\. Each task instance is a tuple

x=\(𝒟,q,y⋆\),x=\(\\mathcal\{D\},q,y^\{\\star\}\),where𝒟\\mathcal\{D\}is a set of CSV files,qqis a natural\-language query, andy⋆y^\{\\star\}is the benchmark reference hypothesis\. A team inspects the data, communicates in a shared workspace, can run analysis code, and submits a hypothesis through the result editor\. We denote this submitted hypothesis byy^\\hat\{y\}; it should identify the relevant task context, map the query to the correct variables, state the target relationship, and justify the claim with evidence from the data\.

A session has participants𝒰\\mathcal\{U\}and produces an ordered interaction trace

τ=\(\(ut,at,ot\)\)t=1T,\\tau=\(\(u\_\{t\},a\_\{t\},o\_\{t\}\)\)\_\{t=1\}^\{T\},whereut∈𝒰u\_\{t\}\\in\\mathcal\{U\}is the participant acting at eventtt,ata\_\{t\}is the action, andoto\_\{t\}is the resulting observation\. Each participantuuhas fixed private guidanceπu\\pi\_\{u\}that defines its collaborator profile, such as data\-analysis or researcher guidance\. When the acting participant isu=utu=u\_\{t\}, its action is

at=fu​\(x,ot−1,M<t,A<tu,πu\),a\_\{t\}=f\_\{u\}\\\!\\left\(x,o\_\{t\-1\},M\_\{<t\},A^\{u\}\_\{<t\},\\pi\_\{u\}\\right\),whereM<tM\_\{<t\}is the chat history andA<tuA^\{u\}\_\{<t\}is the action history visible to participantuu\. In our implementation, the AI agent sees the team\-level action history, while simulated human collaborators see only their own action history rather than the full team history\(Nathani and others,[2026](https://arxiv.org/html/2606.18413#bib.bib12)\)\. We therefore evaluate both the hypothesisy^\\hat\{y\}and the process traceτ\\tau, since useful intermediate work inτ\\taumay never reachy^\\hat\{y\}\(Section[4\.6](https://arxiv.org/html/2606.18413#S4.SS6)\)\.

## 4Experimental Setup

### 4\.1Overview

We vary team composition and collaboration structure to test whether teams turn available expertise into well\-supported hypotheses\. Sections[4\.2](https://arxiv.org/html/2606.18413#S4.SS2)–[4\.5](https://arxiv.org/html/2606.18413#S4.SS5)define the tasks, participants, structures, and team\-composition variants\. We evaluate submissions with the task Performance score and trace\-level activity/process metrics \(Section[4\.6](https://arxiv.org/html/2606.18413#S4.SS6)\); Section[5](https://arxiv.org/html/2606.18413#S5)reports aggregate results\.

### 4\.2Task Suite

We use the complete 38\-task archaeology subset of DiscoveryBench\(Majumder and others,[2025](https://arxiv.org/html/2606.18413#bib.bib17)\)\. This subset is well suited to collaborative data analysis because its tasks require domain\-specific interpretation, careful variable mapping, and multi\-step evidence construction\. Collaborators can help by clarifying archaeological terms, selecting variables, reviewing temporal criteria, or verifying support for a proposed hypothesis\. We use one complete domain subset rather than pooling all DiscoveryBench domains so that task semantics, data conventions, and evaluator expectations remain comparable across team compositions and collaboration structures\.

### 4\.3Participants

Each session includes one AI agent and zero to two simulated human collaborators, depending on team composition\. All participants use DeepSeek V3\.2 with Collaborative Gym’s ReAct\-style action loop and private scratchpad memory\(Yao and others,[2023](https://arxiv.org/html/2606.18413#bib.bib19); Shaoet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib10)\);Shaoet al\.\([2024](https://arxiv.org/html/2606.18413#bib.bib10)\)report that these simulated collaborators reproduce key behavioral patterns of real participants\. All participants share one model and interface, and only the profile guidanceπu\\pi\_\{u\}differs\. Differences across variants therefore reflect team composition and collaboration structure, not model or interface\.

### 4\.4Collaboration Structures

Our main comparison is between Default and Scaffolded collaboration\. Default is the original Collaborative Gym shared\-workspace setup without additional coordination mechanisms\. Scaffolded collaboration combines shared group memory with simulated HITL gates: the team first builds a shared record of expertise, responsibilities, work plan, and evidence criteria, then uses that record to decide which actions require approval and who owns those approvals\. We also evaluate two diagnostic variants: shared group memory only, and preassigned simulated HITL gates whose owners are configured by action type rather than chosen by the team\.

These structures scaffold coordination without changing the task suite, underlying LLM, action interface, profile guidance, or domain knowledge\. Simulated HITL gates add explicit approval responsibility for selected actions, while shared group memory adds a shared coordination artifact\. The scaffolds deliberately change how much context participants share and how often they can communicate\.

#### Simulated HITL gates\.

Simulated HITL gates designate selected actions as requiring approval from a specific participant before they take effect \(Figure[2](https://arxiv.org/html/2606.18413#S4.F2)a\)\. The approving participant is simulated in our experimental setup\. Not all actions require approval: in the Scaffolded setting, the team decides which actions need gates and who should approve them, based on the expertise and responsibilities they have mapped; actions without a designated gate owner proceed normally\. This mirrors how collaborative work already operates, where code review, clinical sign\-off, and AI coding assistants route consequential operations through a designated approver while letting routine work flow\. As a diagnostic variant, we also evaluate preassigned simulated HITL gates where owners are configured by action type rather than team\-chosen\.

#### Shared group memory\.

Shared group memory adds a pre\-task build phase based on transactive memory systems in group\-process research\(Wegner,[1987](https://arxiv.org/html/2606.18413#bib.bib24); Moreland,[1999](https://arxiv.org/html/2606.18413#bib.bib28); Lewis and Herndon,[2011](https://arxiv.org/html/2606.18413#bib.bib27)\)\. The team records who knows what, who should be trusted for what, how work should be coordinated, and what evidence criteria the final answer must satisfy \(Figure[2](https://arxiv.org/html/2606.18413#S4.F2)b\)\. Participants inspect the task context, discuss and revise entries, and agree on this map\. Unlike each participant’s private memory \(Section[4\.3](https://arxiv.org/html/2606.18413#S4.SS3)\), it is shared team state: a single expertise\-and\-responsibility map available to every participant\. Once the build phase ends, the memory becomes a fixed reference that participants consult but do not update\.

#### Scaffolded collaboration\.

The two scaffolds work together: the group\-memory build phase is where the team decides which actions need approval and who owns each one \(leaving some actions ungated\), and the gates then enforce those decisions during the task\. When a designated action is proposed, the chosen owner must approve or reject it before it takes effect, and the agreed expertise map and criteria provide context for that decision\.

Simulated HITL gatesSelected actions require approval from a designated participant\.Proposed actionedit or submitGate ownerapproves?Executeaction acceptedCancelaction not runapprovereject\(a\) Simulated HITL gatesShared group memoryBefore task work, the team maps expertise and assigns responsibilities\.Buildinspect task,share expertiseGroup memory Task workshared memoryavailable to allScaffolded only\(b\) Shared group memory

Figure 2:Collaboration structures used in the study\. \(a\) Simulated HITL gates require approval from a designated participant before selected actions take effect\. \(b\) Shared group memory is built through pre\-task discussion; the team maps expertise, responsibilities, evidence criteria, and a work plan\. In the Scaffolded setting, this shared memory helps determine which actions are gated and who owns each gate in \(a\); in the diagnostic preassigned\-gates variant, owners are configured by action type\.

### 4\.5Team Compositions

Table[1](https://arxiv.org/html/2606.18413#S4.T1)lists the evaluated team variants\. Suffixes denote simulated\-human profile guidance, not guaranteed competence:Dis a data\-analysis collaborator profile focused on variable mapping, filters, computation, and numeric evidence;Ris a researcher collaborator profile focused on query semantics, relation wording, ambiguity, caveats, and evidential support; andDRdenotes a team with both simulated\-human collaborators, one with each profile\. Labels join structure and suffix, such as Default\-D or Scaffolded\-DR\. Appendix[B](https://arxiv.org/html/2606.18413#A2)gives the fuller profile descriptions\.

Table 1:Team\-composition and collaboration\-structure design\. Single\-agent has no collaborator profile; each collaborative structure is evaluated with D, R, and DR profile variants\. Diagnostic variants isolate parts of the Scaffolded setting but are not symmetric ablations because preassigned gates use externally configured owners\.*Variant legend:*D = data\-analysis profile; R = researcher profile; DR = two collaborators, one with each profile\.

### 4\.6Evaluation Metrics

The metrics separate properties of the submitted hypothesisy^\\hat\{y\}from properties of the interaction traceτ\\tau\. The outcome metric is Collaborative Gym’s normalized Task Performance score, reported as*Performance*, where higher is better\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib10)\); evaluator model details are in Appendix[A](https://arxiv.org/html/2606.18413#A1)\. Activity metrics count trace actions: Human Work \(WhumanW\_\{\\mathrm\{human\}\}\) is the number of non\-message actions by simulated humans; Total Work \(WtotalW\_\{\\mathrm\{total\}\}\) is the number of non\-message actions by all participants; and Team Messages \(MteamM\_\{\\mathrm\{team\}\}\) is the number of message actions by all participants\.

For initiative structure, we use Collaborative Gym’s Initiative Entropy\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib10)\)\. We report the normalized form,Hinit,normH\_\{\\mathrm\{init,norm\}\}, so scores are comparable across team sizes:

Hinit,norm=−∑u∈𝒰pu​log⁡pulog⁡\|𝒰\|,H\_\{\\mathrm\{init,norm\}\}=\\frac\{\-\\sum\_\{u\\in\\mathcal\{U\}\}p\_\{u\}\\log p\_\{u\}\}\{\\log\|\\mathcal\{U\}\|\},where𝒰\\mathcal\{U\}is the team andpup\_\{u\}is the fraction of initiative events attributed to participantuu\. Higher values indicate more evenly distributed initiative\.

#### Reference workflow graphs\.

The graph\-dependent process metrics use a reference workflow graph for each task,

Gi=\(Vi,Ei,𝒫i,𝒞i\),G\_\{i\}=\(V\_\{i\},E\_\{i\},\\mathcal\{P\}\_\{i\},\\mathcal\{C\}\_\{i\}\),whereViV\_\{i\}contains reference reasoning and evidence nodes,EiE\_\{i\}their dependencies,𝒫i\\mathcal\{P\}\_\{i\}acceptable solution paths, and𝒞i\\mathcal\{C\}\_\{i\}completion criteria for a supported submission\. We construct each graph from the task data, query, and benchmark reference hypothesis, adding alternative evidence routes where applicable\. Before human validation, we check each graph for schema validity, executable evidence references, consistency with the reference hypothesis, and well\-formed dependencies\. The reference workflow graphs and validation annotations are released with our code; the repository link is in Appendix[B](https://arxiv.org/html/2606.18413#A2)\.

An LLM judge \(model details in Appendix[A](https://arxiv.org/html/2606.18413#A1)\) labels trace\-level events given the session trace, task information, and validated graph\. Workflow Coverage \(CwfC\_\{\\mathrm\{wf\}\}\) measures how much of an acceptable reference path appears in the trace, with nodes labeled satisfied, partially satisfied, or missing and scored as 1, 0\.5, or 0 under𝒫i\\mathcal\{P\}\_\{i\}\. Hypothesis Support \(ShypS\_\{\\mathrm\{hyp\}\}\) measures whethery^\\hat\{y\}’s relation, scope, and evidential grounding are supported by trace evidence under𝒞i\\mathcal\{C\}\_\{i\}\. Profile Alignment \(AprofileA\_\{\\mathrm\{profile\}\}\), computed only when profiled collaborators are present, measures whether their contributions match the profile and the workflow’s needs, not how much they participated\.

## 5Results

### 5\.1Overview

We ask three questions\. First, does adding collaborators with relevant expertise improve team performance, or does process loss limit the benefit? Second, do collaboration structures drawn from group\-process research recover performance by shifting coordination patterns? Third, what do these structures change at the process level \(initiative structure, evidence handoff, or both\)?

For each team variant, we run every task with three independent seeds and report means with standard errors over the resulting runs\. Table[2](https://arxiv.org/html/2606.18413#S5.T2)reports the headline aggregate results: the single\-agent baseline, Default shared\-workspace teams, and Scaffolded teams\. Table[3](https://arxiv.org/html/2606.18413#S5.T3)reports diagnostic variants that isolate shared group memory and preassigned simulated HITL gates\.

Table 2:Headline aggregate metrics by team composition and collaboration structure\. Cells report mean \(SE\) across runs\. Activity counts are raw; process and Performance metrics are on\[0,1\]\[0,1\]\. Dashes mark non\-applicable metrics; diagnostic variants appear in Table[3](https://arxiv.org/html/2606.18413#S5.T3)\.Datasetstime\_series\_data\.csvcapital\.csvpollen\_…\.csv…CEHouseSizeDagger…\-4100\-1\.35\-0\.58…\-4000\-1\.35\-0\.58…\-3900\-1\.35\-0\.58……………Goal:In which century did house sizes and daggers significantly decrease simultaneously for the second time since the start of the observational data?

\(a\)Task specification
![Refer to caption](https://arxiv.org/html/2606.18413v1/figures/qualitative_instances/metadata_27/04_swimlane_polished_opus.png)\(b\)Selected\-event trace panel

Figure 3:Selected trace pair illustrating routed checks in a three\-person task\. In the Default trace, input from both data\-analysis and researcher collaborators is present, but a rolling\-window criterion still reaches the submitted hypothesis\. In the Scaffolded trace, the team routes variable mapping, evidence validation, and answer–query alignment through designated gate owners before editing\.
### 5\.2Default Teams Show Process\-Loss Patterns

Default teams do not improve on the single\-agent baseline, and the three\-person Default\-DR team performs worst\. The single\-agent baseline reaches0\.710\.71Performance; Default\-D and Default\-R are slightly lower \(0\.690\.69and0\.680\.68\), and Default\-DR falls to0\.630\.63\. The largest drop occurs when both profiled collaborators are present, a pattern consistent with process loss as more distinct expertise must be coordinated\.

The drop is not explained by lower activity\. Default\-DR has more human work and team messages than the single\-profile Default teams, yet lower mean Performance; the teams interact more but produce a less\-supported hypothesis\. Default teams also show lower Hypothesis Support:ShypS\_\{\\mathrm\{hyp\}\}falls from0\.280\.28for Single\-agent to0\.180\.18–0\.190\.19for profiled Default teams\.

### 5\.3Scaffolded Teams Raise Mean Performance Across Team Compositions

Mean Performance is higher for Scaffolded than for matched Default in every team composition, with the gain concentrated in DR \(\+0\.13\+0\.13\); the D and R gains \(\+0\.03\+0\.03and\+0\.05\+0\.05\) are small relative to the standard errors\. The largest increase is in Scaffolded\-DR, the same composition that had the largest Default drop\. Because profiles and models are held fixed across these variants, the main design difference between matched Default and Scaffolded teams is the collaboration structure\.

The main process shift is in how initiative is distributed\. From Default\-DR to Scaffolded\-DR,WtotalW\_\{\\mathrm\{total\}\}changes only from7\.67\.6to7\.97\.9, whileWhumanW\_\{\\mathrm\{human\}\}rises from1\.61\.6to2\.22\.2\. Across profiles,Hinit,normH\_\{\\mathrm\{init,norm\}\}increases by\+0\.31\+0\.31to\+0\.43\+0\.43\. In DR, Scaffolded also has higher means on Workflow Coverage, Hypothesis Support, and Profile Alignment\. Total work is roughly unchanged; the teams distribute it differently\.

### 5\.4Diagnostic Variants Suggest Complementarity

Table[3](https://arxiv.org/html/2606.18413#S5.T3)places the diagnostic variants within each collaborator profile, with Default and Scaffolded repeated as shaded anchors\. Shared group memory only removes the gates; the preassigned\-gates variant fixes owners by action type\. Because neither variant includes the team\-led ownership step used in Scaffolded, these comparisons should be read as diagnostic decompositions rather than clean ablations\.

Shared group memory only raisesHinit,normH\_\{\\mathrm\{init,norm\}\}by\+0\.27\+0\.27to\+0\.40\+0\.40and Team Messages by roughly\+5\+5to\+9\+9across profiles\. Its initiative\-entropy values are close to those of Scaffolded, suggesting that the shared\-memory component accounts for most of the initiative\-distribution shift\. However, initiative and communication alone are not sufficient: in R teams, shared group memory only increases both while mean Performance falls from0\.680\.68to0\.640\.64\.

Preassigned simulated HITL gates align more directly with Hypothesis Support, raising meanShypS\_\{\\mathrm\{hyp\}\}in all three profiles, with the largest increases in D and R\. In the three\-person DR team, neither component alone matches the full Scaffolded Performance mean \(Table[3](https://arxiv.org/html/2606.18413#S5.T3)\)\. This is consistent with complementarity: shared group memory gives the team a basis for choosing responsibilities, and simulated HITL gates turn those responsibilities into binding approval requirements\. This pattern is clearest in the three\-person DR team: Default performs worst there, and the full Scaffolded setting produces the largest improvement\.

Figure[3](https://arxiv.org/html/2606.18413#S5.F3)grounds this complementarity in one three\-person task; Section[5\.5](https://arxiv.org/html/2606.18413#S5.SS5)discusses this example alongside two additional trace pairs that show Default process loss and differences in evidential grounding\.

Table 3:Diagnostic variants by collaborator profile\. Default and Scaffolded rows are repeated from Table[2](https://arxiv.org/html/2606.18413#S5.T2)as shaded anchors; the unshaded rows show shared group memory only and preassigned simulated HITL gates\. Cells report mean \(SE\) across runs; activity counts are raw, while process and Performance metrics are on\[0,1\]\[0,1\]\. The preassigned\-gates variant fixes gate owners by action type rather than letting the team choose them, so these are diagnostic comparisons rather than symmetric ablations\.
### 5\.5Qualitative Trace Analysis

To connect the aggregate results to trace\-level mechanisms, we inspect three trace pairs\. Each figure summarizes selected events from the raw traces rather than the full log\.

Default process loss despite profiled\-collaborator participation\.Figure[4](https://arxiv.org/html/2606.18413#A3.F4)compares Default\-D and Scaffolded\-D traces for a copper\-peak task\. In the Default trace, the data\-analysis collaborator asks for broader evidence and answer–query alignment, but those requests never become a binding check on the analysis criterion: the agent follows a smoothed peak\-detection path and records an unsupported 9th\-century BCE hypothesis, while the reference points to the earlier 35th\-century BCE peak\. In the Scaffolded trace for the same task, the collaborator’s criterion check precedes comparison and a revised computation before the hypothesis is edited\.

Three\-person routed checks\.Figure[3](https://arxiv.org/html/2606.18413#S5.F3)contrasts Default\-DR and Scaffolded\-DR on a task requiring the second simultaneous decrease of two archaeological signals\. In the Default trace, both collaborator lanes are active, but a rolling\-window criterion still flows into the editor and the final hypothesis\. In the Scaffolded trace, checks precede finalization: the data\-analysis collaborator maps and later validates the evidence, the researcher checks setup and answer–query alignment, and the agent computes candidate events before editing\. The contrast illustrates the intended mechanism: existing expertise is made operational before finalization, rather than changing the team composition\.

Reviewed computation and evidential grounding\.Figure[5](https://arxiv.org/html/2606.18413#A3.F5)compares Default and Scaffolded traces for a pottery\-decoration task\. The Default team finds the relevant signal but anchors the hypothesis to the first elevated period rather than the start of the highest sustained plateau\. In the Scaffolded trace, the team computes and validates the later maximum plateau before submission, producing a more directly supported hypothesis, though it still does not explicitly contrast the lower and highest plateaus\.

## 6Concluding Discussion

Our results point toward a reorientation: human–AI collaboration research should identify and correct process limitations explicitly, drawing on what is already known about how teams fail\. Decades of group\-process research have produced concrete findings about how teams fail to convert distributed expertise into collective products\(Steiner,[1972](https://arxiv.org/html/2606.18413#bib.bib5); Stasser and Titus,[1985](https://arxiv.org/html/2606.18413#bib.bib8); Heath and Staudenmayer,[2000](https://arxiv.org/html/2606.18413#bib.bib7)\)\. In the simulated hybrid teams we study in Collaborative Gym, we observe analogous failures – unassigned responsibilities, unrouted expertise, and weak evidence handoff – alongside the process\-loss pattern in our metrics\.

This matters for how the field builds and evaluates collaborative agents\. Training and inference pipelines need to attend to the collaborative processes within trajectories \(responsibility assignment, evidence handoff, and review routing\), because these mechanisms determine whether complementary expertise actually reaches the team’s final product\. The group\-process literature offers a starting point: responsibility assignment, transactive memory, and structured review are well\-studied interventions for human teams, and our results show that adaptations of these ideas shift coordination patterns in simulated human–AI teams as well\.

The same structural lens may apply to all\-human, hybrid, and all\-agent teams, with different failure modes in each; treating these as variants of one coordination problem lets the field reuse existing insights rather than rediscovering coordination failures from scratch\.

#### Limitations and Future Work\.

The scaffolded setting we evaluate is a first hand\-designed probe, not a final account of how human–AI teams should coordinate\. Simulated collaborators remain approximations of real people: they build on prior human\-agent and active\-user\-simulation frameworks\(Shaoet al\.,[2024](https://arxiv.org/html/2606.18413#bib.bib10); Nathani and others,[2026](https://arxiv.org/html/2606.18413#bib.bib12)\), but may show simulator\-specific artifacts or underrepresent variation in proactiveness and strategy\. Recent work on scientific deep\-research agents has begun collecting expert feedback on intermediate actions, showing that user\-preferred actions are highly user\-specific and better predicted from interaction histories\(Balepur and others,[2026](https://arxiv.org/html/2606.18413#bib.bib40)\)\. Our setting needs analogous data for longer shared\-workspace, multi\-participant collaboration: real human traces can validate these simulated collaborators and support learning populations of simulators with varied expertise, proactiveness, and work strategies\(Mehri and others,[2026](https://arxiv.org/html/2606.18413#bib.bib36); Chopraet al\.,[2026](https://arxiv.org/html/2606.18413#bib.bib37)\)\. Long multi\-participant trajectories can be sensitive to early choices\(Labanet al\.,[2025](https://arxiv.org/html/2606.18413#bib.bib18)\); additional seeds, larger task suites, and variance\-reduction sampling would support future causal claims\. The archaeology subset provides domain\-specific interpretation challenges, but future work should test generalization\. Finally, we look at explicit collaboration structures rather than learned collaborative behavior\. With validated simulators and richer traces, future work can move from evaluating fixed scaffolds to optimizing collaboration policies using reinforcement learning, for example by training agents to discover responsibilities, request expertise, route checks, and adapt scaffolding online\(Wu and others,[2025](https://arxiv.org/html/2606.18413#bib.bib20); Zhou and others,[2025](https://arxiv.org/html/2606.18413#bib.bib21)\)\.

## Impact Statement

This work studies collaboration structure in simulated human–AI data\-analysis teams\. The scaffolds we evaluate \(shared group memory and simulated HITL gates\) preserve human\-role approval points in the simulated loop rather than treating collaboration as full automation\. These findings should be validated with real users before high\-stakes deployment\. More broadly, collaborative AI systems should preserve human agency and augment human judgment rather than replace it\.

## References

- Combining human expertise with artificial intelligence: experimental evidence from radiology\.Technical reportTechnical Report31422,National Bureau of Economic Research\.Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p2.1)\.
- S\. Amershiet al\.\(2019\)Guidelines for human\-AI interaction\.InProceedings of the Conference on Human Factors in Computing Systems \(CHI\),Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px3.p1.1)\.
- R\. W\. Andrews, J\. M\. Lilly, D\. K\. Srivastava, and K\. M\. Feigh \(2023\)The role of shared mental models in human\-AI teams: a theoretical review\.Theoretical Issues in Ergonomics Science24\(2\),pp\. 129–175\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Argote and Y\. Ren \(2012\)Transactive memory systems: a microfoundation of dynamic capabilities\.Journal of Management Studies49\(8\),pp\. 1375–1382\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Balepuret al\.\(2026\)DRACULA: hunting for the actions users want deep research agents to execute\.CoRRabs/2604\.23815\.Cited by:[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.
- G\. Bansalet al\.\(2021\)Does the whole exceed its parts? the effect of AI explanations on complementary team performance\.InProceedings of the Conference on Human Factors in Computing Systems \(CHI\),Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p2.1)\.
- H\. Chopra, K\. Ghate, A\. Caliskan, T\. Kohno, C\. Shah, and N\. Jaques \(2026\)Beyond cooperative simulators: generating realistic user personas for robust evaluation of LLM agents\.CoRRabs/2605\.12894\.Cited by:[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.
- M\. Elfekiet al\.\(2026\)HiL\-Bench \(human\-in\-loop benchmark\): do agents know when to ask for help?\.CoRRabs/2604\.09408\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px3.p1.1)\.
- F\. Fischer, I\. Kollar, K\. Stegmann, and C\. Wecker \(2013\)Toward a script theory of guidance in computer\-supported collaborative learning\.Educational Psychologist48\(1\),pp\. 56–66\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Gonzalez\-Pumariega, L\. S\. Yean, N\. Sunkara, and S\. Choudhury \(2025\)Robotouille: an asynchronous planning benchmark for LLM agents\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Gutwin and S\. Greenberg \(2002\)A descriptive framework of workspace awareness for real\-time groupware\.Comput\. Support\. Cooperative Work\.11\(3\-4\),pp\. 411–446\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Heath and N\. Staudenmayer \(2000\)Coordination neglect: how lay theories of organizing complicate coordination in organizations\.Research in Organizational Behavior22,pp\. 153–191\.Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p2.1),[§6](https://arxiv.org/html/2606.18413#S6.p1.1)\.
- E\. Horvitz \(1999\)Principles of mixed\-initiative user interfaces\.InProceedings of the Conference on Human Factors in Computing Systems \(CHI\),Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px3.p1.1)\.
- I\. Kollar, F\. Fischer, and F\. W\. Hesse \(2006\)Collaboration scripts: a conceptual analysis\.Educational Psychology Review18,pp\. 159–185\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville \(2025\)LLMs get lost in multi\-turn conversation\.CoRRabs/2505\.06120\.Cited by:[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.
- K\. Lewis and B\. Herndon \(2011\)Transactive memory systems: current issues and future research directions\.Organ\. Sci\.22\(5\),pp\. 1254–1265\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1),[§4\.4](https://arxiv.org/html/2606.18413#S4.SS4.SSS0.Px2.p1.1)\.
- K\. Lewis \(2003\)Measuring transactive memory systems in the field: scale development and validation\.Journal of Applied Psychology88\(4\),pp\. 587–604\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Lin, E\. L\. Malfa, V\. Hofmann, E\. M\. Yang, A\. G\. Cohn, and J\. B\. Pierrehumbert \(2024\)Graph\-enhanced large language models in asynchronous plan reasoning\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Liuet al\.\(2026\)AutoResearchClaw: self\-reinforcing autonomous research with human\-AI collaboration\.CoRRabs/2605\.20025\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px3.p1.1)\.
- B\. P\. Majumderet al\.\(2025\)DiscoveryBench: towards data\-driven discovery with large language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p4.1),[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.18413#S4.SS2.p1.1)\.
- T\. W\. Malone and K\. Crowston \(1994\)The interdisciplinary study of coordination\.ACM Comput\. Surv\.26\(1\),pp\. 87–119\.Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p2.1)\.
- C\. Masterset al\.\(2025\)Orchestrating human\-AI teams: the manager agent as a unifying research challenge\.CoRRabs/2510\.02557\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- J\. E\. Mathieu, T\. S\. Heffner, G\. F\. Goodwin, E\. Salas, and J\. A\. Cannon\-Bowers \(2000\)The influence of shared mental models on team process and performance\.Journal of Applied Psychology85\(2\),pp\. 273–283\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Mehriet al\.\(2026\)Measuring and mitigating the distributional gap between real and simulated user behaviors\.CoRRabs/2605\.07847\.Cited by:[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.
- R\. L\. Moreland \(1999\)Transactive memory: learning who knows what in work groups and organizations\.InShared Cognition in Organizations: The Management of Knowledge,Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1),[§4\.4](https://arxiv.org/html/2606.18413#S4.SS4.SSS0.Px2.p1.1)\.
- D\. Nathaniet al\.\(2026\)Proactive agent research environment: simulating active users to evaluate proactive assistants\.CoRRabs/2604\.00842\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.18413#S3.p2.15),[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.
- R\. Parasuraman, T\. B\. Sheridan, and C\. D\. Wickens \(2000\)A model for types and levels of human interaction with automation\.IEEE Trans\. Syst\. Man Cybern\. Part A30\(3\),pp\. 286–297\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px3.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the ACM Symposium on User Interface Software and Technology \(UIST\),Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Shao, V\. Samuel, Y\. Jiang, J\. Yang, and D\. Yang \(2024\)Collaborative Gym: A framework for enabling and evaluating human\-agent collaboration\.CoRRabs/2412\.15701\.Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p4.1),[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.18413#S3.p1.5),[§4\.3](https://arxiv.org/html/2606.18413#S4.SS3.p1.1),[§4\.6](https://arxiv.org/html/2606.18413#S4.SS6.p1.5),[§4\.6](https://arxiv.org/html/2606.18413#S4.SS6.p2.1),[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.
- G\. Stasser and W\. Titus \(1985\)Pooling of unshared information in group decision making: biased information sampling during discussion\.Journal of Personality and Social Psychology48\(6\),pp\. 1467–1478\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.18413#S6.p1.1)\.
- I\. D\. Steiner \(1972\)Group process and productivity\.Academic Press\.Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p2.1),[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.18413#S6.p1.1)\.
- H\. Sunet al\.\(2025\)Collab\-Overcooked: benchmarking and evaluating large language models as collaborative agents\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Vaccaro, A\. Almaatouq, and T\. W\. Malone \(2024\)When combinations of humans and AI are useful: a systematic review and meta\-analysis\.Nature Human Behaviour8,pp\. 2293–2303\.Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p2.1)\.
- D\. M\. Wegner \(1987\)Transactive memory: a contemporary analysis of the group mind\.InTheories of Group Behavior,Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px2.p1.1),[§4\.4](https://arxiv.org/html/2606.18413#S4.SS4.SSS0.Px2.p1.1)\.
- S\. Wuet al\.\(2025\)CollabLLM: from passive responders to active collaborators\.InProceedings of the International Conference on Machine Learning \(ICML\),Cited by:[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Yaoet al\.\(2023\)ReAct: synergizing reasoning and acting in language models\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§4\.3](https://arxiv.org/html/2606.18413#S4.SS3.p1.1)\.
- F\. Yu, A\. Moehring, O\. Banerjee, T\. Salz, N\. Agarwal, and P\. Rajpurkar \(2024\)Heterogeneity and predictors of the effects of AI assistance on radiologists\.Nature Medicine30,pp\. 837–849\.Cited by:[§1](https://arxiv.org/html/2606.18413#S1.p2.1)\.
- S\. Zhanget al\.\(2025\)ParaCook: on time\-efficient planning for multi\-agent systems\.CoRRabs/2510\.11608\.Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Zhouet al\.\(2024\)SOTOPIA: interactive evaluation for social intelligence in language agents\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.18413#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhouet al\.\(2025\)SWEET\-RL: training multi\-turn LLM agents on collaborative reasoning tasks\.CoRRabs/2503\.15478\.Cited by:[§6](https://arxiv.org/html/2606.18413#S6.SS0.SSS0.Px1.p1.1)\.

## Appendix AEvaluator Model Details

We use the Collaborative Gym default evaluator \(NVIDIA\-Nemotron\-3\-Super\-120B\) for the tabular\-analysis Performance score and the initiative\-event classifier used to computeHinit,normH\_\{\\mathrm\{init,norm\}\}, to remain comparable with prior work, and a separate strong labeler \(Claude Sonnet 4\.6\) for the graph\-dependent process metrics: Workflow Coverage, Hypothesis Support, and Profile Alignment\. We generate reference workflow graphs with Claude Code using Claude Opus 4\.6\. The generation agent receives the task data, query, benchmark reference hypothesis, and instructions to identify acceptable solution paths, including alternative evidence routes where applicable\. We then programmatically check and human\-validate the resulting graphs as described in Section[4](https://arxiv.org/html/2606.18413#S4)\.

## Appendix BSimulated\-Human Profile Guidance

We implement the simulated\-human collaborator profiles by adding a static private guidance blockπu\\pi\_\{u\}to the otherwise shared Collaborative Gym simulated\-human prompt\. This isolates the profile manipulation: both collaborator profiles use the same model, workspace, action loop, communication channel, and own\-history visibility described in Section[4](https://arxiv.org/html/2606.18413#S4)\. The profile guidance provides additional background meant to simulate differences in collaborator expertise, work experience, and attention, without assuming perfect task performance\. Full code, prompts, profile configuration files, and the reference workflow graphs and validation annotations are available in our code repository at[https://github\.com/nachiketdk/scaffolded\-human\-ai\-collaboration](https://github.com/nachiketdk/scaffolded-human-ai-collaboration)\.

#### Data\-analysis collaborator\.

The data\-analysis profile emphasizes table\-grounded evidence work: mapping query concepts to dataset columns, identifying variable families, applying filters or temporal windows, computing extrema or first/second events, validating results, and communicating inspectable evidence\.

#### Researcher collaborator\.

The researcher profile emphasizes domain interpretation and evidential alignment: preserving distinctions such as first versus second, peak versus highest or lowest, increase versus decrease, and bounded\-period language; reviewing variable choice, time conversion, ambiguity, and support; noticing missing column mappings or numeric evidence; and shaping final hypotheses around context, variables, result, evidence, and caveats\.

## Appendix CAdditional Qualitative Trace Figures

These additional trace figures provide the other selected\-event pairs referenced in Section[5\.5](https://arxiv.org/html/2606.18413#S5.SS5)\.

Datasetstime\_series\_data\.csvcapital\.csvpollen\_…\.csv…CECopperGoldCopper\_inter…\-3600\-0\.50\-0\.50…\-35000\.690\.69…\-34000\.690\.69……………Goal:In which century did copper have its first peak?

\(a\)Task specification
![Refer to caption](https://arxiv.org/html/2606.18413v1/figures/qualitative_instances/metadata_07/04_swimlane_polished_opus.png)\(b\)Validated selected\-event trace panel

Figure 4:Validated process view for the copper\-peak task, using selected events from the raw trace\. The compact tags remove low\-level trace text while preserving participant lanes and left\-to\-right event order\. The Default trace carries a wrong criterion into an unsupported hypothesis; the Scaffolded trace shows the data\-analysis collaborator’s criterion check feeding into comparison and a supported hypothesis\.Datasetstime\_series\_data\.csvcapital\.csvpollen\_…\.csv…CEPotteryDec\.Decor\_inter…\-4100\-1\.92\-1\.9200…\-4000\-0\.11\-0\.1100…\-3900\-0\.26\-0\.2600……………Goal:In which century does Diversity in Pottery Decoration begin to show its highest sustained values?

\(a\)Task specification
![Refer to caption](https://arxiv.org/html/2606.18413v1/figures/qualitative_instances/metadata_31/04_swimlane_polished_opus.png)\(b\)Validated selected\-event trace panel

Figure 5:Validated process view for the pottery\-decoration task\. The task asks when Diversity in Pottery Decoration begins its highest sustained values\. The Default trace anchors the hypothesis to the first elevated plateau rather than the start of the highest sustained plateau\. The Scaffolded trace shows teammate checking preceding a more direct computation of the maximum plateau and a more directly supported hypothesis\.

Similar Articles

Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

arXiv cs.AI

This paper proposes Influence-Based Team Steering (IBTS), a framework for zero-shot human-machine teaming that uses influence shaping to discover diverse interaction patterns and steer trajectories toward stronger coordination. Experiments on Overcooked-AI with two-agent and three-agent settings, including a 30-subject human study, show IBTS improves team performance over baselines.

Measuring inter-agent confrontations and collaboration

Reddit r/openclaw

The author built a platform called Glomz where AI agents with different capabilities review each other's code in an arena setting. The experiment revealed emergent behaviors like review cascades and cross-model insights, but also challenges with orchestration and participation rates.

Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

arXiv cs.CL

This research paper investigates how human personality traits and AI design characteristics jointly impact human-AI interactions in imperfectly cooperative scenarios using both simulated datasets (2,000 simulations) and human subjects experiments (290 participants). The study finds significant divergences between simulation and real-world interactions, with AI transparency emerging as a critical factor in actual human-AI encounters.