CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

arXiv cs.CL Papers

Summary

This paper introduces CHI-Bench, a benchmark for evaluating AI agents on end-to-end automation of complex healthcare workflows that require policy-grounded decisions, multi-role composition, and multilateral interactions. Experimental results show that the best agent achieves only 28% task resolution, highlighting significant gaps in current agent capabilities for policy-dense enterprise domains.

arXiv:2605.16679v1 Announce Type: new Abstract: End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $\chi$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:35 AM

# Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
Source: [https://arxiv.org/html/2605.16679](https://arxiv.org/html/2605.16679)
Deon Metelski[actAVA\.ai](https://actava.ai/)Leon Qi[actAVA\.ai](https://actava.ai/)Tao Xia[actAVA\.ai](https://actava.ai/)Joonyul Lee[actAVA\.ai](https://actava.ai/)Steve Brown[actAVA\.ai](https://actava.ai/)Kevin Riley[actAVA\.ai](https://actava.ai/)Frank Wang[actAVA\.ai](https://actava.ai/)T\. Y\. Alvin LiuJohns Hopkins MedicineMDJohns Hopkins MedicineHank CappsWellstar Health SystemMDWellstar Health SystemZeyu TangStanford UniversityXiangchen SongCMULingjing KongCMUFan FengUCSDTianyi ZengYale School of MedicineZhiwei LiuSalesforce AI ResearchZixian MaUniversity of WashingtonHang JiangNortheastern University Fangli GengBrown UniversityYuan YuanBoston CollegeChenyu YouStony Brook UniversityQingsong WenUniversity of OxfordHua WeiArizona State UniversityYanjie FuArizona State University Yue ZhaoUniversity of Southern CaliforniaCarl YangEmory UniversityBiwei HuangUCSDKun ZhangCMUMBZUAICaiming XiongRecursive Superintelligence Sanmi KoyejoStanford UniversityEric P\. XingMBZUAICMUPhilip S\. YuUniversity of Illinois at ChicagoWeiran Yao[actAVA\.ai](https://actava.ai/)

###### Abstract

End\-to\-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks:policy density, decisions must be grounded in a large library of medical, insurance, and operational rules;multi\-role composition, a single task requires the agent to play multiple roles with handoffs; andmultilateral interaction: intermediate workflow steps are multi\-turn dialogs, such as peer\-to\-peer review and patient outreach\. We introduceχ\\upchi\-Bench, a benchmark of long\-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management\. Each task hands the agent a clinical case in a high\-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role’s artifacts, guided by a 1,279\-documentmanaged\-care operations handbookskill\. Across 30 agent harness/model configurations, the best agent resolves only28\.0%of tasks, no agent clears20%on strict pass^3, and executing all tasks in a single session slumps the performance to3\.8%\. These results raise the hypothesis that similar gaps are likely to surface in other policy\-dense, role\-composed, irreversible enterprise domains\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x1.png)Figure 1:χ\\upchi\-Bench:ClinicalHealthcareIn\-Situ Environment and Evaluation Benchmark\.## 1Introduction

The U\.S\. healthcare system is an administrative nightmare\[[11](https://arxiv.org/html/2605.16679#bib.bib11),[42](https://arxiv.org/html/2605.16679#bib.bib42)\]\.Prior authorization \(PA\), where providers \(e\.g\., hospitals\) prepare clinical documents for payers \(e\.g\., insurers\) review to justify a service or medication, is one of the most common yet inefficient workflows\[[43](https://arxiv.org/html/2605.16679#bib.bib43),[45](https://arxiv.org/html/2605.16679#bib.bib45),[1](https://arxiv.org/html/2605.16679#bib.bib1)\]\.Care management \(CM\), a long\-term patient\-assisting program, follows a similar arc\[[25](https://arxiv.org/html/2605.16679#bib.bib25),[10](https://arxiv.org/html/2605.16679#bib.bib10),[23](https://arxiv.org/html/2605.16679#bib.bib23)\]: referrals queue for weeks, staff spend hours outreaching patients, and coordination across roles buries nurses in work they didn’t sign up for\. These arelong\-horizon, policy\-groundedtasks where every handoff is a chance for things to stall\. AI agents are increasingly proposed as a way to assist or partially automate such work\. Already, frontier agents now sustain hundreds of tool calls over hours of execution, automating long\-horizon tasks that were out of reach a year ago\.

However, end\-to\-end automation ofrealistic healthcare workflowstells a different story, posing three underexplored challenges that possibility warrants rigorous stress\-testing:

![Refer to caption](https://arxiv.org/html/2605.16679v1/x2.png)Figure 2:Illustration of the three challenges: policy retrieval, multi\-role composition \(intake clerk → nurse → MD reviewer → peer\-to\-peer coordinator\), and clinician outreach, all occurring in a single utilization management task\. More examples can be found at:[https://actava\.ai/benchmarks](https://actava.ai/benchmarks)\.1\) Policy density\.Every agent decision must be grounded in policy, e\.g\., medical guidelines, insurance rules, operational procedures that vary across providers and payers and shifts over time\. Agents must navigate a large policy library, interpret conditions correctly, and adhere to them across long tool\-call chains\.2\) Multi\-role composition\.An end\-to\-end workflow is divided among roles such as clinician, coordinator, UM nurse, medical director, and RN care manager\. An agent must possess all of their domain knowledge, switch context and goals as the case moves\. Handoffs are terminal: once a step is submitted or routed, it cannot be edited or re\-run\.3\) Multilateral interactions\.Some steps are not tool calls but multi\-turn conversations, such as payer\-provider peer\-to\-peer review, requests for information, or care manager outreach to patients\. Agents must shift from background execution to live dialog, collect information incrementally from humans, and carry results back to workflow\.

These challenges are not edge cases; they are the daily reality of managed\-care operations, where the bulk of work centers on prior authorization, utilization management review, and care management\.

Inspired by these, we introduceχ\\upchi\-Bench, a benchmark that evaluates frontier agents in these three realistic, end\-to\-end healthcare workflow settings\. As shown in[Figure˜1](https://arxiv.org/html/2605.16679#S0.F1), each task hands the agent a case \(a provider PA, a payer UM review, or an RN care management\) in a high\-fidelity simulator of 20 healthcare apps exposed via MCP\. The agent must drive the case to a terminal status by issuing tool calls and writing the role’s artifacts \(submission packets, review notes, letters, care plans\), guided by amanaged\-care operations handbookskill \(1,279 markdowns\) of workflows, platform usage, and medical/insurance policy\. The resulting world state, artifacts and event trail are scoredin\-situby a composite verifier that combines deterministic checks with rubric\-based LLM judge\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x3.png)Figure 3:pass@11across the threeχ\\upchi\-Bench environments of frontier proprietary LLMs with their first\-party agent harness\. Error bars are task\-level percentile bootstrap 95% confidence intervals\.We evaluated 30 agent harness/model configurations spanning major frontier models and strong agent stacks\. As shown in[Figure˜3](https://arxiv.org/html/2605.16679#S1.F3),χ\\upchi\-Bench is far from solved\. The best configuration,Claude Code\+Claude Opus 4\.6, resolves only28\.0%of tasks at pass@1; no agent clears20%under the strict pass^3 reliability metric; and the marathon run, where agents execute all tasks in a single session, drops to3\.8%, and the end\-to\-end provider–payer arena collapses the best prior auth agents to0%\. These results suggest that the long\-horizon capabilities frontier agents demonstrate on coding\-style benchmarks do not generalize well to realistic healthcare workflows, and we expect similar gaps in other policy\-dense, role\-composed, irreversible enterprise domains beyond\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x4.png)Figure 4:Comparing strengths and weaknesses of Codex GPT\-5\.5 and Claude Code Opus 4\.6 across PA, UM, and CM domains\. Higher bars = more trials failing that check\.
## 2Related Work

##### Healthcare AI Benchmarks\.

Prior healthcare benchmarks evaluate one of: factual medical knowledge\[[20](https://arxiv.org/html/2605.16679#bib.bib20),[40](https://arxiv.org/html/2605.16679#bib.bib40),[21](https://arxiv.org/html/2605.16679#bib.bib21),[51](https://arxiv.org/html/2605.16679#bib.bib51),[56](https://arxiv.org/html/2605.16679#bib.bib56),[62](https://arxiv.org/html/2605.16679#bib.bib62)\], broad clinical LLM proficiency\[[7](https://arxiv.org/html/2605.16679#bib.bib7),[5](https://arxiv.org/html/2605.16679#bib.bib5)\], EHR querying\[[29](https://arxiv.org/html/2605.16679#bib.bib29),[26](https://arxiv.org/html/2605.16679#bib.bib26),[48](https://arxiv.org/html/2605.16679#bib.bib48),[52](https://arxiv.org/html/2605.16679#bib.bib52),[53](https://arxiv.org/html/2605.16679#bib.bib53)\], short\-horizon clinical agents\[[18](https://arxiv.org/html/2605.16679#bib.bib18),[44](https://arxiv.org/html/2605.16679#bib.bib44),[32](https://arxiv.org/html/2605.16679#bib.bib32),[58](https://arxiv.org/html/2605.16679#bib.bib58)\], or narrower administrative interactions\[[18](https://arxiv.org/html/2605.16679#bib.bib18),[8](https://arxiv.org/html/2605.16679#bib.bib8)\]\.χ\\upchi\-Bench is the first to combine, in a single task, long\-horizon tool calls, explicit dense policy retrieval, irreversible workflow state, hidden multilateral interaction, and in\-situ verification against persisted simulator state\. HealthAdminBench\[[8](https://arxiv.org/html/2605.16679#bib.bib8)\], the closest peer, focuses on GUI interaction over payer portal via pixel/DOM browsings; whileχ\\upchi\-Bench instead exposes apps via structured MCP tools and a large explicit policy handbook skill\. We also add the care management domain with patient outreach\.

Table 1:Coverage matrix of nine evaluation axes across2929healthcare and long\-horizon agent benchmarks, characterizing the task surface each one targets; per\-axis definitions and per\-benchmark cell\-by\-cell justifications are in𝝌\\bm\{\\chi\}\-Bench: Can AI Agents Automate End\-to\-End, Long\-Horizon, Policy\-Rich Healthcare Workflows?\.✓= supported,❍= partially supported,✗= not supported\.BenchmarkHealthcareAPI ToolsLong\-horiz\.Policy DensityMulti\-role Comp\.MultilateralHidden stateIn\-SituLLM judgeMedQA\[[20](https://arxiv.org/html/2605.16679#bib.bib20)\]✓✗✗✗✗✗✗✗✗MedMCQA\[[40](https://arxiv.org/html/2605.16679#bib.bib40)\]✓✗✗✗✗✗✗✗✗PubMedQA\[[21](https://arxiv.org/html/2605.16679#bib.bib21)\]✓✗✗✗✗✗✗✗✗BioASQ\[[51](https://arxiv.org/html/2605.16679#bib.bib51)\]✓✗✗✗✗✗✗✗✗MIRAGE\[[56](https://arxiv.org/html/2605.16679#bib.bib56)\]✓❍✗✗✗✗✗✗✗MedCalc\-Bench\[[26](https://arxiv.org/html/2605.16679#bib.bib26)\]✓✗✗✗✗✗✗✗✗EHRSQL\[[29](https://arxiv.org/html/2605.16679#bib.bib29)\]✓✓✗✗✗✗✗✗✗BioCoder\[[48](https://arxiv.org/html/2605.16679#bib.bib48)\]✓✓✗✗✗✗✗❍✗BioDSBench\[[52](https://arxiv.org/html/2605.16679#bib.bib52)\]✓✓❍✗✗✗✗❍✗EHRSHOT\[[53](https://arxiv.org/html/2605.16679#bib.bib53)\]✓✗✗✗✗✗✗✗✗MedHELM\[[7](https://arxiv.org/html/2605.16679#bib.bib7)\]✓✗✗✗✗✗✗✗✓MedXpertQA\[[62](https://arxiv.org/html/2605.16679#bib.bib62)\]✓✗✗✗✗✗✗✗✗HealthBench\[[5](https://arxiv.org/html/2605.16679#bib.bib5)\]✓✗❍✗✗❍✗✗✓MedAgentsBench\[[49](https://arxiv.org/html/2605.16679#bib.bib49)\]✓❍✗✗✗✗✗✗✗AgentClinic\[[44](https://arxiv.org/html/2605.16679#bib.bib44)\]✓❍❍✗❍✓✓✗❍MedChain\[[32](https://arxiv.org/html/2605.16679#bib.bib32)\]✓✓✓✗✗❍✓✗❍MedAgentBench\[[18](https://arxiv.org/html/2605.16679#bib.bib18)\]✓✓❍✗✗✗✗❍✗MedAgentGym\[[58](https://arxiv.org/html/2605.16679#bib.bib58)\]✓✓❍✗✗✗✗❍✗HealthAdminBench\[[8](https://arxiv.org/html/2605.16679#bib.bib8)\]✓✗✓❍✓✗❍✓✓SWE\-Bench\[[19](https://arxiv.org/html/2605.16679#bib.bib19)\]✗✓✓✗✗✗✗❍✗WebArena\[[61](https://arxiv.org/html/2605.16679#bib.bib61)\]✗✗✓✗✗✗✗❍✗OSWorld\[[55](https://arxiv.org/html/2605.16679#bib.bib55)\]✗✗✓✗✗✗✗✓✗WorkArena\[[13](https://arxiv.org/html/2605.16679#bib.bib13)\]✗✗✓❍✗✗✗✓✗AppWorld\[[50](https://arxiv.org/html/2605.16679#bib.bib50)\]✗✓✓✗❍❍❍✓✗Terminal\-Bench\[[33](https://arxiv.org/html/2605.16679#bib.bib33)\]✗✓✓✗✗✗✗✓✗Toolathlon\[[30](https://arxiv.org/html/2605.16679#bib.bib30)\]✗✓✓✗✗✗✗✓✗SkillsBench\[[31](https://arxiv.org/html/2605.16679#bib.bib31)\]❍✓✓❍✗✗✗✓✗τ\\tau/τ2\\tau^\{2\}\-Bench\[[59](https://arxiv.org/html/2605.16679#bib.bib59),[6](https://arxiv.org/html/2605.16679#bib.bib6)\]✗✓✓❍❍✓✓✓✗TheAgentCompany\[[57](https://arxiv.org/html/2605.16679#bib.bib57)\]✗❍✓❍❍✓❍✓❍χ\\upchi\-Bench \(ours\)✓✓✓✓✓✓✓✓✓

##### Long\-Horizon Agent Benchmarks\.

General\-purpose benchmarks cover GUI control\[[61](https://arxiv.org/html/2605.16679#bib.bib61),[55](https://arxiv.org/html/2605.16679#bib.bib55),[13](https://arxiv.org/html/2605.16679#bib.bib13)\], long\-horizon code\[[19](https://arxiv.org/html/2605.16679#bib.bib19),[33](https://arxiv.org/html/2605.16679#bib.bib33)\], and broad tool\-use\[[50](https://arxiv.org/html/2605.16679#bib.bib50),[30](https://arxiv.org/html/2605.16679#bib.bib30),[31](https://arxiv.org/html/2605.16679#bib.bib31)\], but rarely model multi\-actor workflows\.τ\\tau/τ2\\tau^\{2\}\-Bench\[[59](https://arxiv.org/html/2605.16679#bib.bib59)\]and TheAgentCompany\[[57](https://arxiv.org/html/2605.16679#bib.bib57)\]are closest in interaction structure, pairing agents with simulated stakeholders under policy constraints; neither targets healthcare or the long\-horizon, policy\-dense, information asymmetry that defines prior authorization\.

See cell\-by\-cell details of Table[1](https://arxiv.org/html/2605.16679#S2.T1)in𝝌\\bm\{\\chi\}\-Bench: Can AI Agents Automate End\-to\-End, Long\-Horizon, Policy\-Rich Healthcare Workflows?\.

## 3χ\\upchi\-Bench: High\-Fidelity Healthcare Environment and Benchmark

χ\\upchi\-Bench evaluates AI agents onclinicalhealthcare workflowsin\-situ \(χ\\upchi\), automating prior\-authorization \(PA\), utilization\-management \(UM\), and care\-management \(CM\) tasks for U\.S\. providers and payers\. It spans three long\-horizon domains, each requiring grounded navigation of a large policy library:\(1\) Provider PA submission—verify coverage, gather evidence, submit the packet, and work the response \(RFIs, peer\-to\-peers, appeals\) to terminal status;\(2\) Payer UM review—intake the request, check plan policy, escalate through nurse and physician reviewers, and issue a determination;\(3\) RN care management—review the chart, contact the patient, administer assessments, and author a care plan\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x5.png)Figure 5:χ\\upchi\-World Engine: Simulated Worlds forClinicalHealthcareIn\-Situ Workflows\.### 3\.1χ\\upchi\-World Engine: Simulated Worlds forClinicalHealthcareIn\-Situ Workflows

Healthcare workflows involve four stakeholders:patients,clinicians \(provider\),payers, andcare managemententities, and a faithful benchmark must represent each and their interactions\.χ\\upchi\-World Engine\([Figure˜5](https://arxiv.org/html/2605.16679#S3.F5)\) is a local, high\-fidelity simulator of 20 day\-to\-day healthcare apps, operable via 151 REST APIs and 87 MCP tools across 3 MCP servers, populated with∼\{\\sim\}5,000chart activities for50simulated patients and∼\{\\sim\}90healthcare workers\. Agents operate the apps autonomously through MCP servers, the local database, and the file system\.

#### 3\.1\.1Realistic Healthcare Software Environments

We implement the apps111Using FastAPI, SQLite, SQLModel, and MCP over streamable HTTP\.across three domains:provider PA,payer UM, andcare management\. Built in∼\{\\sim\}115K lines of Python, the simulator captures features absent from general\-purpose benchmarks: case state machines with 29 statuses and explicit legal transitions; reviewer\-independence constraints across nurse, medical\-director, and peer\-to\-peer review; channel\-specific submission semantics; and document authorship, signing, and FHIR\-grade encounter linkage\. Actions trigger consistent cross\-app effects atomically: a provider\-side submission spawns a payer intake record, advances the event log, and may produce routing assignments, pend notifications, and outbound letters\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x6.png)Figure 6:Healthcare apps across three task domains\.\(a\) Payer – Utilization Management\(10 apps\); shown: nurse clinical review\.\(b\) Provider – Prior Authorization\(5 apps\); shown: service\-request step\.\(c\) Population Healthcare – Care Management\(5 apps\); shown: patient outreach\.We expose 87 of the 151 backend APIs as MCP tools, manually selected to mirror UI operations available to human users\. See appendix for the MCP server and tool details\.

#### 3\.1\.2Encoding Healthcare Workflows with Managed\-Care Operations Handbook Skill

We complement MCP servers withAgent Skills\[[60](https://arxiv.org/html/2605.16679#bib.bib60),[22](https://arxiv.org/html/2605.16679#bib.bib22)\]to teach agents the specialized healthcare workflows\. To simulate realistically how a healthcare worker handles a case, skills must encode the entire operation workflows, external software usage patterns, and the medical and insurance policies that govern each decision \(e\.g\., payer medical\-policy criteria, insurance coverage and eligibility, etc\.\)\.

provider\-pa/PA specialist \- SKILL\.md \- 01\-workflow\.md \- pa\-requirements/\(CPT/HCPCS\) \- forms/\(by coverage\)\- examples/payer\-um/UM reviewer \- SKILL\.md \- 01\-workflow\.md \- operational/\(5 chapters\) \- examples/\(letters\)care\-manager/RN care manager \- SKILL\.md \- 01\-workflow\.md \- operational/\(6 chapters\)medical\-library/shared appendix \- 1,000\+ medical\-policy documents \- drug auth\. criteria \- clinical guidelinesplatform/shared appendix \- object semantics \- surface ownership \- workspace boundaries\(per role\)managed\-care\-operations\-handbook/\- SKILL\.md\(top\-level index→\\toroutes by role\)\- references/

Figure 7:Managed\-Care Operations Handbook Skillis organized as a progressive\-disclosure manual\. The top\-levelSKILL\.mdacts as a table of contents that routes the agent to one of three role sub\-skills \(provider\-pa,payer\-um,care\-manager\); the two sharedmedical\-library\(clinical lookup\) andplatform\(role\-specific tutorials\) are reachable from any sub\-skill via the dashed access bus\.In this paper, we propose a core skill, theManaged\-Care Operations Handbookwith1,279markdown documents in a skill/sub\-skill structure, developed with clinicians and operations leaders at Johns Hopkins Medicine to ensure clinical fidelity and alignment with real\-world workflows\. We treat skill authoring as writing the onboarding guide for a new hire\. As shown in[Figure˜7](https://arxiv.org/html/2605.16679#S3.F7), we organize the skill as a wiki manual, where a top\-level skill routes the agent to one of three role sub\-skills \(PA specialist, UM reviewer, care manager\), each opening with a workflow chapter before diving into role\-specific chapters and templates\. Two appendices: amedical libraryof policies, drug criteria, and guidelines curated and validated with subject\-matter experts, andplatform tutorialson how to use MCP for specialized workflows\. To our knowledge, although skill context can be in theory unbounded, the largest skills published to date are a handful of files; this is thefirst timeagent with skills have been evaluated at the scale of a real healthcare operational workflow library\. The handbook details, and provenance and licensing information are in𝝌\\bm\{\\chi\}\-Bench: Can AI Agents Automate End\-to\-End, Long\-Horizon, Policy\-Rich Healthcare Workflows?\.

### 3\.2Theχ\\upchi\-Bench Construction

#### 3\.2\.1Task Definition

Aχ\\upchi\-Bench task is a quadruple: instructions, the containerizedχ\\upchi\-World environment, role\-scoped tool surfaces, and a two\-layer verifier—formalized as a hierarchical POMDP\[[24](https://arxiv.org/html/2605.16679#bib.bib24)\]ℳ=\(𝒮,𝒜,𝒪,P,Z,R,ρ0;ℋ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},P,Z,R,\\rho\_\{0\};\\mathcal\{H\}\), where the latent state𝒮\\mathcal\{S\}spans patient charts, payer/provider records, workflow status, communications, artifacts, and event history;𝒜\\mathcal\{A\}comprises role\-scoped MCP and default\-agent tool actions;𝒪\\mathcal\{O\}comprises the role\-scoped observations returned through MCP outputs, messages, policy passages, and shared\-workspace files;PPandZZare the transition and observation kernels induced by the environment and its tools;RRis the verifier\-induced reward; andρ0\\rho\_\{0\}is the distribution over initial task states\. The hierarchyℋ:=\(𝒢,ν,𝒲\)\\mathcal\{H\}:=\(\\mathcal\{G\},\\nu,\\mathcal\{W\}\)uses role\-agent specifications𝒢:=\{\(Gi,ui,𝒦i\)\}i=1N\\mathcal\{G\}:=\\\{\(G\_\{i\},u\_\{i\},\\mathcal\{K\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}, whereGiG\_\{i\}is a role agent,uiu\_\{i\}its instruction, and𝒦i\\mathcal\{K\}\_\{i\}its available skill set;ν\\nudefines the handoff order and𝒲\\mathcal\{W\}the shared workspace\. Each𝒦i\\mathcal\{K\}\_\{i\}is a set of options\[[47](https://arxiv.org/html/2605.16679#bib.bib47)\], i\.e\. temporally extended procedures \(e\.g\.,*nurse criterion review*: policy retrieval→\\rightarrowchart read→\\rightarrowstructured\-payload write\)\. Instructions specify role, case, workspace, and rules; procedural detail must be recovered from the handbook\. Handoffs are irreversible: outgoing commits to𝒲\\mathcal\{W\}become incoming input, and the accumulating state and event log calculate reward \([Section˜3\.2\.3](https://arxiv.org/html/2605.16679#S3.SS2.SSS3)\)\.

Newsubmissionfora45\-year\-oldfemalewithmalignantneoplasmofrightfemalebreastandfamilyhistoryofmalignantneoplasmofbreast\.\.\.

\|Resource\|Location\|

\|\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\|\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\|

\|Casedata,tools\|healthverseMCP\|

\|Handbook\|/workspace/skills\|

\|Workingfiles\|<case\>/payer/\|

\|\.\.\.\|\.\.\.\|

\-Useonlypayernamespaces:

payer\_intake\_hub,triage,review,\.\.\.

\-\.\.\.

\(a\)Instruction the agent reads at task entry\.
\{

"expected\_target\_status":"pended\_action\_required",

"stage\_ground\_truth":\[

\{"stage":"nurse\_review",

"expected\_fields":\{

"recommendation":"escalate"\}\},

\.\.\.

\],

"expected\_clinical\_criteria":\[

\{"criterion\_id":"crit\_test\_modality\_dna\_panel",

"stage":"nurse\_review",

"expected\_criterion\_result":"met"\},

\.\.\.

\],

"expected\_service\_request":\{

"required\_diagnosis\_codes":\["C50\.911",\.\.\.\],

\.\.\.

\}

\}

\(b\)The per\-stage ground\-truth bundle\.

Figure 8:Example of a Payer UM task for hereditary breast\-cancer genomic sequencing\.
#### 3\.2\.2Task Construction and Composition

Each task annotation consists of sampling an initial states0∼ρ0s\_\{0\}\\sim\\rho\_\{0\}, a role assignment over𝒢\\mathcal\{G\}, and a ground\-truth trajectory clicked through theχ\\upchi\-World UI\.

Step 1 – Case generation\.The pipeline first samples a terminal world state of a case, then uses Claude Opus 4\.7 \+ structured JSON sampling, conditioned on the relevant system state graph and the matching section of the*Managed\-Care Operations Handbook*to emit the upstream artifacts, including chart specifications, submission packets or personas, and per\-stage rubric prompts, each of which is anchored to an explicit policy or state graph citation\.Step 2 – Human walkthrough\.An annotator works on each case candidate end\-to\-end on the liveχ\\upchi\-World UI with the handbook\. The recorded trajectories, db states, workspace commits, and role handoffs become the ground truth\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x7.png)Figure 9:Task breakdown\. Inner: Domain; Middle: PA/UM terminal state, CM patient persona; Outer: clinical/service category\.Step 3 – Multi\-reviewer review\.Each trajectory is reviewed by at least 1 practicing healthcare worker and 5 authors for clinical precision, and must clear a residual\-PHI scan and a clinical\-realism check before admission\. The detailed human validation protocols are described in𝝌\\bm\{\\chi\}\-Bench: Can AI Agents Automate End\-to\-End, Long\-Horizon, Policy\-Rich Healthcare Workflows?\.

The annotation pipeline has produced 523 tasks, each assigned a difficulty band from tool\-call length, decision\-tree depth\. Candidates are retained only when every expected action resolves to a cited policy section, and the chart and rubric mutually entail without leaking the chosen path\. We filter down to 75 representative, long\-horizon tasks for quality and diversity, where the human on average needs 21 steps, and at most 40 steps to finish\. The task categories are depicted in[Figure˜9](https://arxiv.org/html/2605.16679#S3.F9)\.

#### 3\.2\.3Reward

The verifier \([Figure˜10](https://arxiv.org/html/2605.16679#S3.F10)\) scores each trial off the record the simulator itself persisted: world store, event log, and multi\-turn transcripts, combining adeterministic contractwith arubric LLM judgeinto a binary rewardR=DeterministicPass∧JudgePassR=\\mathrm\{DeterministicPass\}\\land\\mathrm\{JudgePass\}, with a fractional scorecard for diagnostics\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x8.png)Figure 10:Verification pipeline\.Each trial emits a persisted record to*deterministic contract*verifier and*rubric\-based LLM judge*under strict\-majority vote\. A trial passes only when both layers pass\.

## 4Experiments

### 4\.1Experiment Setup

We evaluate 30 agent harness/model configurations across two stacks: a*proprietary stack*pairing each frontier lab’s first\-party CLI \(Claude Code\[[3](https://arxiv.org/html/2605.16679#bib.bib3)\],OpenAI Codex\[[37](https://arxiv.org/html/2605.16679#bib.bib37)\],Gemini CLI\[[15](https://arxiv.org/html/2605.16679#bib.bib15)\]\) with that lab’s closed\-weight models\[[4](https://arxiv.org/html/2605.16679#bib.bib4),[38](https://arxiv.org/html/2605.16679#bib.bib38),[16](https://arxiv.org/html/2605.16679#bib.bib16)\], plus an*open\-source stack*sweeping four agent frameworks \(OpenClaw\[[39](https://arxiv.org/html/2605.16679#bib.bib39)\],Hermes\[[35](https://arxiv.org/html/2605.16679#bib.bib35)\],OpenAI Agents SDK\[[36](https://arxiv.org/html/2605.16679#bib.bib36)\]\(*OAI Agents*\), andDeepAgents\[[28](https://arxiv.org/html/2605.16679#bib.bib28)\]\) over five OpenRouter\-served open\-weight models\[[12](https://arxiv.org/html/2605.16679#bib.bib12),[14](https://arxiv.org/html/2605.16679#bib.bib14),[27](https://arxiv.org/html/2605.16679#bib.bib27),[41](https://arxiv.org/html/2605.16679#bib.bib41),[54](https://arxiv.org/html/2605.16679#bib.bib54)\], plus an additional OpenClaw \+ Claude Opus 4\.7 reference cell\. For each task we run33independent trials and report pass@11\[[9](https://arxiv.org/html/2605.16679#bib.bib9)\], pass@33, and pass^33\[[59](https://arxiv.org/html/2605.16679#bib.bib59)\]\. The evaluation protocol is shown in[Figure˜10](https://arxiv.org/html/2605.16679#S3.F10)\. Detail configurations like sandbox, judge, and runtime are deferred to𝝌\\bm\{\\chi\}\-Bench: Can AI Agents Automate End\-to\-End, Long\-Horizon, Policy\-Rich Healthcare Workflows?\.

### 4\.2χ\\upchi\-Bench Results

[Table˜2](https://arxiv.org/html/2605.16679#S4.T2)summarizes benchmark performance across agent harnesses and frontier models\.

Table 2:χ\\upchi\-Bench results across agent harnesses and frontier models\. Per\-column maxima are bolded\. The threeOverallcolumns show task\-level bootstrap 95% CIs inv​a​l​u​e−lo\+hivalue\_\{\-\\mathrm\{lo\}\}^\{\+\\mathrm\{hi\}\}form; per\-domain pass cells show mean only and the two*Efficiency*columns are averaged over all225225trials per row\.Agent HarnessModelOverall![[Uncaptioned image]](https://arxiv.org/html/2605.16679v1/figures/icons/PA.png)Prior Authorization![[Uncaptioned image]](https://arxiv.org/html/2605.16679v1/figures/icons/UM.png)Utilization Management![[Uncaptioned image]](https://arxiv.org/html/2605.16679v1/figures/icons/CM.png)Care ManagementEfficiencypass@11pass@3passˆ3pass@11pass@3passˆ3pass@11pass@3passˆ3pass@11pass@3passˆ3StepsCost \($\)CodexGPT\-5\.520\.9−7\.6\+8\.420\.9\_\{\\scriptscriptstyle\-7\.6\}^\{\\scriptscriptstyle\+8\.4\}30\.7−10\.7\+10\.730\.7\_\{\\scriptscriptstyle\-10\.7\}^\{\\scriptscriptstyle\+10\.7\}9\.3−5\.3\+8\.09\.3\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+8\.0\}29\.340\.016\.032\.048\.012\.01\.34\.00\.054$1\.29CodexGPT\-5\.416\.0−6\.7\+7\.116\.0\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+7\.1\}25\.3−9\.3\+9\.325\.3\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+9\.3\}8\.0−5\.3\+6\.78\.0\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}24\.032\.016\.017\.324\.08\.06\.720\.00\.058$1\.30CodexGPT\-5\.4 Mini8\.4−4\.0\+4\.48\.4\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+4\.4\}20\.0−9\.3\+10\.720\.0\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+10\.7\}0\.0−0\.0\+0\.00\.0\_\{\\scriptscriptstyle\-0\.0\}^\{\\scriptscriptstyle\+0\.0\}10\.724\.00\.013\.332\.00\.01\.34\.00\.058$0\.27Claude CodeClaude Opus 4\.724\.4−8\.0\+8\.424\.4\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+8\.4\}41\.3−12\.0\+12\.0\\bm\{41\.3\}\_\{\\scriptscriptstyle\-12\.0\}^\{\\scriptscriptstyle\+12\.0\}10\.7−6\.7\+8\.010\.7\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}24\.032\.016\.017\.328\.08\.032\.064\.08\.068$9\.91Claude CodeClaude Opus 4\.628\.0−8\.4\+8\.9\\bm\{28\.0\}\_\{\\scriptscriptstyle\-8\.4\}^\{\\scriptscriptstyle\+8\.9\}38\.7−10\.7\+10\.738\.7\_\{\\scriptscriptstyle\-10\.7\}^\{\\scriptscriptstyle\+10\.7\}18\.7−8\.0\+9\.3\\bm\{18\.7\}\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+9\.3\}18\.724\.012\.041\.344\.040\.024\.048\.04\.076$6\.47Claude CodeClaude Sonnet 4\.626\.2−8\.0\+7\.626\.2\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+7\.6\}41\.3−10\.7\+10\.7\\bm\{41\.3\}\_\{\\scriptscriptstyle\-10\.7\}^\{\\scriptscriptstyle\+10\.7\}12\.0−6\.7\+8\.012\.0\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}24\.028\.020\.034\.752\.016\.020\.044\.00\.082$1\.30Claude CodeClaude Haiku 4\.56\.2−4\.0\+5\.36\.2\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+5\.3\}10\.7−6\.7\+8\.010\.7\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}2\.7−2\.7\+4\.02\.7\_\{\\scriptscriptstyle\-2\.7\}^\{\\scriptscriptstyle\+4\.0\}0\.00\.00\.014\.724\.08\.04\.08\.00\.041$0\.16Gemini CLIGemini 3\.1 Pro7\.1−4\.0\+4\.97\.1\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+4\.9\}13\.3−6\.7\+8\.013\.3\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}1\.3−1\.3\+2\.71\.3\_\{\\scriptscriptstyle\-1\.3\}^\{\\scriptscriptstyle\+2\.7\}14\.724\.04\.06\.716\.00\.00\.00\.00\.082$2\.11Gemini CLIGemini 3 Flash12\.4−6\.2\+7\.112\.4\_\{\\scriptscriptstyle\-6\.2\}^\{\\scriptscriptstyle\+7\.1\}17\.3−8\.0\+9\.317\.3\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+9\.3\}8\.0−5\.3\+6\.78\.0\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}18\.728\.08\.018\.724\.016\.00\.00\.00\.0142$0\.33OpenClawClaude Opus 4\.717\.3−5\.8\+6\.217\.3\_\{\\scriptscriptstyle\-5\.8\}^\{\\scriptscriptstyle\+6\.2\}37\.3−12\.0\+10\.737\.3\_\{\\scriptscriptstyle\-12\.0\}^\{\\scriptscriptstyle\+10\.7\}4\.0−4\.0\+5\.34\.0\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+5\.3\}18\.728\.08\.013\.332\.04\.020\.052\.00\.041$11\.48OpenClawKimi K2\.610\.2−4\.9\+5\.310\.2\_\{\\scriptscriptstyle\-4\.9\}^\{\\scriptscriptstyle\+5\.3\}18\.7−8\.0\+9\.318\.7\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+9\.3\}2\.7−2\.7\+4\.02\.7\_\{\\scriptscriptstyle\-2\.7\}^\{\\scriptscriptstyle\+4\.0\}12\.020\.04\.018\.736\.04\.00\.00\.00\.072$0\.91OpenClawDeepSeek V4 Pro11\.1−4\.9\+5\.311\.1\_\{\\scriptscriptstyle\-4\.9\}^\{\\scriptscriptstyle\+5\.3\}24\.0−9\.3\+9\.324\.0\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+9\.3\}1\.3−1\.3\+2\.71\.3\_\{\\scriptscriptstyle\-1\.3\}^\{\\scriptscriptstyle\+2\.7\}14\.728\.04\.012\.028\.00\.06\.716\.00\.042$0\.53OpenClawGLM\-5\.116\.9−6\.7\+7\.116\.9\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+7\.1\}30\.7−10\.7\+10\.730\.7\_\{\\scriptscriptstyle\-10\.7\}^\{\\scriptscriptstyle\+10\.7\}6\.7−5\.3\+6\.76\.7\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}13\.324\.04\.026\.736\.016\.010\.732\.00\.0116$0\.96OpenClawQwen 3\.6 Max4\.9−3\.1\+4\.04\.9\_\{\\scriptscriptstyle\-3\.1\}^\{\\scriptscriptstyle\+4\.0\}10\.7−6\.7\+8\.010\.7\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}0\.0−0\.0\+0\.00\.0\_\{\\scriptscriptstyle\-0\.0\}^\{\\scriptscriptstyle\+0\.0\}10\.724\.00\.04\.08\.00\.00\.00\.00\.079$2\.80OpenClawGrok 4\.30\.4−0\.4\+0\.90\.4\_\{\\scriptscriptstyle\-0\.4\}^\{\\scriptscriptstyle\+0\.9\}1\.3−1\.3\+2\.71\.3\_\{\\scriptscriptstyle\-1\.3\}^\{\\scriptscriptstyle\+2\.7\}0\.0−0\.0\+0\.00\.0\_\{\\scriptscriptstyle\-0\.0\}^\{\\scriptscriptstyle\+0\.0\}1\.34\.00\.00\.00\.00\.00\.00\.00\.065$2\.66OAI AgentsKimi K2\.615\.1−6\.2\+7\.115\.1\_\{\\scriptscriptstyle\-6\.2\}^\{\\scriptscriptstyle\+7\.1\}22\.7−9\.3\+9\.322\.7\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+9\.3\}8\.0−5\.3\+6\.78\.0\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}17\.328\.012\.025\.336\.012\.02\.74\.00\.060$0\.43OAI AgentsDeepSeek V4 Pro14\.2−6\.2\+7\.114\.2\_\{\\scriptscriptstyle\-6\.2\}^\{\\scriptscriptstyle\+7\.1\}22\.7−9\.3\+9\.322\.7\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+9\.3\}9\.3−5\.3\+6\.79\.3\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}10\.716\.08\.028\.040\.020\.04\.012\.00\.052$0\.25OAI AgentsGLM\-5\.118\.7−8\.0\+8\.418\.7\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+8\.4\}26\.7−9\.3\+10\.726\.7\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+10\.7\}12\.0−6\.7\+8\.012\.0\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}18\.724\.012\.033\.344\.024\.04\.012\.00\.058$0\.27OAI AgentsQwen 3\.6 Max15\.6−6\.7\+8\.015\.6\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}22\.7−9\.3\+10\.722\.7\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+10\.7\}9\.3−5\.3\+6\.79\.3\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}16\.020\.012\.026\.736\.016\.04\.012\.00\.048$0\.58OAI AgentsGrok 4\.35\.8−3\.6\+4\.45\.8\_\{\\scriptscriptstyle\-3\.6\}^\{\\scriptscriptstyle\+4\.4\}10\.7−6\.7\+8\.010\.7\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}1\.3−1\.3\+2\.71\.3\_\{\\scriptscriptstyle\-1\.3\}^\{\\scriptscriptstyle\+2\.7\}0\.00\.00\.016\.028\.04\.01\.34\.00\.032$1\.54HermesKimi K2\.615\.6−6\.7\+7\.615\.6\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+7\.6\}24\.0−9\.3\+10\.724\.0\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+10\.7\}6\.7−5\.3\+6\.76\.7\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}18\.724\.012\.021\.336\.08\.06\.712\.00\.031$1\.07HermesDeepSeek V4 Pro13\.8−6\.2\+7\.113\.8\_\{\\scriptscriptstyle\-6\.2\}^\{\\scriptscriptstyle\+7\.1\}22\.7−9\.3\+9\.322\.7\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+9\.3\}8\.0−5\.3\+6\.78\.0\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}8\.016\.04\.025\.332\.020\.08\.020\.00\.026$2\.19HermesGLM\-5\.118\.7−7\.1\+8\.018\.7\_\{\\scriptscriptstyle\-7\.1\}^\{\\scriptscriptstyle\+8\.0\}28\.0−9\.3\+10\.728\.0\_\{\\scriptscriptstyle\-9\.3\}^\{\\scriptscriptstyle\+10\.7\}10\.7−6\.7\+8\.010\.7\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+8\.0\}10\.716\.08\.034\.744\.024\.010\.724\.00\.030$1\.04HermesQwen 3\.6 Max16\.4−6\.7\+6\.716\.4\_\{\\scriptscriptstyle\-6\.7\}^\{\\scriptscriptstyle\+6\.7\}28\.0−10\.7\+10\.728\.0\_\{\\scriptscriptstyle\-10\.7\}^\{\\scriptscriptstyle\+10\.7\}5\.3−4\.0\+5\.35\.3\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+5\.3\}9\.316\.04\.026\.736\.012\.013\.332\.00\.029$4\.12HermesGrok 4\.34\.4−3\.1\+4\.44\.4\_\{\\scriptscriptstyle\-3\.1\}^\{\\scriptscriptstyle\+4\.4\}8\.0−5\.3\+6\.78\.0\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}1\.3−1\.3\+2\.71\.3\_\{\\scriptscriptstyle\-1\.3\}^\{\\scriptscriptstyle\+2\.7\}0\.00\.00\.013\.324\.04\.00\.00\.00\.032$1\.05DeepAgentsKimi K2\.63\.1−2\.2\+3\.13\.1\_\{\\scriptscriptstyle\-2\.2\}^\{\\scriptscriptstyle\+3\.1\}8\.0−5\.3\+6\.78\.0\_\{\\scriptscriptstyle\-5\.3\}^\{\\scriptscriptstyle\+6\.7\}0\.0−0\.0\+0\.00\.0\_\{\\scriptscriptstyle\-0\.0\}^\{\\scriptscriptstyle\+0\.0\}8\.020\.00\.01\.34\.00\.00\.00\.00\.039$0\.55DeepAgentsDeepSeek V4 Pro10\.7−4\.9\+5\.810\.7\_\{\\scriptscriptstyle\-4\.9\}^\{\\scriptscriptstyle\+5\.8\}18\.7−8\.0\+9\.318\.7\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+9\.3\}2\.7−2\.7\+4\.02\.7\_\{\\scriptscriptstyle\-2\.7\}^\{\\scriptscriptstyle\+4\.0\}14\.724\.04\.010\.720\.04\.06\.712\.00\.015$0\.21DeepAgentsGLM\-5\.111\.1−5\.8\+6\.211\.1\_\{\\scriptscriptstyle\-5\.8\}^\{\\scriptscriptstyle\+6\.2\}17\.3−8\.0\+8\.017\.3\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+8\.0\}5\.3−4\.0\+5\.35\.3\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+5\.3\}17\.324\.012\.010\.716\.04\.05\.312\.00\.021$0\.26DeepAgentsQwen 3\.6 Max9\.3−4\.9\+5\.89\.3\_\{\\scriptscriptstyle\-4\.9\}^\{\\scriptscriptstyle\+5\.8\}16\.0−8\.0\+9\.316\.0\_\{\\scriptscriptstyle\-8\.0\}^\{\\scriptscriptstyle\+9\.3\}4\.0−4\.0\+5\.34\.0\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+5\.3\}12\.016\.08\.010\.716\.04\.05\.316\.00\.018$0\.57DeepAgentsGrok 4\.32\.2−1\.8\+2\.72\.2\_\{\\scriptscriptstyle\-1\.8\}^\{\\scriptscriptstyle\+2\.7\}5\.3−4\.0\+5\.35\.3\_\{\\scriptscriptstyle\-4\.0\}^\{\\scriptscriptstyle\+5\.3\}0\.0−0\.0\+0\.00\.0\_\{\\scriptscriptstyle\-0\.0\}^\{\\scriptscriptstyle\+0\.0\}0\.00\.00\.05\.312\.00\.01\.34\.00\.021$1\.43

![Refer to caption](https://arxiv.org/html/2605.16679v1/x9.png)\(a\)ROI quadrants\.
![Refer to caption](https://arxiv.org/html/2605.16679v1/x10.png)\(b\)Reliability degradation\.

Figure 11:\([11\(a\)](https://arxiv.org/html/2605.16679#S4.F11.sf1)\) Each marker is one row of[Table˜2](https://arxiv.org/html/2605.16679#S4.T2):xx= mean per\-trial cost in USD \(log scale\),yy= Overall pass@11\. Dashed cross\-hairs at the median cost and median pass@11split the plane into four quadrants \(Sweet Spot,Premium,Budget,Overpriced\); the Pareto\-optimal frontier is connected with a dark line\. \([11\(b\)](https://arxiv.org/html/2605.16679#S4.F11.sf2)\) pass@kk\(dotted\) and passˆkk\(solid\) fork∈\{1,2,3\}k\\in\\\{1,2,3\\\}pooled across all 75 tasks\.##### Performance, Reliability and ROI\.

Claude Code paired with Claude Opus 4\.6 tops Overall pass@11at28\.0%28\.0\\%, with Sonnet 4\.6 \(26\.2%26\.2\\%\), Opus 4\.7 \(24\.4%24\.4\\%\), and Codex \+ GPT\-5\.5 \(20\.9%20\.9\\%\) close behind; the best domain\-level rows are split across Opus 4\.6 for UM \(41\.3%41\.3\\%\), Opus 4\.7 for CM \(32\.0%32\.0\\%\), and Codex \+ GPT\-5\.5 for PA \(29\.3%29\.3\\%\)\. Reliability further collapses on repeat trials \([Figure˜11\(b\)](https://arxiv.org/html/2605.16679#S4.F11.sf2)\): passˆ3 sits well below pass@11for the main cells \(Opus 4\.6 28\.0→\\to18\.7, GPT\-5\.5 20\.9→\\to9\.3, OAI Agents \+ GLM\-5\.1 18\.7→\\to12\.0, Hermes \+ Grok 4\.3 4\.4→\\to1\.3\), exposing run\-to\-run inconsistency that any production deployment would need to close\.

The ROI quadrants in[Figure˜11\(a\)](https://arxiv.org/html/2605.16679#S4.F11.sf1)separate absolute capability from cost\-normalized value: high\-performing configurations \(e\.g\. Claude Code \+ Opus 4\.6\) sit in*Premium*, while OAI Agents \+ GLM\-5\.1 stands out as a strong cost\-normalized point, anchoring the*Sweet Spot*and the low\-cost end of the Pareto frontier\. The*Overpriced*quadrant collects all Grok 4\.3 cells, OpenClaw \+ Qwen 3\.6 Max, and Gemini 3\.1 Pro \+ Gemini CLI; the*Budget*quadrant contains low\-cost rows whose savings come with below\-median completion rates\.

### 4\.3χ\\upchi\-Bench\-Arena: Can Prior Authorization Workflows be Automated End\-to\-End?

Configurationpass@11PA provider\-only \(23 tasks\)30\.4E2E two\-agent0\.0Table 3:E2E two\-agent PA vs\. same\-tasks single\-agent baseline\.The arena runs a provider agent and a payer agent, both running Codex \+ GPT\-5\.5 \(our best PA configuration\) as a two\-player game end\-to\-end on 23 PA tasks\.222Two tasks not applicable to the two\-agent setting are excluded\.Each holds its own role\-scoped MCPs and state, and they exchange information only through MCP tools\. Each side is scored independently; a trial passes only when every check on both sides passes\. Pass@11collapses from30\.4%to0%once the payer agent and cross\-role checks join:22tasks did not get submitted;1818did not finish MD decision, and55failed the final judge\. P2P tasks fail in both sides:0P2P request on 5 P2P\-required tasks appears and22spontaneous P2Ps happen\.

### 4\.4χ\\upchi\-Bench\-Marathon: Can Long\-Running Agents Stay on Track Across All 25 Tasks?

χ\\upchi\-Bench\-Marathon stress\-tests long\-horizon capabilities by loading all 25 tasks of a domain into a sharedχ\\upchi\-World\. The agent is instructed to finish all tasks, lists them via MCP tools and attempts in any order, in one agent run\. Context compaction follows the harness’s default setting\. Each case is scored individually after the agent reports completion\. We evaluate Claude Code \+ Opus 4\.7 and Codex \+ GPT\-5\.5\. Pass@11slumps for both configurations regardless of baseline \([Table˜4](https://arxiv.org/html/2605.16679#S4.T4)\)\. On PA, neither agent submits a single authorization across any of the 25 queued cases, despite touching most cases via write\-side tool calls\. On UM and CM, agents reach a finalized determination or care plan on only 3\-8 of 25 cases per session\. Codex \+ GPT\-5\.5 reaches its context window and auto\-compacts 4\-6 times per PA session and 1\-2 times on UM; Claude Code \+ Opus 4\.7, with a 1M\-token context, never compacts yet completes a similar number of cases\. However both agents fan out across the queue, save partial work, and fail to drive most cases to a terminal action\.

Table 4:χ\\upchi\-Bench\-Marathon pass@11vs\. the per\-task baseline\. Marathon = all 25 tasks queued in a single agent session, pass@11averaged over33independent sessions; Per\-task = isolated single\-task trials from[Table˜2](https://arxiv.org/html/2605.16679#S4.T2)\.Δ\\Delta= Marathon−\-Per\-task \(percentage points\)\.Agent HarnessModel![[Uncaptioned image]](https://arxiv.org/html/2605.16679v1/figures/icons/PA.png)Prior Authorization![[Uncaptioned image]](https://arxiv.org/html/2605.16679v1/figures/icons/UM.png)Utilization Management![[Uncaptioned image]](https://arxiv.org/html/2605.16679v1/figures/icons/CM.png)Care ManagementMarathonPer\-taskΔ\\DeltaMarathonPer\-taskΔ\\DeltaMarathonPer\-taskΔ\\DeltaCodexGPT\-5\.58\.029\.3−21\.3\-21\.32\.732\.0−29\.3\-29\.30\.01\.3−1\.3\-1\.3Claude CodeClaude Opus 4\.78\.024\.0−16\.0\-16\.01\.317\.3−16\.0\-16\.02\.732\.0−29\.3\-29\.3

### 4\.5Effects of Handbook Skills Components

![Refer to caption](https://arxiv.org/html/2605.16679v1/x11.png)Figure 12:Pass@11under trimmed skills\.We trimmed the1,2791\{,\}279\-documentManaged\-Care Operations Handbook Skillthree ways \(−\-*Domain*drops the domain handbook,−\-*Medical*drops the medical library,−\-*Both*drops both\), ran all tasks with Codex \+ GPT\-5\.5, and found that the handbook’s effect is domain\-dependent \([Figure˜12](https://arxiv.org/html/2605.16679#S4.F12)\)\.UM is handbook\-bound:−\-*Domain*collapses pass@11from32\.032\.0to17\.317\.3, while−\-*Medical*barely moves it\.PA inverts:−\-*Both*modestly beats the other two trimming settings because, with one handbook present, the agent enters an exhaustive verification mode and refuses to submit when uncertain; with no handbook, it commits and the verifier accepts the packet\.CM stays near the floor regardless: the complexity is conversation driving, not policy\. The finding is that large skills can help policy\-heavy reviews, but can also induce over\-verification, refusal, or cognitive overload\.

### 4\.6MCP vs\. CLI for Healthcare Agent Workflows

DomainMCPCLIΔ\\DeltaPA29\.328\.0−1\.3\-1\.3UM32\.025\.3−6\.7\-6\.7CM1\.34\.0\+2\.7\+2\.7Table 5:pass@11of MCP vs\. CLI\.As an exploratory probe, we re\-surface every MCP tool as a CLI bash command via MCPorter\[[46](https://arxiv.org/html/2605.16679#bib.bib46)\]and re\-run Codex \+ GPT\-5\.5, on the 75\-task suite with33trials per task\.[Table˜5](https://arxiv.org/html/2605.16679#S4.T5)shows a small PA regression, a clear UM drop, and a small CM gain\. On this configuration, MCPorter\-style CLI re\-surfacing is neutral\-to\-worse rather than uniformly beneficial\. We hypothesize that the effect of tool surface format is neutral for OOD tasks like healthcare workflows\.

### 4\.7Failure Mode Analysis

We analyze all5,8865\{,\}886failed trials with the two\-layer taxonomy defined in𝝌\\bm\{\\chi\}\-Bench: Can AI Agents Automate End\-to\-End, Long\-Horizon, Policy\-Rich Healthcare Workflows?: first\-level categories capture the broad failure source, while second\-level modes specify how the failure occurred\.[Figure˜13](https://arxiv.org/html/2605.16679#S4.F13)reports the first\-level distribution, separating non\-agent*Harness\-Fault*\(1\.0%1\.0\\%\) from agent\-side failures:*Clinical\-Reasoning*\(35\.4%35\.4\\%, medical or protocol judgment errors\),*Workflow\-Completion*\(23\.3%23\.3\\%, a required terminal action was never invoked\),*Abstain\-or\-Stuck*\(15\.6%15\.6\\%, wall\-clock timeouts, looping, premature closes, and explicit refusal to act\),*Policy\-Compliance*\(13\.2%13\.2\\%, dominantly literal misreading of cited criterion text\),*Tool\-Use\-Error*\(10\.7%10\.7\\%, concentrated in DeepAgents, where a single malformed tool call escalates into a trial\-fatal exit\), and*Hallucination*\(0\.8%0\.8\\%\)\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x12.png)Figure 13:Failure\-mode distribution sorted by overall pass@11\.Abstain\-or\-Stuckconcentrates in PA/CM and in DeepAgents \+ Kimi K2\.6 and OpenClaw\-based configurations\. Nearly half simply exhaust the18001800s wall\-clock cap, and the rest are loops, premature closes, or refusals to act\. We therefore read this category as a reliability and termination problem, whereasPolicy\-Compliancecaptures completed decisions based on misread criteria\.

[Figure˜14](https://arxiv.org/html/2605.16679#S4.F14)shows that the dominant second\-level modes are*criteria misapplication*, where agents see the relevant evidence but make the wrong medical or protocol judgment,*skipped required steps*\(18\.7%18\.7\\%\), and*policy criteria misreading*\(13\.2%13\.2\\%\)\. We distinguish policy criteria misreading from criteria misapplication by the locus of error: the former misreads the rule text itself, while the latter applies the correct rule or evidence to the case incorrectly\. A separate CM\-specific mode,*illegitimate consent*\(337337failures,5\.7%5\.7\\%\), captures concern\-mining: the agent repeatedly reframes and expands care program scopes until an initially refusing member says “yes,” instead of using autonomy\-first engagement\. Detailed failure\-mode definitions, analysis, and case examples are in𝝌\\bm\{\\chi\}\-Bench: Can AI Agents Automate End\-to\-End, Long\-Horizon, Policy\-Rich Healthcare Workflows?\.

![Refer to caption](https://arxiv.org/html/2605.16679v1/x13.png)Figure 14:Second\-level failure modes\. % is over failed trials; colors show first\-level categories\.

## 5Conclusion

We developedχ\\upchi\-Bench, a high\-fidelity benchmark that evaluates agents on long\-horizon healthcare operations: prior authorization, utilization management, and care management, grounded in a1,2791\{,\}279\-document managed\-care operations handbook\. The strongest agent \(Claude Code \+ Opus 4\.6\) resolves only28\.0%of tasks at pass@11, no agent exceeds20%at passˆ33\. Our analysis attributes most failures to three first\-level categories:*Clinical\-Reasoning*\(35\.4%35\.4\\%\),*Workflow\-Completion*\(23\.3%23\.3\\%\), and*Policy\-Compliance*\(13\.2%13\.2\\%\)\. Second level modes, e\.g\.*criteria misapplication*,*skipped required steps*, and*policy criteria misreading*show that failures arise from distinct bottlenecks\. The CM\-specific*illegitimate consent*mode further shows that an agent can advance the workflow while violating autonomy\-first engagement, so completion alone is not an adequate safety criterion\.

Limitations\.χ\\upchi\-Bench evaluates language\-only agents; real\-world healthcare operations often require multimodal reasoning over imaging and speech\. Additionally, whileχ\\upchi\-World workflows are high\-impact, the healthcare industry encompasses hundreds of long\-tail workflows with empirical values\. Extending coverage along both axes is our immediate next step\. Besides, Opus 4\.7 is the only judge model, and the effects of using different judge models are yet to be studied\.

Broader Impacts\.χ\\upchi\-Bench is intentionally a stress test:28%28\\%pass@11on a static benchmark might be risky for live patient care\. The failures our analysis surfaces translate directly into clinical, financial, and regulatory harm if left unchecked\. We releaseχ\\upchi\-Bench to expose these gaps and to encourage caution before agents are deployed on irreversible workflows where the affected party is a patient\.

## References

- American Medical Association \[2024\]American Medical Association\.2024 AMA prior authorization physician survey\.Presented at the Annual Meeting of the American Medical Association, Chicago, IL, 2024\.URL[https://www\.ama\-assn\.org/system/files/prior\-authorization\-survey\.pdf](https://www.ama-assn.org/system/files/prior-authorization-survey.pdf)\.
- Anthropic \[2024\]Anthropic\.Introducing the Model Context Protocol\.[https://www\.anthropic\.com/news/model\-context\-protocol](https://www.anthropic.com/news/model-context-protocol), 2024\.Accessed: 2026\-04\-30\.
- Anthropic \[2025\]Anthropic\.Claude Code\.[https://github\.com/anthropics/claude\-code](https://github.com/anthropics/claude-code), 2025\.Accessed: 2026\-04\-30\.
- Anthropic \[2026\]Anthropic\.Claude Opus 4\.7 system card\.[https://www\.anthropic\.com/system\-cards](https://www.anthropic.com/system-cards), 2026\.Accessed: 2026\-04\-30\. Covers Claude Opus 4\.7, Sonnet 4\.6, and Haiku 4\.5\.
- Arora et al\. \[2025\]R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel, J\. Heidecke, and K\. Singhal\.Healthbench: Evaluating large language models towards improved human health, 2025\.URL[https://arxiv\.org/abs/2505\.08775](https://arxiv.org/abs/2505.08775)\.
- Barres et al\. \[2025\]V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan\.τ2\\tau^\{2\}\-bench: Evaluating conversational agents in a dual\-control environment, 2025\.URL[https://arxiv\.org/abs/2506\.07982](https://arxiv.org/abs/2506.07982)\.
- Bedi et al\. \[2025\]S\. Bedi, H\. Cui, M\. Fuentes, A\. Unell, M\. Wornow, J\. M\. Banda, N\. Kotecha, T\. Keyes, Y\. Mai, M\. Oez, et al\.Medhelm: Holistic evaluation of large language models for medical tasks\.*arXiv preprint arXiv:2505\.23802*, 2025\.
- Bedi et al\. \[2026\]S\. Bedi, R\. Welch, E\. Steinberg, M\. Wornow, T\. M\. Kim, H\. Ahmed, P\. Sterling, B\. Purohit, Q\. Akram, A\. Acosta, et al\.Healthadminbench: Evaluating computer\-use agents on healthcare administration tasks\.*arXiv preprint arXiv:2604\.09937*, 2026\.
- Chen et al\. \[2021\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.
- Cuellar et al\. \[2018\]A\. Cuellar, A\. H\. Krist, L\. M\. Nichols, and A\. J\. Kuzel\.Facilitators and barriers to care coordination in patient\-centered medical homes \(PCMHs\) from coordinators’ perspectives\.*Journal of the American Board of Family Medicine*, 31\(1\):90–101, 2018\.doi:10\.3122/jabfm\.2018\.01\.170133\.PMC4809054\.
- Cutler et al\. \[2012\]D\. Cutler, E\. Wikler, and P\. Basch\.Reducing administrative costs and improving the health care system\.*New England Journal of Medicine*, 367\(20\):1875–1878, 2012\.doi:10\.1056/NEJMp1209711\.
- DeepSeek\-AI \[2026\]DeepSeek\-AI\.DeepSeek\-V4 Pro model card\.[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Pro](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro), 2026\.Accessed: 2026\-04\-30\.
- Drouin et al\. \[2024\]A\. Drouin, M\. Gasse, M\. Caccia, I\. H\. Laradji, M\. D\. Verme, T\. Marty, L\. Boisvert, M\. Thakkar, Q\. Cappart, D\. Vazquez, N\. Chapados, and A\. Lacoste\.WorkArena: How capable are web agents at solving common knowledge work tasks?In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024\.URL[https://arxiv\.org/abs/2403\.07718](https://arxiv.org/abs/2403.07718)\.
- GLM\-5 Team \[2026\]GLM\-5 Team\.GLM\-5: From vibe coding to agentic engineering\.*arXiv preprint arXiv:2602\.15763*, 2026\.
- Google \[2025\]Google\.Gemini CLI\.[https://github\.com/google\-gemini/gemini\-cli](https://github.com/google-gemini/gemini-cli), 2025\.Accessed: 2026\-04\-30\.
- Google DeepMind \[2026\]Google DeepMind\.Gemini 3\.1 Pro model card\.[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/), 2026\.Accessed: 2026\-04\-30\. Covers Gemini 3\.1 Pro and Gemini 3 Flash\.
- Harbor Framework \[2026\]Harbor Framework\.Harbor: A framework for agent evaluations and RL environments\.[https://github\.com/harbor\-framework/harbor](https://github.com/harbor-framework/harbor), 2026\.Accessed: 2026\-04\-30\.
- Jiang et al\. \[2025\]Y\. Jiang, K\. C\. Black, G\. Geng, D\. Park, J\. Zou, A\. Y\. Ng, and J\. H\. Chen\.Medagentbench: a virtual ehr environment to benchmark medical llm agents\.*Nejm Ai*, 2\(9\):AIdbp2500144, 2025\.
- Jimenez et al\. \[2024\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\.Swe\-bench: Can language models resolve real\-world github issues?, 2024\.URL[https://arxiv\.org/abs/2310\.06770](https://arxiv.org/abs/2310.06770)\.
- Jin et al\. \[2021\]D\. Jin, E\. Pan, N\. Oufattole, W\.\-H\. Weng, H\. Fang, and P\. Szolovits\.What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.*Applied Sciences*, 11\(14\):6421, 2021\.URL[https://arxiv\.org/abs/2009\.13081](https://arxiv.org/abs/2009.13081)\.
- Jin et al\. \[2019\]Q\. Jin, B\. Dhingra, Z\. Liu, W\. W\. Cohen, and X\. Lu\.PubMedQA: A dataset for biomedical research question answering\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 2567–2577\. Association for Computational Linguistics, 2019\.URL[https://arxiv\.org/abs/1909\.06146](https://arxiv.org/abs/1909.06146)\.
- Jones and Kelly \[2025\]A\. Jones and C\. Kelly\.Code execution with mcp: Building more efficient agents, 2025\.
- Ju \[2022\]H\.\-H\. Ju\.Improving care coordination of patients with chronic diseases\.*The Journal for Nurse Practitioners*, 18\(9\):926–929, 2022\.doi:10\.1016/j\.nurpra\.2022\.07\.005\.
- Kaelbling et al\. \[1998\]L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra\.Planning and acting in partially observable stochastic domains\.*Artificial intelligence*, 101\(1\-2\):99–134, 1998\.
- Karam et al\. \[2026\]M\. Karam, M\.\-C\. Chouinard, M\. Kevork, R\. Fleming, and A\. Duhoux\.Nurses’ and patients’ perspectives on care coordination across health care and social services sectors: A qualitative study\.*SAGE Open Nursing*, 2026\.doi:10\.1177/08445621251395347\.
- Khandekar et al\. \[2024\]N\. Khandekar, Q\. Jin, G\. Xiong, S\. Dunn, S\. S\. Applebaum, Z\. Anwar, M\. Sarfo\-Gyamfi, C\. W\. Safranek, A\. A\. Anwar, A\. Zhang, A\. Gilson, M\. B\. Singer, A\. Dave, A\. Taylor, A\. Zhang, Q\. Chen, and Z\. Lu\.MedCalc\-Bench: Evaluating large language models for medical calculations\.In*Advances in Neural Information Processing Systems 37: Datasets and Benchmarks Track*, 2024\.URL[https://arxiv\.org/abs/2406\.12036](https://arxiv.org/abs/2406.12036)\.
- Kimi Team \[2025\]Kimi Team\.Kimi K2: Open agentic intelligence\.*arXiv preprint arXiv:2507\.20534*, 2025\.
- LangChain \[2025\]LangChain\.DeepAgents\.[https://github\.com/langchain\-ai/deepagents](https://github.com/langchain-ai/deepagents), 2025\.Accessed: 2026\-04\-30\.
- Lee et al\. \[2022\]G\. Lee, H\. Hwang, S\. Bae, Y\. Kwon, W\. Shin, S\. Yang, M\. Seo, J\.\-Y\. Kim, and E\. Choi\.EHRSQL: A practical text\-to\-SQL benchmark for electronic health records\.In*Advances in Neural Information Processing Systems 35: Datasets and Benchmarks Track*, 2022\.URL[https://arxiv\.org/abs/2301\.07695](https://arxiv.org/abs/2301.07695)\.
- Li et al\. \[2025\]J\. Li, W\. Zhao, J\. Zhao, W\. Zeng, H\. Wu, X\. Wang, R\. Ge, Y\. Cao, Y\. Huang, W\. Liu, et al\.The tool decathlon: Benchmarking language agents for diverse, realistic, and long\-horizon task execution\.*arXiv preprint arXiv:2510\.25726*, 2025\.
- Li et al\. \[2026\]X\. Li, W\. Chen, Y\. Liu, S\. Zheng, X\. Chen, Y\. He, Y\. Li, B\. You, H\. Shen, J\. Sun, et al\.Skillsbench: Benchmarking how well agent skills work across diverse tasks\.*arXiv preprint arXiv:2602\.12670*, 2026\.
- Liu et al\. \[2025\]J\. Liu, W\. Wang, Z\. Ma, G\. Huang, Y\. Su, K\.\-J\. Chang, W\. Chen, H\. Li, L\. Shen, and M\. R\. Lyu\.MedChain: Bridging the gap between LLM agents and clinical practice through interactive sequential benchmarking\.In*Advances in Neural Information Processing Systems 38: Datasets and Benchmarks Track*, 2025\.URL[https://arxiv\.org/abs/2412\.01605](https://arxiv.org/abs/2412.01605)\.
- Merrill et al\. \[2026\]M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan, et al\.Terminal\-bench: Benchmarking agents on hard, realistic tasks in command line interfaces\.*arXiv preprint arXiv:2601\.11868*, 2026\.
- Modal Labs \[2025\]Modal Labs\.Modal: High\-performance serverless infrastructure for AI and data\.[https://modal\.com](https://modal.com/), 2025\.Accessed: 2026\-04\-30\.
- Nous Research \[2026\]Nous Research\.Hermes Agent: The agent that grows with you\.[https://github\.com/NousResearch/hermes\-agent](https://github.com/NousResearch/hermes-agent), 2026\.Accessed: 2026\-04\-30\.
- OpenAI \[2025a\]OpenAI\.OpenAI Agents SDK \(python\)\.[https://github\.com/openai/openai\-agents\-python](https://github.com/openai/openai-agents-python), 2025a\.Accessed: 2026\-04\-30\.
- OpenAI \[2025b\]OpenAI\.OpenAI Codex CLI\.[https://github\.com/openai/codex](https://github.com/openai/codex), 2025b\.Accessed: 2026\-04\-30\.
- OpenAI \[2026\]OpenAI\.GPT\-5\.5 system card\.[https://openai\.com/index/gpt\-5\-5\-system\-card/](https://openai.com/index/gpt-5-5-system-card/), 2026\.Accessed: 2026\-04\-30\. Covers the GPT\-5\.5, GPT\-5\.4, and GPT\-5\.4 Mini family\.
- OpenClaw \[2025\]OpenClaw\.OpenClaw: Your own personal ai assistant\.[https://github\.com/openclaw/openclaw](https://github.com/openclaw/openclaw), 2025\.Accessed: 2026\-04\-30\.
- Pal et al\. \[2022\]A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu\.MedMCQA: A large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.In*Proceedings of the Conference on Health, Inference, and Learning \(CHIL\)*, volume 174 of*Proceedings of Machine Learning Research*, pages 248–260\. PMLR, 2022\.URL[https://arxiv\.org/abs/2203\.14371](https://arxiv.org/abs/2203.14371)\.
- Qwen Team \[2025\]Qwen Team\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Sahni et al\. \[2023\]N\. R\. Sahni, P\. Gupta, M\. Peterson, and D\. M\. Cutler\.Active steps to reduce administrative spending associated with financial transactions in US health care\.*Health Affairs Scholar*, 1\(5\):qxad053, 2023\.doi:10\.1093/haschl/qxad053\.
- Sahni et al\. \[2024\]N\. R\. Sahni, B\. Istvan, and D\. M\. Cutler\.Perceptions of prior authorization burden and solutions\.*Health Affairs Scholar*, 2\(9\):qxae096, 2024\.doi:10\.1093/haschl/qxae096\.
- Schmidgall et al\. \[2024\]S\. Schmidgall, R\. Ziaei, C\. Harris, E\. Reis, J\. Jopling, and M\. Moor\.AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments, 2024\.URL[https://arxiv\.org/abs/2405\.07960](https://arxiv.org/abs/2405.07960)\.
- Sinsky et al\. \[2016\]C\. A\. Sinsky, L\. Colligan, L\. Li, M\. Prgomet, S\. Reynolds, L\. Goeders, J\. Westbrook, M\. Tutty, and G\. Blike\.Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties\.*Annals of Internal Medicine*, 165\(11\):753–760, 2016\.doi:10\.7326/M16\-0961\.
- Steinberger \[2025\]P\. Steinberger\.MCPorter: TypeScript runtime and CLI for connecting to MCP servers\.[https://github\.com/steipete/mcporter](https://github.com/steipete/mcporter), 2025\.npm packagemcporter; accessed 2026\-05\-03\.
- Sutton et al\. \[1999\]R\. S\. Sutton, D\. Precup, and S\. Singh\.Between mdps and semi\-mdps: A framework for temporal abstraction in reinforcement learning\.*Artificial intelligence*, 112\(1\-2\):181–211, 1999\.
- Tang et al\. \[2024\]X\. Tang, B\. Qian, R\. Gao, J\. Chen, X\. Chen, and M\. Gerstein\.BioCoder: a benchmark for bioinformatics code generation with large language models\.*Bioinformatics*, 40\(Supplement\_1\):i266–i276, 2024\.doi:10\.1093/bioinformatics/btae230\.URL[https://arxiv\.org/abs/2308\.16458](https://arxiv.org/abs/2308.16458)\.
- Tang et al\. \[2025\]X\. Tang, D\. Shao, J\. Sohn, J\. Chen, J\. Zhang, J\. Xiang, F\. Wu, Y\. Zhao, C\. Wu, W\. Shi, A\. Cohan, and M\. Gerstein\.MedAgentsBench: Benchmarking thinking models and agent frameworks for complex medical reasoning, 2025\.URL[https://arxiv\.org/abs/2503\.07459](https://arxiv.org/abs/2503.07459)\.
- Trivedi et al\. \[2024\]H\. Trivedi, T\. Khot, M\. Hartmann, R\. Manku, V\. Dong, E\. Li, S\. Gupta, A\. Sabharwal, and N\. Balasubramanian\.AppWorld: A controllable world of apps and people for benchmarking interactive coding agents\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*\. Association for Computational Linguistics, 2024\.URL[https://arxiv\.org/abs/2407\.18901](https://arxiv.org/abs/2407.18901)\.
- Tsatsaronis et al\. \[2015\]G\. Tsatsaronis, G\. Balikas, P\. Malakasiotis, I\. Partalas, M\. Zschunke, M\. R\. Alvers, D\. Weissenborn, A\. Krithara, S\. Petridis, D\. Polychronopoulos, Y\. Almirantis, J\. Pavlopoulos, N\. Baskiotis, P\. Gallinari, T\. Artières, A\.\-C\. N\. Ngomo, N\. Heino, E\. Gaussier, L\. Barrio\-Alvers, M\. Schroeder, I\. Androutsopoulos, and G\. Paliouras\.An overview of the BIOASQ large\-scale biomedical semantic indexing and question answering competition\.*BMC Bioinformatics*, 16\(1\):138, 2015\.doi:10\.1186/s12859\-015\-0564\-6\.URL[https://bmcbioinformatics\.biomedcentral\.com/articles/10\.1186/s12859\-015\-0564\-6](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0564-6)\.
- Wang et al\. \[2024\]Z\. Wang, B\. Danek, Z\. Yang, Z\. Chen, and J\. Sun\.Can large language models replace data scientists in biomedical research?, 2024\.URL[https://arxiv\.org/abs/2410\.21591](https://arxiv.org/abs/2410.21591)\.
- Wornow et al\. \[2023\]M\. Wornow, R\. Thapa, E\. Steinberg, J\. A\. Fries, and N\. H\. Shah\.EHRSHOT: An EHR benchmark for few\-shot evaluation of foundation models\.In*Advances in Neural Information Processing Systems 36: Datasets and Benchmarks Track*, 2023\.URL[https://arxiv\.org/abs/2307\.02028](https://arxiv.org/abs/2307.02028)\.
- xAI \[2025\]xAI\.Grok 4 model card\.[https://data\.x\.ai/2025\-08\-20\-grok\-4\-model\-card\.pdf](https://data.x.ai/2025-08-20-grok-4-model-card.pdf), 2025\.Accessed: 2026\-04\-30\.
- Xie et al\. \[2024\]T\. Xie, D\. Zhang, J\. Chen, X\. Li, S\. Zhao, R\. Cao, T\. J\. Hua, Z\. Cheng, D\. Shin, F\. Lei, Y\. Liu, Y\. Xu, S\. Zhou, S\. Savarese, C\. Xiong, V\. Zhong, and T\. Yu\.OSWorld: Benchmarking multimodal agents for open\-ended tasks in real computer environments\.In*Advances in Neural Information Processing Systems 37: Datasets and Benchmarks Track*, 2024\.URL[https://arxiv\.org/abs/2404\.07972](https://arxiv.org/abs/2404.07972)\.
- Xiong et al\. \[2024\]G\. Xiong, Q\. Jin, Z\. Lu, and A\. Zhang\.Benchmarking retrieval\-augmented generation for medicine\.In*Findings of the Association for Computational Linguistics: ACL 2024*\. Association for Computational Linguistics, 2024\.URL[https://arxiv\.org/abs/2402\.13178](https://arxiv.org/abs/2402.13178)\.
- Xu et al\. \[2024\]F\. F\. Xu, Y\. Song, B\. Li, Y\. Tang, K\. Jain, M\. Bao, Z\. Z\. Wang, X\. Zhou, Z\. Guo, M\. Cao, M\. Yang, H\. Y\. Lu, A\. Martin, Z\. Su, L\. Maben, R\. Mehta, W\. Chi, L\. Jang, Y\. Xie, S\. Zhou, and G\. Neubig\.TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024\.URL[https://arxiv\.org/abs/2412\.14161](https://arxiv.org/abs/2412.14161)\.
- Xu et al\. \[2026\]R\. Xu, Y\. Zhuang, Y\. Zhong, Y\. Yu, Z\. Wang, X\. Tang, H\. Wu, M\. D\. Wang, J\. C\. Ho, Y\. Xiao, W\. Shi, and C\. Yang\.MedAgentGym: A scalable agentic training environment for code\-centric reasoning in biomedical data science\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.URL[https://arxiv\.org/abs/2506\.04405](https://arxiv.org/abs/2506.04405)\.
- Yao et al\. \[2024\]S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan\.τ\\tau\-bench: A benchmark for tool\-agent\-user interaction in real\-world domains\.*arXiv preprint arXiv:2406\.12045*, 2024\.
- Zhang et al\. \[2025\]B\. Zhang, K\. Lazuka, and M\. Murag\.Equipping agents for the real world with agent skills\.*Anthropic Engineering Blog*, 2025\.
- Zhou et al\. \[2024\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig\.WebArena: A realistic web environment for building autonomous agents\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.URL[https://arxiv\.org/abs/2307\.13854](https://arxiv.org/abs/2307.13854)\.
- Zuo et al\. \[2025\]Y\. Zuo, S\. Qu, Y\. Li, Z\. Chen, X\. Zhu, E\. Hua, K\. Zhang, N\. Ding, and B\. Zhou\.Medxpertqa: Benchmarking expert\-level medical reasoning and understanding\.*arXiv preprint arXiv:2501\.18362*, 2025\.

Similar Articles

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Hugging Face Daily Papers

AutoMedBench is a workflow-aware benchmark for autonomous medical-AI research, evaluating agents across five stages on diverse medical imaging tasks. Stage-level scoring reveals validation as the weakest stage, highlighting the need for reliable verification in agentic workflows.

Introducing HealthBench

OpenAI Blog

OpenAI introduces HealthBench, a new benchmark for evaluating AI systems in healthcare contexts, created with 262 physicians across 60 countries. The benchmark includes 5,000 realistic health conversations with physician-written rubrics to assess model performance on meaningful, trustworthy, and improvable metrics.

Rehumanizing global health care with agentic AI

MIT Technology Review

Healthcare providers are turning to agentic AI to automate complex tasks, reduce clinician burnout, and improve patient outcomes, as demonstrated by HSS's use of AI agents for insurance claims and patient triage.