Tag
This paper introduces CHI-Bench, a benchmark for evaluating AI agents on end-to-end automation of complex healthcare workflows that require policy-grounded decisions, multi-role composition, and multilateral interactions. Experimental results show that the best agent achieves only 28% task resolution, highlighting significant gaps in current agent capabilities for policy-dense enterprise domains.