ChatPaper.aiChatPaper

CHI-Bench:AI代理能否实现端到端、长期、政策密集的医疗工作流自动化?

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

May 15, 2026
作者: Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao
cs.AI

摘要

面向真实医疗运营的端到端自动化,强调了当前基准测试中未充分体现的三项能力:政策密度——决策必须基于庞大的医疗、保险及操作规则库;多角色组合——单个任务要求代理扮演多个角色并进行交接;多边互动——中间工作流步骤需通过多轮对话完成,如同行评审和患者外联。为此,我们引入了χ-Bench,一个涵盖三大领域的长期医疗工作流基准测试:医疗服务提供方预授权、支付方利用率管理以及患者照护管理。每个任务将一份临床案例交给代理,在由20个医疗应用(通过87个MCP工具暴露)组成的高保真模拟器中,通过工具调用和撰写角色文书,引导代理完成任务至终止状态,并依据一份包含1290多条文档的管理式医疗运营手册技能进行指导。在30种代理框架/模型配置下,最佳代理仅能解决28.0%的任务,且没有代理在严格通过率³下达到20%,而在单次会话中执行所有任务的表现则骤降至3.8%。这些结果提出了一个假设:在其他政策密集、角色组合、不可逆的企业领域中,类似的差距很可能会显现。
English
End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.