CHI-Bench:AI代理能否自動化端到端、長週期、政策密集的醫療工作流程?
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
May 15, 2026
作者: Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao
cs.AI
摘要
端到端的真實醫療保健營運自動化強調了當前基準測試中三項未被充分體現的能力:政策密度(決策必須基於大量的醫療、保險及營運規則庫);多重角色組成(單一任務要求代理扮演多個角色並進行交接);以及多邊互動(工作流程的中間步驟涉及多輪對話,例如同儕審查與患者聯繫)。我們引入了χ-Bench,這是一個跨三個領域的長期醫療保健工作流程基準:提供者事前授權、支付者利用管理及護理管理。每個任務在一個高保真模擬器中提供臨床案例給代理,該模擬器包含20個醫療保健應用程式,透過87個MCP工具對外暴露。代理必須通過工具呼叫與撰寫角色的產出物,將案例驅動至終端狀態,並遵循一份由1,290多份文件組成的管理式醫療營運手冊技能指南。在30種代理框架/模型配置中,最佳代理僅解決了28.0%的任務,沒有任何代理在嚴格通過^3標準上超過20%,而在單次會話中執行所有任務時,性能下降到3.8%。這些結果提出了一個假設:在其他政策密集、角色組成、不可逆轉的企業領域中,很可能會出現類似的差距。
English
End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.