CHI-Bench：AI代理能否实现端到端、长期、政策密集的医疗工作流自动化？

摘要

面向真实医疗运营的端到端自动化，强调了当前基准测试中未充分体现的三项能力：政策密度——决策必须基于庞大的医疗、保险及操作规则库；多角色组合——单个任务要求代理扮演多个角色并进行交接；多边互动——中间工作流步骤需通过多轮对话完成，如同行评审和患者外联。为此，我们引入了χ-Bench，一个涵盖三大领域的长期医疗工作流基准测试：医疗服务提供方预授权、支付方利用率管理以及患者照护管理。每个任务将一份临床案例交给代理，在由20个医疗应用（通过87个MCP工具暴露）组成的高保真模拟器中，通过工具调用和撰写角色文书，引导代理完成任务至终止状态，并依据一份包含1290多条文档的管理式医疗运营手册技能进行指导。在30种代理框架/模型配置下，最佳代理仅能解决28.0%的任务，且没有代理在严格通过率³下达到20%，而在单次会话中执行所有任务的表现则骤降至3.8%。这些结果提出了一个假设：在其他政策密集、角色组合、不可逆的企业领域中，类似的差距很可能会显现。

English

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.