CHI-Bench：AIエージェントは長期間にわたるエンドツーエンドのポリシー豊富な医療ワークフローを自動化できるか？

要旨

現実的なヘルスケア業務のエンドツーエンド自動化には、現在のベンチマークでは評価が不足している3つの能力が求められる。すなわち、ポリシー密度（意思決定が医療、保険、業務ルールの大規模なライブラリに基づいていなければならないこと）、マルチロール構成（単一のタスクにおいてエージェントが複数の役割を担い、それらを引き継ぎながら遂行すること）、そして多角的な対話（中間的なワークフロー手順が、ピアレビューや患者へのアウトリーチなど、複数ターンにわたる対話で構成されること）である。本稿では、プロバイダー事前認可、支払者の利用管理、ケア管理の3領域にわたる長期的なヘルスケアワークフローのベンチマークであるχ-Benchを紹介する。各タスクは、臨床事例をエージェントに提示し、87個のMCPツールを介して公開された20のヘルスケアアプリからなる高忠実度シミュレータ上で、1,290以上の文書からなるマネージドケア業務ハンドブックスキルに従い、ツール呼び出しとロール成果物の作成を通じて終端状態に到達させるものである。30のエージェントハーネス/モデル構成の中で、最高性能のエージェントはタスクの28.0%しか解決できず、厳格なpass^3基準では20%を超えるエージェントは存在せず、全タスクを単一セッションで実行した場合の性能は3.8%にまで低下した。これらの結果は、同様のギャップが、ポリシー集約型、ロール構成型、かつ不可逆的な他のエンタープライズ領域でも表面化する可能性があるという仮説を提起する。

English

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce χ-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.