PhysicianBench: 実世界EHR環境におけるLLMエージェントの評価

要旨

我々はPhysicianBenchを紹介する。これは電子健康記録（EHR）環境における実際の臨床現場に基づいた医師業務において、LLMエージェントを評価するためのベンチマークである。既存の医療エージェントベンチマークは、主に静的な知識の想起、単一ステップの原子動作、あるいは環境に対する検証可能な実行を伴わないアクション意図に焦点を当てている。その結果、実際の臨床システムを特徴づける長期的で複合的なワークフローを捉えることに失敗している。PhysicianBenchは、プライマリケア医とサブスペシャリスト医師間の実際の相談症例に基づいて作成された100の長期的タスクで構成され、各タスクは別個の医師パネルによって独立してレビューされている。タスクは実際の患者記録を含むEHR環境で具体化され、商用EHRベンダーが使用する標準APIを通じてアクセスされる。タスクは21の診療科（循環器内科、内分泌学、腫瘍学、精神医学など）と多様なワークフロータイプ（診断解釈、処方箋発行、治療計画立案など）にまたがり、1タスクあたり平均27回のツール呼び出しを必要とする。各タスクを解決するには、複数の診療記録にわたるデータ検索、多様な臨床情報に基づく推論、結果を伴う臨床アクションの実行、および臨床文書の作成が要求される。各タスクは構造化されたチェックポイント（ベンチマーク全体で合計670）に分解され、実行に基づく検証を行うタスク固有のスクリプトによって評価される完了段階を捉えている。13のプロプライエタリおよびオープンソースLLMエージェントを評価した結果、最高性能のモデルでも成功率（pass@1）は46%に留まり、オープンソースモデルは最大19%であった。これは、現在のエージェント能力と実世界の臨床ワークフローの要求との間に大きな隔たりがあることを示している。PhysicianBenchは、自律的な臨床エージェントの進歩を測るための現実的かつ実行基盤型のベンチマークを提供する。

English

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

PhysicianBench: 実世界EHR環境におけるLLMエージェントの評価

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

要旨

Support