醫師基準：在真實世界電子健康記錄環境中評估大型語言模型代理

摘要

我們推出PhysicianBench基準測試，旨在評估大型語言模型代理於真實臨床環境中，在電子健康紀錄系統內執行醫師任務的能力。現有醫療代理基準主要側重靜態知識回憶、單步驟原子操作或缺乏環境驗證的意圖動作，因而無法捕捉真實臨床系統中具備長時程、複合式工作流的特性。PhysicianBench包含100個改編自初級保健與專科醫師真實會診案例的長時程任務，每項任務均經獨立醫師小組審核。任務在搭載真實病歷的電子健康紀錄環境中實例化，並透過商業電子健康紀錄供應商採用的標準API進行存取。任務橫跨21個專科領域（如心臟病學、內分泌學、腫瘤學、精神病學）及多樣化工作流類型（如診斷解讀、藥物處方、治療規劃），每項任務平均需調用27次工具。解決任務需跨就診時段檢索數據、對異質臨床資訊進行推理、執行具臨床影響力的行動並產出臨床文檔。每項任務被分解為結構化檢查點（總計670個），透過任務專用腳本與執行驗證機制，對不同完成階段進行評分。在13個專有及開源大型語言模型代理的測試中，最佳表現模型僅達46%成功率（pass@1），開源模型最高為19%，顯示當前代理能力與真實臨床工作流需求存在顯著差距。PhysicianBench為衡量自主臨床代理發展進程提供了具真實性與執行驗證的基準平台。

English

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

醫師基準：在真實世界電子健康記錄環境中評估大型語言模型代理

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

摘要

Support