PhysicianBench:在真实世界电子健康档案环境中评估LLM智能体
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
May 4, 2026
作者: Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black, John L. Havlik, Isaac Ogunmola, Stephen P. Ma, Roopa Dhatt, Jonathan H. Chen
cs.AI
摘要
我们推出PhysicianBench基准测试,旨在评估基于电子健康记录(EHR)真实临床环境下的LLM智能体执行医生任务的能力。现有医疗智能体基准主要关注静态知识回忆、单步原子操作或缺乏环境可验证执行的动作意图,因而无法捕捉真实临床系统中具有长周期、复合型工作流的特点。PhysicianBench包含100个源自初级保健与专科医生真实会诊案例的长周期任务,每个任务均经由独立医师小组审核。这些任务在搭载真实患者档案的EHR环境中实例化,并通过商用EHR供应商使用的标准API进行访问。任务涵盖21个专科领域(如心脏病学、内分泌学、肿瘤学、精神病学)及多样化工作流类型(如诊断解读、药物开具、治疗规划),平均每个任务需调用27次工具。解决每个任务需要跨就诊记录检索数据、对异构临床信息进行推理、执行具有临床影响的行动并生成临床文档。所有任务被分解为结构化检查点(基准测试共670个),通过任务特定脚本进行执行验证评分,以捕捉不同完成阶段。在13个专有及开源LLM智能体的测试中,表现最佳模型的成功率(pass@1)仅为46%,而开源模型最高仅达19%,这揭示了当前智能体能力与真实临床工作流需求之间的显著差距。PhysicianBench为衡量自主临床智能体的进展提供了真实且基于执行验证的基准测试。
English
We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.