PhysicianBench: 실제 EHR 환경에서의 LLM 에이전트 평가

초록

우리는 실제 전자의무기록(EHR) 환경에서 의사 업무를 평가하기 위한 벤치마크인 PhysicianBench를 소개한다. 기존 의료 에이전트 벤치마크는 주로 정적 지식 회상, 단일 단계 원자적 행동, 또는 환경에 대한 검증 가능한 실행 없이 의도만을 평가하는 데 중점을 둔다. 그 결과, 실제 임상 시스템을 특징짓는 장기적 복합 워크플로우를 제대로 반영하지 못한다. PhysicianBench는 1차 진료와 세부 전문의 간 실제 상담 사례를 바탕으로 적응한 100개의 장기적 과제로 구성되며, 각 과제는 별도의 의사 패널에 의해 독립적으로 검토되었다. 과제는 실제 환자 기록이 포함된 EHR 환경에서 구현되며, 상용 EHR 벤더들이 사용하는 표준 API를 통해 접근된다. 과제는 21개 전문 진료 분야(예: 심장학, 내분비학, 종양학, 정신의학)와 다양한 워크플로우 유형(예: 진단 해석, 약물 처방, 치료 계획 수립)을 아우르며, 과제당 평균 27회의 도구 호출이 필요하다. 각 과제를 해결하려면 여러 진료 기록에 걸친 데이터 검색, 이질적 임상 정보에 대한 추론, 결과적인 임상 행동 실행, 그리고 임상 문서 작성이 요구된다. 각 과제는 구조화된 검증점(총 670개)으로 세분화되며, 이는 실행 기반 검증을 위한 과제별 스크립트로 채점되는 별개의 완료 단계를 포착한다. 13개의 사적 및 오픈소스 LLM 에이전트를 평가한 결과, 최고 성능 모델의 성공률(pass@1)은 46%에 그쳤고, 오픈소스 모델은 최대 19%를 기록하여 현재 에이전트 역량과 실제 임상 워크플로우 요구 사항 간에 상당한 격차가 있음을 보여주었다. PhysicianBench는 자율 임상 에이전트 발전을 측정하기 위한 현실적이고 실행 기반의 벤치마크를 제공한다.

English

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

PhysicianBench: 실제 EHR 환경에서의 LLM 에이전트 평가

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

초록

Support