PhysicianBench: Evaluatie van LLM-agents in Realistische EHR-omgevingen

Samenvatting

Wij introduceren PhysicianBench, een benchmark voor het evalueren van LLM-agenten op artsentaken, gebaseerd op een reële klinische setting binnen elektronische patiëntendossier (EPD)-omgevingen. Bestaande medische agentenbenchmarks richten zich voornamelijk op statische kennisrecall, enkelstaps atomische acties, of actie-intentie zonder verifieerbare uitvoering tegen de omgeving. Hierdoor slagen zij er niet in om de langetermijn, samengestelde workflows vast te leggen die kenmerkend zijn voor echte klinische systemen. PhysicianBench omvat 100 langetermijntaken, aangepast uit echte consultatiegevallen tussen huisartsen en specialisten, waarbij elke taak onafhankelijk werd beoordeeld door een apart panel van artsen. Taken worden geïnstantieerd in een EPD-omgeving met echte patiëntendossiers en toegankelijk gemaakt via dezelfde standaard-API's die door commerciële EPD-leveranciers worden gebruikt. De taken beslaan 21 specialismen (bijv. cardiologie, endocrinologie, oncologie, psychiatrie) en diverse workflowtypen (bijv. diagnose-interpretatie, medicatievoorschrijven, behandelplanning), waarbij gemiddeld 27 toolaanroepen per taak nodig zijn. Het oplossen van elke taak vereist het ophalen van gegevens across encounters, redeneren over heterogene klinische informatie, het uitvoeren van consequente klinische acties en het produceren van klinische documentatie. Elke taak wordt opgedeeld in gestructureerde checkpoints (670 in totaal verspreid over de benchmark) die afzonderlijke voltooiingsstadia vastleggen, beoordeeld door taakspecifieke scripts met op uitvoering gebaseerde verificatie. Over 13 propriëtaire en open-source LLM-agenten heen, behaalt het best presterende model slechts een slagingspercentage van 46% (pass@1), terwijl open-source modellen maximaal 19% bereiken, wat een aanzienlijke kloof onthult tussen de huidige agentcapaciteiten en de eisen van real-world klinische workflows. PhysicianBench biedt een realistische en op uitvoering gebaseerde benchmark om de voortgang naar autonome klinische agenten te meten.

English

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

PhysicianBench: Evaluatie van LLM-agents in Realistische EHR-omgevingen

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Samenvatting

Support