以標準化病人案例評估大型語言模型於動態臨床決策之表現

摘要

大型語言模型（LLMs）正日益被提出作為臨床代理，然而靜態、單回合的基準測試無法捕捉模型在整個診療過程中如何動態提供照護：收集資訊、規劃治療，以及在連續的病人狀態中調整長期管理。醫學教育長期以來透過標準化病人（SP）來應對類似的挑戰：這些經過訓練的演員能一致地扮演臨床案例，提供逼真的練習與客觀、基於腳本的評估。在此，我們引入 MedSP1000，這是一個基於標準化病人建立的互動式基準測試，用於評估臨床代理，包含 1,638 個標準化病人案例與 24,602 個經同儕審查的軌跡層級評分量表。MedSP1000 將經同儕審查的標準化病人教學案例轉化為可執行的場景，包含定義好的標準化病人案例腳本、臨床環境背景，以及經人類驗證的結構化評分量表。在每次模擬評估運行中，臨床代理與病人代理及環境控制器進行閉環互動，其行為會根據原始材料中專家制定的標準，在整個診療過程中進行評分。將 MedSP1000 應用於一系列通用型與醫學專科的大型語言模型，我們發現，在靜態基準測試上的表現無法可靠地轉移到這類教育場景中。表現最佳的模型 GPT-5.5 僅完成專家定義評分量表中 60.4% 的項目，而最強的醫學專科模型則達到 40.0%；增加測試時計算並未帶來可衡量的提升。這些結果顯示，當前的大型語言模型，包括針對醫學調整的代理系統，尚未足夠可靠以安全整合至實際臨床實務。更廣泛而言，MedSP1000 展示了過程層級、標準化病人式的評估如何揭露單回合基準測試所忽略的臨床相關失敗模式。

English

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.