基于标准化病例评估大型语言模型在动态临床决策中的表现

摘要

大型语言模型（LLMs）日益被提议作为临床代理，但静态的单轮基准测试无法捕捉模型在诊疗过程中如何动态地提供护理：收集信息、制定治疗方案，并在连续的患者状态中调整长期管理。医学教育长期以来通过标准化病人（SP）应对类似挑战：受过训练的演员能够一致地扮演临床病例，从而实现逼真的练习和客观、脚本化的评估。在此，我们提出MedSP1000，一种源自SP的交互式临床代理评估基准，包含1,638个SP案例及24,602个经过同行评审的轨迹级评分标准。MedSP1000将经过同行评审的SP教学案例转化为可执行场景，配有明确的SP案例脚本、临床环境背景及经人工验证的结构化评分标准。在每次模拟评估运行中，临床代理与患者代理及环境控制器进行闭环交互，其行为在整个诊疗过程中依据原始材料中专家设定的标准进行评分。将MedSP1000应用于多种通用及医学专用LLMs，我们发现静态基准上的表现并不能可靠地迁移至此类教育场景。表现最佳的模型GPT-5.5仅完成了专家定义评分标准中的60.4%，而最强的医学专用模型达到40.0%；增加测试时计算并未产生可测量的提升。这些结果表明，当前LLMs（包括针对医学微调的代理系统）尚不足以可靠地整合进实际临床实践。更广泛地说，MedSP1000展示了过程级、SP式评估如何揭示单轮基准测试所遗漏的临床相关失败模式。

English

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.