標準模擬患者症例を用いた動的臨床意思決定における大規模言語モデルの評価

要旨

大規模言語モデル（LLMs）が臨床エージェントとして提案される機会が増えているが、静的で単一ターンのベンチマークでは、モデルが診療のやり取りを通じて動的にケアを提供する様子（情報収集、治療計画の立案、連続する患者状態に応じた長期的管理の適応）を捉えることができない。医学教育は長年にわたり、模擬患者（SPs）、すなわち訓練された俳優が臨床症例を一貫して演じ、現実的な練習と客観的かつ台本に基づく評価を可能にする手法を通じて、同様の課題に取り組んできた。本稿では、臨床エージェント評価のためのSP由来の対話型ベンチマークであるMedSP1000を紹介する。これには1,638件のSP症例と24,602件の軌跡レベルの査読付き評価基準が含まれる。MedSP1000は、査読済みのSP教育用症例を、定義されたSP症例台本、臨床環境コンテキスト、および人間が検証した構造化評価基準を備えた実行可能なシナリオに変換する。各シミュレーション評価の実行では、臨床エージェントが患者エージェントおよび環境コントローラと閉ループで対話し、その行動は元の資料に指定された専門家基準に照らして診療のやり取り全体を通じて採点される。MedSP1000を汎用および医療特化型の様々なLLMに適用した結果、静的ベンチマークでの性能がこうした教育シナリオに確実に転用されるわけではないことが判明した。最良のモデルであるGPT-5.5でも、専門家が定義した評価項目の60.4%しか達成できず、最も強力な医療特化型モデルでも40.0%に留まった。テスト時計算量を増やしても測定可能な改善は見られなかった。これらの結果は、医療に調整されたエージェントシステムを含む現在のLLMが、実際の臨床診療に安全に統合できるほど信頼性が高くないことを示唆している。さらに広く言えば、MedSP1000は、プロセスレベルのSP形式評価が、単一ターンのベンチマークでは見逃される臨床的に関連する障害モードを明らかにできることを示している。

English

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.