표준화 환자 사례를 활용한 동적 임상 의사 결정에서의 대규모 언어 모델 평가

초록

대규모 언어 모델(LLM)이 임상 에이전트로 점차 제안되고 있지만, 정적이고 단일 턴(single-turn)의 벤치마크로는 모델이 진료 과정 전반에 걸쳐 정보 수집, 치료 계획 수립, 연속적인 환자 상태에 따른 장기 관리 적응 등 역동적으로 진료를 제공하는 방식을 포착할 수 없다. 의학교육에서는 오랫동안 표준화 환자(standardized patients, SPs)를 통해 이와 유사한 과제를 해결해 왔다. 즉, 임상 사례를 일관되게 연기하도록 훈련된 배우를 활용하여 현실적인 실습과 객관적이고 대본화된 평가를 가능하게 한 것이다. 본 연구에서는 MedSP1000을 소개한다. 이는 SP 기반의 임상 에이전트 평가를 위한 상호작용 벤치마크로, 1,638개의 SP 사례와 24,602개의 궤적 수준(trajectory-level) 동료 검토 루브릭을 포함한다. MedSP1000은 동료 검토를 거친 SP 교육 사례를 정의된 SP 사례 대본, 임상 환경 맥락, 인간 검증을 거친 구조화된 루브릭을 갖춘 실행 가능한 시나리오로 변환한다. 각 시뮬레이션 평가 실행에서 임상 에이전트는 환자 에이전트 및 환경 제어기와 폐루프(closed loop)로 상호작용하며, 그 행동은 원자료에 명시된 전문가 기준에 따라 진료 과정 전반에 걸쳐 점수화된다. MedSP1000을 다양한 범용 및 의학 특화 LLM에 적용한 결과, 정적 벤치마크에서의 성능이 이러한 교육 시나리오에서 신뢰할 수 있게 전이되지 않음을 발견했다. 최고 성능 모델인 GPT-5.5는 전문가가 정의한 루브릭 항목의 60.4%만을 완료했으며, 가장 강력한 의학 특화 모델은 40.0%에 도달했다. 테스트 시점 연산량(test-time compute)을 늘려도 측정 가능한 성능 향상은 없었다. 이러한 결과는 현재의 LLM, 특히 의학 분야에 맞춰 조정된 에이전트 시스템이 실제 임상 현장에 안전하게 통합되기에는 아직 충분히 신뢰할 수 없음을 시사한다. 더 널리 보면, MedSP1000은 과정 수준(process-level)의 SP 방식 평가가 단일 턴 벤치마크가 놓치는 임상적으로 관련 있는 실패 모드를 어떻게 드러낼 수 있는지를 보여준다.

English

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.