虚拟临床环境中诊断智能体的演进
Evolving Diagnostic Agents in a Virtual Clinical Environment
October 28, 2025
作者: Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie
cs.AI
摘要
本文提出了一种基于强化学习的大型语言模型诊断智能体训练框架,使模型能够管理多轮诊断流程、自适应选择检查项目并做出最终诊断。与基于静态病例摘要进行指令微调的模型不同,我们的方法通过交互式探索和结果反馈来获取诊断策略。我们的贡献包括:(i)开发DiagGym诊断世界模型,该模型基于电子健康记录训练,能根据患者病史和推荐检查项目生成检查结果,为诊断训练与评估提供虚拟临床环境;(ii)通过端到端多轮强化学习训练DiagAgent,使其学习优化信息获取与诊断准确性的决策策略;(iii)构建DiagBench诊断基准数据集,包含750个具有医师验证检查建议的病例,以及99个附有973条医师撰写诊断流程标准的病例;(iv)在多样化诊断场景中展现卓越性能。DiagAgent显著超越10个前沿大语言模型(包括DeepSeek-v3和GPT-4o)及两个提示工程优化的智能体。在单轮诊断场景中,诊断准确率提升9.34%,检查推荐命中率提高44.03%;在端到端场景中,诊断准确率提升15.12%,检查推荐F1分数提高23.09%;在标准评估中,其加权评分较次优模型Claude-sonnet-4高出7.1%。这些结果表明,通过交互式临床环境学习的策略能赋予模型动态且具临床意义的诊断管理能力,这是被动训练无法实现的。
English
In this paper, we present a framework for training large language models
(LLMs) as diagnostic agents with reinforcement learning, enabling them to
manage multi-turn diagnostic processes, adaptively select examinations, and
commit to final diagnoses. Unlike instruction-tuned models trained on static
case summaries, our method acquires diagnostic strategies through interactive
exploration and outcome-based feedback. Our contributions are fourfold: (i) We
present DiagGym, a diagnostics world model trained with electronic health
records that emits examination outcomes conditioned on patient history and
recommended examination, serving as a virtual clinical environment for
realistic diagnosis training and evaluation; (ii) We train DiagAgent via
end-to-end, multi-turn reinforcement learning to learn diagnostic policies that
optimize both information yield and diagnostic accuracy; (iii) We introduce
DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated
examination recommendations and 99 cases annotated with 973 physician-written
rubrics on diagnosis process; (iv) we demonstrate superior performance across
diverse diagnostic settings. DiagAgent significantly outperforms 10
state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two
prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34%
higher diagnostic accuracy and 44.03% improvement in examination recommendation
hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic
accuracy and 23.09% boost in examination recommendation F1 score. In
rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by
7.1% in weighted rubric score. These findings indicate that learning policies
in interactive clinical environments confers dynamic and clinically meaningful
diagnostic management abilities unattainable through passive training alone.