ChatPaper.aiChatPaper

虚拟临床环境中诊断智能体的演进

Evolving Diagnostic Agents in a Virtual Clinical Environment

October 28, 2025
作者: Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie
cs.AI

摘要

本文提出了一种基于强化学习的大型语言模型诊断智能体训练框架,使模型能够管理多轮诊断流程、自适应选择检查项目并最终确定诊断结果。与基于静态病例摘要进行指令微调的模型不同,我们的方法通过交互式探索和结果反馈来获取诊断策略。我们的贡献包括四个方面:(一)开发DiagGym诊断世界模型,该模型基于电子健康记录训练,能够根据患者病史和推荐检查项目生成检查结果,为诊断训练与评估提供虚拟临床环境;(二)通过端到端多轮强化学习训练DiagAgent,使其掌握兼顾信息获取与诊断准确性的决策策略;(三)构建DiagBench诊断基准数据集,包含750个具有医师验证检查建议的病例,以及99个附有973条医师撰写诊断流程标注的病例;(四)在多样化诊断场景中展现卓越性能。DiagAgent显著超越10个前沿大型语言模型(包括DeepSeek-v3和GPT-4o)及两个提示工程优化的智能体。在单轮诊断场景中,诊断准确率提升9.34%,检查推荐命中率提高44.03%;在端到端场景中,诊断准确率提升15.12%,检查推荐F1分数提高23.09%;在基于量规的评估中,其加权量规得分较次优模型Claude-sonnet-4高出7.1%。这些发现表明,通过交互式临床环境学习策略,能赋予模型动态且具临床意义的诊断管理能力,这是被动训练无法实现的。
English
In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.
PDF111December 2, 2025