端到端可追溯诊断推理的自主RAG系统训练
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
August 21, 2025
作者: Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
cs.AI
摘要
医疗大语言模型在精准诊断上面临知识鸿沟与幻觉问题的阻碍。检索与工具增强方法虽有所助益,但其效果受限于对外部知识的利用不足及反馈推理链的追踪性差。为应对这些挑战,我们推出了Deep-DxSearch,一个通过强化学习(RL)端到端训练的代理式RAG系统,旨在实现可引导的检索增强推理,助力医疗诊断。在Deep-DxSearch中,我们首先构建了一个大规模医疗检索语料库,包含患者记录及可靠的医学知识源,以支持跨诊断场景的检索感知推理。更为关键的是,我们将大语言模型定位为核心代理,检索语料库作为其环境,通过定制化奖励机制——涵盖格式、检索、推理结构及诊断准确性——利用大规模数据通过RL进化代理式RAG策略。
实验表明,我们的端到端代理式RL训练框架在多个数据中心均稳定优于提示工程及免训练RAG方法。训练后,Deep-DxSearch在诊断准确率上取得显著提升,无论是在分布内还是分布外设置下,均超越了如GPT-4o、DeepSeek-R1等强诊断基线及其他医疗专用框架,适用于常见与罕见疾病的诊断。此外,奖励设计与检索语料库组件的消融研究证实了它们的关键作用,凸显了相较于传统实现方式,我们方法的独特性和有效性。最后,案例研究与可解释性分析揭示了Deep-DxSearch诊断策略的改进,为其性能提升提供了深入见解,并支持临床医生提供更可靠、精确的初步诊断。详情请访问https://github.com/MAGIC-AI4Med/Deep-DxSearch。
English
Accurate diagnosis with medical large language models is hindered by
knowledge gaps and hallucinations. Retrieval and tool-augmented methods help,
but their impact is limited by weak use of external knowledge and poor
feedback-reasoning traceability. To address these challenges, We introduce
Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement
learning (RL) that enables steer tracebale retrieval-augmented reasoning for
medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical
retrieval corpus comprising patient records and reliable medical knowledge
sources to support retrieval-aware reasoning across diagnostic scenarios. More
crutially, we frame the LLM as the core agent and the retrieval corpus as its
environment, using tailored rewards on format, retrieval, reasoning structure,
and diagnostic accuracy, thereby evolving the agentic RAG policy from
large-scale data through RL.
Experiments demonstrate that our end-to-end agentic RL training framework
consistently outperforms prompt-engineering and training-free RAG approaches
across multiple data centers. After training, Deep-DxSearch achieves
substantial gains in diagnostic accuracy, surpassing strong diagnostic
baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks
for both common and rare disease diagnosis under in-distribution and
out-of-distribution settings. Moreover, ablation studies on reward design and
retrieval corpus components confirm their critical roles, underscoring the
uniqueness and effectiveness of our approach compared with traditional
implementations. Finally, case studies and interpretability analyses highlight
improvements in Deep-DxSearch's diagnostic policy, providing deeper insight
into its performance gains and supporting clinicians in delivering more
reliable and precise preliminary diagnoses. See
https://github.com/MAGIC-AI4Med/Deep-DxSearch.