端到端自主RAG系統訓練:可追溯的診斷推理
End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
August 21, 2025
作者: Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie
cs.AI
摘要
精確診斷在醫學大型語言模型中的應用,因知識缺口與幻覺現象而受阻。檢索與工具增強方法雖有所助益,但其效果受限於外部知識的薄弱運用及反饋推理的可追溯性不足。為應對這些挑戰,我們引入了Deep-DxSearch,這是一個基於強化學習(RL)端到端訓練的代理式RAG系統,旨在實現醫學診斷中可引導的檢索增強推理。在Deep-DxSearch中,我們首先構建了一個大規模的醫學檢索語料庫,涵蓋患者記錄與可靠的醫學知識來源,以支持跨診斷場景的檢索感知推理。更為關鍵的是,我們將LLM定位為核心代理,檢索語料庫作為其環境,通過對格式、檢索、推理結構及診斷準確性定制獎勵,從而從大規模數據中演化出代理式RAG策略,通過RL實現。
實驗表明,我們的端到端代理式RL訓練框架在多個數據中心中持續優於提示工程與無訓練RAG方法。訓練後,Deep-DxSearch在診斷準確性上取得顯著提升,超越如GPT-4o、DeepSeek-R1等強勁診斷基線,以及針對常見與罕見疾病診斷的醫學專用框架,無論是在分佈內還是分佈外設置下。此外,獎勵設計與檢索語料庫組件的消融研究證實了它們的關鍵作用,凸顯了我們方法相比傳統實現的獨特性與有效性。最後,案例研究與可解釋性分析展示了Deep-DxSearch診斷策略的改進,為其性能提升提供了更深入的見解,並支持臨床醫生提供更可靠與精確的初步診斷。詳見https://github.com/MAGIC-AI4Med/Deep-DxSearch。
English
Accurate diagnosis with medical large language models is hindered by
knowledge gaps and hallucinations. Retrieval and tool-augmented methods help,
but their impact is limited by weak use of external knowledge and poor
feedback-reasoning traceability. To address these challenges, We introduce
Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement
learning (RL) that enables steer tracebale retrieval-augmented reasoning for
medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical
retrieval corpus comprising patient records and reliable medical knowledge
sources to support retrieval-aware reasoning across diagnostic scenarios. More
crutially, we frame the LLM as the core agent and the retrieval corpus as its
environment, using tailored rewards on format, retrieval, reasoning structure,
and diagnostic accuracy, thereby evolving the agentic RAG policy from
large-scale data through RL.
Experiments demonstrate that our end-to-end agentic RL training framework
consistently outperforms prompt-engineering and training-free RAG approaches
across multiple data centers. After training, Deep-DxSearch achieves
substantial gains in diagnostic accuracy, surpassing strong diagnostic
baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks
for both common and rare disease diagnosis under in-distribution and
out-of-distribution settings. Moreover, ablation studies on reward design and
retrieval corpus components confirm their critical roles, underscoring the
uniqueness and effectiveness of our approach compared with traditional
implementations. Finally, case studies and interpretability analyses highlight
improvements in Deep-DxSearch's diagnostic policy, providing deeper insight
into its performance gains and supporting clinicians in delivering more
reliable and precise preliminary diagnoses. See
https://github.com/MAGIC-AI4Med/Deep-DxSearch.