추적 가능한 진단 추론을 위한 종단 간 에이전트 RAG 시스템 훈련

초록

의료 대형 언어 모델(LLM)을 이용한 정확한 진단은 지식 격차와 환각 현상으로 인해 제한을 받고 있습니다. 검색 및 도구 보강 방법이 도움을 주지만, 외부 지식의 약한 활용과 피드백-추적 가능성의 부족으로 그 효과가 제한적입니다. 이러한 문제를 해결하기 위해, 우리는 강화 학습(RL)을 통해 종단 간 학습된 에이전트 기반 RAG 시스템인 Deep-DxSearch를 소개합니다. Deep-DxSearch는 의료 진단을 위한 추적 가능한 검색 보강 추론을 가능하게 합니다. Deep-DxSearch에서는 먼저 환자 기록과 신뢰할 수 있는 의료 지식 소스를 포함한 대규모 의료 검색 코퍼스를 구축하여 다양한 진단 시나리오에서 검색 인식 추론을 지원합니다. 더욱 중요한 것은, LLM을 핵심 에이전트로 설정하고 검색 코퍼스를 환경으로 간주하여 형식, 검색, 추론 구조, 진단 정확성에 맞춤화된 보상을 사용함으로써, 대규모 데이터를 통해 에이전트 RAG 정책을 진화시킵니다. 실험 결과, 우리의 종단 간 에이전트 RL 학습 프레임워크는 여러 데이터 센터에서 프롬프트 엔지니어링 및 학습 없는 RAG 접근법을 지속적으로 능가하는 것으로 나타났습니다. 학습 후, Deep-DxSearch는 분포 내 및 분포 외 설정에서 일반 및 희귀 질병 진단 모두에서 GPT-4o, DeepSeek-R1 및 기타 의료 특화 프레임워크와 같은 강력한 진단 기준을 크게 능가하는 진단 정확성 향상을 달성했습니다. 또한, 보상 설계 및 검색 코퍼스 구성 요소에 대한 제거 연구는 이들의 중요한 역할을 확인하며, 전통적인 구현 방식과 비교하여 우리 접근법의 독창성과 효과를 강조합니다. 마지막으로, 사례 연구와 해석 가능성 분석은 Deep-DxSearch의 진단 정책 개선을 강조하며, 그 성능 향상에 대한 깊은 통찰을 제공하고 임상의가 더 신뢰할 수 있고 정확한 예비 진단을 내리는 데 도움을 줍니다. 자세한 내용은 https://github.com/MAGIC-AI4Med/Deep-DxSearch를 참조하십시오.

English

Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.

추적 가능한 진단 추론을 위한 종단 간 에이전트 RAG 시스템 훈련

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

초록

Support