MMed-RAG：用于医疗视觉语言模型的多模态RAG系统

摘要

人工智能（AI）在医疗保健领域展现出了显著的潜力，特别是在疾病诊断和治疗规划方面。最近在医疗大规模视觉语言模型（Med-LVLMs）方面取得的进展为交互式诊断工具开辟了新的可能性。然而，这些模型经常出现事实幻觉，可能导致错误诊断。微调和检索增强生成（RAG）已经成为解决这些问题的方法。然而，高质量数据的数量以及训练数据与部署数据之间的分布偏移限制了微调方法的应用。尽管RAG轻量且有效，但现有基于RAG的方法对不同医学领域的通用性不足，可能导致模态之间以及模型与真实情况之间的不对齐问题。在本文中，我们提出了一种多功能多模态RAG系统，MMed-RAG，旨在增强Med-LVLMs的事实性。我们的方法引入了一个领域感知的检索机制，一个自适应的检索上下文选择方法，以及一个可证明的基于RAG的优先微调策略。这些创新使RAG过程足够通用和可靠，在引入检索上下文时显著提高了对齐性。在包括放射学、眼科学和病理学在内的五个医学数据集上的实验结果，涉及医学VQA和报告生成，表明MMed-RAG可以使Med-LVLMs的事实准确性平均提高43.8%。我们的数据和代码可在https://github.com/richard-peng-xia/MMed-RAG找到。

English

Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG.

MMed-RAG：用于医疗视觉语言模型的多模态RAG系统

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

摘要

Support