MMed-RAG: 医療ビジョン言語モデル向けの多目的マルチモーダルRAGシステム

要旨

人工知能（AI）は、特に疾病の診断や治療計画において、医療分野で著しい潜在能力を示しています。最近の医療用大規模ビジョン言語モデル（Med-LVLMs）の進歩により、対話型診断ツールの新たな可能性が開かれました。しかしながら、これらのモデルはしばしば事実の幻覚に苦しんでおり、これが誤った診断につながる可能性があります。ファインチューニングと検索増強生成（RAG）が、これらの問題に対処する手法として登場しています。しかし、高品質なデータの量やトレーニングデータと展開データとの分布シフトが、ファインチューニング手法の適用を制限しています。RAGは軽量で効果的ですが、既存のRAGベースのアプローチは、異なる医療領域に対して十分に汎用的ではなく、モダリティ間やモデルとグラウンドトゥルースとの間で整合性の問題を引き起こす可能性があります。本論文では、Med-LVLMsの事実性を向上させるために設計された多目的マルチモーダルRAGシステム、MMed-RAGを提案します。当該手法は、ドメインに精通した検索メカニズム、適応的な検索されたコンテキストの選択方法、証明可能なRAGベースの優先ファインチューニング戦略を導入しています。これらの革新により、RAGプロセスは十分に汎用的かつ信頼性があり、検索されたコンテキストを導入する際の整合性が大幅に向上します。医療VQAおよびレポート生成における5つの医療データセット（放射線学、眼科学、病理学を含む）を対象とした実験結果は、MMed-RAGがMed-LVLMsの事実的な正確性を平均43.8％向上させることを示しています。当該データとコードは、https://github.com/richard-peng-xia/MMed-RAG で入手可能です。

English

Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG.

MMed-RAG: 医療ビジョン言語モデル向けの多目的マルチモーダルRAGシステム

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

要旨

Support