準則：可靠的多模態RAG用於醫學視覺語言模型

摘要

最近出現的醫學大型視覺語言模型（Med-LVLMs）已增強醫學診斷的能力。然而，目前的Med-LVLMs經常遇到事實問題，通常生成的回應與確立的醫學事實不一致。檢索增強生成（RAG）利用外部知識，可以提高這些模型的事實準確性，但也帶來兩個主要挑戰。首先，有限的檢索上下文可能無法涵蓋所有必要信息，而過多的檢索可能會引入無關和不準確的參考，干擾模型的生成。其次，在模型原本回答正確的情況下，應用RAG可能導致對檢索上下文的過度依賴，導致答案不正確。為解決這些問題，我們提出RULE，包括兩個組件。首先，我們引入一種經證明有效的策略，通過校準選擇檢索上下文數量來控制事實風險。其次，基於對檢索上下文依賴過多導致錯誤的樣本，我們編纂了一個偏好數據集來微調模型，平衡其對內在知識和檢索上下文的依賴以進行生成。我們在三個醫學VQA數據集上展示了RULE的有效性，實現了事實準確性平均提高20.8％。我們在https://github.com/richard-peng-xia/RULE 上公開發布了我們的基準和代碼。

English

The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has enhanced medical diagnosis. However, current Med-LVLMs frequently encounter factual issues, often generating responses that do not align with established medical facts. Retrieval-Augmented Generation (RAG), which utilizes external knowledge, can improve the factual accuracy of these models but introduces two major challenges. First, limited retrieved contexts might not cover all necessary information, while excessive retrieval can introduce irrelevant and inaccurate references, interfering with the model's generation. Second, in cases where the model originally responds correctly, applying RAG can lead to an over-reliance on retrieved contexts, resulting in incorrect answers. To address these issues, we propose RULE, which consists of two components. First, we introduce a provably effective strategy for controlling factuality risk through the calibrated selection of the number of retrieved contexts. Second, based on samples where over-reliance on retrieved contexts led to errors, we curate a preference dataset to fine-tune the model, balancing its dependence on inherent knowledge and retrieved contexts for generation. We demonstrate the effectiveness of RULE on three medical VQA datasets, achieving an average improvement of 20.8% in factual accuracy. We publicly release our benchmark and code in https://github.com/richard-peng-xia/RULE.

準則：可靠的多模態RAG用於醫學視覺語言模型

RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

摘要

Support