可靠的多模态RAG用于医学视觉语言模型的真实性
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models
July 6, 2024
作者: Peng Xia, Kangyu Zhu, Haoran Li, Hongtu Zhu, Yun Li, Gang Li, Linjun Zhang, Huaxiu Yao
cs.AI
摘要
最近出现的医学大规模视觉语言模型(Med-LVLMs)已经增强了医学诊断能力。然而,当前的Med-LVLMs经常遇到事实问题,经常生成与已建立的医学事实不符的响应。利用外部知识的检索增强生成(RAG)可以提高这些模型的事实准确性,但也引入了两个主要挑战。首先,有限的检索上下文可能无法涵盖所有必要信息,而过多的检索可能会引入无关和不准确的参考,干扰模型的生成。其次,在模型最初回答正确的情况下,应用RAG可能导致对检索上下文过度依赖,导致错误答案。为了解决这些问题,我们提出了RULE,包括两个组成部分。首先,我们引入了一个经过验证有效的策略,通过校准选择检索上下文数量来控制事实风险。其次,基于检索上下文过度依赖导致错误的样本,我们策划了一个偏好数据集,对模型进行微调,平衡其对内在知识和检索上下文的生成依赖。我们在三个医学VQA数据集上展示了RULE的有效性,事实准确性平均提高了20.8%。我们在https://github.com/richard-peng-xia/RULE 上公开发布了我们的基准和代码。
English
The recent emergence of Medical Large Vision Language Models (Med-LVLMs) has
enhanced medical diagnosis. However, current Med-LVLMs frequently encounter
factual issues, often generating responses that do not align with established
medical facts. Retrieval-Augmented Generation (RAG), which utilizes external
knowledge, can improve the factual accuracy of these models but introduces two
major challenges. First, limited retrieved contexts might not cover all
necessary information, while excessive retrieval can introduce irrelevant and
inaccurate references, interfering with the model's generation. Second, in
cases where the model originally responds correctly, applying RAG can lead to
an over-reliance on retrieved contexts, resulting in incorrect answers. To
address these issues, we propose RULE, which consists of two components. First,
we introduce a provably effective strategy for controlling factuality risk
through the calibrated selection of the number of retrieved contexts. Second,
based on samples where over-reliance on retrieved contexts led to errors, we
curate a preference dataset to fine-tune the model, balancing its dependence on
inherent knowledge and retrieved contexts for generation. We demonstrate the
effectiveness of RULE on three medical VQA datasets, achieving an average
improvement of 20.8% in factual accuracy. We publicly release our benchmark and
code in https://github.com/richard-peng-xia/RULE.Summary
AI-Generated Summary