ChatPaper.aiChatPaper

GEMeX-ThinkVG:通过强化学习实现医学视觉问答中的视觉基础思维

GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning

June 22, 2025
作者: Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu
cs.AI

摘要

医疗视觉问答旨在通过使模型能够基于医学图像回答自然语言问题,从而支持临床决策。尽管多模态学习的最新进展显著提升了性能,但现有方法仍存在答案可靠性有限和可解释性差的问题,这影响了临床医生和患者对模型生成答案的理解与信任。为解决这一问题,本研究首先提出了一个“视觉基础思维”(ThinkVG)数据集,其中答案生成被分解为中间推理步骤,这些步骤明确地锚定了医学图像中的相关视觉区域,从而提供了细粒度的可解释性。此外,我们引入了一种新颖的可验证奖励机制,用于强化学习以指导后期训练,提高模型推理过程与其最终答案之间的一致性。值得注意的是,我们的方法仅使用八分之一的训练数据就实现了可比的性能,证明了该方案的高效性和有效性。该数据集可在https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG获取。
English
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG.
PDF31June 24, 2025