ChatPaper.aiChatPaper

GEMeX-ThinkVG:基於強化學習的醫學視覺問答中的視覺基礎思維探索

GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning

June 22, 2025
作者: Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu
cs.AI

摘要

醫學視覺問答旨在通過使模型能夠基於醫學影像回答自然語言問題,來支持臨床決策。儘管多模態學習的最新進展顯著提升了性能,但現有方法仍存在答案可靠性有限和解釋性差的問題,這影響了臨床醫生和患者對模型生成答案的理解與信任。為此,本研究首先提出了一個“視覺基礎思維”(ThinkVG)數據集,其中答案生成被分解為中間推理步驟,這些步驟明確地錨定了醫學影像的相關視覺區域,從而提供了細粒度的可解釋性。此外,我們引入了一種新穎的可驗證獎勵機制,用於強化學習以指導後續訓練,從而改善模型推理過程與最終答案之間的一致性。值得注意的是,我們的方法僅使用八分之一的訓練數據就達到了可比的性能,展示了該方案的高效性和有效性。該數據集可在https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG獲取。
English
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG.
PDF21June 24, 2025