GEMeX-ThinkVG：基於強化學習的醫學視覺問答中的視覺基礎思維探索

摘要

醫學視覺問答旨在通過使模型能夠基於醫學影像回答自然語言問題，來支持臨床決策。儘管多模態學習的最新進展顯著提升了性能，但現有方法仍存在答案可靠性有限和解釋性差的問題，這影響了臨床醫生和患者對模型生成答案的理解與信任。為此，本研究首先提出了一個“視覺基礎思維”（ThinkVG）數據集，其中答案生成被分解為中間推理步驟，這些步驟明確地錨定了醫學影像的相關視覺區域，從而提供了細粒度的可解釋性。此外，我們引入了一種新穎的可驗證獎勵機制，用於強化學習以指導後續訓練，從而改善模型推理過程與最終答案之間的一致性。值得注意的是，我們的方法僅使用八分之一的訓練數據就達到了可比的性能，展示了該方案的高效性和有效性。該數據集可在https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG獲取。

English

Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model-generated answers. To address this, this work first proposes a Thinking with Visual Grounding (ThinkVG) dataset wherein the answer generation is decomposed into intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model's reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://huggingface.co/datasets/BoKelvin/GEMeX-ThinkVG.

GEMeX-ThinkVG：基於強化學習的醫學視覺問答中的視覺基礎思維探索

GEMeX-ThinkVG: Towards Thinking with Visual Grounding in Medical VQA via Reinforcement Learning

摘要

Support