重新思考大型视觉语言模型中胸部X光推理的视觉归因

摘要

大型视觉-语言模型（LVLMs）在医学应用中展现出潜力，但其无法将回答忠实锚定于视觉证据的缺陷引发了对其临床可信度的严重担忧。尽管视觉归因方法被广泛用于解释LVLM的预测，但这些解释是否真正反映了模型决策背后的视觉证据在很大程度上未经验证——因为模型内部推理的真实标注通常难以获取。我们针对胸部X光（CXR）推理问题，通过开发一种因果评估框架来探究该问题：该框架仅保留经反事实编辑验证、确证专家标注区域对模型预测具有因果作用的CXR-VQA样本。利用该框架对11种归因方法、6个开源LVLM及两种输出模式（直接回答与逐步推理）进行测试，我们发现现有归因方法往往无法识别LVLM实际采用的证据。为解决此问题，我们提出MedFocus——一种基于概念的归因方法，该方法通过非平衡最优传输定位具有临床意义的解剖区域，并通过靶向干预度量这些区域对模型输出的因果效应。MedFocus生成空间级、概念级和词元级归因，其性能显著优于先前方法，为医学LVLM迈向更可信的归因迈出一步。我们的数据和代码可在 https://github.com/gzxiong/medfocus/ 获取。

English

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.