重新思考大型视觉语言模型中胸部X光推理的视觉归因

摘要

大型視覺語言模型（LVLMs）在醫學應用中展現潛力，但它們無法忠實地將回應基於視覺證據，這引發了對臨床可信度的嚴重擔憂。雖然視覺歸因方法被廣泛用於解釋LVLM的預測，但這些解釋是否確實反映模型決策背後的視覺證據，在很大程度上未經證實，因為內部模型推理的真實標註通常無法獲得。我們針對胸部X光（CXR）推理解決這個問題，開發了一個因果評估框架，該框架僅保留那些經由反事實編輯驗證、專家標註區域對模型預測具有因果責任的CXR-VQA樣本。使用此框架評估11種歸因方法、六個開源LVLMs以及兩種輸出模式（直接回答與逐步推理），我們發現現有歸因方法往往無法識別LVLMs所使用的證據。為了解決這個問題，我們提出MedFocus，一種基於概念的歸因方法，通過不平衡最優傳輸定位臨床上有意義的解剖區域，並通過有針對性的干預測量它們對模型輸出的因果效應。MedFocus產生空間、概念層級及詞元層級的歸因，且顯著優於先前的方法，為醫學LVLMs邁向更可信的歸因邁出一步。我們的數據與程式碼可於 https://github.com/gzxiong/medfocus/ 取得。

English

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.