대규모 시각-언어 모델에서 흉부 X-ray 추론을 위한 시각적 귀속 재고

초록

대규모 시각-언어 모델(LVLM)은 의료 애플리케이션에서 가능성을 보여주지만, 응답을 시각적 증거에 충실하게 근거하지 못하는 점은 임상적 신뢰성에 대한 심각한 우려를 제기한다. 시각적 귀인 방법이 LVLM 예측을 설명하는 데 널리 사용되지만, 모델 내부 추론에 대한 실제 정답 주석이 일반적으로 제공되지 않기 때문에 이러한 설명이 실제로 모델의 결정에 기반이 되는 시각적 증거를 반영하는지 여부는 대부분 검증되지 않은 상태이다. 우리는 흉부 X선(CXR) 추론에 대해 이 질문을 다루기 위해, 전문가가 주석을 단 영역이 반사실적 편집을 통해 모델 예측에 인과적 책임이 있는 것으로 검증된 CXR-VQA 샘플만 유지하는 인과 평가 프레임워크를 개발한다. 이 프레임워크를 11개의 귀인 방법, 6개의 오픈소스 LVLM, 그리고 두 가지 출력 모드(직접 답변과 단계적 추론)에 걸쳐 적용한 결과, 기존 귀인 방법은 LVLM이 사용하는 증거를 식별하는 데 종종 실패함을 발견했다. 이러한 실패를 해결하기 위해 우리는 MedFocus를 제안한다. 이는 불균형 최적 수송을 통해 임상적으로 의미 있는 해부학적 영역을 위치화하고, 표적 개입을 통해 모델 출력에 대한 이들의 인과 효과를 측정하는 개념 기반 귀인 방법이다. MedFocus는 공간적, 개념 수준, 토큰 수준 귀인을 생성하며, 기존 방법들을 크게 능가하여 의료 LVLM을 위한 보다 신뢰할 수 있는 귀인으로 한 걸음 나아간다. 우리의 데이터와 코드는 https://github.com/gzxiong/medfocus/에서 확인할 수 있다.

English

Large Vision Language Models (LVLMs) show promise in medical applications, but their inability to faithfully ground responses in visual evidence raises serious concerns about clinical trustworthiness. While visual attribution methods are widely used to explain LVLM predictions, whether these explanations actually reflect the visual evidence underlying the model's decision is largely unverified, since ground-truth annotations for internal model reasoning are typically unavailable. We address this question for chest X-ray (CXR) reasoning by developing a causal evaluation framework that retains only CXR-VQA samples for which the expert-annotated region is verified, via counterfactual editing, to be causally responsible for the model's prediction. Using this framework across 11 attribution methods, six open-source LVLMs, and two output modes (direct answer and step-by-step reasoning), we find that existing attribution methods often fail to identify the evidence used by LVLMs. To address this failure, we propose MedFocus, a concept-based attribution method that localizes clinically meaningful anatomical regions via unbalanced optimal transport and measures their causal effect on model outputs through targeted interventions. MedFocus produces spatial, concept-level, and token-level attributions and substantially outperforms prior methods, taking a step toward more trustworthy attribution for medical LVLMs. Our data and code are available at https://github.com/gzxiong/medfocus/.