多模態大語言模型關注何處及其依賴：自回歸詞元生成的解釋

摘要

多模態大型語言模型（MLLMs）在將視覺輸入與自然語言輸出對齊方面展現了顯著的能力。然而，生成詞元在多大程度上依賴於視覺模態仍知之甚少，這限制了模型的可解釋性和可靠性。在本研究中，我們提出了EAGLE，一個輕量級的黑箱框架，用於解釋MLLMs中的自回歸詞元生成過程。EAGLE將任何選定的詞元歸因於緊湊的感知區域，同時量化語言先驗和感知證據的相對影響。該框架引入了一個統一充分性（洞察分數）和必要性（必要性分數）的目標函數，通過對稀疏化圖像區域的貪婪搜索進行優化，以實現忠實且高效的歸因。除了空間歸因外，EAGLE還進行模態感知分析，解構詞元依賴的內容，提供模型決策的細粒度可解釋性。在開源MLLMs上的廣泛實驗表明，EAGLE在忠實性、定位和幻覺診斷方面始終優於現有方法，同時顯著減少了GPU內存需求。這些結果凸顯了其在提升MLLMs可解釋性方面的有效性和實用性。代碼可在https://github.com/RuoyuChen10/EAGLE獲取。

English

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.

多模態大語言模型關注何處及其依賴：自回歸詞元生成的解釋

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

摘要

Support