多模态大语言模型关注何处及其依赖因素：自回归令牌生成机制解析

摘要

多模态大语言模型（MLLMs）在将视觉输入与自然语言输出对齐方面展现了卓越的能力。然而，生成词汇对视觉模态的依赖程度仍鲜为人知，这限制了模型的可解释性和可靠性。本研究提出了EAGLE，一个轻量级的黑箱框架，用于解释MLLMs中的自回归词汇生成过程。EAGLE能够将任何选定词汇归因于紧凑的感知区域，同时量化语言先验与感知证据的相对影响。该框架引入了一个统一充分性（洞察分数）与必要性（必需分数）的目标函数，通过稀疏化图像区域的贪婪搜索进行优化，以实现忠实且高效的归因。除了空间归因外，EAGLE还执行模态感知分析，解构词汇依赖的基础，为模型决策提供细粒度的可解释性。跨开源MLLMs的广泛实验表明，EAGLE在忠实度、定位能力及幻觉诊断方面持续优于现有方法，同时显著减少GPU内存需求。这些结果凸显了其在提升MLLMs可解释性方面的有效性和实用性。代码已发布于https://github.com/RuoyuChen10/EAGLE。

English

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.

多模态大语言模型关注何处及其依赖因素：自回归令牌生成机制解析

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

摘要

Support