MLLMが注目する点と依存する要素：自己回帰型トークン生成の説明

要旨

マルチモーダル大規模言語モデル（MLLMs）は、視覚的入力と自然言語出力を整合させる際に顕著な能力を示している。しかし、生成されたトークンが視覚モダリティにどの程度依存しているかは十分に理解されておらず、解釈可能性と信頼性が制限されている。本研究では、MLLMsにおける自己回帰的トークン生成を説明するための軽量なブラックボックスフレームワークであるEAGLEを提案する。EAGLEは、選択されたトークンをコンパクトな知覚領域に帰属させると同時に、言語事前情報と知覚的証拠の相対的な影響を定量化する。このフレームワークは、十分性（洞察スコア）と不可欠性（必要性スコア）を統合する目的関数を導入し、スパース化された画像領域に対する貪欲探索を通じて最適化することで、忠実かつ効率的な帰属を実現する。空間的帰属を超えて、EAGLEはモダリティを意識した分析を行い、トークンが何に依存しているかを解き明かし、モデルの決定に対する細かな解釈可能性を提供する。オープンソースのMLLMsを対象とした広範な実験により、EAGLEが忠実性、局所化、および幻覚診断において既存の手法を一貫して上回り、GPUメモリの使用量を大幅に削減することが示された。これらの結果は、MLLMsの解釈可能性を向上させるためのEAGLEの有効性と実用性を強調している。コードはhttps://github.com/RuoyuChen10/EAGLEで公開されている。

English

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.

MLLMが注目する点と依存する要素：自己回帰型トークン生成の説明

Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

要旨

Support