テキスト優位性を超えて：オムニモーダル大規模言語モデルのモダリティ選好性の解明

要旨

ネイティブなオムニモーダル大規模言語モデル（OLLM）は、パイプラインアーキテクチャから統一された表現空間への移行を遂げている。しかし、このネイティブ統合は「モダリティ選好」という重要ながら十分に研究されていない現象を引き起こす。本論文ではまず、新たに構築した矛盾ベースのベンチマークとモダリティ選択率指標を用いて、OLLMのモダリティ選好を体系的に定量化する。代表的な10種のOLLMを評価した結果、従来の視覚言語モデル（VLM）に見られる「テキスト優位性」とは異なり、大半のOLLMが顕著な視覚選好を示すというパラダイムシフトを明らかにした。さらにそのメカニズム解明のため層別解析を実施し、モダリティ選好が静的な性質ではなく中後期層で漸進的に出現することを実証する。これらの知見に基づき、内部信号を活用してクロスモーダル幻覚を診断する手法を提案する。タスク特有のデータを必要とせず、3種の下流マルチモーダルベンチマークで競合性能を達成した。本研究は、より信頼性の高いOLLM構築に向けたメカニズムの解明と実用的ツールを提供する。コード及び関連リソースはhttps://github.com/icip-cas/OmniPreference で公開している。

English

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

テキスト優位性を超えて：オムニモーダル大規模言語モデルのモダリティ選好性の解明

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

要旨

Support