ChatPaper.aiChatPaper

超越文本主导:探究全模态大语言模型的模态偏好机制

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

April 18, 2026
作者: Xinru Yan, Boxi Cao, Yaojie Lu, Hongyu Lin, Weixiang Zhou, Le Sun, Xianpei Han
cs.AI

摘要

原生全模态大语言模型(OLLMs)已从流水线架构转向统一表示空间。然而,这种原生集成引发了一个关键但尚未被充分探索的现象:模态偏好。为填补这一空白,我们首先通过新构建的基于冲突的基准和模态选择率指标,系统量化了OLLMs的模态偏好。对十个代表性OLLMs的评估揭示了一个显著范式转变:与传统视觉语言模型(VLMs)的"文本主导"特性不同,大多数OLLMs表现出明显的视觉偏好。为深入理解其内在机制,我们进行分层探测并证明这种模态偏好并非静态存在,而是在中后层网络中逐渐显现。基于这些发现,我们利用内部信号诊断跨模态幻觉,在三个下游多模态基准测试中无需任务特定数据即达到竞争优势。本研究不仅提供了机制性解释,还为构建更可信的OLLMs提供了实用工具。代码及相关资源已开源:https://github.com/icip-cas/OmniPreference
English
Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference
PDF22April 22, 2026