Além da Dominância do Texto: Compreendendo a Preferência Modal de Modelos de Linguagem Grande Omnimodais

Resumo

Os Modelos de Linguagem de Grande Porte Omnimodais Nativos (OLLMs) evoluíram de arquiteturas de pipeline para espaços de representação unificados. No entanto, esta integração nativa dá origem a um fenómeno crítico, mas ainda pouco explorado: a preferência modal. Para colmatar esta lacuna, começamos por quantificar sistematicamente a preferência modal dos OLLMs utilizando um novo benchmark baseado em conflito e a métrica de taxa de seleção modal. A nossa avaliação de dez OLLMs representativos revela uma mudança de paradigma notável: ao contrário da "dominância textual" dos VLMs tradicionais, a maioria dos OLLMs exibe uma preferência visual pronunciada. Para compreender melhor o mecanismo subjacente, realizamos uma análise por camadas e demonstramos que esta preferência modal não é estática, mas emerge progressivamente nas camadas intermédias e finais. Com base nestas perceções, aproveitamos estes sinais internos para diagnosticar alucinações cross-modais, alcançando um desempenho competitivo em três benchmarks multimodais de downstream sem dados específicos da tarefa. O nosso trabalho fornece tanto uma compreensão mecanicista como uma ferramenta prática para a construção de OLLMs mais confiáveis. O nosso código e recursos relacionados estão publicamente disponíveis em: https://github.com/icip-cas/OmniPreference.

English

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

Além da Dominância do Texto: Compreendendo a Preferência Modal de Modelos de Linguagem Grande Omnimodais

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Resumo

Support