Voorbij tekstdominantie: Inzicht in de modaliteitsvoorkeur van omnimodale grote taalmodellen

Samenvatting

Inheemse Omni-modale Grote Taalmodellen (OLLMs) zijn verschoven van pijplijnarchitecturen naar verenigde representatieruimten. Deze inheemse integratie leidt echter tot een kritiek maar onderbelicht fenomeen: modale voorkeur. Om deze kloof te overbruggen, kwantificeren we eerst systematisch de modale voorkeur van OLLMs met behulp van een nieuw samengestelde, op conflicten gebaseerde benchmark en de metriek van modale selectiefrequentie. Onze evaluatie van tien representatieve OLLMs onthult een opmerkelijke paradigmaverschuiving: in tegenstelling tot de "tekstdominantie" van traditionele VLMs vertonen de meeste OLLMs een uitgesproken visuele voorkeur. Om het onderliggende mechanisme verder te begrijpen, voeren we laaggewijs onderzoek uit en tonen we aan dat deze modale voorkeur niet statisch is, maar progressief ontstaat in de midden tot late lagen. Op basis van deze inzichten benutten we deze interne signalen om cross-modale hallucinaties te diagnosticeren, waarbij we competitieve prestaties behalen op drie downstream multimodale benchmarks zonder taakspecifieke data. Ons werk biedt zowel een mechanistisch inzicht als een praktisch hulpmiddel voor het bouwen van betrouwbaardere OLLMs. Onze code en gerelateerde bronnen zijn openbaar beschikbaar op: https://github.com/icip-cas/OmniPreference.

English

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

Voorbij tekstdominantie: Inzicht in de modaliteitsvoorkeur van omnimodale grote taalmodellen

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Samenvatting

Support