Anisotrope Modaliteitsafstemming

Samenvatting

Het trainen van multimodale grote taalmodellen wordt al lang beperkt door de schaarste aan hoogwaardige gepaarde multimodale gegevens. Recent onderzoek toont aan dat de gedeelde representatieruimte van voorgetrainde multimodale contrastieve modellen als brug kan dienen, waardoor modellen multimodale training kunnen uitvoeren met unimodale gegevens. De belangrijkste premisse van dit paradigma blijft echter onvoldoende begrepen: kunnen representaties uit verschillende modaliteiten betrouwbaar worden uitgewisseld? De kernobstakel ligt in de aanhoudende modaliteitskloof in de gedeelde ruimte. In dit werk herzien we de geometrische aard van de modaliteitskloof. We ontdekken dat modaliteitsrepresentaties reeds compatibele dominante semantische geometrie delen. Wat de uitwisselbaarheid van modaliteiten werkelijk belemmert, is niet een eenvoudige globale verschuiving, maar een anisotrope reststructuur geconcentreerd langs een klein aantal dominante richtingen. Op basis van deze bevinding stellen we verder het principe van anisotrope modaliteitskloofuitlijning voor: effectieve modaliteitsuitlijning moet aansluiten bij de doelmodaliteitsverdeling terwijl de semantische structuur van de bronmodaliteit behouden blijft. Geleid door dit principe stellen we een anisotroop geometrisch correctiekader voor, AnisoAlign, voor ongepaarde modaliteitsuitlijning. Dit kader maakt gebruik van de interne geometrische voorkennis van de doelmodaliteit en voert begrensde correctie uit op bronmodaliteitsrepresentaties, waardoor vervangende representaties in de doelmodaliteit worden geconstrueerd. Experimenten bevestigen de voordelen ervan in zowel geometrische diagnostiek als puur tekstgebaseerde MLLM-training. Al met al herformuleert dit werk de modaliteitskloof van een empirische observatie tot een corrigeerbaar, gestructureerd geometrisch fenomeen en biedt het een nieuw perspectief voor representatie-uitlijning voor het trainen van multimodale modellen met unimodale gegevens.

English

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.