Alignement de modalités anisotrope

Résumé

L’entraînement des grands modèles de langage multimodaux a longtemps été limité par la rareté des données multimodales appariées de haute qualité. Des études récentes montrent que l’espace de représentation partagé des modèles contrastifs multimodaux pré-entraînés peut servir de pont, permettant aux modèles d’effectuer un entraînement multimodal avec des données unimodales. Cependant, la prémisse clé de ce paradigme reste insuffisamment comprise : les représentations de différentes modalités peuvent-elles être échangées de manière fiable ? L’obstacle central réside dans le Fossé Modalitaire persistant au sein de l’espace partagé. Dans ce travail, nous revisitons la nature géométrique du fossé modalitaire. Nous constatons que les représentations des modalités partagent déjà une géométrie sémantique dominante compatible. Ce qui entrave véritablement l’interchangeabilité des modalités n’est pas un simple décalage global, mais une structure résiduelle anisotrope concentrée le long d’un petit nombre de directions dominantes. Sur la base de cette observation, nous proposons en outre le principe d’alignement anisotrope du fossé modalitaire : un alignement efficace des modalités doit s’ajuster à la distribution de la modalité cible tout en préservant la structure sémantique de la modalité source. Guidé par ce principe, nous proposons un cadre de correction géométrique anisotrope, AnisoAlign, pour l’alignement non apparié des modalités. Ce cadre exploite les connaissances géométriques internes de la modalité cible et effectue une correction bornée des représentations de la modalité source, construisant ainsi des représentations substitutives dans la modalité cible. Les expériences confirment ses avantages à la fois dans le diagnostic géométrique et dans l’entraînement de MLLM uniquement textuels. Dans l’ensemble, ce travail refonde le fossé modalitaire, passant d’une observation empirique à un phénomène géométrique structuré et corrigible, et offre une nouvelle perspective d’alignement des représentations pour l’entraînement de modèles multimodaux avec des données unimodales.

English

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.

Alignement de modalités anisotrope

Anisotropic Modality Align

Résumé

Support