各向异性模态对齐

摘要

训练多模态大语言模型长期受限于高质量配对多模态数据的稀缺性。近期研究表明，预训练多模态对比模型的共享表征空间可作为桥梁，使模型能够利用单模态数据进行多模态训练。然而，该范式的关键前提仍未得到充分理解：来自不同模态的表征能否可靠地相互替换？核心障碍在于共享空间中持续存在的模态间隙。本文重新审视了模态间隙的几何本质。我们发现，模态表征已共享兼容的主导语义几何结构。真正阻碍模态可互换性的并非简单的全局偏移，而是集中在少数主导方向上的各向异性残差结构。基于此发现，我们进一步提出各向异性模态间隙对齐原则：有效的模态对齐应在保持源模态语义结构的同时，与目标模态分布对齐。在该原则指导下，我们提出各向异性几何校正框架AnisoAlign，用于无配对模态对齐。该框架利用目标模态的内部几何先验，对源模态表征进行有界校正，从而构建目标模态中的替代表征。实验在几何诊断和纯文本多模态大语言模型训练中均证实了其优势。总体而言，本文将模态间隙从经验观察重新塑造为一种可纠正的结构化几何现象，并为利用单模态数据训练多模态模型提供了新的表征对齐视角。

English

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.

各向异性模态对齐

Anisotropic Modality Align

摘要

Support