各向異性模態對齊
Anisotropic Modality Align
May 8, 2026
作者: Xiaomin Yu, Yijiang Li, Yuhui Zhang, Hanzhen Zhao, Yue Yang, Hao Tang, Yue Song, Xiaobin Hu, Chengwei Qin, Shuicheng Yan, Hui Xiong
cs.AI
摘要
訓練多模態大型語言模型長期受限於高品質配對多模態數據的稀缺性。近期研究顯示,預訓練多模態對比模型的共享表徵空間可作為橋樑,使模型能利用單模態數據進行多模態訓練。然而,此範式的關鍵前提仍未被充分理解:不同模態的表徵能否可靠地相互替換?核心障礙在於共享空間中持續存在的模態差距。在本工作中,我們重新審視模態差距的幾何特性。我們發現模態表徵已共享相容的主導語義幾何結構。真正阻礙模態可互換性的並非簡單的整體偏移,而是一種沿少數主導方向集中的非均質殘留結構。基於此發現,我們進一步提出非均質模態差距對齊原則:有效的模態對齊應與目標模態分佈對齊,同時保留源模態的語義結構。在此原則指導下,我們提出一個非均質幾何校正框架 AnisoAlign,用於非配對模態對齊。該框架利用目標模態的內部幾何先驗,對源模態表徵進行有界校正,從而在目標模態中構建替代表徵。實驗驗證了其在幾何診斷以及純文本訓練多模態大型語言模型中的效益。總體而言,本工作將模態差距從經驗觀察重新定義為可校正的結構化幾何現象,並為利用單模態數據訓練多模態模型提供了新的表徵對齊視角。
English
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.