基于模态间隙驱动的子空间对齐训练范式在多模态大语言模型中的应用
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
February 2, 2026
作者: Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan
cs.AI
摘要
尽管多模态对比学习在视觉与语言表征对齐方面取得了成功,但始终存在一种几何异常现象——模态鸿沟:表达相同语义的不同模态嵌入会系统性地占据偏移区域。现有弥合该鸿沟的方法大多受限于过度简化的各向同性假设,阻碍了其在大规模场景中的应用。本文通过精确刻画模态鸿沟的几何形态并利用其实现高效模型扩展,以解决这些局限性。首先,我们提出固定框架模态鸿沟理论,将冻结参考系中的模态鸿沟分解为稳定偏差和各向异性残差。基于这一精确建模的指导,我们提出无需训练的模态对齐策略ReAlign。该方法利用海量非配对数据的统计特征,通过锚点对齐、轨迹对齐和质心对齐的三步流程,将文本表征对齐至图像表征分布,从而显式修正几何错位。基于ReAlign,我们进一步提出面向多模态大语言模型的可扩展训练范式ReVision。该范式将ReAlign集成至预训练阶段,使模型在视觉指令微调前就能从非配对文本中学习视觉表征分布,无需依赖大规模高质量图文对。我们的框架证明,经过统计对齐的非配对数据可有效替代昂贵的图文对,为多模态大语言模型的高效扩展提供了可行路径。
English
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.