ChatPaper.aiChatPaper

基於模態差距驅動的子空間對齊訓練範式:面向多模態大型語言模型

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

February 2, 2026
作者: Xiaomin Yu, Yi Xin, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Xiaoxing Hu, Xinlei Yu, Ziyue Qiao, Hao Tang, Xue Yang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan
cs.AI

摘要

儘管多模態對比學習在對齊視覺與語言表徵方面取得了成功,但一個持續存在的幾何異常現象——模態鴻溝——仍然存在:表達相同語義的不同模態嵌入會系統性地佔據偏移區域。先前彌合這一鴻溝的方法大多受制於過度簡化的各向同性假設,阻礙了其在大規模場景中的應用。本文通過精確刻畫模態鴻溝的幾何形態並利用其實現高效模型擴展,來解決這些局限性。首先,我們提出固定框架模態鴻溝理論,將凍結參考系內的模態鴻溝分解為穩定偏差與各向異性殘差。在這一精確建模的指導下,我們引入ReAlign——一種免訓練的模態對齊策略。該方法利用海量非配對數據的統計特徵,通過錨點對齊、軌跡對齊和質心對齊的三步流程,將文本表徵對齊至圖像表徵分佈中,從而顯式修正幾何失準問題。基於ReAlign,我們進一步提出ReVision——一種適用於多模態大語言模型的可擴展訓練範式。ReVision將ReAlign整合至預訓練階段,使模型能在視覺指令微調前從非配對文本中學習視覺表徵分佈,無需依賴大規模高質量圖文配對數據。我們的框架證明,經過統計對齊的非配對數據可有效替代昂貴的圖文配對數據,為多模態大語言模型的高效擴展提供了可靠路徑。
English
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models (MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.
PDF1287February 11, 2026