扩展以语言为核心的全模态表征学习
Scaling Language-Centric Omnimodal Representation Learning
October 13, 2025
作者: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong
cs.AI
摘要
近期,利用对比学习(CL)微调的多模态大语言模型(MLLMs)所开发的多模态嵌入方法已展现出显著成果,然而其优越性的深层原因仍待深入探究。本研究提出,基于MLLM方法的一个关键优势源于生成式预训练过程中实现的隐式跨模态对齐,在此过程中,语言解码器学会了在共享表示空间内利用多模态信号来生成单模态输出。通过对各向异性和核相似性结构的分析,我们实证确认了MLLM表示中存在的潜在对齐现象,使得对比学习能够作为一个轻量级的优化阶段发挥作用。基于这一洞见,我们提出了一个以语言为中心的全模态嵌入框架,简称LCO-Emb。在多种骨干网络和基准测试上的广泛实验验证了其有效性,实现了跨模态的顶尖性能。此外,我们发现了生成-表示缩放定律(GRSL),表明通过对比优化获得的表示能力与MLLM的生成能力呈正相关增长。这表明,提升生成能力成为增强表示质量的有效范式。我们为GRSL提供了理论解释,正式将MLLM的生成质量与其表示性能的上限联系起来,并在一个具有挑战性的低资源视觉-文档检索任务上进行了验证,结果显示在对比学习之前持续进行生成式预训练能进一步提升模型嵌入能力的潜力。代码、模型及相关资源已发布于https://github.com/LCO-Embedding/LCO-Embedding。
English
Recent multimodal embedding approaches leveraging multimodal large language
models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising
results, yet the underlying reasons behind their superiority remain
underexplored. This work argues that a crucial advantage of MLLM-based
approaches stems from implicit cross-modal alignment achieved during generative
pretraining, where the language decoder learns to exploit multimodal signals
within a shared representation space for generating unimodal outputs. Through
analysis of anisotropy and kernel similarity structure, we empirically confirm
that latent alignment emerges within MLLM representations, allowing CL to serve
as a lightweight refinement stage. Leveraging this insight, we propose a
Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive
experiments across diverse backbones and benchmarks demonstrate its
effectiveness, achieving state-of-the-art performance across modalities.
Furthermore, we identify a Generation-Representation Scaling Law (GRSL),
showing that the representational capabilities gained through contrastive
refinement scales positively with the MLLM's generative capabilities. This
suggests that improving generative abilities evolves as an effective paradigm
for enhancing representation quality. We provide a theoretical explanation of
GRSL, which formally links the MLLM's generative quality to the upper bound on
its representation performance, and validate it on a challenging, low-resource
visual-document retrieval task, showing that continual generative pretraining
before CL can further enhance the potential of a model's embedding
capabilities. Codes, models, and resources are available at
https://github.com/LCO-Embedding/LCO-Embedding.