MoCa:模态感知的持续预训练优化双向多模态嵌入
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
June 29, 2025
作者: Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou
cs.AI
摘要
基于因果视觉语言模型(VLMs)构建的多模态嵌入模型,已在多种任务中展现出潜力。然而,当前方法面临三大局限:VLM骨干网络中的因果注意力机制在嵌入任务中表现欠佳;依赖高质量标注配对数据进行对比学习带来的可扩展性问题;以及训练目标和数据多样性有限。为解决这些问题,我们提出了MoCa,一个两阶段框架,旨在将预训练的VLMs转化为高效的双向多模态嵌入模型。第一阶段,模态感知持续预训练,引入联合重建目标,同时去噪交错文本与图像输入,强化双向上下文感知推理能力。第二阶段,异构对比微调,利用超越简单图文对的多样的、语义丰富的多模态数据,以增强泛化与对齐效果。我们的方法通过持续预训练引入双向注意力机制,借助联合重建目标有效利用海量未标注数据进行扩展,并利用多样多模态数据提升表示鲁棒性,从而解决了上述局限。实验表明,MoCa在MMEB和ViDoRe-v2基准测试中持续提升性能,取得了新的最先进成果,并在MMEB上展现出与模型规模和训练数据同步的强扩展性。
English
Multimodal embedding models, built upon causal Vision Language Models (VLMs),
have shown promise in various tasks. However, current approaches face three key
limitations: the use of causal attention in VLM backbones is suboptimal for
embedding tasks; scalability issues due to reliance on high-quality labeled
paired data for contrastive learning; and limited diversity in training
objectives and data. To address these issues, we propose MoCa, a two-stage
framework for transforming pre-trained VLMs into effective bidirectional
multimodal embedding models. The first stage, Modality-aware Continual
Pre-training, introduces a joint reconstruction objective that simultaneously
denoises interleaved text and image inputs, enhancing bidirectional
context-aware reasoning. The second stage, Heterogeneous Contrastive
Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple
image-caption pairs to enhance generalization and alignment. Our method
addresses the stated limitations by introducing bidirectional attention through
continual pre-training, scaling effectively with massive unlabeled datasets via
joint reconstruction objectives, and utilizing diverse multimodal data for
enhanced representation robustness. Experiments demonstrate that MoCa
consistently improves performance across MMEB and ViDoRe-v2 benchmarks,
achieving new state-of-the-art results, and exhibits strong scalability with
both model size and training data on MMEB.