CoMP:面向视觉基础模型的持续多模态预训练
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
March 24, 2025
作者: Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang
cs.AI
摘要
预训练视觉基础模型(VFMs)为广泛的应用提供了强大的视觉表征。本文中,我们以多模态方式持续预训练主流VFMs,使其能够轻松处理不同尺寸的视觉输入,并生成与语言表征更为对齐的视觉表征,无论其原始预训练过程如何。为此,我们引入了CoMP,一个精心设计的多模态预训练流程。CoMP采用持续旋转位置嵌入以支持原生分辨率的持续预训练,并通过语言原型在视觉与文本特征间引入对齐损失,以实现多模态表征的对齐。通过三阶段训练,我们的VFMs不仅在多模态理解上取得了显著提升,还在分类和分割等其他下游任务中表现优异。值得注意的是,CoMP-SigLIP在配备0.5B大语言模型的情况下,于ChartQA和DocVQA上分别获得了66.7和75.9的分数,同时在冻结块评估下,在ImageNet-1K上保持了87.4%的准确率,在ADE20K上达到了49.5的mIoU。
English
Pre-trained Vision Foundation Models (VFMs) provide strong visual
representations for a wide range of applications. In this paper, we continually
pre-train prevailing VFMs in a multimodal manner such that they can
effortlessly process visual inputs of varying sizes and produce visual
representations that are more aligned with language representations, regardless
of their original pre-training process. To this end, we introduce CoMP, a
carefully designed multimodal pre-training pipeline. CoMP uses a Continual
Rotary Position Embedding to support native resolution continual pre-training,
and an Alignment Loss between visual and textual features through language
prototypes to align multimodal representations. By three-stage training, our
VFMs achieve remarkable improvements not only in multimodal understanding but
also in other downstream tasks such as classification and segmentation.
Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA
with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5
mIoU on ADE20K under frozen chunk evaluation.Summary
AI-Generated Summary