CoMP：面向视觉基础模型的持续多模态预训练

摘要

预训练视觉基础模型（VFMs）为广泛的应用提供了强大的视觉表征。本文中，我们以多模态方式持续预训练主流VFMs，使其能够轻松处理不同尺寸的视觉输入，并生成与语言表征更为对齐的视觉表征，无论其原始预训练过程如何。为此，我们引入了CoMP，一个精心设计的多模态预训练流程。CoMP采用持续旋转位置嵌入以支持原生分辨率的持续预训练，并通过语言原型在视觉与文本特征间引入对齐损失，以实现多模态表征的对齐。通过三阶段训练，我们的VFMs不仅在多模态理解上取得了显著提升，还在分类和分割等其他下游任务中表现优异。值得注意的是，CoMP-SigLIP在配备0.5B大语言模型的情况下，于ChartQA和DocVQA上分别获得了66.7和75.9的分数，同时在冻结块评估下，在ImageNet-1K上保持了87.4%的准确率，在ADE20K上达到了49.5的mIoU。

English

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations. By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation. Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation.

CoMP：面向视觉基础模型的持续多模态预训练

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

摘要

Support