CoMP：面向視覺基礎模型的持續多模態預訓練

摘要

預訓練視覺基礎模型（VFMs）為廣泛的應用提供了強大的視覺表徵能力。本文中，我們以多模態方式持續預訓練現有的VFMs，使其能夠輕鬆處理不同尺寸的視覺輸入，並生成與語言表徵更為一致的視覺表徵，無論其原始預訓練過程如何。為此，我們引入了CoMP，這是一個精心設計的多模態預訓練流程。CoMP採用持續旋轉位置嵌入來支持原生分辨率的持續預訓練，並通過語言原型在視覺與文本特徵之間引入對齊損失，以實現多模態表徵的對齊。通過三階段訓練，我們的VFMs在多模態理解以及其他下游任務（如分類和分割）中均取得了顯著提升。值得注意的是，CoMP-SigLIP在配備0.5B大語言模型的情況下，於ChartQA和DocVQA上分別取得了66.7和75.9的分數，同時在凍結塊評估下保持了ImageNet-1K上87.4%的準確率和ADE20K上49.5的mIoU。

English

Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations. By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation. Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation.

CoMP：面向視覺基礎模型的持續多模態預訓練

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

摘要

Support