ChatPaper.aiChatPaper

CoMP:面向視覺基礎模型的持續多模態預訓練

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

March 24, 2025
作者: Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, Yu-Gang Jiang
cs.AI

摘要

預訓練視覺基礎模型(VFMs)為廣泛的應用提供了強大的視覺表徵能力。本文中,我們以多模態方式持續預訓練現有的VFMs,使其能夠輕鬆處理不同尺寸的視覺輸入,並生成與語言表徵更為一致的視覺表徵,無論其原始預訓練過程如何。為此,我們引入了CoMP,這是一個精心設計的多模態預訓練流程。CoMP採用持續旋轉位置嵌入來支持原生分辨率的持續預訓練,並通過語言原型在視覺與文本特徵之間引入對齊損失,以實現多模態表徵的對齊。通過三階段訓練,我們的VFMs在多模態理解以及其他下游任務(如分類和分割)中均取得了顯著提升。值得注意的是,CoMP-SigLIP在配備0.5B大語言模型的情況下,於ChartQA和DocVQA上分別取得了66.7和75.9的分數,同時在凍結塊評估下保持了ImageNet-1K上87.4%的準確率和ADE20K上49.5的mIoU。
English
Pre-trained Vision Foundation Models (VFMs) provide strong visual representations for a wide range of applications. In this paper, we continually pre-train prevailing VFMs in a multimodal manner such that they can effortlessly process visual inputs of varying sizes and produce visual representations that are more aligned with language representations, regardless of their original pre-training process. To this end, we introduce CoMP, a carefully designed multimodal pre-training pipeline. CoMP uses a Continual Rotary Position Embedding to support native resolution continual pre-training, and an Alignment Loss between visual and textual features through language prototypes to align multimodal representations. By three-stage training, our VFMs achieve remarkable improvements not only in multimodal understanding but also in other downstream tasks such as classification and segmentation. Remarkably, CoMP-SigLIP achieves scores of 66.7 on ChartQA and 75.9 on DocVQA with a 0.5B LLM, while maintaining an 87.4% accuracy on ImageNet-1K and a 49.5 mIoU on ADE20K under frozen chunk evaluation.

Summary

AI-Generated Summary

PDF301March 26, 2025