对比式局部化语言-图像预训练
Contrastive Localized Language-Image Pre-Training
October 3, 2024
作者: Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan
cs.AI
摘要
对比语言-图像预训练(CLIP)是一种备受赞誉的方法,用于训练视觉编码器生成图像/文本表示,促进各种应用。最近,CLIP已被广泛采用作为多模态大型语言模型(MLLMs)的视觉骨干,以连接图像输入以进行语言交互。CLIP作为视觉-语言基础模型的成功依赖于在图像级别对齐网络爬取的嘈杂文本注释。然而,这样的标准对于需要细粒度视觉表示的下游任务可能不足,特别是当对MLLMs来说,区域级别的理解对任务要求较高时。在本文中,我们通过几项进展改进了CLIP的定位能力。我们提出了一种名为对比定位语言-图像预训练(CLOC)的预训练方法,通过补充CLIP与区域-文本对比损失和模块。我们构建了一个新概念,即可提示的嵌入,其中编码器生成易于根据空间提示转换为区域表示的图像嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且空间定位的字幕框架,以有效生成规模化的区域-文本伪标签。通过扩展到数十亿个带注释图像,CLOC实现了用于图像区域识别和检索任务的高质量区域嵌入,并可以作为CLIP的替代以增强MLLMs,特别是在指代和定位任务上。
English
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method
for training vision encoders to generate image/text representations
facilitating various applications. Recently, CLIP has been widely adopted as
the vision backbone of multimodal large language models (MLLMs) to connect
image inputs for language interactions. The success of CLIP as a
vision-language foundation model relies on aligning web-crawled noisy text
annotations at image levels. Nevertheless, such criteria may become
insufficient for downstream tasks in need of fine-grained vision
representations, especially when region-level understanding is demanding for
MLLMs. In this paper, we improve the localization capability of CLIP with
several advances. We propose a pre-training method called Contrastive Localized
Language-Image Pre-training (CLOC) by complementing CLIP with region-text
contrastive loss and modules. We formulate a new concept, promptable
embeddings, of which the encoder produces image embeddings easy to transform
into region representations given spatial hints. To support large-scale
pre-training, we design a visually-enriched and spatially-localized captioning
framework to effectively generate region-text pseudo-labels at scale. By
scaling up to billions of annotated images, CLOC enables high-quality regional
embeddings for image region recognition and retrieval tasks, and can be a
drop-in replacement of CLIP to enhance MLLMs, especially on referring and
grounding tasks.Summary
AI-Generated Summary