对比特征遮盖开放词汇视觉Transformer

摘要

我们提出了对比特征遮罩视觉Transformer（CFM-ViT）- 一种图像文本预训练方法，实现了针对开放词汇目标检测（OVD）的图像和区域级表示的同时学习。我们的方法将掩码自编码器（MAE）目标与对比学习目标相结合，以改进定位任务的表示。与标准的MAE不同，我们在联合图像文本嵌入空间中执行重建，而不是像传统的MAE方法那样在像素空间中执行，这使模型更好地学习区域级语义。此外，我们引入了位置嵌入丢失（PED）来解决图像文本预训练和检测微调之间的尺度变化，通过在预训练期间随机丢弃位置嵌入来提高检测性能，并使得可以将冻结的ViT骨干作为区域分类器，防止在检测微调期间遗忘开放词汇知识。在LVIS开放词汇检测基准上，CFM-ViT实现了33.9的APr，超过了最佳方法7.6个点，并实现了更好的零样本检测转移。最后，CFM-ViT获得了强大的图像级表示，在零样本图像文本检索基准的12个指标中，有8个超越了现有技术水平。

English

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

对比特征遮盖开放词汇视觉Transformer

Contrastive Feature Masking Open-Vocabulary Vision Transformer

摘要

Support