對比特徵遮罩開放詞彙視覺Transformer

摘要

我們提出了對比特徵遮罩視覺轉換器（CFM-ViT）- 一種圖像-文本預訓練方法，實現了對開放詞彙對象檢測（OVD）的圖像和區域級表示的同時學習。我們的方法將遮罩自編碼器（MAE）目標結合到對比學習目標中，以改善用於定位任務的表示。與標準的MAE不同，我們在聯合圖像-文本嵌入空間中執行重建，而不是像傳統的MAE方法那樣在像素空間中進行，這使模型更好地學習區域級語義。此外，我們引入位置嵌入丟棄（PED）來解決圖像-文本預訓練和檢測微調之間的尺度變化，通過在預訓練期間隨機丟棄位置嵌入來實現。PED提高了檢測性能，並使得可以使用凍結的ViT骨幹作為區域分類器，防止在檢測微調期間遺忘開放詞彙知識。在LVIS開放詞彙檢測基準上，CFM-ViT實現了最先進的33.9 APr，超越最佳方法7.6個點，並實現更好的零樣本檢測轉移。最後，CFM-ViT獲得了強大的圖像級表示，在零樣本圖像-文本檢索基準中，在12個指標中有8個超越了最新技術。

English

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

對比特徵遮罩開放詞彙視覺Transformer

Contrastive Feature Masking Open-Vocabulary Vision Transformer

摘要

Support