區域感知預訓練技術用於視覺Transformer進行開放詞彙物體檢測

摘要

我們提出了區域感知開放詞彙視覺Transformer（RO-ViT）- 一種對比圖像-文本預訓練方法，以彌合圖像級預訓練和開放詞彙物體檢測之間的差距。在預訓練階段，我們建議隨機裁剪和調整位置嵌入的區域，而不是使用整個圖像的位置嵌入。這樣更好地配合了在檢測微調階段中區域級別使用位置嵌入的情況。此外，我們將對比學習中常見的softmax交叉熵損失替換為焦點損失，以更好地學習信息豐富但困難的示例。最後，我們利用最新的新物體提議技術來改善開放詞彙檢測的微調。我們在LVIS和COCO開放詞彙檢測基準以及零樣本轉移上對我們的完整模型進行評估。RO-ViT在LVIS上實現了32.1的AP_r，超越了最佳現有方法5.8個百分點，並具有競爭力的零樣本轉移檢測。令人驚訝的是，RO-ViT還改善了圖像級表示，並在COCO和Flickr圖像-文本檢索基準的12個指標中有9個達到了最新水平，勝過了具有更大模型的競爭方法。

English

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

區域感知預訓練技術用於視覺Transformer進行開放詞彙物體檢測

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

摘要

Support