具有区域感知的视觉Transformer用于开放词汇物体检测的预训练

摘要

我们提出了区域感知的开放词汇视觉Transformer（RO-ViT）- 一种对比图像-文本预训练方法，用于弥合图像级预训练与开放词汇目标检测之间的差距。在预训练阶段，我们建议随机裁剪和调整位置嵌入的区域，而不是使用整个图像的位置嵌入。这样更好地匹配了检测微调阶段中区域级别使用位置嵌入的情况。此外，我们用焦点损失替换了对比学习中常见的softmax交叉熵损失，以更好地学习信息丰富但困难的示例。最后，我们利用最近的新颖对象提议的进展来改进开放词汇检测的微调。我们在LVIS和COCO开放词汇检测基准以及零样本迁移上评估了我们的完整模型。RO-ViT在LVIS上实现了32.1的AP_r，超过了现有最佳方法5.8个百分点，同时具有竞争力的零样本迁移检测。令人惊讶的是，RO-ViT还改进了图像级表示，并在COCO和Flickr图像-文本检索基准的12个指标中的9个上实现了最新技术，胜过了具有更大模型的竞争方法。

English

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

具有区域感知的视觉Transformer用于开放词汇物体检测的预训练

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

摘要

Support