ビジョントランスフォーマーを用いたオープン語彙物体検出のための領域認識事前学習

要旨

本論文では、Region-aware Open-vocabulary Vision Transformers (RO-ViT)を提案する。これは、画像レベルの事前学習とオープン語彙物体検出のギャップを埋めるための、対照的な画像-テキスト事前学習手法である。事前学習段階において、我々は、画像全体の位置埋め込みを使用する代わりに、位置埋め込みの領域をランダムにクロップしてリサイズすることを提案する。これにより、検出のファインチューニング段階での領域レベルの位置埋め込みの使用により適した形となる。さらに、対照学習における一般的なソフトマックス交差エントロピー損失を、情報量が多く難しいサンプルをより良く学習するために、フォーカル損失に置き換える。最後に、最近の新規物体提案の進展を活用して、オープン語彙検出のファインチューニングを改善する。我々の完全なモデルを、LVISおよびCOCOのオープン語彙検出ベンチマークとゼロショット転移で評価する。RO-ViTは、LVISにおいて32.1 AP_rという最先端の結果を達成し、既存の最良の手法を+5.8ポイント上回るとともに、競争力のあるゼロショット転移検出も実現する。驚くべきことに、RO-ViTは画像レベルの表現も改善し、COCOおよびFlickrの画像-テキスト検索ベンチマークにおいて、12の指標のうち9つで最先端を達成し、より大規模なモデルを持つ競合手法を上回る。

English

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 AP_r on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

ビジョントランスフォーマーを用いたオープン語彙物体検出のための領域認識事前学習

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

要旨

Support