コントラスティブ・フィーチャー・マスキングを用いたオープン語彙ビジョントランスフォーマー

要旨

本論文では、Contrastive Feature Masking Vision Transformer (CFM-ViT)を提案する。これは、オープン語彙物体検出（OVD）における画像レベルおよび領域レベルの表現を同時に学習する画像-テキスト事前学習手法である。我々のアプローチは、マスクドオートエンコーダ（MAE）の目的関数を対照学習の目的関数に組み合わせることで、ローカライゼーションタスクのための表現を改善する。従来のMAE手法とは異なり、ピクセル空間ではなく、画像-テキストの埋め込み空間で再構成を行うことで、モデルが領域レベルのセマンティクスをより良く学習する。さらに、Positional Embedding Dropout (PED)を導入し、画像-テキスト事前学習と検出ファインチューニング間のスケール変動に対処する。PEDは、事前学習中に位置埋め込みをランダムにドロップアウトすることで、検出性能を向上させ、凍結されたViTバックボーンを領域分類器として使用可能にし、検出ファインチューニング中にオープン語彙知識が失われるのを防ぐ。LVISオープン語彙検出ベンチマークにおいて、CFM-ViTは33.9 APrという最先端の結果を達成し、従来の最良の手法を7.6ポイント上回り、ゼロショット検出転移においても優れた性能を示す。最後に、CFM-ViTは強力な画像レベルの表現を獲得し、ゼロショット画像-テキスト検索ベンチマークにおいて12の指標のうち8つで最先端の性能を上回る。

English

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

コントラスティブ・フィーチャー・マスキングを用いたオープン語彙ビジョントランスフォーマー

Contrastive Feature Masking Open-Vocabulary Vision Transformer

要旨

Support