Transformateur Vision à Vocabulaire Ouvert avec Masquage de Caractéristiques Contrastif

papers.abstract

Nous présentons le Contrastive Feature Masking Vision Transformer (CFM-ViT) - une méthodologie de pré-entraînement image-texte qui permet un apprentissage simultané des représentations au niveau de l'image et de la région pour la détection d'objets à vocabulaire ouvert (OVD). Notre approche combine l'objectif de l'autoencodeur masqué (MAE) à celui de l'apprentissage contrastif pour améliorer la représentation des tâches de localisation. Contrairement au MAE standard, nous effectuons la reconstruction dans l'espace d'embedding conjoint image-texte, plutôt que dans l'espace des pixels comme c'est habituel avec la méthode MAE classique, ce qui permet au modèle de mieux apprendre la sémantique au niveau des régions. De plus, nous introduisons le Positional Embedding Dropout (PED) pour gérer les variations d'échelle entre le pré-entraînement image-texte et le fine-tuning de détection, en supprimant aléatoirement les embeddings positionnels pendant le pré-entraînement. Le PED améliore les performances de détection et permet l'utilisation d'un backbone ViT gelé comme classificateur de région, évitant ainsi l'oubli des connaissances à vocabulaire ouvert pendant le fine-tuning de détection. Sur le benchmark de détection à vocabulaire ouvert LVIS, CFM-ViT atteint un APr de pointe de 33,9, surpassant la meilleure approche de 7,6 points et obtenant un meilleur transfert de détection zero-shot. Enfin, CFM-ViT acquiert une représentation au niveau de l'image robuste, surpassant l'état de l'art sur 8 des 12 métriques des benchmarks de recherche image-texte zero-shot.

English

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks.

Transformateur Vision à Vocabulaire Ouvert avec Masquage de Caractéristiques Contrastif

Contrastive Feature Masking Open-Vocabulary Vision Transformer

papers.abstract

Support